3.8. Using Character CodesSeveral factors affect the choice of character encoding for some particular area of application or purpose. The factors range from the nature of use and technical possibilities and limitations to policy decisions and external requirements. In typical situations, however, the choice is relatively simple. For example, if you live in Sweden and wish to communicate in Swedish, you normally choose ISO 8859-1 for email, web pages, and plain text files. That's what practically all people in Sweden can work with, and it is reasonably acceptable. There are good reasons for using a larger character repertoire, such as some punctuation marks, but they are probably not good enough to justify the potential risks of using Windows Latin 1 or Unicode. On the other hand, as soon as you really need to include words in, say, Eastern European languages, or technical special symbols, you should probably switch to Unicode, normally using UTF-8 encoding. In that case, you should make a reasonable effort in making sure that the recipients have software that can handle the encoding and the characters you use. 3.8.1. Repertoire RequirementsEach character encoding allows a specific repertoire of characters to be written. Therefore, the set of characters that you need imposes restrictions on the encodings that you can use. However, as discussed in Chapter 2, in different data formats, there are escape mechanisms that let you enter characters that cannot be written directly in the selected encoding. Thus, if you write a web page in English and may occasionally need an omega character Ω, for example, you can use ISO-8859-1 or even ASCII, since you can represent the special character using the entity reference Ω. Different languages have rather different requirements on the repertoire. In the section "What's in a Character" in Chapter 1, some sources of information on the character requirements of languages were mentioned. The database at http://www.eki.ee/letter/ can be used to list the characters used in many languages (written in Latin or Cyrillic letters). The requirements are, however, largely debatable, and they are relative rather than absolute. Does English need é? Most sources don't mention it as a letter of the English alphabet, but it is regarded by many as necessary for correct writing of English texts. For normal modern English that does not contain special notations, the repertoire of Windows Latin 1 is sufficient, whereas, for example, the ISO Latin 1 repertoire is insufficient (due to the lack of some punctuation marks). This does not mean that you should use the Windows Latin 1 encoding (windows-1252). You can use UTF-8, or ISO-8859-1, or even ASCII, and accept the consequences. 3.8.2. Encodings and the InternetMost important, make sure that any Internet-related software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: there must be a header that reflects the actual encoding used and the encoding used must be one that is widely understood by the (potential) recipients' software. You often need to make compromises in regard to the latter aim: you may need to use an encoding that is not yet widely supported to get your message through at all. In principle, you should first determine the character repertoire you need in a document, database, web page, or other context. Then you should proceed to determining the best possible character code and encoding. In practice, things don't quite work that way. We need to consider some widely recognized encodings and choose between them. Some rules of thumb:
If you use, say, Outlook Express to send email or to post to Usenet groups, make sure it sends the message in a reasonable form. In particular, make sure it does not send the message as HTML or duplicate it by sending it both as plain text and as HTML (select plain text only). In regard to character encoding, make sure it is something widely understood, such as ASCII, some ISO 8859 encoding, or UTF-8, depending on how large a character repertoire you need. In particular, avoid sending data in a proprietary encoding (like the Macintosh encoding or a DOS or Windows encoding) to a public network. At the very least, if you do that, make sure that the message heading specifies the encoding! There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to the Internet, data should be converted to a more widely known encoding by the sending program. If you cannot find a way to configure your program to do that, get another program. In email programs, there's typically a "Tools" or "Settings" menu, where you can set things like the format and encoding of outgoing messages. The hard part is to understand what the settings and the options are about, but at this point, you should have most if not all the information needed for that. Having checked the settings, you can test them by sending email to yourself and viewing the "hidden data." The "hidden data" in an email (or Usenet) message consists of the message headers, and the message body interpreted as plain text. Normally you see that data as formatted by your email program. This includes interpreting the content according to the specified encoding and displaying some (but usually not all) of the information in the headers, such as the sender's name and email address. There are different ways to view such information, or to view just the headers. The ways are not always easy to find; in Outlook Express for example, you can select the received message and select File Properties. In Mozilla Thunderbird you can use View Message source, and you will see the message headers and raw content in a new window, as illustrated in Figure 3-7. The Content-Type header contains a charset parameter that specifies the encoding. In the absence of such information, the ASCII encoding would be implied. 3.8.3. Encoding in Offline DataIn regard to other forms of transfer of data in digital form, such as diskette or CD-ROM, information about encoding is important, too. The problem is typically handled by guesswork. Often the crucial thing is to know which program was used to generate the data, since the text data might be inside a file in, say, the MS Word format, which can only be read by (a suitable version of) MS Word or by a program that knows its internal data format. That format, once recognized, might contain information that specifies the character encoding used in the text data included; or it might not, in which case one has to ask the sender, or make a guess, or use trial and errorviewing the data using different encodings until something sensible appears. Make sure you write down the encoding and make information about it available along with the data. This could mean a separate document on a CD-ROM, or a note written with a pen on the CD-ROM or its cover, or a sheet of paper you store and send with the data. This may sound trivial, but it is often neglected. It is best to specify the encoding in two ways: by its official name, and by its more widely known informal name. For example: "The files on this diskette are Windows Latin 2 (windows-1250) encoded." Figure 3-7. Viewing an email message in raw format ("source"), with headers that should indicate the character encoding3.8.4. Common Choices of EncodingSome widely used choices of encoding for different languages are presented in Table 3-7, identified by name of language or a name for a collection of languages, as commonly used in menus in programs.
As you can see, some encodings are intended rather specifically for a single language, while some are for a wide group of languages. This depends mostly on character repertoire requirements rather than language family relationships. For data that may contain a combination of languages, Unicode encodings are usually the best approach, and often the only possibility. You cannot find any widely understood encoding (other than Unicode encodings) that would let you write a plain text file that contains French and Thai, for example. The encoding that supports French accented letters does not support Thai characters, and vice versa. Some ISO-8859 encodings and their Windows counterparts have been designed to cover a large set of languages. This especially applies to ISO-8859-1 and windows-1252. Such coverage is possible due to the fact that many European languages use just the basic Latin letters with a small collection of additional letters. 3.8.5. Sources of InformationThe following web sites contain useful information on character codes. This means code tables, conversion tables, prose descriptions, usage guidelines, etc.
3.8.6. ExercisesIf possible, carry out the following exercises. If the book has been successful in explaining things, each exercise should take just about 10 to 15 minutes and give you some self-confidence and practice. 3.8.6.1. Testing encodingsUse an HTML document with an unspecified character encoding containing all octets in the range 160 through 255, like the document that you can copy from http://www.cs.tut.fi/~jkorpela/chars/test8.htm. View the document in your web browser, using at least two different 8-bit encodings other than ISO 8859-1. (In Internet Explorer, use View Encoding.) Analyze which encoding your browser uses by default. 3.8.6.2. "Deciphering" textYou have got a text file of unknown origin and in unknown encoding but presumably containing text in English. When you view the file in a Windows environment, with Windows Latin 1 as the default encoding, using Notepad, you see the following:
Can you deduce what the real encoding is and what the content is? |