Section 3.8. Using Character Codes

3.8. Using Character Codes

Several factors affect the choice of character encoding for some particular area of application or purpose. The factors range from the nature of use and technical possibilities and limitations to policy decisions and external requirements. In typical situations, however, the choice is relatively simple.

For example, if you live in Sweden and wish to communicate in Swedish, you normally choose ISO 8859-1 for email, web pages, and plain text files. That's what practically all people in Sweden can work with, and it is reasonably acceptable. There are good reasons for using a larger character repertoire, such as some punctuation marks, but they are probably not good enough to justify the potential risks of using Windows Latin 1 or Unicode. On the other hand, as soon as you really need to include words in, say, Eastern European languages, or technical special symbols, you should probably switch to Unicode, normally using UTF-8 encoding. In that case, you should make a reasonable effort in making sure that the recipients have software that can handle the encoding and the characters you use.

3.8.1. Repertoire Requirements

Each character encoding allows a specific repertoire of characters to be written. Therefore, the set of characters that you need imposes restrictions on the encodings that you can use.

However, as discussed in Chapter 2, in different data formats, there are escape mechanisms that let you enter characters that cannot be written directly in the selected encoding. Thus, if you write a web page in English and may occasionally need an omega character Ω, for example, you can use ISO-8859-1 or even ASCII, since you can represent the special character using the entity reference Ω.

Different languages have rather different requirements on the repertoire. In the section "What's in a Character" in Chapter 1, some sources of information on the character requirements of languages were mentioned. The database at http://www.eki.ee/letter/ can be used to list the characters used in many languages (written in Latin or Cyrillic letters).

The requirements are, however, largely debatable, and they are relative rather than absolute. Does English need é? Most sources don't mention it as a letter of the English alphabet, but it is regarded by many as necessary for correct writing of English texts.

For normal modern English that does not contain special notations, the repertoire of Windows Latin 1 is sufficient, whereas, for example, the ISO Latin 1 repertoire is insufficient (due to the lack of some punctuation marks). This does not mean that you should use the Windows Latin 1 encoding (windows-1252). You can use UTF-8, or ISO-8859-1, or even ASCII, and accept the consequences.

3.8.2. Encodings and the Internet

Most important, make sure that any Internet-related software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: there must be a header that reflects the actual encoding used and the encoding used must be one that is widely understood by the (potential) recipients' software. You often need to make compromises in regard to the latter aim: you may need to use an encoding that is not yet widely supported to get your message through at all.

In principle, you should first determine the character repertoire you need in a document, database, web page, or other context. Then you should proceed to determining the best possible character code and encoding. In practice, things don't quite work that way. We need to consider some widely recognized encodings and choose between them. Some rules of thumb:

In email to people that you do not know, use US-ASCII if possible. If not possible, try to analyze whether the recipient(s) can handle some other encoding. When needed, ask for permission to send non-ASCII data as attachments.
In messages on various international discussion forums, use US-ASCII even if the forum software supports other characters. Check the rules of the forum for other alternatives.
In email to people in a particular cultural environment, or in discussion forums where a language other than English is used, find out what people mostly do there, and do the same. Usually there is one dominant encoding that you should use.
On web pages, try to express yourself in ISO-8859-1. For pages in languages other than English, you could often use some widely understood encoding (such as ISO-8859-2 for Central/East European languages). However, UTF-8 is fairly well supported, too, these days. Use UTF-8, if ISO-8859-1 is not practical and there's no particular reason to use one of the 8-bit encodings for different languages.
In projects and activities where information providers and editors work with different systems and tools, be conservative and try to live with ASCII or ISO 8859-1 or perhaps some other 8-bit code. The reason is that most tools, including simple text editors, can handle such encodings, whereas Unicode encodings often pose problems, and many people do not know how to work with them. Note that some data formats, such as HTML and XML, let you escape from the limitations set by the encoding, and this can be feasible if you need extra characters only rarely.

If you use, say, Outlook Express to send email or to post to Usenet groups, make sure it sends the message in a reasonable form. In particular, make sure it does not send the message as HTML or duplicate it by sending it both as plain text and as HTML (select plain text only). In regard to character encoding, make sure it is something widely understood, such as ASCII, some ISO 8859 encoding, or UTF-8, depending on how large a character repertoire you need.

In particular, avoid sending data in a proprietary encoding (like the Macintosh encoding or a DOS or Windows encoding) to a public network. At the very least, if you do that, make sure that the message heading specifies the encoding! There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to the Internet, data should be converted to a more widely known encoding by the sending program. If you cannot find a way to configure your program to do that, get another program.

In email programs, there's typically a "Tools" or "Settings" menu, where you can set things like the format and encoding of outgoing messages. The hard part is to understand what the settings and the options are about, but at this point, you should have most if not all the information needed for that. Having checked the settings, you can test them by sending email to yourself and viewing the "hidden data."

The "hidden data" in an email (or Usenet) message consists of the message headers, and the message body interpreted as plain text. Normally you see that data as formatted by your email program. This includes interpreting the content according to the specified encoding and displaying some (but usually not all) of the information in the headers, such as the sender's name and email address. There are different ways to view such information, or to view just the headers. The ways are not always easy to find; in Outlook Express for example, you can select the received message and select File Properties. In Mozilla Thunderbird you can use View Message source, and you will see the message headers and raw content in a new window, as illustrated in Figure 3-7. The Content-Type header contains a charset parameter that specifies the encoding. In the absence of such information, the ASCII encoding would be implied.

3.8.3. Encoding in Offline Data

In regard to other forms of transfer of data in digital form, such as diskette or CD-ROM, information about encoding is important, too. The problem is typically handled by guesswork. Often the crucial thing is to know which program was used to generate the data, since the text data might be inside a file in, say, the MS Word format, which can only be read by (a suitable version of) MS Word or by a program that knows its internal data format. That format, once recognized, might contain information that specifies the character encoding used in the text data included; or it might not, in which case one has to ask the sender, or make a guess, or use trial and errorviewing the data using different encodings until something sensible appears.

Make sure you write down the encoding and make information about it available along with the data. This could mean a separate document on a CD-ROM, or a note written with a pen on the CD-ROM or its cover, or a sheet of paper you store and send with the data. This may sound trivial, but it is often neglected. It is best to specify the encoding in two ways: by its official name, and by its more widely known informal name. For example: "The files on this diskette are Windows Latin 2 (windows-1250) encoded."

Figure 3-7. Viewing an email message in raw format ("source"), with headers that should indicate the character encoding

3.8.4. Common Choices of Encoding

Some widely used choices of encoding for different languages are presented in Table 3-7, identified by name of language or a name for a collection of languages, as commonly used in menus in programs.

Table 3-7. Commonly used encodings for some languages
Language(s)	Encodings	Notes
Arabic	iso-8859-6, windows-1256
Armenian	ARMSCII-8
Baltic	iso-8859-4, windows-1257	Latvian, Lithuanian
Central European	iso-8859-2, windows-1250	Czech, Polish...
Chinese	gb2312, hz-gb-2312, big5
Cyrillic	koi8-r, koi8-u, windows-1251	koi8-r: Russian, koi8-u: Ukrainian
Farsi (Persian)	windows-1256, MacFarsi
Georgian	GEOSTD8
Greek	iso-8859-7, windows-1253
Hebrew	iso-8859-8, windows-1255
Japanese	euc-jp, iso-2022-jp, Shift_JIS
Korean	euc-kr, iso-2022-kr
Thai	windows-874, TIS-620
Turkish	iso-8859-9, windows-1254
Vietnamese	windows-1258
Western European	iso-8859-1, windows-1252	English, French, German, Italian...

As you can see, some encodings are intended rather specifically for a single language, while some are for a wide group of languages. This depends mostly on character repertoire requirements rather than language family relationships.

For data that may contain a combination of languages, Unicode encodings are usually the best approach, and often the only possibility. You cannot find any widely understood encoding (other than Unicode encodings) that would let you write a plain text file that contains French and Thai, for example. The encoding that supports French accented letters does not support Thai characters, and vice versa.

Some ISO-8859 encodings and their Windows counterparts have been designed to cover a large set of languages. This especially applies to ISO-8859-1 and windows-1252. Such coverage is possible due to the fact that many European languages use just the basic Latin letters with a small collection of additional letters.

3.8.5. Sources of Information

The following web sites contain useful information on character codes. This means code tables, conversion tables, prose descriptions, usage guidelines, etc.

Czyborra's site (http://czyborra.com): A widely known site, which contains good concise descriptions and comments. It is rather old, though, and has not been updated for years.
Fileformat.info on charsets (http://www.fileformat.info/info/charset/): This part of the Fileformat.info site contains character tables ("grids") for different encodings, tabular material.
Tex Texin's material (http://www.i18nguy.com/unicode/codepages.html): "Character Sets And Code Pages At The Push Of A Button." This might be called a real portal to detailed information on encodings.

3.8.6. Exercises

If possible, carry out the following exercises. If the book has been successful in explaining things, each exercise should take just about 10 to 15 minutes and give you some self-confidence and practice.

3.8.6.1. Testing encodings

Use an HTML document with an unspecified character encoding containing all octets in the range 160 through 255, like the document that you can copy from http://www.cs.tut.fi/~jkorpela/chars/test8.htm. View the document in your web browser, using at least two different 8-bit encodings other than ISO 8859-1. (In Internet Explorer, use View Encoding.) Analyze which encoding your browser uses by default.

3.8.6.2. "Deciphering" text

You have got a text file of unknown origin and in unknown encoding but presumably containing text in English. When you view the file in a Windows environment, with Windows Latin 1 as the default encoding, using Notepad, you see the following:

The letters á and ù are the first and last letter of the Greek alphabet and are often used to symbolize beginning and end. In uppercase, they are Á and Ù; in uppercase with stress mark, they are ¢ and ¿.

Can you deduce what the real encoding is and what the content is?