Section 10.1. Information About Encoding

10.1. Information About Encoding

When data is sent over the Internet, it needs to be encoded into digital format, ultimately as octets and bits. If the recipient program does not know the overall formati.e., how the data has been encoded, it needs to make guesses, or it might simply fail to do anything sensible with it. A sequence of octets could be intended to present data other than character data, too. It could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers.

Moreover, if the data is text, the recipient needs to know the character encodingi.e., how the octets will be mapped to characters. If you only look at an octet sequence, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated. Sometimes the recipient can guess the encoding, but data processing and transfer shouldn't be guesswork.

Information about the overall format and the character encoding should normally be included into Internet message headers . The headers contain other information, too. The MIME specification defines how the format, the encoding, and other information pertaining to character representation are expressed in Internet message headers. In particular, when non-ASCII data is sent by email, there should be a header that says the MIME is used in the first place (as opposed to old email formats, where ASCII was implied) and a header that indicates the data transmission method. For example:

MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit

The header Content-Transfer-Encoding: 8bit indicates that the octets representing the data (in this case, in the ISO 8859-15 encoding) are transmitted as such as 8-bit quantities. The original design of Internet email postulated the use of 7-bit quantities only. Most email software can handle 8-bit quantities nowadays, but the octets can be encoded using 7-bit quantities when needed.

10.1.1. What Happens Without Information About Encoding

Because of default settings, you might work with computers and the Internet for quite a while without ever worrying about formats and encodings. Suppose that you use just English, or some other language of Western European origin, like Spanish. When you send email, your email program probably sends your message as plain text encoded in ASCII, ISO-8859-1, or windows-1252 (Windows Latin 1). The program may also automatically include a header that tells the format and the encoding. A recipient's email program will often find that header and act accordingly, without bothering its user with any technicalities. In the absence of the header, the program will probably interpret the data as plain text in windows-1252 encoding, and get it right. (ASCII and ISO-8859-1 encoded data gets interpreted correctly when interpreted as windows-1252; see Chapter 3.)

Problems arise when defaults clash with each other. Suppose that you send email to Russia. Even if your message is in English, you might use some non-ASCII characters, such as curved quotation marks, dashes, or symbols like µ or €. Your email program might therefore decide to send the message as ISO-8859-1 or windows-1252 encoded. If it does not inform about the encoding, or if the recipient's program does not use the information, the odds are that the recipient sees the non-ASCII characters wrong. Some Cyrillic letters or some special characters (but not the right ones) would appear, when your message is interpreted according to some of the 8-bit encodings commonly used in Russia. This all works in the opposite direction, too. Someone writing in English in Russia might use the character to mean "number (incorrectly but understandably thinking the symbol is used in English, too), but when his email program sends it, for example, in windows-1251 (Windows Cyrillic) encoding and your email program interprets it as windows-1252, you will see the symbol as ¹.

It is easy to guess wrong and never realize the truth, if the wrong guess affects a few characters only. This may happen when non-ASCII characters appear only rarely. It also happens when some commonly used encodings are rather similar to each other but not the same. For example, ISO-8859-1 and ISO-8859-15 differ in a few positions only. If you get a lump of data and notice that it looks ISO-8859-1 encoded, you might be quite happy even if the encoding is in fact ISO-8859-15. However, the data that you pass forward or print or otherwise process might contain some wrong characters. For example, octet A8 (hexadecimal) means the dieresis ¨ in ISO-8859-1, and since the dieresis has so little use as a separate character, the texts you look at probably don't contain it. One day, however, the data you get might contain that octet and you would see it as the dieresis, wondering what it means. If the encoding is in fact ISO-8859-15, the octet should be taken as meaning the letter .

When very different encodings are implied by a sending program and a receiving program, the user will immediately see that there is something wrong. If you send Spanish text (using all accents correctly) to Russia and the recipient's program interprets it according to some encoding commonly used in Russia, all non-ASCII letters will appear replaced by Cyrillic letters. If someone sends you a message in Japanese, using some of the encodings commonly used in Japan for Japanese text, and your program interprets it according to windows-1252, the result will be completely illegible even you read Japanese fluently.

When there is no information about encoding or the information is wrong, the user often has to try to set her program to show the data according to different encodings to find the right one. We discussed this in Chapter 1, but most users do not know such features or they have problems using them. It is difficult to find information about these features in documentation of most programs.

10.1.2. Approaches to Specifying the Encoding

For reliable data transmission, a platform-independent method of specifying the general format and the encoding and other relevant information is needed. Such methods exist, although they are not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm.

Attaching a human-readable note, such as a few words of explanation in an email message body, is better than nothing. You could write, for example: "The enclosed attachment contains the report you asked for, as plain text, in ISO-8859-1 encoding."

Before the Web, FTP (File Transfer Protocol) servers were used to make documents available on the Internet, and they still have some usage. In FTP, there is no way to indicate the format of documents at the protocol level, except by distinguishing between text ("ASCII") files and all other files, collectively called binary files. It is therefore common and recommendable to include a text file in a directory on an FTP server so that this file, often named conventionally as README.TXT, contains a list of all files in the directory. That's a suitable place for explaining not only the content and purpose of each file, but also the file formats and character encodings.

However, since data is processed by programs that cannot understand such notes, the encoding should be specified in a standardized computer-readable form whenever possible. Ideally, computers would do this automatically when sending data, so that people would not need to know anything about it, unless they are computer specialists who work on technologies that make such things possible. In the real world, many people need to know something about the internals of sending information about encoding.

Thus, in most Internet contexts, the normal and recommendable approach is to specify the encoding of data in a formalized manner, in a format that can easily be processed by programs. Usually, Internet message headers are used for the purpose.

10.1.3. Practical Recommendations

Most important, make sure that any Internet-related software that you use to send data (such as an email program) specifies the encoding correctly in suitable headers. There are two things involved:

Figure 10-1. Normal view of an incoming email message in Thunderbird

The header must be present and it must reflect the actual encoding used.
The encoding used must be one that is widely understood by the (potential) recipients' software.

You often need to make compromises with regards to the latter aim: you may need to use an encoding that is not yet universally supported to get your message through at all. In practice, this mainly means that you may need to use UTF-8, even though not all email programs can handle it in incoming mail. In Chapter 3, we described some of the commonly available encodings and their suitability. ASCII is safe, ISO 8859 encodings are safe in many contexts (in communication between people who belong to the same language community), and UTF-8 is usually the best approach when you need a wide repertoire of characters.

Typically, you should check the headers sent by a program when you first use it, or the first time you intend to send anything but ASCII characters. We discussed this in Chapter 1. However, you should also check that the message has appropriate headers, instead of just looking right by accident.

10.1.4. Looking at the Headers

When you view an incoming message normally, as in Figure 10-1, you see just the content, not the headers. However, some information extracted from the headers may appear; e.g., the "Subject" and "From" information has been taken from them.

Using some program-dependent method, you can change the display of an incoming email message so that all the message headers become visible. In Thunderbird, you would just click on the small box containing + at the start of the line with the Subject of the message, right above the message itself. The headers then appear before the message, as shown in Figure 10-2, and the content of the box changes to the minus sign, - (meaning that if you click on it, the headers are removed from the display).

Figure 10-2. View of an incoming email message with headers in Thunderbird

The structure of the email message headers, or MIME headers, will be discussed later in this chapter. Here it suffices to note that the last three headers specify the following:

The message is in MIME format (specifically, in MIME Version 1.0).
The content is in plain text format, UTF-8 encoded, and it is subject to a specific convention expressed by format=flowed (which says that the message may be reformatted for display according to certain rules, as opposite to fixed line structure).
The encoded (UTF-8) content is transferred directly as octets, instead of applying any particular transfer encoding such as Quoted Printable (see Chapter 6).

Alternatively, when viewing a message in Thunderbird, you could select View Message source, or simply type Ctrl-U, to see the message as "source," or as "raw format." This means that the message is displayed as transmitted on the network and as received by an email program. This format contains first the headers, then a blank line, and then the message itself. Our test message is shown as "source in Figure 10-3. In this case, the characters in the message are displayed as such, but if some special transfer encoding (such as Quoted Printable) had been used, they would appear as "raw," in the encoded form.

Figure 10-3. View of email message "source" in Thunderbird

Other programs have different methods for making the headers visible or viewing message "source" or "raw format." Typically, the relevant commands are in a "File" menu or in a "View" menu. In Outlook Express, for example, you can normally use File Properties to access both the source and the raw format.

To test that your email program behaves well, you could send a message with several special characters to a friend who works in a completely different environmente.g., a Linux or Mac environment, if you use Windowsand ask her to forward the message back to you. Of course, if something goes wrong, you will not immediately see whether the problem is in your system or in hers. However, the headers will help in analyzing the situation.