10.4. Characters in Protocol HeadersThe original Internet message syntax restricts the character repertoire to ASCII. For most message headers, this does not cause problems, since the headers names are in ASCII, and most header values are code-like notations designed to be writeable in ASCII. There are some exceptions, though, such as the Subject header in email and on Usenet. The header should tell what the message is about, and naturally, it should be in the same language as the message content. The sender and recipient headers (such as From and To) contain Internet email addresses, which are normally in ASCII, but they may contain, as comments, real names of people and organizations. If your real name is Matti Meikäläinen, you would like to have it expressed as such, with the ä's, in the From field of your messages. Such practice is often recommended, but it immediately raises the character problems. Figure 10-9. Sample HTTP headers echoed (in Opera)In practice, if you include non-ASCII data in the message headers, things will usually work, if your program sends your messages by the MIME conventions. The headers will specify the encoding for the message body, and most programs that can handle MIME will apply the conventions to the message headers, too. The headers might even contain, for example, Latin 1 Supplement characters as "raw" 8-bit data by the ISO-8859-1 encoding, naturally assuming that there is a Content-Type header that specifies the encoding. In principle, such methods are not recommended, and they may cause practical problems to some software. Within a country where some 8-bit encoding, such as one of the ISO 8859 family, is widely used, you can probably send email with raw 8-bit data in headings without encountering problems with that. Sending such email to a country where people use dominantly just ASCII may result in unreadable headers, or even make programs crash, because people use software that cannot handle such data. As a consequence, when sending a message in an international group discussion, whether by email or on Usenet, it is safest to use ASCII only in headers, especially in the Subject line. The reason is that when people respond to your message, their messages get the Subject line content from the original message. Although most people's programs can handle MIME properly, sooner or later someone might respond using a program that cannot. It may mess up the Subject line quite a lot. 10.4.1. The Signature Convention May HelpIn some cases, you might avoid the problem by using a simplified version of the spelling of your name (e.g., From: Matti Meikalainen <mm@fi.example>) and specify the correct version in a signature. A signature, or "sig," is a short piece of text (recommended maximum length is four lines) appended automatically at the end of the email and Usenet messages that you send. It is preceded by a "sig separator," namely two hyphen-minus characters and one space "-- " on a line of its own. For example: -- Matti Meikäläinen freelance generalist Programs may treat signatures in a special way, distinguishing them from the message body proper. By the protocols, however, a signature is part of the body and may contain non-ASCII characters the same way and under the same conditions as the content. 10.4.2. The Q EncodingThe Q encoding is a general mechanism for overcoming the limitation to ASCII in Internet message headers. Technically, it means that the headers do not crash anything that expects ASCII only, since all octets are in the ASCII range. However, programs are expected to interpret some patterns as indicating a particular character encoding. In that case, part of the heading is to be interpreted according to that encoding. The Q encoding resembles the QP encoding discussed in Chapter 3 but differs from it in a few essential ways:
Thus, the general format is: =?encoding?Q?data?= For example, if you send email (on a MIME enabled program) and specify the recipient name as Matti Meikäläinen, the program will generate a header like the following: To: =?ISO-8859-1?Q?Matti_Meik=E4l=E4inen?= <mm@fi.example> A recipient who uses an old program that cannot handle MIME will see the name literally that way, but more likely, the recipient's program will interpret the Q encoding and display the name correctly. Here, as usual, things may fail if the recipient's program cannot handle the character encoding used, but ISO-8859-1 will probably work fine. 10.4.3. The B EncodingThe B encoding is similar to the Q encoding but uses Base64 encoding for the data. Since that encoding was described in Chapter 6, we will only give an example here: Subject: =?UTF-8?B?VMOkbcOkIG9uIMK1LXRlc3RpIGphIM6jLXRlc3Rp?= The point is that although modern software recognizes this and decodes the data, it is completely illegible without such decoding. A recipient who is not familiar with encodings might not even realize that there is some sensible data involved. 10.4.4. Summary: Dealing with Non-ASCII Characters in HeadersIf it seems that you need to use characters other than ASCII in email or Usenet messages, you can choose between the following options:
Some programs like Outlook Express can be used both for email and for posting to Usenet ("newsgroups"), and they have partly separate settings for these two types of use. You could for example allow 8-bit characters in headers when posting to Usenet but disallow them in email. It is not possible to give a comprehensive presentation of the ways that email programs should be configured and used with regards to character encoding. The discussion in this section is meant to present the basics for an analysis of the various settings that are available in each program. The bottom line is that anything beyond ASCII in message headers may cause problems, though modern email programs usually understand whatever another modern email program sends. |