Section 10.4. Characters in Protocol Headers

10.4. Characters in Protocol Headers

The original Internet message syntax restricts the character repertoire to ASCII. For most message headers, this does not cause problems, since the headers names are in ASCII, and most header values are code-like notations designed to be writeable in ASCII.

There are some exceptions, though, such as the Subject header in email and on Usenet. The header should tell what the message is about, and naturally, it should be in the same language as the message content. The sender and recipient headers (such as From and To) contain Internet email addresses, which are normally in ASCII, but they may contain, as comments, real names of people and organizations. If your real name is Matti Meikäläinen, you would like to have it expressed as such, with the ä's, in the From field of your messages. Such practice is often recommended, but it immediately raises the character problems.

Figure 10-9. Sample HTTP headers echoed (in Opera)

In practice, if you include non-ASCII data in the message headers, things will usually work, if your program sends your messages by the MIME conventions. The headers will specify the encoding for the message body, and most programs that can handle MIME will apply the conventions to the message headers, too. The headers might even contain, for example, Latin 1 Supplement characters as "raw" 8-bit data by the ISO-8859-1 encoding, naturally assuming that there is a Content-Type header that specifies the encoding.

In principle, such methods are not recommended, and they may cause practical problems to some software. Within a country where some 8-bit encoding, such as one of the ISO 8859 family, is widely used, you can probably send email with raw 8-bit data in headings without encountering problems with that. Sending such email to a country where people use dominantly just ASCII may result in unreadable headers, or even make programs crash, because people use software that cannot handle such data.

As a consequence, when sending a message in an international group discussion, whether by email or on Usenet, it is safest to use ASCII only in headers, especially in the Subject line. The reason is that when people respond to your message, their messages get the Subject line content from the original message. Although most people's programs can handle MIME properly, sooner or later someone might respond using a program that cannot. It may mess up the Subject line quite a lot.

10.4.1. The Signature Convention May Help

In some cases, you might avoid the problem by using a simplified version of the spelling of your name (e.g., From: Matti Meikalainen <mm@fi.example>) and specify the correct version in a signature. A signature, or "sig," is a short piece of text (recommended maximum length is four lines) appended automatically at the end of the email and Usenet messages that you send. It is preceded by a "sig separator," namely two hyphen-minus characters and one space "-- " on a line of its own. For example:

-- Matti Meikäläinen freelance generalist

Programs may treat signatures in a special way, distinguishing them from the message body proper. By the protocols, however, a signature is part of the body and may contain non-ASCII characters the same way and under the same conditions as the content.

10.4.2. The Q Encoding

The Q encoding is a general mechanism for overcoming the limitation to ASCII in Internet message headers. Technically, it means that the headers do not crash anything that expects ASCII only, since all octets are in the ASCII range. However, programs are expected to interpret some patterns as indicating a particular character encoding. In that case, part of the heading is to be interpreted according to that encoding. The Q encoding resembles the QP encoding discussed in Chapter 3 but differs from it in a few essential ways:

Q encoding may be applied in a part of text (header) only.
A Q encoded part starts with the characters =? and ends with ?=.
The initial =? is followed by the name of the encoding and the string ?Q?.
In the data that follows, an octet (to be interpreted in the encoding specified) can be represented as =xx, where xx is its numeric value in hexadecimal. The octet 20 (corresponding to space in ASCII) may also be represented as _ (underline). Octets that correspond to printable ASCII characters, except the space and =, may also be represented as those characters.

Thus, the general format is:

=?encoding?Q?data?=

For example, if you send email (on a MIME enabled program) and specify the recipient name as Matti Meikäläinen, the program will generate a header like the following:

To: =?ISO-8859-1?Q?Matti_Meik=E4l=E4inen?= <mm@fi.example>

A recipient who uses an old program that cannot handle MIME will see the name literally that way, but more likely, the recipient's program will interpret the Q encoding and display the name correctly. Here, as usual, things may fail if the recipient's program cannot handle the character encoding used, but ISO-8859-1 will probably work fine.

10.4.3. The B Encoding

The B encoding is similar to the Q encoding but uses Base64 encoding for the data. Since that encoding was described in Chapter 6, we will only give an example here:

Subject: =?UTF-8?B?VMOkbcOkIG9uIMK1LXRlc3RpIGphIM6jLXRlc3Rp?=

The point is that although modern software recognizes this and decodes the data, it is completely illegible without such decoding. A recipient who is not familiar with encodings might not even realize that there is some sensible data involved.

10.4.4. Summary: Dealing with Non-ASCII Characters in Headers

If it seems that you need to use characters other than ASCII in email or Usenet messages, you can choose between the following options:

Use ASCII only: This avoids the technical problems but creates problems in human communication. Consider how understandable the data is when mapped to ASCII (e.g., replacing ä with "a," or maybe "ae"; see the section "Escape sequences" in Chapter 2). This is often the only feasible approach in international discussion groups, worldwide email distribution lists, etc.
Use Q encoding: Modern software often applies Q encoding automatically, if you include non-ASCII characters in headers. This is usually adequate when sending messages in a culturally homogenous environment where the languages normally used need non-ASCII characters, so that most people have MIME capable software.
Use B encoding: This is hardly useful, since it normally has no significant benefits over Q encoding but serious drawbacks: when presented as such, B encoded data is illegible. Some programs use B encoding by default, at least in some situations.
Use 8-bit characters in headers: If the program you use has an option for sending 8-bit characters in headers, this means that it uses octets larger than 7F there, tooe.g., passing ISO-8859-1 data as such. This is risky but sometimes works better than Q encoding; for example, some Usenet software ("newsreaders") can deal with 8-bit data but can not decode Q encoding. To use this feature, you would simply select that option, but remember that it will remain in effect until you change it.

Some programs like Outlook Express can be used both for email and for posting to Usenet ("newsgroups"), and they have partly separate settings for these two types of use. You could for example allow 8-bit characters in headers when posting to Usenet but disallow them in email.

It is not possible to give a comprehensive presentation of the ways that email programs should be configured and used with regards to character encoding. The discussion in this section is meant to present the basics for an analysis of the various settings that are available in each program. The bottom line is that anything beyond ASCII in message headers may cause problems, though modern email programs usually understand whatever another modern email program sends.