C.4 Declarations | Text Processing in Python

We have seen how Unicode characters are actually encoded, at least briefly, but how do applications know to use a particular decoding procedure when Unicode is encountered? How applications are alerted to a Unicode encoding depends upon the type of data stream in question.

Normal text files do not have any special header information attached to them to explicitly specify type. However, some operating systems (like MacOS, OS/2, and BeOS Windows and Linux only in a more limited sense) have mechanisms to attach extended attributes to files; increasingly, MIME header information is stored in such extended attributes. If this happens to be the case, it is possible to store MIME header information such as:

 Content-Type: text/plain; charset=UTF-8

Nonetheless, having MIME headers attached to files is not a safe, generic assumption. Fortunately, the actual byte sequences in Unicode files provide a tip to applications. A Unicode-aware application, absent contrary indication, is supposed to assume that a given file is encoded with UTF-8. A non-Unicode-aware application reading the same file will find a file that contains a mixture of ASCII characters and high-bit characters (for multibyte UTF-8 encodings). All the ASCII-range bytes will have the same values as if they were ASCII encoded. If any multibyte UTF-8 sequences were used, those will appear as non-ASCII bytes and should be treated as noncharacter data by the legacy application. This may result in nonprocessing of those extended characters, but that is pretty much the best we could expect from a legacy application (that, by definition, does not know how to deal with the extended characters).

For UTF-16 encoded files, a special convention is followed for the first two bytes of the file. One of the sequences 0xFF 0xFE or 0xFE 0xFF acts as small headers to the file. The choice of which header specifies the endianness of a platform's bytes (most common platforms are little-endian and will use 0xFF 0xFE). It was decided that the collision risk of a legacy file beginning with these bytes was small and therefore these could be used as a reliable indicator for UTF-16 encoding. Within a UTF-16 encoded text file, plain ASCII characters will appear every other byte, interspersed with 0x00 (null) bytes. Of course, extended characters will produce non-null bytes and in some cases double-word (4 byte) representations. But a legacy tool that ignores embedded nulls will wind up doing the right thing with UTF-16 encoded files, even without knowing about Unicode.

Many communications protocols and more recent document specifications allow for explicit encoding specification. For example, an HTTP daemon application (a Web server) can return a header such as the following to provide explicit instructions to a client:

 HTTP/1.1 200 OK Content-Type: text/html; charset:UTF-8;

Similarly, an NNTP, SMTP/POP3 message can carry a similar Content-Type: header field that makes explicit the encoding to follow (most likely as text/plain rather than text/html, however; or at least we can hope).

HTML and XML documents can contain tags and declarations to make Unicode encoding explicit. An HTML document can provide a hint in a META tag, like:

 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

However, a META tag should properly take lower precedence than an HTTP header, in a situation where both are part of the communication (but for a local HTML file, such an HTTP header does not exist).

In XML, the actual document declaration should indicate the Unicode encoding, as in:

 <?xml version="1.0" encoding="UTF-8"?>

Other formats and protocols may provide explicit encoding specification by similar means.