Choosing an Encoding | Effective XML: 50 Specific Ways to Improve Your XML

Given that you've wisely chosen to use Unicode for your documents, the next question is which encoding of Unicode to pick. Unicode is a character set that assigns almost 100,000 characters to different numeric code points. The characters assigned code points from 0 to 65,535 are sometimes referred to as Plane 0 or the Basic Multilingual Plane (BMP for short). The BMP includes most common characters from most of the world's living languages including the Roman alphabet, Cyrillic, Arabic, Greek, Hebrew, Hangul, the most common Han ideographs, and many more. Plane 1, spanning code points 65,536 to 131,071, includes musical notation, many mathematical symbols, and several dead languages such as Old Italic. Plane 2 (code points 131,072 to 196,607) adds many less common Han ideographs. Plane 14 includes language tags XML developers can safely ignore. (They should use the xml:lang attribute instead.) The other planes are as yet unpopulated.

Unicode does not specify a single unique binary representation for any of the code points in any of the planes, however. They can be encoded as four-byte big-endian integers, four-byte little-endian integers, or in some more complex but efficient way. Indeed, there are several common encodings of the Unicode character set in practical use today. However, only two are worthy of serious consideration, UTF-8 and UTF-16. UTF-8 should be the default choice for most documents that don't contain large amounts of Chinese, Japanese, or Korean text. Documents that do contain significant amounts of Chinese, Japanese, or Korean should be encoded in UTF-16.

UTF-8

UTF-8 (Unicode Transformation Format, 8-bit encoding form) is a very clever encoding that uses different numbers of bytes for different characters. Characters in the ASCII range (0127) occupy one byte each. Characters from 128 to 4095 occupy two bytes each. The rest of the characters in Plane 0 occupy three bytes each. Finally, all the other characters in Planes 1 through 15 occupy four bytes each.

This scheme has a number of useful characteristics. First among them is that UTF-8 is a strict superset of ASCII. Every ASCII text file is also a legal UTF-8 document. This makes UTF-8 much more compatible with the installed software base, especially when working primarily in English. Also important is that none of the ASCII characters ever appear as parts of another character. That is, when you encounter a byte with a value between 0 and 127, it is always an ASCII character.

Second, UTF-8 is byte order independent. A UTF-8 document on a big-endian UNIX system is byte-for-byte identical to the same document on a little-endian Windows system. Byte order marks are not necessary, though they are allowed.

Third, the particular encoding scheme used is such that by looking at any one byte, a program can determine, based on that byte alone, whether or not it's the only byte of a single-byte character, the first or second byte of a two-byte character, or the first, second, or third byte of a three-byte character. The character boundaries can be inferred from the bytes alone. You do not need to start at the beginning of a stream or a file to read text from the middle. By reading at most three bytes starting at any position in the file, a program can align itself on the character boundaries.

Finally, for documents that use the Roman alphabet primarily, UTF-8 documents tend to be smaller than other Unicode encodings because each ASCII character takes up only one byte. The additional characters in other Roman alphabet languages (e.g., French, Turkish) don't make a huge difference. Non-Roman alphabets like Arabic and Greek use two bytes per character, which is no bigger than they are in other Unicode encodings. However, in languages with ideographic characters, such as Chinese, Japanese, and Korean, each character occupies three bytes or more, which makes text significantly larger than it would be in UTF-16.

UTF-16

UTF-16 (Unicode Transformation Format, 16-bit encoding form) is a more obvious encoding that uses two bytes for most characters, including the most common Chinese, Japanese, and Korean ideographs. Some less common ideographs are encoded in Plane 1 and represented with surrogate pairs of four bytes each. However, though you might use one or two of these, they're unlikely to make up the bulk of any document. Thus a document containing large amounts of Chinese, Japanese, or Korean text can be a third smaller in UTF-16 than it would be in UTF-8.

Note

On the other hand, ideographic languages stuff a lot of information into a single character. For example, the Japanese word for tree is . That single character is three bytes in UTF-8 and two bytes in UTF-16. By contrast the English word for tree needs four bytes in UTF-8 and eight bytes in UTF-16. The English word grove takes five bytes in UTF-8. The Japanese equivalent, , takes only three bytes. The word forest, , takes six bytes in English but still only three bytes in Japanese. Ideographic languages are quite space efficient to start with, regardless of encoding. Chinese is probably the most efficient, Korean the least, with Japanese somewhere in between.

However, UTF-16 does not have all the other nice qualities of UTF-8, so it should not be used as your default Unicode encoding unless you're working with ideographic languages, even when space is not an issue. UTF-16 is not byte order independent, it does not allow character boundaries to be easily detected , and it tends to contain many embedded nulls.

The normal solution to the byte order problem is to place a byte order mark (#xFEFF) at the beginning of the document. If the first two bytes the program reads are FE and FF, the document is written in big-endian UTF-16. However, if the first two bytes the program reads are FF and FE, the document is written in little-endian UTF-16. (FFFE is not a legal Unicode character, so there's no chance of misidentifying a legal character as the byte order mark in the opposite encoding.) XML explicitly allows a byte order mark to appear in the first two bytes of a document. This is the only thing that may appear before the XML declaration.

UTF-16 does not provide any foolproof mechanism to detect character boundaries. However, this shouldn't be an issue for streaming applications, which can simply start at the beginning and read two bytes at a time from that point forward. Some random access programs can simply assert that character boundaries occur only at even indexes (byte 0, byte 2, byte 4, and so on). If this is not an option, one useful heuristic is that zero bytes are much more likely to occur as the first byte of a character than the second, especially in markup and code where most characters are taken from the ASCII character set anyway.

Non-Unicode Character Sets

Having said all this, please keep in mind that this only applies to generating XML and processing XML with non-XML-aware tools such as text editors and grep. When an XML parser reads a document, it will translate the document's declared encoding to Unicode before presenting it to the client application. No properly designed XML system should ever depend on the document's original encoding.

Thus you can write XML documents in other encodings such as ISO-8859-1 (Latin-1) if this works better with your existing tools. Different branch offices in different countries can and do use different encodings, all of which are resolved by the parser when the document is processed . However, other than UTF-8 and UTF-16, XML processors are not required to recognize and understand other character sets. In practice, ISO-8859-1 does seem ubiquitous. However, the other standard character sets such as ISO-8859-2 through ISO-8859-16, ISO-2022-JP, and Big5 are often unsupported. Even ASCII is not recognized by all parsers, so it tends to get labeled as UTF-8, which is recognized. (Remember, any ASCII document is also a legal UTF-8 document.) UTF-8 and UTF-16 are much more interoperable across processes. Use UTF-8 if you plausibly can.