5.5 Unicode | XML in a Nutshell, Third Edition

Unicode is an international standard character set that can be used to write documents in almost any language you're likely to speak, learn, or encounter in your lifetime, barring alien abduction. Version 4.0.1, the current version as of June, 2004, contains 96,447 characters from most of Earth's living languages as well as several dead ones. Unicode easily covers the Latin alphabet, in which most of this book is written. Unicode also covers Greek-derived scripts, including ancient and modern Greek and the Cyrillic scripts used in Serbia and much of the former Soviet Union. Unicode covers several ideographic scripts, including the Han character set used for Chinese and Japanese, the Korean Hangul syllabary, and phonetic representations of these languages, including Katakana and Hiragana. It covers the right-to-left Arabic and Hebrew scripts. It covers various scripts native to the Indian subcontinent, including Devanagari, Thai, Bengali, Tibetan, and many more. And that's still less than half of the scripts in Unicode 4.0. Probably less than one person in a thousand today speaks a language that cannot be reasonably represented in Unicode. In the future, Unicode will add still more characters, making this fraction even smaller. Unicode can potentially hold more than a million characters, but no one is willing to say in public where they think most of the remaining million characters will come from. ^[2]

^[2] After a few beers, some developers are willing to admit that they're preparing for a day when we're part of a Galactic Federation of thousands of intelligent species.

The Unicode character set assigns characters to code points; that is, numbers. These numbers can then be encoded in a variety of schemes, including:

UCS-2
UCS-4
UTF-8
UTF-16

5.5.1 UCS-2 and UTF-16

UCS-2, also known as ISO-10646-UCS-2, represents each character as a two-byte, unsigned integer between 0 and 65,535. Thus the capital letter A , code point 65 in Unicode, is represented by the two bytes 00 and 41 (in hexadecimal). The capital letter B , code point 66, is represented by the two bytes 00 and 42. The two bytes 03 and A3 represent the capital Greek letter , code point 931.

UCS-2 comes in two variations, big endian and little endian. In big-endian UCS-2, the most significant byte of the character comes first. In little-endian UCS-2, the order is reversed . Thus, in big-endian UCS-2, the letter A is #x0041 . ^[3] In little-endian UCS-2, the bytes are swapped, and A is #x4100 . In big-endian UCS-2, the letter B is #x0042 ; in little-endian UCS-2, it's #x4200 . In big-endian UCS-2, the letter is #x03A3 ; in little-endian UCS-2, it's #xA303 . In this book we use big-endian notation, but parsers cannot assume this. They must be able to determine the endianness from the document itself.

^[3] For reasons that will become apparent shortly, this book has adopted the convention that #x precedes hexadecimal numbers. Every two hexadecimal digits map to one byte.

To distinguish between big-endian and little-endian UCS-2, a document encoded in UCS-2 customarily begins with Unicode character #xFEFF , the zero-width nonbreaking space, more commonly called the byte-order mark . This character has the advantage of being invisible. Furthermore, if its bytes are swapped, the resulting #xFFFE character doesn't actually exist. Thus, a program can look at the first two bytes of a UCS-2 document and tell immediately whether the document is big endian, depending on whether those bytes are #xFEFF or #xFFFE .

UCS-2 has three major disadvantages, however:

Files containing mostly Latin text are about twice as large in UCS-2 as they are in a single-byte character set such as ASCII or Latin-1.
UCS-2 is not backward- or forward-compatible with ASCII. Tools that are accustomed to single-byte character sets often can't process a UCS-2 file in a reasonable way, even if the file only contains characters from the ASCII character set. For instance, a program written in C that expects the zero byte to terminate strings will choke on a UCS-2 file containing mostly English text because almost every other byte is zero.
UCS-2 is limited to 65,536 characters.

The last problem isn't so important in practice, since the first 65,536 code points of Unicode nonetheless manage to cover most people's needs except for dead languages like Ugaritic, fictional scripts like Tengwar, and musical and some mathematical symbols. Unicode does, however, provide a means of representing code points beyond 65,535 by recognizing certain two-byte sequences as half of a surrogate pair. A Unicode document that uses UCS-2 plus surrogate pairs is said to be in the UTF-16 encoding.

The other two problems, however, are more likely to affect most developers. UTF-8 is an alternative encoding for Unicode that addresses both.

5.5.2 UTF-8

UTF-8 is a variable-length encoding of Unicode. Characters 0 through 127, that is, the ASCII character set, are encoded in one byte each, exactly as they would be in ASCII. In ASCII, the byte with value 65 represents the letter A . In UTF-8, the byte with the value 65 also represents the letter A . There is a one-to-one identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII files are also acceptable UTF-8 files.

UTF-8 represents the characters from 128 to 2,047, a range that covers the most common non-ideographic scripts, in two bytes each. Characters from 2,048 to 65,535mostly from Chinese, Japanese, and Koreanare represented in three bytes each. Characters with code points above 65,535 are represented in four bytes each. For a file that's mostly Latin text, this effectively halves the file size from what it would be in UCS-2. However, for a file that's primarily Japanese, Chinese, Korean, or one of the languages of the Indian subcontinent, the file size can grow by 50%. For most other living languages, the file size is close to the same as it would be in UCS-2.

UTF-8 is probably the most broadly supported encoding of Unicode. For instance, it's how Java .class files store strings, it's the native encoding of the BeOS, and it's the default encoding an XML processor assumes unless told otherwise by a byte-order mark or an encoding declaration. Chances are pretty good that if a program tells you it's saving Unicode, it's really saving UTF-8.