Chapter 5. Internationalization | XML in a Nutshell, Third Edition

We've told you that XML documents contain text, but we haven't yet told you what kind of text they contain. In this chapter we rectify that omission. XML documents contain Unicode text. Unicode is a character set large enough to include all the world's living languages and a few dead ones. It can be written in a variety of encodings, including UCS-2 and the ASCII superset UTF-8. However, since Unicode text editors are not ubiquitous, XML documents may also be written in other character sets and encodings, which are converted to Unicode when the document is parsed. The encoding declaration specifies which character set a document uses. You can use character references, such as θ , to insert Unicode characters like that aren't available in the legacy character set in which a document is written.

Computers don't really understand text. They don't recognize the Latin letter Z , the Greek letter , or the Han ideograph . All a computer understands are numbers such as 90, 947, or 40,821. A character set maps particular characters, like Z , to particular numbers, like 90. These numbers are called code points . A character encoding determines how those code points are represented in bytes. For instance, the code point 90 can be encoded as a signed byte, a little-endian unsigned short, a 4-byte, two's complement, a big-endian integer, or in some still more complicated fashion.

A human script like Cyrillic may be written in multiple character sets, such as KOI8-R, Unicode, or ISO-8859-5. A character set like Unicode may then be encoded in multiple encodings, such as UTF-8, UCS-2, or UTF-16. However, most simpler character sets, such as ASCII and KOI8-R, have only one encoding.