Character Set versus Character Encoding | Effective XML: 50 Specific Ways to Improve Your XML

XML is based on the Unicode character set. A character set is a collection of characters assigned to particular numbers called code points . Currently Unicode 4.0 defines more than 90,000 individual characters. Each character in the set is mapped to a number, such as 64, 812, or 87,000. These numbers are not ints, shorts, bytes, longs, or any other numeric data type. They are simply numbers. Other character sets, such as Shift-JIS and Latin-1, contain different collections of characters that are assigned to different numbers, although there's often substantial overlap with the Uni code character set. That is, many character sets assign some or all of their characters to the same numbers to which Unicode assigns those characters.

A character encoding represents the members of a character set as bytes in a particular way. There are multiple encodings of Unicode, including UTF-8, UTF-16, UCS-2, UCS-4, UTF-32, and several other more obscure ones. Different encodings may encode the same code point using a different sequence of bytes and/or a different number of bytes. They may use big-endian or little-endian data. They can even use non- twos complement representations. They may use two bytes or four bytes for each character. They may even use different numbers of bytes for different characters.

Changing the character set changes which characters can be represented. For instance, the ISO-8859-7 set includes Greek letters . The ISO-8859-1 set does not. Changing the character encoding does not change which characters can be usedit merely changes how each character is encoded in bytes.

XML parsers always convert characters in other sets to Unicode before reporting them to the client application. In effect, they treat other character sets as different encodings of some subset of Unicode. Thus, XML doesn't ever really let you change the character set. This is always Unicode. XML only lets you adjust how those characters are represented.