27.1 Character Tables | XML in a Nutshell, Third Edition

The XML specification divides Unicode into five overlapping sets:

Name characters

Characters that can appear in an element, attribute, or entity name. These characters are letters , ideographs, digits, and the punctuation marks _ , - , ., and :. In the tables that follow, name characters are shown in bold type, such as A , …, , , , 1 , 2 , 3 , , , and _ .

One of the major differences between XML 1.0 and 1.1 is in which characters are name characters. All XML 1.0 name characters are also XML 1.1 name characters. However, XML 1.1 also promotes many other characters to name characters. Some of these, such as the Burmese and Mongolian letters, reasonably deserve to be name characters. However, XML 1.1 also allows many problematic characters including ligatures such as ij, currency symbols such as the Greek drachma sign, letter-like symbols such as , number forms such as Roman numerals, and presentation forms. Finally, it allows all characters not defined as of Unicode 3.1.1 and all characters from beyond the basic multilingual plane, including such strange things as the musical symbol for a six-string fretboard. Unless you are working in a language such as Burmese or Mongolian that requires these new characters, it is recommended that you restrict your markup to characters that are legal in XML 1.0. The tables that follow are based on XML 1.0 rules.

Name start characters

Characters that can be the first character of an element, attribute, or entity name. These characters are letters, ideographs, and the underscore _ . In the tables that follow, these characters are shown with a gray background, such as A , …, , , , , , and _ . Because name start characters are a subset of name characters, they are also shown in bold.

Character data characters

All characters that can be used anywhere in an XML document, including element and attribute content, comments, and DTDs. This set includes almost all Unicode characters, except for surrogates and most C0 control characters. These characters are shown in a normal typeface. If they are name characters, they will be bold. If they are also name start characters, they'll have a gray background.

Illegal characters

Characters that may not appear anywhere in an XML document, such as in part of a name, character data, or comment text. These characters are shown in italic, such as NUL or BEL . Most of these characters are either C0 control characters or half of a surrogate pair.

XML 1.1 does allow the C0 control characters, except for NUL, to be included with a character reference such as &#x0B ;. XML 1.0 does not allow this. XML 1.1 also requires C1 control characters, except for NEL, to be escaped with character references. XML 1.0 does not require this.

Unassigned code points

Bytes or byte sequences that are not assigned to a character as of Unicode 4.0.1. Theoretically, a program could produce a file containing one of these byte sequences, but their meaning is undefined and they should be avoided. They are represented in the following tables as n/a .

Figure 27-1 shows the relationship between these sets. Note that all name start characters are name characters and that all name characters are character data characters.

Figure 27-1. XML's division of Unicode characters

In all the tables that follow, each cell 's upper lefthand corner contains the character's two-digit Unicode hexadecimal value, and the upper righthand corner contains the character's Unicode decimal value. You can insert a character in an XML document by prefixing the decimal value with &# and suffixing it with a semicolon. Thus, Unicode character 69, the capital letter E, can be written as E . Hexadecimal values work the same way, except that you prefix them with &#x ;. In hexadecimal, the letter E is 45, so it can also be written as E .

27.1.1 ASCII

Most character sets in common use today are supersets of ASCII. That is, code points 0 through 127 are assigned to the same characters to which ASCII assigns them. Table 27-2 lists the ASCII character set. The only notable exceptions are the EBCDIC-derived character sets. Specifically, Unicode is a superset of ASCII, and code points 1 through 127 identify the same characters in Unicode as they do in ASCII.

Table 27-2. The first 128 Unicode characters (the ASCII character set)

figs/xian3_t2702a.gif

figs/xian3_t2702b.gif

Characters 0 through 31 and character 127 are nonprinting control characters, sometimes called the C0 controls to distinguish them from the C1 controls used in the ISO-8859 character sets. Of these 33 characters, only the carriage return, line feed, and horizontal tab may appear in XML documents. The other 30 may not appear anywhere in an XML document, including in tags, comments, or parsed character data. In XML 1.1 (but not XML 1.0), 29 of these 30 characters (all of them except NUL) can be inserted with character references, such as  .

27.1.2 ISO-8859-1, Latin-1

Character sets defined by the ISO-8859 standard comprise one popular superset of the ASCII character sets. These characters all provide the normal ASCII characters from code points 0 through 127 and the C1 controls from 128 to 159. They provide different repertoires of characters in the range from 160 to 255.

In particular, many Western European and American systems use a character set called Latin-1. This set is the first code page defined in the ISO-8859 standard and is also called ISO-8859-1. Although all common encodings of Unicode map code points 128 through 255 differently than Latin-1, code points 128 through 255 map to the same characters in both Latin-1 and Unicode. This situation does not occur in other character sets.

27.1.2.1 C1 controls

All ISO-8859 character sets begin with the same 32 extra nonprinting control characters in code points 128 through 159. These sets are used on terminals like the DEC VT-320 to provide graphics functionality not included in ASCIIfor example, erasing the screen and switching it to inverse video or graphics mode. These characters cause severe problems for anyone reading or editing an XML document on a terminal or terminal emulator.

Fortunately, these characters are not necessary in XML documents. Their inclusion in XML 1.0 was an oversight. They should have been banned like the C0 controls. Unfortunately, many editors and documents incorrectly label documents written in the Cp1252 Windows character set as ISO-8859-1. This character set does use the code points between 128 and 159 for noncontrol graphics characters. When documents written with this character set are displayed or edited on a dumb terminal, they can effectively disable the user 's terminal. Similar problems exist with most other Windows code pages for single-byte character sets. XML 1.1 corrects this by requiring all of these, except NEL, to be escaped with character references such as  .

In the spirit of being liberal in what you accept and conservative in what you generate, you should never use Cp1252, correctly labeled or otherwise . You should also avoid using other nonstandard code pages for documents that move beyond a single system. On the other hand, if you receive a document labeled as Cp1252 (or any other Windows code page), it can be displayed if you're careful not to throw it at a terminal unfiltered . If you suspect that a document labeled as ISO-8859-1 that uses characters between 128 and 159 is in fact a Cp1252 document, you should probably reject it. This decision is difficult, however, given the prevalence of broken software that does not identify documents sent properly.

27.1.2.2 Latin-1

Latin-1 covers most Western European languages that use some variant of the Latin alphabet. Characters 0 through 127 in this set are identical to the ASCII characters with the same code points. Characters 128 to 159 are the C1 control characters used only for dumb terminals. Character 160 is the nonbreaking space. Characters 161 through 255 are accented characters, such as , , and , non-U.S. punctuation marks, such as and , and a few new letters, such as the Icelandic and . Table 27-3 shows the upper half of this character set. The lower half is identical to the ASCII character set shown in Table 27-2.

Table 27-3. Unicode characters between 160 and 255 and the second half of the Latin-1, ISO-8859-1 character set

figs/xian3_t2703a.gif

figs/xian3_t2703b.gif