Chapter 5. Internationalization | XML in a Nutshell, 2nd Edition

CONTENTS

5.1 Character-Set Metadata
5.2 The Encoding Declaration
5.3 Text Declarations
5.4 XML-Defined Character Sets
5.5 Unicode
5.6 ISO Character Sets
5.7 Platform-Dependent Character Sets
5.8 Converting Between Character Sets
5.9 The Default Character Set for XML Documents
5.10 Character References
5.11 xml:lang

We've told you that XML documents contain text, but we haven't yet told you what kind of text they contain. In this chapter we rectify that omission. XML documents contain Unicode text. Unicode is a character set large enough to include all the world's living languages and a few dead ones. It can be written in a variety of encodings, including UCS-2 and the ASCII superset UTF-8. However, since Unicode text editors are relatively uncommon, XML documents may also be written in other character sets and encodings, which are converted to Unicode when the document is parsed. The encoding declaration specifies which character set a document uses. You can use character references, such as θ, to insert Unicode characters like that aren't available in the legacy character set in which a document is written.

Computers don't really understand text. They don't recognize the Latin letter Z, the Greek letter , or the Han ideograph . All a computer understands are numbers such as 90, 947, or 40,821. A character set maps particular characters, like Z, to particular numbers, like 90. These numbers are called code points. A character encoding determines how those code points are represented in bytes. For instance, the code point 90 can be encoded as a signed byte, a little-endian unsigned short, a 4-byte, two's complement, big-endian integer, or in some still more complicated fashion.

A human script like Cyrillic may be written in multiple character sets, such as KOI8-R, Unicode, or ISO-8859-5. A character set like Unicode may then be encoded in multiple encodings, such as UTF-8, UCS-2, or UTF-16. In general, however, simpler character sets like ASCII and KOI8-R have only one encoding.

5.1 Character-Set Metadata

Some environments keep track of which encodings in which particular documents are written. For instance, web servers that transmit XML documents precede them with an HTTP header that looks something like this:

HTTP/1.1 200 OK Date: Sun, 28 Oct 2001 11:05:42 GMT Server: Apache/1.3.19 (Unix) mod_jk mod_perl/1.25 mod_fastcgi/2.2.10 Connection: close Transfer-Encoding: chunked Content-Type: text/xml; charset=iso-8859-1

The Content-Type field of the HTTP header provides the MIME media type of the document. This may, as shown here, specify in which character set the document is written. An XML parser reading this document from a web server should use this information to determine the document's character encoding.

Many web servers omit the charset parameter from the MIME media type. In this case, if the MIME media type is text/xml , then the document is assumed to be in the us-ascii encoding. If the MIME media type is application/xml, then the parser attempts to guess the character set by reading the first few bytes of the document.

Since ASCII is almost never an appropriate character set for an XML document, application/xml is much preferred over text/xml. Unfortunately, most web servers including Apache 2.0.36 and earlier are configured to use text/xml by default. It's worth editing your mime.types file to fix this. Alternately, at least with Apache, if you don't have root access to your web server, you can use the AddType and AddCharset directives in your .htaccess files to override the server-wide defaults.

We've focused on MIME types in HTTP headers because that's the most common place where character-set metadata is applied to XML documents. However, MIME types are also used in some filesystems (e.g., the BeOS), in email, and in other environments. Other systems may provide other forms of character-set metadata. If such metadata is available for a document, whatever form it takes, the parser should use it, though in practice this is an area where not all parsers and programs are as conformant as they should be.

5.2 The Encoding Declaration

Every XML document should have an encoding declaration as part of its XML declaration. The encoding declaration tells the parser in which character set the document is written. It's used only when other metadata from outside the file is not available. For example, this XML declaration says that the document uses the character encoding US-ASCII:

<?xml version="1.0" encoding="US-ASCII" standalone="yes"?>

This one states that the document uses the Latin-1 character set, though it uses the more official name ISO-8859-1:

<?xml version="1.0" encoding="ISO-8859-1"?>

Even if metadata is not available, the encoding declaration can be omitted if the document is written in either the UTF-8 or UTF-16 encodings of Unicode. UTF-8 is a strict superset of ASCII, so ASCII files can be legal XML documents without an encoding declaration. Note, however, that this only applies to genuine, pure 7-bit ASCII files. It does not include the extended ASCII character sets that some editors produce with characters like , , or ".

Even if character-set metadata is available, many parsers ignore it. Thus, we highly recommend including an encoding declaration in all your XML documents that are not written in UTF-8 or UTF-16. It certainly never hurts to do so.

5.3 Text Declarations

XML documents may be composed of multiple parsed entities, as you learned in Chapter 3. These external parsed entities may be DTD fragments or chunks of XML that will be inserted into the master document using external general entity references. In either case, the external parsed entity does not necessarily use the same character set as the master document. Indeed, one external parsed entity may be referenced in several different files, each of which is written in a different character set. Therefore, it is important to specify the character set for an external parsed entity independently of the character set that the including document uses.

To accomplish this task, each external parsed entity should have a text declaration. If present, the text declaration must be the very first thing in the external parsed entity. For example, this text declaration says that the associated entity is encoded in the KOI8-R character set:

<?xml version="1.0" encoding="KOI8-R"?>

The text declaration looks like an XML declaration. It has version info and an encoding declaration. However, a text declaration may not have a standalone declaration. Furthermore, the version info may be omitted. A legal text declaration that specifies the encoding as KOI8-R might look like this:

<?xml encoding="KOI8-R"?>

However, it is not a legal XML declaration.

Example 5-1 shows an external parsed entity containing several verses from Pushkin's The Bronze Horseman in a Cyrillic script. The text declaration identifies the encoding as KOI8-R. Example 5-1 is not itself a well-formed XML document because it has no root element. It exists only for inclusion in other documents.

Example 5-1. An external parsed entity with a text declaration identifying the character set as KOI8-R

External DTD subsets reside in external parsed entities and, thus, may have text declarations. Indeed, they should have text declarations if they're written in a character set other than one of the Unicode's variants. Example 5-2 shows a DTD fragment written in KOI8-R that might be used to validate Example 5-1 after it is included as part of a larger document.

Example 5-2. A DTD with a text declaration identifying the character set as KOI8-R

5.4 XML-Defined Character Sets

An XML parser is required to handle the UTF-16 and UTF-8 encodings or Unicode (about which more follows). However, XML parsers are allowed to understand and process many other character sets. In particular, the specification recommends that processors recognize and be able to read these encodings:

UTF-8	UTF-16
ISO-10646-UCS-2	ISO-10646-UCS-4
ISO-8859-1	ISO-8859-2
ISO-8859-3	ISO-8859-4
ISO-8859-5	ISO-8859-6
ISO-8859-7	ISO-8859-8
ISO-8859-9	ISO-2022-JP
Shift_JIS	EUC-JP

Many XML processors understand other legacy encodings. For instance, processors written in Java often understand all character sets available in a typical Java virtual machine. For a list, see http://java.sun.com/products/jdk/1.3/docs/guide/intl/encoding.doc.html. Furthermore, some processors may recognize aliases for these encodings; both Latin-1 and 8859_1 are sometimes used as synonyms for ISO-8859-1. However, using these names limits your document's portability. We recommend that you use standard names for standard encodings. For encodings whose standard name isn't given by the XML 1.0 specification, use one of the names registered with the Internet Assigned Numbers Authority (IANA) and listed at ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets. However, knowing the name of a character set and saving a file in that set does not mean that your XML parser can read such a file. XML parsers are only required to support UTF-8 and UTF-16. They are not required to support the hundreds of different legacy encodings used around the world.

5.5 Unicode

Unicode is an international standard character set that can be used to write documents in almost any language you're likely to speak, learn, or encounter in your lifetime, barring alien abduction. Version 3.2, the current version as of May, 2002, contains 95,156 characters from most of Earth's living languages as well as several dead ones. Unicode easily covers the Latin alphabet, in which most of this book is written. Unicode also covers Greek-derived scripts, including ancient and modern Greek and the Cyrillic scripts used in Serbia and much of the former Soviet Union. Unicode covers several ideographic scripts, including the Han character set used for Chinese and Japanese, the Korean Hangul syllabary, and phonetic representations of these languages, including Katakana and Hiragana. It covers the right-to-left Arabic and Hebrew scripts. It covers various scripts native to the Indian subcontinent, including Devanagari, Thai, Bengali, Tibetan, and many more. And that's still less than half of the scripts in Unicode 3.1.1. Probably less than one person in a thousand today speaks a language that cannot be reasonably represented in Unicode. In the future, Unicode will add still more characters, making this fraction even smaller. Unicode can potentially hold more than a million characters, but no one is willing to say in public where they think most of the remaining million characters will come from.^[1]

The Unicode character set assigns characters to code points, that is, numbers. These numbers can then be encoded in a variety of schemes, including:

UCS-2
UCS-4
UTF-8
UTF-16

5.5.1 UCS-2 and UTF-16

UCS-2, also known as ISO-10646-UCS-2, is perhaps the most natural encoding of Unicode. It represents each character as a 2-byte, unsigned integer between 0 and 65,535. Thus the capital letter A, code point 65 in Unicode, is represented by the 2 bytes 00 and 41 (in hexadecimal). The capital letter B, code point 66, is represented by the 2 bytes 00 and 42. The 2 bytes 03 and A3 represent the capital Greek letter , code point 931.

UCS-2 comes in two variations, big endian and little endian. In big-endian UCS-2, the most significant byte of the character comes first. In little-endian UCS-2, the order is reversed. Thus, in big-endian UCS-2, the letter A is #x0041.^[2] In little-endian UCS-2, the bytes are swapped, and A is #x4100. In big-endian UCS-2, the letter B is #x0042; in little-endian UCS-2, it's #x4200. In big-endian UCS-2, the letter is #x03A3; in little-endian UCS-2, it's #xA303. In this book we use big-endian notation, but parsers cannot assume this. They must be able to determine the endianness from the document itself.

To distinguish between big-endian and little-endian UCS-2, a document encoded in UCS-2 customarily begins with Unicode character #xFEFF, the zero-width nonbreaking space, more commonly called the byte-order mark. This character has the advantage of being invisible. Furthermore, if its bytes are swapped, the resulting #xFFFE character doesn't actually exist. Thus, a program can look at the first two bytes of a UCS-2 document and tell immediately whether the document is big endian, depending on whether those bytes are #xFEFF or #xFFFE.

UCS-2 has three major disadvantages, however:

Files containing mostly Latin text are about twice as large in UCS-2 as they are in a single-byte character set, such as ASCII or Latin-1.
UCS-2 is not backward or forward compatible with ASCII. Tools that are accustomed to single-byte character sets often can't process a UCS-2 file in a reasonable way, even if the file only contains characters from the ASCII character set. For instance, a program written in C that expects the zero byte to terminate strings will choke on a UCS-2 file containing mostly English text because almost every other byte is zero.
UCS-2 is limited to 65,536 characters.

The last problem isn't so important in practice, since the first 65,536 code points of Unicode nonetheless manage to cover most people's needs except for dead languages like Ugaritic and fictional scripts like Tengwar. Mathematical symbols are also encountering these issues. Unicode does, however, provide a means of representing code points beyond 65,535 by recognizing certain two-byte sequences as half of a surrogate pair. A Unicode document that uses UCS-2 plus surrogate pairs is said to be in the UTF-16 encoding.

The other two problems, however, are more likely to affect most developers. UTF-8 is an alternative encoding for Unicode that addresses both.

5.5.2 UTF-8

UTF-8 is a variable-length encoding of Unicode. Characters 0 through 127, that is, the ASCII character set, are encoded in 1 byte each, exactly as they would be in ASCII. In ASCII, the byte with value 65 represents the letter A. In UTF-8, the byte with the value 65 also represents the letter A. There is a one-to-one identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII files are also acceptable UTF-8 files.

UTF-8 represents the characters from 128 to 2047, a range that covers the most common non-ideographic scripts, in two bytes each. Characters from 2048 to 65,535, mostly from Chinese, Japanese, and Korean, are represented in three bytes each. Characters with code points above 65,535 are represented in four bytes each. For a file that's mostly Latin text, this effectively halves the file size from what it would be in UCS-2. However, for a file that's primarily Japanese, Chinese, or Korean, the file size can grow by 50%. For most other living languages, the file size is close to the same.

UTF-8 is probably the most broadly supported encoding of Unicode. For instance, it's how Java .class files store strings; it's the native encoding of the BeOS, and it's the default encoding an XML processor assumes unless told otherwise by a byte-order mark or an encoding declaration. Chances are pretty good that if a program tells you it's saving Unicode, it's really saving UTF-8.

5.6 ISO Character Sets

Unicode has only recently become popular. Previously, the space and processing costs associated with Unicode files prompted vendors to prefer smaller, single-byte character sets that could only handle English and a few other languages of interest, but not the full panoply of human language. The International Standards Organization (ISO) has standardized 14 of these character sets as ISO standard 8859. For all of these single-byte character sets, characters 0 through 127 are identical to the ASCII character set; characters 128 through 159 are the C1 controls; and characters 160 through 255 are the additional characters needed for scripts such as Greek, Cyrillic, and Turkish.

ISO-8859-1 (Latin-1): ASCII plus the accented letters and other characters needed for most Latin-alphabet Western European languages, including Danish, Dutch, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.
ISO-8859-2 (Latin-2): ASCII plus the accented letters and other characters needed to write most Latin-alphabet Central and Eastern European languages, including Czech, English, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, and Sorbian.
ISO-8859-3 (Latin-3): ASCII plus the accented letters and other characters needed to write Esperanto, Maltese, and Turkish.
ISO-8859-4 (Latin-4): ASCII plus the accented letters and other characters needed to write most Baltic languages including Estonian, Latvian, Lithuanian, Greenlandic, and Lappish. Now deprecated. New applications should use 8859-10 (Latin-6) or 8859-13 (Latin-7) instead.
ISO-8859-5: ASCII plus the Cyrillic alphabet used for Russian and many other languages of the former Soviet Union and other Slavic countries, including Bulgarian, Byelorussian, Macedonian, Serbian, and Ukrainian.
ISO-8859-6: ASCII plus basic Arabic. However, the character set doesn't have the extra letters needed for non-Arabic languages written in the Arabic script, such as Farsi and Urdu.
ISO-8859-7: ASCII plus modern Greek. This set does not have the extra letters and accents necessary for ancient and Byzantine Greek.
ISO-8859-8: ASCII plus the Hebrew script used for Hebrew and Yiddish.
ISO-8859-9 (Latin-5): Essentially the same as Latin-1, except six letters used in Icelandic have been replaced with six letters used in Turkish.
ISO-8859-10 (Latin-6): ASCII plus accented letters and other characters needed to write most Baltic languages, including Estonian, Icelandic, Latvian, Lithuanian, Greenlandic, and Lappish.
ISO-8859-11: ASCII plus Thai.
ISO-8859-13 (Latin-7): Yet another attempt to cover the Baltic region properly. Very similar to Latin-6, except for some question marks.
ISO-8859-14 (Latin-8): ASCII plus the Celtic languages, including Gaelic and Welsh.
ISO-8859-15 (Latin-9, Latin-0): A revised version of Latin-1 that replaces some unnecessary symbols, such as ¹/₄, with extra French and Finnish letters. Instead of the international currency sign, these sets include the Euro sign .
ISO-8859-16, (Latin-10): A revised version of Latin-2 that works better for Romanian. Other languages supported by this character set include Albanian, Croatian, English, Finnish, French, German, Hungarian, Italian, Polish, and Slovenian.

Various national standards bodies have produced other character sets to cover scripts and languages of interest within their geographic and political boundaries. For example, the Korea Industrial Standards Association developed the KS C 5601-1992 standard for encoding Korean. These national standard character sets can be used in XML documents as well, provided that you include the proper encoding declaration in the document and your parser knows how to translate these character sets into Unicode.

5.7 Platform-Dependent Character Sets

In addition to the standard character sets discussed previously, many vendors have at one time or another produced proprietary character sets to meet the needs of their specific platform. Often, they contain special characters the vendor saw a need for, such as Apple's trademarked open apple or the box-drawing characters such as and used for cell boundaries in early DOS spreadsheets. Microsoft, IBM, and Apple are the three most prolific inventors of character sets. The single most common such set is probably Microsoft's Cp1252, a variant of Latin-1 that replaces the C1 controls with more graphic characters. Hundreds of such platform-dependent character sets are in use today. Documentation for these ranges from excellent to nonexistent.

Platform-specific character sets like these should be used only within a single system. They should never be placed on the wire or used to transfer data between systems. Doing so can lead to nasty surprises in unexpected places. For example, displaying a file that contains some of the extra Cp1252 characters , , ^, , ", , ..., , , , , `, ', ", ", -, , , , , , and ~ on a VT-220 terminal can effectively disable the screen. Nonetheless, these character sets are in common use and often seen on the Web even when they don't belong there. There's no absolute rule that says you can't use them for an XML document, provided that you include the proper encoding declaration and your parser understands it. The one advantage to using these sets is that existing text editors are likely to be much more comfortable with them than with Unicode and its friends. Nonetheless, we strongly recommend that you don't use them and stick to the documented standards that are much more broadly supported across platforms.

5.7.1 Cp1252

The most common platform-dependent character set, and the one you're most likely to encounter on the Internet, is Cp1252, also (and incorrectly) known as Windows ANSI. This is the default character set used by most American and Western European Windows PCs, which explains its ubiquity. Cp1252 is a single-byte character set almost identical to the standard ISO-8859-1 character set indeed, many Cp1252 documents are often incorrectly labeled as being Latin-1 documents. However, this set replaces the C1 controls between code points 128 and 159 with additional graphics characters, such as , , and . These characters won't cause problems on other Windows systems. However, other platforms will have difficulty viewing them properly and may even crash in extreme cases. Cp1252 (and its siblings used in non-Western Windows systems) should be avoided.

5.7.2 MacRoman

The Mac OS uses a different nonstandard, single-byte character set that's a superset of ASCII. The version used in the Americas and most of Western Europe is called MacRoman. Variants for other countries include MacGreek, MacHebrew, MacIceland, and so forth. Most Java-based XML processors can make sense out of these encodings if they're properly labeled, but most other non-Macintosh tools cannot.

For instance, if the French sentence "Au cours des derni res ann es, XML a t adapte dans des domaines aussi diverse que l'a ronautique, le multim dia, la gestion de h pitaux, les t l communications, la th ologie, la vente au d tail et la litt rature m di vale" is written on a Macintosh and then read on a PC, what the PC user will see is "Au cours des derni?res annes, XML a t adapte dans des domaines aussi diverse que l'aronautique, le multimdia, la gestion de h pitaux, les tlcommunications, la thologie, la vente au dtail et la littrature mdivale," not the same thing at all. Generally, the result is at least marginally intelligible if most of the text is ASCII, but it certainly doesn't lend itself to high fidelity or quality. Mac-specific character sets should also be avoided.

5.8 Converting Between Character Sets

The ultimate solution to this character set morass is to use Unicode in either UTF-16 or UTF-8 format for all your XML documents. An increasing number of tools support one of these two formats natively; even the unassuming Notepad offers an option to save files in Unicode in Windows NT 4.0 and 2000. Microsoft Word 97 and later saves the text of its documents in Unicode, though unlike XML documents, Word files are hardly pure text. Much of the binary data in a Word file is not Unicode or any other kind of text. However, Word 2000 can actually save plain text files into Unicode and is the authors' Unicode editor of choice these days when we need to type a document in several different, complex scripts. To save Word 2000 as plain Unicode text, select the format Encoded Text from the Save As Type: Choice menu in Word's Save As dialog box. Then select one of the four Unicode formats in the resulting File Conversion dialog box.

Nonetheless, most of our current tools are still adapted primarily for vendor-specific character sets that can't handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.

Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. On Unix, the native character set is likely one of the standard ISO character sets, and you can save into that format directly. On the Mac, you can avoid problems if you stick to pure ASCII documents. On Windows, you can go a little further and use Latin-1, if you're careful to stay away from the extra characters that aren't part of the official ISO-8859-1 specification. Otherwise, you'll have to convert your document from its native, platform-dependent encoding to one of the standard platform-independent character sets.

Fran ois Pinard has written an open source character-set conversion tool called recode for Linux and Unix, which you can download from http://www.iro.umontreal.ca/contrib/recode/, as well as the usual GNU mirror sites. Wojciech Galazka has ported recode to DOS. See ftp://ftp.simtel.net/pub/simtelnet/gnu/djgpp/v2gnu/rcode34b.zip. You can also use the Java Development Kit's native2ascii tool at http://java.sun.com/products/jdk/1.3/docs/tooldocs/win32/native2ascii.html. First convert the file from its native encoding to Java's special ASCII-encoded Unicode format, then use the same tool in reverse to convert from the Java format to the encoding you actually want. For example, to convert the file myfile.xml from the Windows Cp1252 encoding to UTF-8, execute these two commands in sequence:

% native2ascii -encoding Cp1252 myfile.xml myfile.jtx % native2ascii -reverse -encoding UTF-8 myfile.jtx myfile.xml

5.9 The Default Character Set for XML Documents

Before an XML parser can read a document, it must know which character set and encoding the document uses. In some cases, external metainformation tells the parser what encoding the document uses. For instance, an HTTP header may include a Content-type header like this:

Content-type: text/html; charset=ISO-8859-1

However, XML parsers generally can't count on the availability of such information. Even if they can, they can't necessarily assume that it's accurate. Therefore, an XML parser will attempt to guess the character set based on the first several bytes of the document. The main checks the parser makes include the following:

If the first two bytes of the document bytes are #xFEFF, then the parser recognizes the bytes as the Unicode byte-order mark. It then guesses that the document is written in the big-endian, UCS-2 encoding of Unicode. With that knowledge, it can read the rest of the document.
If the first two bytes of the document are #xFFFE, then the parser recognizes the little-endian form of the Unicode byte-order mark. It now knows that the document is written in the little-endian, UCS-2 encoding of Unicode, and with that knowledge it can read the rest of the document.
If the first four bytes of the document are #x3C3F786D, that is, the ASCII characters <?xm, then it guesses that the file is written in a superset of ASCII. In particular, it assumes that the file is written in the UTF-8 encoding of Unicode. Even if it's wrong, this information is sufficient to continue reading the document until it gets to the encoding declaration and finds out what the character set really is.

Parsers that understand EBCDIC or UCS-4 may also apply similar heuristics to detect those encodings. However, UCS-4 isn't really used yet and is mostly of theoretical interest, and EBCDIC is a legacy family of character sets that shouldn't be used in new documents. Neither of these sets are important in practice.

5.10 Character References

Unicode contains more than 95,000 different characters covering almost all of the world's written languages. Predefining entity references for each of these characters, most of which will never be used in any one document, would impose an excessive burden on XML parsers. Rather than pick and choose which characters are worthy of being encoded as entities, XML goes to the other extreme. It predefines entity references only for characters that have special meaning as markup in an XML document: <, >, &, ", and '. All these are ASCII characters that are easy to type in any text editor.

For other characters that may not be accessible from an ASCII text editor, XML lets you use character references. A character reference gives the number of the particular Unicode character it stands for, in either decimal or hexadecimal. Decimal character references look like њ; hexadecimal character references have an extra x after the &#;, that is, they look like њ. Both of these references refer to the same character, , the Cyrillic small letter "nje" used in Serbian and Macedonian. For example, suppose you want to include the Greek maxim " " ("The wise man knows himself") in your XML document. However, you only have an ASCII text editor at your disposal. You can replace each Greek letter with the correct character reference, like this:

<maxim>   &#x3C3;&#x3BF;&#x3C6;&#x3CC;&#x3C2;    &#x3AD;&#x3B1;&#x3C5;&#x3C4;&#x3CC;&#x3BD;   &#x3B3;&#x3B9;&#x3B3;&#x3BD;&#x3CE;&#x3C3;&#x3BA;&#x3B5;&#x3B9; </maxim>

To the XML processor, a document using character entity references referring to Unicode characters that don't exist in the current encoding is equivalent to a Unicode document in which all character references are replaced by the actual characters to which they refer. In other words, this XML document is the same as the previous one:

<maxim> </maxim>

Character references may be used in element content, attribute values, and comments. They may not be used in element and attribute names, processing instruction targets, or XML keywords, such as DOCTYPE or ELEMENT. They may be used in the DTD in attribute default values and entity replacement text. Tag and attribute names may be written in languages such as Greek, Russian, Arabic, or Chinese, but you must use a character set that allows you to include the appropriate characters natively. You can't insert these characters with character references. For instance, this is well-formed:

<> σοφός <>

This is not well-formed:

<&#x3BB;&#x3BF;&#x3B3;&#x3BF;&#x3C2;>   &#x3C3;&#x3BF;&#x3C6;&#x3CC;&#x3C2; </&#x3BB;&#x3BF;&#x3B3;&#x3BF;&#x3C2;>

There are more than 90,000 Unicode characters that you can include in your XML documents with character entity references. Chapter 26 provides character codes in both decimal and hexadecimal for some of the most useful and widely used alphabetic scripts. The interested reader will find the complete set in The Unicode Standard Version 3.0 by the Unicode Consortium (Addison Wesley, 2000). You can also view the code charts online at http://www.unicode.org/charts/.

If you use a particular group of character references frequently, you may find it easier to define them as entities, then refer to the entities instead. Example 5-3 shows a DTD defining the entities you might use to spell out the Greek words in the previous several examples.

Example 5-3. A DTD defining general entity references for several Greek letters

<!ENTITY sigma              "&#x3C3;"> <!ENTITY omicron_with_tonos "&#x3CC;"> <!ENTITY phi                "&#x3C6;"> <!ENTITY omicron            "&#x3BF;"> <!ENTITY final_sigma        "&#x3C2;"> <!ENTITY epsilon_with_tonos "&#x3AD;"> <!ENTITY alpha              "&#x3B1;"> <!ENTITY lambda             "&#x3C3;"> <!ENTITY upsilon            "&#x3C5;"> <!ENTITY tau                "&#x3C4;"> <!ENTITY nu                 "&#x3BD;"> <!ENTITY gamma              "&#x3B3;"> <!ENTITY iota               "&#x3B9;"> <!ENTITY omega_with_tonos   "&#x3CE;"> <!ENTITY kappa              "&#x3BA;"> <!ENTITY epsilon            "&#x3B5;">

These entities can even be used in invalid documents, provided either that the declarations are made in the document's internal DTD subset, which all XML parsers are required to process, or that your parser reads the external DTD subset. By convention, DTD fragments that do nothing but define entities have the three-letter suffix, .ent. Generally, these fragments are imported into the document's DTD, using external parameter entity references. Example 5-4 shows how the maxim might be written using these entities, assuming they can be found at the relative URL greek.ent.

Example 5-4. The maxim using entity references instead of character references

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> <!DOCTYPE maxim [  <!ENTITY % greek_alphabet SYSTEM "greek.ent">  %greek_alphabet;   ]> <maxim>   &sigma;&omicron_with_tonos;&phi;&omicron;&final_sigma;    &epsilon_with_tonos;&alpha;&upsilon;&tau;&omicron_with_tonos;&nu;   &gamma;&iota;&gamma;&nu;&omega_with_tonos;&sigma;&kappa;&epsilon;&iota; </maxim>

A few standard entity subsets are widely available for your own use. The XHTML 1.0 DTD includes three useful entity sets you can adopt in your own work:

Latin-1 characters, http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent: The non-ASCII characters from 160 up in ISO-8859-1
Special characters, http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent: Letters from ISO-8859-2 (Latin-2) that aren't also in Latin-1, such as and various punctuation marks, such as the dagger, the Euro sign, and the em dash
Symbols, http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent: The Greek alphabet (though accented characters are missing) and various punctuation marks, mathematical operators, and other symbols commonly used in mathematics

Chapter 26 provides complete charts showing of all characters in these entity sets. You can either use these directly from their relatively stable URLs at the W3C or copy them onto your own systems. For example, to use entities from the symbol set in a document, add the following to the document's DTD:

<!ENTITY % HTMLsymbol PUBLIC     "-//W3C//ENTITIES Symbols for XHTML//EN"    "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent"> %HTMLsymbol;

Since these are fairly standard DTDs, they have both Public IDs and URLs. Other groups and individuals have written entity sets you can use similarly, though no canonical collection of entity sets that covers all of Unicode exists. SGML included almost 20 separate entity sets covering Greek, Cyrillic, extended Latin, mathematical symbols, diacritical marks, box-drawing characters, and publishing marks. These aren't a standard part of XML, but several applications including DocBook (http://www.docbook.org/) and MathML (http://www.w3.org/TR/MathML2/chapter6.html#chars_entity-tables) have ported them to XML. MathML also has several useful entity sets containing more mathematical symbols.

5.11 xml:lang

Since XML documents are written in Unicode, XML is an excellent choice for multilingual documents, such as an Arabic commentary on a Greek text (something that couldn't be done with almost any other character set). In such multilingual documents, it's useful to identify in which language a particular section of text is written. For instance, a spellchecker that only knows English shouldn't try to check a French quote.

Each XML element may have an xml:lang attribute that specifies the language in which the content of that element is written. For example, the previous maxim might look like this:

<maxim xml:lang="el">   &#x3C3;&#x3CC;&#3C6;&#3BF;&#3C2; &#x3AD;&#3B1;&#3C5;&#3C4;&#x3CC;&#x3BD;   &#x3B3;&#x3B9;&#x3B3;&#x3BD;&#X3CE;&#x3C3;&#x3BA;&#x3B5;&#x3B9; </maxim>

This identifies it as Greek. The specific code used, el, comes from the Greek word for Greek, .

5.11.1 Language Codes

The value of the xml:lang language attribute should be one of the two-letter language codes defined in ISO-639, Codes for the Representation of Names of Languages, found at http://lcweb.loc.gov/standards/iso639-2/langhome.html, if such a code exists for the language in question.

For languages that aren't listed in ISO-639, you can use a language identifier registered with IANA; currently, about 20 of these identifiers exist, including i-navajo, i-klingon, and i-lux. The complete list can be found at ftp://ftp.isi.edu/in-notes/iana/assignments/languages/tags. All identifiers begin with i-. For example:

<maxim xml:lang="i-klingon">Heghlu'meH QaQ jajvam</maxim>

If the language you need still isn't present in these two lists, you can create your own language tag, as long as it begins with the prefix x- or X- to identify it as a user-defined language code. For example, the title of this journal is written in J. R. R. Tolkien's fictional Quenya language:

<journal xml:lang="x-quenya">Tyali  Tyelelli va</journal>

5.11.2 Subcodes

For some purposes, knowing the language is not enough. You also need to know the region where the language is spoken. For instance, French has slightly different vocabulary, spelling, and pronunciation in France, Quebec, Belgium, and Switzerland. Although written identically with an ideographic character set, Mandarin and Cantonese are actually quite different, mutually unintelligible dialects of Chinese. The United States and the United Kingdom are jocularly referred to as "two countries separated by a common language."

To handle these distinctions, the language code may be followed by any number of subcodes that further specify the language. Hyphens separate the language code from the subcode and subcodes from each other. If the language code is an ISO-639 code, the first subcode should be one of the two-letter country codes defined by ISO-3166, Codes for the Representation of Names of Countries, found at http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt. This xml:lang attribute indicates Canadian French:

<p xml:lang="fr-CA">Marie vient pour le fin de semaine.</p>

The language code is usually written in lowercase, and the country code is written in uppercase. However, this is just a convention, not a requirement.

5.11.3 ATTLIST Declarations of xml:lang

Although the XML 1.0 specification defines the xml:lang attribute, you still have to declare it in the DTDs of your valid documents. For example, this information declares the maxim element used several times in this chapter:

<!ELEMENT maxim (#PCDATA)> <!ATTLIST maxim xml:lang NMTOKEN #IMPLIED>

Here I've used the NMTOKEN type, since all legal language codes are well-formed XML name tokens.

You may declare the xml:lang attribute in any convenient way. For instance, if you want to require its presence on the maxim element, you could make it #REQUIRED:

<!ATTLIST maxim xml:lang NMTOKEN #REQUIRED>

Or, if you wanted to allow only French and English text in your documents, you might specify it as an enumerated type with a default of English like this:

<!ATTLIST maxim xml:lang (en | fr) 'en'>

Unless you use an enumerated type, the parser will not check that the value you give it follows the rules outlined here. It's your responsibility to make sure you use appropriate language codes and subcodes.

[1] Privately, some developers are willing to admit that they're preparing for a day when we're part of a Galactic Federation of thousands of intelligent species.

[2] For reasons that will become apparent shortly, this book has adopted the convention that #x precedes hexadecimal numbers. Every two hexadecimal digits map to one byte.

CONTENTS