5.7 Platform-Dependent Character Sets


In addition to the standard character sets discussed previously, many vendors have at one time or another produced proprietary character sets to meet the needs of their specific platform. Often, they contain special characters the vendor saw a need for, such as Apple's trademarked open apple figs/apple.gif or the box-drawing characters, such as figs/p79a.gif and figs/p79b.gif , used for cell boundaries in early DOS spreadsheets. Microsoft, IBM, and Apple are the three most prolific inventors of character sets. The single most common such set is probably Microsoft's Cp1252, a variant of Latin-1 that replaces the C1 controls with more graphic characters. Hundreds of such platform-dependent character sets are in use today. Documentation for these ranges from excellent to nonexistent.

Platform-specific character sets like these should be used only within a single system. They should never be placed on the wire or used to transfer data between systems. Doing so can lead to nasty surprises in unexpected places. For example, displaying a file that contains some of the extra Cp1252 characters figs/lsaquo.gif , figs/u2030.gif , ^, figs/u0192.gif , ", figs/u2020.gif , ..., figs/u2021.gif , figs/u0153.gif , figs/u0152.gif , , ', ', ", ", -, , figs/u0178.gif , figs/u0161.gif , , figs/rsaquo.gif , and ~ on a VT-220 terminal can effectively disable the screen. Nonetheless, these character sets are in common use and often seen on the Web, even when they don't belong there. There's no absolute rule that says you can't use them for an XML document, provided that you include the proper encoding declaration and your parser understands it. The one advantage to using these sets is that existing text editors are likely to be much more comfortable with them than with Unicode and its friends . Nonetheless, we strongly recommend that you don't use them and stick to the documented standards that are much more broadly supported across platforms.

5.7.1 Cp1252

The most common platform-dependent character set, and the one you're most likely to encounter on the Internet, is Cp1252, also (and incorrectly) known as Windows ANSI . This is the default character set used by most American and Western European Windows PCs, which explains its ubiquity. Cp1252 is a single-byte character set almost identical to the standard ISO-8859-1 character setindeed, many Cp1252 documents are often incorrectly labeled as being Latin-1 documents. However, this set replaces the C1 controls between code points 128 and 159 with additional graphics characters, such as figs/u2030.gif , figs/u2021.gif , and figs/u0178.gif . These characters won't cause problems on other Windows systems. However, other platforms will have difficulty viewing them properly and may even crash in extreme cases. Cp1252 (and its siblings used in non-Western Windows systems) should be avoided.

5.7.2 MacRoman

The Mac OS uses a different, nonstandard, single-byte character set that's a superset of ASCII. The version used in the Americas and most of Western Europe is called MacRoman. Variants for other countries include MacGreek, MacHebrew, MacIceland, and so forth. Most Java-based XML processors can make sense out of these encodings if they're properly labeled, but most other non-Macintosh tools cannot.

For instance, if the French sentence "Au cours des dernires annes, XML a t adapte dans des domaines aussi diverse que l'aronautique, le multimdia, la gestion de hpitaux, les tlcommunications, la thologie, la vente au dtail et la littrature mdivale" is written on a Macintosh and then read on a PC, what the PC user will see is "Au cours des derni?res ann es, XML a t adapte dans des domaines aussi diverse que la ronautique, le multim dia, la gestion de hpitaux, les t l communications, la th ologie, la vente au d tail et la litt rature m di vale," not the same thing at all. Generally, the result is at least marginally intelligible if most of the text is ASCII, but it certainly doesnt lend itself to high fidelity or quality. Mac-specific character sets should also be avoided.

XML in a Nutshell
XML in a Nutshell, Third Edition
ISBN: 0596007647
EAN: 2147483647
Year: 2003
Pages: 232

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net