Item 38. Write in Unicode | Effective XML: 50 Specific Ways to Improve Your XML

You may work in English, but these days it's no great surprise if some of your coworkers or customers are more comfortable in French, Chinese, or Amharic. One of the most underrated advantages of XML is its internationalization support. Much of this is a direct result of its dependence on Unicode. In effect, every XML document is read in Unicode. Even if the document is written in a different character set such as ISO-8859-1 or SJIS, the parser converts it to Unicode on input. Thus it behooves you to know how to properly process Unicode data.

How difficult this is varies greatly from one language or environment to the next . In Python 2.2 it's relatively easy. In Java it's not too hard, but there are some pitfalls laid out to trap the unwary. In Perl 5.0 it's nearly impossible , but more recent versions of Perl are much better, especially Perl 5.8 and later. In C and C++ Unicode normally requires types other than string and char. In many cases (including C, C++, Python, Perl, and Java) a lot depends on exactly which version of the language you're using. In general, you should strive to use the most recent version of the language if at all possible. In all cases I'm aware of, the more recent version always has Unicode support that's as good as or better than the earlier versions.

Note

This is not in conflict with Item 2, Mark Up with ASCII if Possible. ASCII is still the best choice for markup (that is, element names, attribute names , and so on), especially markup that needs to be shared among many different developers with many cultures and languages. The simple fact is that English and ASCII are the lowest common denominator for technical communication around the world.

However, the situation is very different for content; that is, for PCDATA and attribute values. Here, the text must be highly localized. For example, consider the MegaBank Statement Markup Language one more time. If the bank operates internationally, it may need to transmit information back and forth between branches in France, the United States, Japan, China, Brazil, and many other countries . Programs that process this data work more effectively if the structure and markup of the documents don't vary from country to country. The markup should be the same across national boundaries.

On the other hand, each individual document is probably local to a particular country. The information changes from one customer to the next. M. B la Delano « of Lyons should receive a statement that shows his name with all the accents in place. Iwahashi-san of Tokyo should receive a message written in Kanji, not Romaji. The content needs to be localized. By far the easiest way to do that for a worldwide audience is to use Unicode.