Item 2. Mark Up with ASCII if Possible | Effective XML: 50 Specific Ways to Improve Your XML

Despite the rapid growth of Unicode in the last few years , the sad fact is that many text editors and other tools are still tied to platform- and nationality -dependent character sets such as Windows-1252, MacRoman, and SJIS. The only characters all these sets have in common are the 128 ASCII letters , digits, punctuation marks, and control characters. These characters are the only ones that can be reliably displayed and edited across the wide range of computers and software in use today. Thus, if it's not too big a problem, try to limit your markup to the ASCII character set. If you're writing in English, this is normally not a problem.

On the other hand, this principle is not written in stone, especially if you're not working in English. If you're writing a simple vocabulary for a local French bank without any international ambitions, you will probably want to include all the accents commonly used in French for words like relev (statement) and num ro (number). For instance, a bank statement might look like this:

 <?xml version="1.0" encoding="ISO-8859-1"?> <Relev xmlns="http://namespaces.petitebanque.com/">   <Banque>PetiteBanque</Banque>   <Compte>     <Numro>00003145298</Numro>     <Type>pargne</Type>     <Propritaire>Jean Deaux</Propritaire>   </Compte>   <Date>2003-30-02</Date>   <SoldeDOuverture>5266.34</SoldeDOuverture>   <Transaction type="dpt">     <Date>2003-02-07</Date>     <Somme>300.00</Somme>   </Transaction>   <Transaction type="transfert">     <Compte>       <Numro>0000271828</Numro>       <Type>courant</Type>       <Propritaire>Jean Deaux</Propritaire>     </Compte>     <Date>2003-02-07</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="dpt">     <Date>2003-02-15</Date>     <Somme>512.32</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-15</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-25</Date>     <Somme>200.00</Somme>   </Transaction>   <SoldeDeFermeture>5478.64</SoldeDeFermeture> </Relev>

However, this code is likely to cause trouble if the document ever crosses national or linguistic boundaries. For instance, if programmers at the bank's Athens branch open the same document in a text editor, they're likely to see something like this:

 <?xml version="1.0" encoding="ISO-8859-1"?> <Relev  I  xmlns="http://namespaces.petitebanque.com/">   <Banque>PetiteBanque</Banque>   <Compte>     <Num  I  ro>00003145298</Num  I  ro>     <Type>  I  pargne</Type>     <Propri  I  taire>Jean Deaux</Propri  I  taire>   </Compte>   <Date>2003-30-02</Date>   <SoldeDOuverture>5266.34</SoldeDOuverture>   <Transaction type="d  I  p  t  t">     <Date>2003-02-07</Date>     <Somme>300.00</Somme>   </Transaction>   <Transaction type="transfert">     <Compte>       <Num  I  ro>0000271828</Num  I  ro>       <Type>courant</Type>       <Propri  I  taire>Jean Deaux</Propri  I  taire>     </Compte>     <Date>2003-02-07</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="d  I  p  t  t">     <Date>2003-02-15</Date>     <Somme>512.32</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-15</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-25</Date>     <Somme>200.00</Somme>   </Transaction>   <SoldeDeFermeture>5478.64</SoldeDeFermeture> </Relev  I  >

The e's with accents acute have morphed into iotas, and the o's with carets have turned into lowercase taus.

Indeed, even crossing platform boundaries within the same country may cause problems. Were the same document opened on a Mac, the developers would likely see something like this:

 <?xml version="1.0" encoding="ISO-8859-1"?> <Relev xmlns="http://namespaces.petitebanque.com/">   <Banque>PetiteBanque</Banque>   <Compte>     <Numro>00003145298</Numro>     <Type>pargne</Type>     <Propritaire>Jean Deaux</Propritaire>   </Compte>   <Date>2003-30-02</Date>   <SoldeDOuverture>5266.34</SoldeDOuverture>   <Transaction type="dpt">     <Date>2003-02-07</Date>     <Somme>300.00</Somme>   </Transaction>   <Transaction type="transfert">     <Compte>       <Numro>0000271828</Numro>       <Type>courant</Type>       <Propritaire>Jean Deaux</Propritaire>     </Compte>     <Date>2003-02-07</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="dpt">     <Date>2003-02-15</Date>     <Somme>512.32</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-15</Date>     <Somme>200.00</Somme>   </Transaction>   <Transaction type="retrait">     <Date>2003-02-25</Date>     <Somme>200.00</Somme>   </Transaction>   <SoldeDeFermeture>5478.64</SoldeDeFermeture> </Relev>

In this case, the lowercase e's with accents acute have changed to uppercase E's with accents grave, and the accented o's have also changed. This isn't quite as bad, but it's more than enough to confuse most software applications and not a few people.

The encoding declaration fixes these issues for XML-aware tools such as parsers and XML editors. However, it doesn't help non-XML-aware systems like plain text editors and regular expressions. (See Item 29.) There's a lot of flaky code out there in the world in less than perfect systems.

Using Unicode instead of ISO-8859-1 for the character set goes a long way toward fixing this particular problem. (See Item 38.) However, this opens up several lesser but still significant problems.

Many text editors can't handle Unicode. Even editors that can often have trouble recognizing Unicode documents.
The only glyphs (graphical representations of characters from a particular font) you can rely on being available across a variety of systems cover the ASCII range. Even a system that can process all Unicode characters may not be able to display them.
Keyboards don't have the right keys for more than a few languages. The basic ASCII characters are the only characters likely to be available anywhere in the world. Indeed, even a few ASCII characters are problematic . I once purchased a French keyboard in Montreal that did not have a single quote key.
The string facilities of many languages and operating systems implicitly assume single-byte characters. This includes the char data type in C. Java is slightly better but still can't handle all Unicode characters.

None of these problems are insurmountable. Programmers' editors that properly handle Unicode are available for almost all systems of interest. You can purchase or download fonts that cover most Unicode blocks, though you may have to mix and match several fonts to get full coverage. Input methods , multi-key combinations, and graphical keymaps allow authors to type accented characters and ideographic characters even on U.S. keyboards. It is possible to write Unicode-savvy Java, C, and Perl code provided you have a solid understanding of Unicode and know exactly where those languages' usual string and character types are inadequate. Just be aware that if you do use non-ASCII characters for your markup, these issues will arise.

One final caveat: I am primarily concerned with markup here, that is, element and attribute names . I am not talking about element content and attribute values. To the extent that such content is written in a natural human language, that content really needs to be written in that language with all its native characters intact. For instance, if a customer's name is Th r se Barri re, it should be written as Th r se Barri re, not Therese Barriere. (In some jurisdictions there are even laws requiring this.) Non-ASCII content raises many of the same issues as non-ASCII markup. However, the need for non-ASCII characters is greater here, and the problems aren't quite as debilitating.

While the situation is improving slowly, for the time being, documents will be more easily processed in an international, heterogeneous environment if they contain only ASCII characters. ASCII is a lowest common denominator and a very imperfect one at that. However, it is the lowest common denominator. In the spirit of being liberal in what you accept but conservative in what you generate, you should use ASCII when possible. If the text you're marking up is written in any language other than English, you'll almost certainly have to use other character sets. Just don't choose to do so gratuitously. For example, don't pick ISO-8859-1 (Latin-1) just so you can tag a curriculum vitae with < resum > instead of <resume> .