Item 4. Use Standard Entity References | Effective XML: 50 Specific Ways to Improve Your XML

No reasonably sized keyboard could possibly include all the characters in Unicode. U.S. keyboards are especially weak when it comes to typing in foreign languages with unusual accents and non-Latin scripts. XML allows you to use either character references or entity references to address this problem. In general, named entity references like &Ecaron; should be preferred to character references like ě because they're easier on the human beings who have to read the source code.

While there are no standards for how to name these entity references, there are some useful entity sets bundled with XHTML, DocBook, and MathML. Since these are all modular specifications, you can even use the DTDs that define their entity sets without pulling in the rest of the application. For example, if you just want to use the standard HTML entity references many designers have already memorized like © and   , you could add the following lines to your DTD.

 <!ENTITY nbsp   "&#160;"> <!ENTITY iexcl  "&#161;"> <!ENTITY cent   "&#162;"> <!ENTITY pound  "&#163;"> <!ENTITY curren "&#164;"> <!ENTITY yen    "&#165;"> <!ENTITY brvbar "&#166;"> <!ENTITY sect   "&#167;"> <!ENTITY uml    "&#168;"> <!ENTITY copy   "&#169;"> ...

Better yet, you could store local copies of the relevant DTDs in the same directory as your own DTD and just point to them.

 <!ENTITY % HTMLlat1 PUBLIC    "-//W3C//ENTITIES Latin 1 for XHTML//EN"    "xhtml-lat1.ent"> %HTMLlat1; <!ENTITY % HTMLsymbol PUBLIC    "-//W3C//ENTITIES Symbols for XHTML//EN"    "xhtml-symbol.ent"> %HTMLsymbol; <!ENTITY % HTMLspecial PUBLIC    "-//W3C//ENTITIES Special for XHTML//EN"    "xhtml-special.ent"> %HTMLspecial;

If you're using catalogs (Item 47), you can use the public IDs to locate the local caches of these entity sets.

Even if you're defining your own entity references for a particular subset of Unicode, I still suggest using the standard names. The HTML names are far and away the most popular, so if there's an HTML name for a certain character, by all means use it. For example, I would never call Unicode character 0xA0, the nonbreaking space, anything other than   .

If HTML does not have a standard name for a character, I normally turn to DocBook next . Its entity names are based on the standard SGML entity names. (These are among the SGML features that got dropped out of XML in the process of making it simple enough for mere mortals to use.) They aren't as well known as the HTML entity names, but they are much more comprehensive and are an international standard. The SGML entity names include those listed below.

ISO 8879:1986//ENTITIES Added Math Symbols: Arrow Relations//EN//XML

ISO 8879:1986//ENTITIES Added Math Symbols: Binary Operators//EN//XML

ISO 8879:1986//ENTITIES Added Math Symbols: Delimiters//EN//XML

ISO 8879:1986//ENTITIES Added Math Symbols: Negated Relations//EN//XML

ISO 8879:1986//ENTITIES Added Math Symbols: Ordinary//EN//XML

ISO 8879:1986//ENTITIES Added Math Symbols: Relations//EN//XML

ISO 8879:1986//ENTITIES Box and Line Drawing//EN//XML

ISO 8879:1986//ENTITIES Russian Cyrillic//EN//XML

ISO 8879:1986//ENTITIES Non-Russian Cyrillic//EN//XML

ISO 8879:1986//ENTITIES Diacritical Marks//EN//XML

ISO 8879:1986//ENTITIES Greek Letters//EN//XML

ISO 8879:1986//ENTITIES Monotoniko Greek//EN//XML

ISO 8879:1986//ENTITIES Greek Symbols//EN//XML

ISO 8879:1986//ENTITIES Alternative Greek Symbols//EN//XML

ISO 8879:1986//ENTITIES Added Latin 1//EN//XML

ISO 8879:1986//ENTITIES Added Latin 2//EN//XML

ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN//XML

ISO 8879:1986//ENTITIES Publishing//EN//XML

ISO 8879:1986//ENTITIES General Technical//EN//XML

Finally, for mathematically oriented characters like and , I turn to MathML 2.0. Its entity sets, found at http://www.w3.org/TR/MathML2/chapter6.html, cover more of the special characters that Unicode 3.0 and later define for mathematics than the pure SGML mathematical entity sets.