Character and Entity References | Effective XML: 50 Specific Ways to Improve Your XML

Entity and character references are also often abused. Many XML parser APIs sometimes let you see which entity any given character came from (though not all do, and in SAX and DOM this ability is not implemented by all parsers). However, you shouldn't rely on this, and no parsers will tell you whether each character came from raw text or a character reference.

The classic example of what not to do here is to mix XML's escaping mechanisms with your application's escaping mechanism. For instance, an application could specify that a string of text beginning with a literal dollar sign ($, Unicode character 36) is a variable reference. For example, the following Para element includes a variable reference.

 <Para>Hello $name</Para>

This is fine. However, it does require some way to escape the dollar sign when it's used as just a dollar sign. I've occasionally seen applications that attempt to use XML character references for such escaping. For example, this would not be a variable reference.

 <Para>Hello &#36;name</Para>

This is a bad design that makes it impossible to parse these documents correctly with standard APIs like SAX and DOM or standard parsers like Crimson and lfred because they won't distinguish between a literal $ and $ . Instead a custom parser is required. This makes development much harder than it needs to be.

The mistake is tying application-level semantics (how to tell what's a variable and what isn't) to syntactic aspects of the document that the parser hides. The correct approach is to define a new escaping mechanism that's visible above the XML parser layer instead of below it. For example, you could declare that all variables begin with a $ , whichever way that character was typed. However, a double dollar sign would be converted to a single plain text dollar sign. For example, these Para elements would each contain a variable reference.

 <Para>Hello $name</Para> <Para>Hello &#36;name</Para>

However, these two would not.

 <Para>Hello $$name</Para> <Para>Hello &#36;&#36;name</Para>

Design your processing software and XML applications so that they depend only on those aspects of XML that parsers reliably report: element boundaries, text content, attribute values, and processing instructions. Do not write markup that depends on syntax that the parser may resolve before reporting to the client application: CDATA sections, entity references, attribute order, character references, comments, whether attributes are defaulted from the DTD or included in the instance document, and so on. You may indeed be able to write software that supports such lower-level syntax using one particular parser or API. However, you won't be able to validate it with standard schema languages, and I guarantee that you'll confuse document authors who won't always follow your rules. Worst of all, many and perhaps most XML parsers and APIs won't be able to fully process your documents, even if you can. Build applications on top of the structure layer, and let the parser do the hard work of sorting out the syntax.