Hack 92 Use Elements Instead of Entities to Avoid the

Hack 92 Use Elements Instead of Entities to Avoid the "amp Explosion Problem"

Use replaceable elements as a solution to the "amp explosion problem."

Search for the string & using your favorite search engine. Then search for the string &amp; and then &amp;amp; and so on. You will get lots of hits and see lots of interesting text. Here are some examples I found:

Why Choose Auto &amp;amp;amp;amp; Home Insurance
po&amp;amp;eacute;sie, nouvelles, th&amp;amp;eacute;&amp;amp;acirc;tre

These strange incantations can be traced back to the entity structure of XML (and SGML before it). Simply put, XML provides a number of ways in which textual units, known as entities [Hack #25], can be spliced into other textual units by an XML parser. The mechanism involves referring to these entities by name. The name is preceded by an ampersand character and followed by a semicolon.

Some of these entities are built into XML itself and thus are built into every XML parser. The five built-in entities (see Table 7-1) provide ways of encoding characters that would otherwise have special meaning to an XML parser because of their roles in markup .

Table 7-1. XML predefined entities
Entity reference	Description
`<`	Less-than sign (`<`)
`>`	Greater-than sign (`>`)
`'`	Apostrophe (')
`"`	Quotation mark (")
`&`	Ampersand (`&`)

The troublesome entity here is the ampersand. Note that the escaped version of it features an ampersand character the very character we are trying to escape. This self-referencing ampersand is the source of the trouble illustrated in the two examples shown earlier. Unless you are very careful in XML processing (especially multistage XML processing), it is very easy to get your ampersand escaping into a muddle.

The simplest muddle is illustrated here again:

Why Choose Auto &amp;amp;amp;amp;amp; Home Insurance

The base text most probably started out as:

Why Choose Auto & Home Insurance

A program probably performed a global search and replace operation to escape the ampersand, yielding:

Why Choose Auto &amp; Home Insurance

So far so good. Later however, another application (or a second invocation of the first application) performed the same operation again, yielding:

Why Choose Auto &amp;amp; Home Insurance

This happened three more times to yield the final text:

Why Choose Auto &amp;amp;amp;amp;amp; Home Insurance

Note that the text is well-formed XML syntax every step of the way, which makes detecting this problem more difficult.

This explosion of escaped ampersands acts like the rings in the cross-section of a tree, providing a good guide to the age of the document in terms of document processing steps performed.

The second example is a more complex muddle caused by exactly the same ampersand explosion. Here it is again:

po&amp;amp;amp;eacute;sie, nouvelles, th&amp;amp;amp;eacute;&amp;amp;amp;acirc;tre

The original text here was most probably:

poésie, nouvelles, théÃ¢tre

In the first stage of processing, the accented characters were replaced with corresponding entity names commonly used in HTML/XML applications, namely the so-called ISO standard entity sets (http://www.ascc.net/xml/resource/entities/). In the ISO entity sets, an accented é is represented by the entity reference é and a circumflexed â is represented by the entity reference â. Performing these entity replacements yields:

po&eacute;sie, nouvelles, th&eacute;&acirc;tre

Later, in order to insulate any literal ampersands in the text, a program probably performed a global search and replace operation to escape the ampersands, yielding:

po&amp;eacute;sie, nouvelles, th&amp;eacute;&amp;acirc;tre

This was repeated twice more to yield the final text:

po&amp;amp;amp;eacute;sie, nouvelles, th&amp;amp;amp;eacute;&amp;amp;amp;acirc;tre

The problems of the latter example are compounded by the fact that the é and â entities are not built into XML parsers. Consequently, to get a document to pass a well-formedness parse, it is necessary to define these entities. This can be done in a document itself using a document type declaration subset such as this (ents.xml):

<!DOCTYPE doc [ <!ENTITY eacute "&amp;eacute;"> <!ENTITY acirc "&amp;acirc;"> ]> <doc> <p> This document has an ampersand (&amp;) an apostrophe (&apos;) and a quotation mark  (&quot;). These three entities are built into XML. </p> <p> This document also as an e acute (&eacute;) and an a circumflex (&acirc;).</p> </doc>

However, defining these entities is more commonly done by adding the entity declarations to the external DTD.

Note that the replacement text for both é and â entities in the previous example is designed to recreate the original entity markup. This is the safest thing to do when you wish to process XML without harming the entity markup. Unfortunately, it involves adding yet more troublesome ampersands to the XML document!

One way to avoid this kind of trouble is to replace all entity references in your documents with elements. As soon as the content is parseable XML, use XML processing exclusively. Do not mix text processing (global search and replace) with XML processing.

Replacing the entity references in your documents with elements is straightforward. Simply create elements to act as placeholders for the real entity references while document processing is underway. Here is a good sequence to follow when starting out with plain text:

Replace all literal ampersand characters with an empty XML element such as <amp/>, and replace all literal less-than signs with an empty XML element such as <lt/>. This is a global search and replace operation.
Replace all non-built-in entity references with empty elements named after the entity, such as <eacute/> for the é entity and so on. This is a global search and replace operation.
Top and tail your document text with the start and end tags of a single XML element, say, <doc>. You now have well-formed XML.
Perform all subsequent text processing using XML tools; i.e., always start with an XML parse, and process the data emitted by the parser. Do not use any further global search and replace operations that involve either of XML's special characters (ampersand or less-than sign). If an ampersand needs to be added to the document during processing, insert an <amp/> element. Likewise, if a less-than sign needs to be added, insert an <lt/> element.
At the very last stage of processing, convert the placeholder elements created for entities back into entity syntax.

For example, here is a file, ents.txt, that we will mark up in XML, avoiding the use of entities:

This document has an ampersand (&) an apostrophe (') and a quotation mark (").  These three entities are built into XML. This document also as an e acute (é) and an a circumflex (â).

Performing steps 1 through 3 of the procedure produces the following XML document (fixedents.xml):

<doc> This document has an ampersand (<amp/>) an apostrophe (<apos/>) and a quotation mark (<quot/>). These three entities are built into XML. This document also as an e acute (<eacute/>) and an a circumflex (<acirc/>). </doc>

The document is now well-formed XML, and will pass through an XML parser unharmed. Perform all further processing of this document using XML tools and you are unlikely to ever suffer from ampersand explosion.

The only place that literal ampersands can occur with the method I just presented is in attribute values. In my processing pipelines, I tend to model everything in terms of element structure, modeling attributes as sub-elements, and converting them to attribute syntax only at the very end of a processing pipeline.

Note that this approach makes XML parsing easier, as well-formedness checking parsers do not require you to declare element types but do require you to declare any non-built-in entities. Using this technique negates the need to carry around entity declarations, either in internal document type declaration subsets or in external DTDs.

Newer schema languages, such as RELAX NG (http://www.relaxng.org) and W3C XML Schema (http://www.w3.org/XML/Schema), do not provide facilities for manipulating the entity structure and so this all-element approach plays well with them and with tools based on them.

Finally, when creating a DTD, RELAX NG, or W3C XML Schema model, I make a point of creating amp, quot, lt, and apos elements so that I can use the aforementioned techniques, even when performing schema-valid XML processing.