Item 15. Build on Top of Structures, Not Syntax | Effective XML: 50 Specific Ways to Improve Your XML

Entity references, CDATA sections, character references, empty-element tags, and the like are just syntax sugar. They make it a little easier to include certain hard-to-type constructs in XML documents. They do not in any way change a document's information content. Many parsers will not even tell you whether such syntax sugar was used or not. Your documents should convey the same meaning if each of these is replaced with an equivalent representation of the same content.

XML processing can be thought of as a five-layer stack as shown in Figure 15-1. Each layer of data is processed to generate the successively more abstract, more useful layer that follows it. Binary data is converted into characters. Characters are converted into syntax. Syntax is processed to form structures. Finally structures are interpreted to form semantics. Each layer has its place and each layer is necessary. However, it's important not to mix them. A program processing XML can safely operate on only a single layer. Programs that attempt to operate on multiple layers simultaneously risk corrupting the clean, well- formed nature of XML.

Figure 15-1. The Five-Layer XML Processing Model

graphics/15fig01.jpg

Normally processing begins with binary data that is translated into Unicode text according to a particular encoding. It may be necessary to first strip off and interpret metadata from the binary stream to locate the XML document. For example, when reading an XML document from a web server over a socket, you would have to read and remove the HTTP header while storing the information the header contained about the document's content type and encoding. (See Item 45.) Once the beginning of the document has been located, the parser will read ahead far enough to detect the encoding. Once the parser is confident it knows the encoding, it backs up to the beginning of the document and begins converting bytes into Unicode characters. This may happen before the XML parser begins its work and is technically not a part of XML, although for convenience most XML parsers at least have options to perform some of this work, especially encoding detection. In Java the APIs for the binary layer are java.io.InputStream and java.io.OutputStream .

The Unicode characters form the lexical layer. In Java the APIs for this layer are java.io.Reader and java.io.Writer . These are not specifically XML APIs because this data is not necessarily XML until well- formedness has been verified . The only well-formedness check that can be performed at this level is verifying that the characters are all legal in an XML document; for example, that there are no vertical tabs or unmatched halves of surrogate pairs in the data stream.

The parser then reads the raw Unicode characters to recognize the low-level syntax of an XML document: tags, characters, entity references, CDATA section delimiters, and so forth. ^[1] This is the layer where most of the well-formedness rules defined by XML's BNF grammar are checked. There are very few existing APIs that truly expose the constructs in this layer, partially because it's not always recognized as a separate layer and partially because few programs really need to operate at this level, mostly just source-code-level XML editors. However, a number of APIs have dug holes for themselves by mixing a few pieces of this layer in with the next higher structure layer.

^[1] In a traditional compiler we'd say this step is performed by the lexer rather than the parser. However, in XML the distinction between lexers and parsers is rarely made, and lexers are not normally available separate from parsers.

The parser combines these low-level syntax items into higher-level information structures: elements, attributes, text nodes, processing instructions, and so forth. During this process, the parser checks the XML well-formedness constraints that the XML specification calls out separately because they cannot be encoded in the BNF grammar. The most important of these is that each start-tag has a matching end-tag. At this point many of the details about exactly how the information was encoded are deliberately lost. For instance, the parser will merge the text inside a CDATA section with the text outside the CDATA section without in any way noting which characters came from inside and which from outside. Most common XML APIs operate primarily at the structure layer. These include SAX, DOM, JDOM, and XOM. Both DOM and SAX parsers can optionally mix in a lot of syntax layer information, but neither is required to support this.

Finally, the parser passes the information about these high-level structures to the client program that invoked the parser. This client program then acts on these structures to produce semantic objects and data structures that are appropriate for its local process. This is the domain of data binding APIs such as JAXB, Castor, and Zeus. These attempt to completely hide the fact that the data came from XML and treat it as some kind of programming object.

A clean program that processes XML works exclusively with a single layer. Almost always, the appropriate layer to work with is the structure layer. In this layer, a well-designed program processes the elements, attributes, text, and other post-parse content. It is responsible for transforming from the structure layer to the semantic layer. It does not involve itself with syntactic issues such as whether a dollar sign was typed as $ , $ , &dollar; , $ , or even <![CDATA[$]]> . It has even less interest in lexical and binary layer issues such as which character encoding the document uses. The parser handles all of this before the program ever sees the document.

Note

There is perhaps one exception to this rule. Source-code-level, generic XML editors such as XMLSPY, XED, and jEdit do need access to the syntax layer in order to preserve the appearance of the document. For instance, they do not want to change a named entity reference to a numeric character reference or vice versa. They may even allow for partially malformed documents because users may want to type content after start-tags before they type the end-tags. Thus these tools tend to operate on the syntax layer rather than the structure layer. However, XML editors are a very special case in the realm of XML software. The very unusual needs of these tools should not influence the design of other, more conventional applications.

Particularly common confusions about layers include the following:

Treating empty-element tags different from the equivalent start-tag and end-tag pairs
Using CDATA sections as pseudo-elements that contain malformed markup
Considering character and entity references as somehow different from their replacement text
Skipping or forbidding the document type declaration

Let's explore some of the problems that commonly arise as a result of these layer confusions.