18.2 Common XML Processing Issues | XML in a Nutshell, Third Edition

As with any technology, there are several ways to accomplish most design goals when developing a new XML application, as well as a few potential problems worth knowing about ahead of time. An understanding of the intended uses for these features can help ensure that new applications will be compatible not only with their intended target audience, but also with other XML processing systems that may not even exist yet.

18.2.1 What You Get Is Not What You Saw

The XML specification provides several loopholes that permit XML parsers to play fast and loose with your document's literal contents, while retaining the semantic meaning. Comments can be omitted and entity references silently replaced by the parser without any warning to the client application. Non-validating parsers aren't required to retrieve external DTDs or entities, although the parser should at least warn applications that this is happening. While reconstructing an XML document with exactly the same logical structure and content is possible, guaranteeing that it will match the original in a byte-by-byte comparison generally is not.

XML Canonicalization defines a more consistent form of XML and a process for producing it that permits a much higher degree of predictability in reconstructing a document from its logical model. For details, see http://www.w3.org/TR/xml-c14n.

Authors of simple XML processing tools that act on data without storing or modifying it might not consider these constraints particularly restrictive . The ability to reconstruct an XML document precisely from in-memory data structures, however, becomes more critical for authors of XML editing tools and content-management solutions. While no parser is required to make all comments, whitespace, and entity references available from the parse stream, many do or can be made to do so with the proper configuration options.

The only real option to ensure that a parser reports documents as you want, and not just the minimum required by the XML specification, is to check its documentation and configure (or choose) the parser accordingly .

18.2.2 To Read the DTD or Not To Read the DTD?

DTDs come in two forms: internal and external and sometimes both. The XML specification requires all parsers to read the internal DTD subset. Validation requires reading the external DTD subset (if any); but if you don't validate, this is optional. Reading the external DTD subset takes extra time, especially if the DTD is large and/or stored on a remote network host, so you may not want to load it if you're not validating. Most parsers provide options to specify whether the external DTD subset and other external entities should be resolved. If validation were all a DTD did, then the decision of whether to load the DTD would be easy. Unfortunately, DTDs also augment a document's infoset with several important properties, including:

Entity definitions
Default attribute values
Whether boundary whitespace is ignorable

At the extreme, since a document with a malformed DTD is itself malformed , a DTD can make a document readable or unreadable. This means whether a parser reads the external DTD subset or not can have a significant impact on what the parser reports. For maximum interoperability documents should be served without external DTD subsets. In this case parser behavior is deterministic and reproducible, regardless of configuration. On the flip side a consumer of XML documents should attempt to read any external DTD subset the document references if they want to be sure of receiving what the sender intended. Be conservative in what you send (don't use external DTD subsets) and liberal in what you accept (do read any external DTD subsets for documents you receive).

18.2.3 Whitespace

How parsers treat whitespace is one of the most commonly misunderstood areas of XML processing. There are four basic rules you need to remember:

All whitespace in element content is always reported .
Whitespace in attribute values is normalized .
Whitespace in the prolog and epilog and within tags but outside attribute values is not reported.
All non-escaped line breaks (carriage returns, line feeds, carriage return-line feed pairs, and, in XML 1.1, NEL and line separator) are converted to line feeds.

Consider Example 18-2.

Example 18-2. Various kinds of whitespace

 <?xml version="1.0"?>     <!DOCTYPE person  SYSTEM "person.dtd ">     <person  source="Alan Turing: the Enigma,                    Andrew Hodges, 1983">   <name>     <first>Alan</first>     <last>Turing</last>   </name>   <profession  id="p1"                 value="computer  scientist "                source="" />   <profession  id="p2"                value="mathematician"/>   <profession  id="p3"                value="cryptographer"/> </person>

When a parser reads this document, it will report all the whitespace in the element content to the client application. This includes boundary whitespace like that between the <name> and <first> start-tags and the </last> and </name> end-tags. If the DTD says that the name element cannot contain mixed content, the whitespace is considered to be whitespace in element content , also called ignorable whitespace . However, the parser still reports it. The client application receiving the content from the parser may choose to ignore boundary whitespace, whether it's ignorable or not, interpreting it as purely for the purpose of pretty printing; but that's up to the client application. The parser always reports it all.

The parser does not report the line breaks and other whitespace in the prolog and epilog. Nor does it report the line breaks and whitespace in the tags such as that between the id and value attributes in the profession elements. Nothing in your program should depend on this.

The parser will normalize all the whitespace in attribute values. At a minimum, this means it will turn line breaks like those in the source attribute into spaces. If the DTD says the attribute has type CDATA or does not declare it, or if the DTD has not been read or does not exist, then that's all. However, if the attribute has any other type such as ID , NMTOKENS , or an enumeration, then the parser will strip all leading and trailing whitespace from the attribute and compress all remaining runs of whitespace to a single space each. However, normalization is only performed on literal whitespace. Spaces, tabs, line feeds, and carriage returns embedded with character or entity references are converted to their replacement text and then retained. They are not normalized like literal whitespace.

18.2.4 Entity References

There are three kinds of references in XML instance documents (plus another couple in the DTD we can ignore for the moment):

Numeric character references, such as   and  
The five predefined entity references, & , > , " , ' , and <
General entity references defined by the DTD, such as &chapter1; and

The first two kinds are easy to handle. The parser always resolves them and never tells you anything about them. As a parser client, you can simply ignore these and the right thing will happen. The parser will report the replacement text in the same way it reports regular text. It won't ever tell you that these entity references were used. On rare occasion you may be able to set a special property on the parser to have it tell you about these things, but you almost never want to do that. The only case where that might make sense is if you're writing an XML editor that tries to round-trip the source form of a document.

The third case is trickier. These entity references may refer to external files on remote sites you don't necessarily want to connect to for reasons of performance, availability, or security. Even if they're internal entities, they may be defined in the external DTD subset in a remote document. Parsers vary in whether they load such entities by default. Most parsers and APIs do provide a means of specifying whether external entities should or should not be loaded, although this is not universal. For instance, XOM always resolves external entities, while the XML parser in Mozilla never resolves them. Parsers that do not resolve an external entity should nevertheless notify the client application that the entity was not loadedfor instance, calling skippedEntity( ) in SAX or inserting an EntityReference object into the tree in DOM. How the program responds to such notifications is a question that must be answered in the context of each application. Sometimes it's a fatal problem. Other times it's something you can work around or even ignore, but do be aware that you need to consider this possibility unless the parser is configured to always resolve external entities.

Recently, a few parser vendors have become concerned about the so-called billion laugh attacks. In brief, it works by defining entity references that progressively double in size , especially in the internal DTD subset where the entities must be resolved:

 <!ENTITY ha1 "Ha! "> <!ENTITY ha2 "&ha1; &ha1;"> <!ENTITY ha3 "&ha2; &ha2;"> <!ENTITY ha4 "&ha3; &ha3;"> <!ENTITY ha5 "&ha4; &ha4;"> <!ENTITY ha6 "&ha5; &ha5;"> ... <!ENTITY ha31 "&ha30; &ha30;"> <!ENTITY ha32 "&ha31; &ha31;"> ... <root>&ha32;</root>

So far this attack is purely theoretical. Nonetheless, some parser vendors have started adding options to their parsers not to resolve entities defined in the internal DTD subset either (which is non-conformant to the XML Recommendation). Other palliatives include setting maximum limits on entity size or recursion depth in entity reference. In general these options are not turned on by default, because they are nonconformant.

18.2.5 CDATA Sections

The golden rule of handling CDATA sections is this: ignore them. When writing code to process XML, pretend CDATA sections do not exist, and everything will work just fine. The content of a CDATA section is plain text. It will be reported to your application as plain text, just like any other text, whether enclosed in a CDATA section, escaped with character references, or typed out literally when escaping is not necessary. For example, these two example elements are exactly the same as far as anything in your code should know or care:

 <example><![CDATA[<?xml version="1.0"?> <root>   Hello! </root>]]></example> <example>&lt;?xml version="1.0"?> &lt;root>   Hello! &lt;/root></example>

Do not write programs or XML documents that depend on knowing the difference between the two. Parsers rarely (and never reliably) inform you of the difference. Furthermore, passing such documents through a processing chain often removes the CDATA sections completely, leaving only the content intact but represented differentlyfor instance, with numeric character references representing the unserializable characters . CDATA sections are a minor convenience for human authors, nothing more. Do not treat them as markup.

This also means you should not attempt to nest one XML (or HTML) document inside another using CDATA sections. XML documents are not designed to nest inside one another. The correct solution to this problem is to use namespaces to sort out which markup is which, rather than trying to treat a document as an envelope for other documents. Similarly do not use CDATA sections to escape malformed markup such as is found in many HTML systems. Instead, use a tool such as Tidy to correct the malformed HTML before embedding it in an XML document.

18.2.6 Comments

Despite a long history in HTML of using comments for tasks like Server-Side Includes (SSI) and for hiding JavaScript code and Cascading Style Sheets, using comments for anything other than human-readable notes is generally a bad idea in XML. XML parsers may (and frequently do) discard comments entirely, keeping them from reaching an application at all. Transformations generally discard comments as well.

18.2.7 Processing Instructions

XML parsers are required to provide client applications access to XML processing instructions. Processing instructions provide a mechanism for document authors to communicate with XML-aware applications behind the scenes in a way that doesn't interfere with the content of the document. DTD and schema validation both ignore processing instructions, making it possible to use them anywhere in a document structure without changing the DTD or schema. The processing instruction's most widely recognized application is its ability to embed stylesheet references inside XML documents. The following XML fragment shows a stylesheet reference:

 <?xml-stylesheet type="text/css" href="test.css"?>

An XML-aware application, such as Internet Explorer 6.0, would be capable of recognizing the XML author's intention to display the document using the test.css stylesheet. This processing instruction can also be used to link to XSLT stylesheets or other kinds of stylesheets not yet developed, although the client application needs to understand how to process them to make this work. Applications that do not understand the processing instructions can still parse and use the information in the XML document while ignoring the unfamiliar processing instruction.

The furniture example from Chapter 21 (see Figure 21-1) gives a hypothetical application of processing instructions. A processing instruction in the bookcase .xml file signals the furniture example's processor to verify the parts list from the document against the true list of parts required to build the furniture item:

 <parts_list>         <part_name id="A" count="1">END PANEL</part_name>         <part_name id="B" count="2">SIDE PANEL</part_name>         <part_name id="C" count="1">BACK PANEL</part_name>         <part_name id="D" count="4">SHELF</part_name>         <part_name id="E" count="8">HIDDEN CONNECTORS</part_name>         <part_name id="F" count="8">CONNECTOR SCREWS</part_name>         <part_name id="G" count="22">7/16" TACKS</part_name>         <part_name id="H" count="16">SHELF PEGS</part_name>     </parts_list>    <?furniture_app    verify_parts_list?>

This processing instruction is meaningless unless the parsing application understands the given type of processing instruction.

The XML specification also permits the association of the processing instruction's targetthe XML name immediately after the <? with a notation, as described in the next sectionbut this is not required and is rarely used in XML.

18.2.8 Notations

The notation syntax of XML provides a way for the document author to specify an external unparsed entity's type within the XML document's framework. If an application requires access to external data that cannot be represented in XML, consider declaring a notation name and using it where appropriate when declaring external unparsed entities. For example, if an XML application were an annotated Java source-code format, the compiled bytecode could then be referenced as an external unparsed entity.

Notations effectively provide metadata, identifiers that applications may apply to information. Using notations requires making declarations in the DTD, as described in Chapter 3. One use of notations is with NOTATION-type attributes. For example, if a document contained various scripts designed for different environments, it might declare some notations and then use an attribute on a containing element to identify what kind of script it contained:

 <!NOTATION DOS PUBLIC "-//MS/DOS Batch File/"> <!NOTATION BASH PUBLIC "-//UNIX/BASH Shell Script/"> <!ELEMENT batch_code (#PCDATA)*> <!ATTLIST batch_code      lang NOTATION (DOS  BASH)> . . .  <batch_code lang="DOS">   echo Hello, world! </batch_code>

Applications that read this document and recognized the public identifier could interpret the foreign element data correctly, based on its type. (Notations can also have system identifiers, and applications can use either approach.)

Categorizing processing instructions is the other use of notations. For instance, the previous furniture_app processing-instruction example could have been declared as a notation in the DTD:

 <!NOTATION furniture_app SYSTEM "http://namespaces.example.com/furniture">

Then the furniture-document processing application could verify that the processing instruction was actually intended for itself and not for another application that used a processing instruction with the same name.

18.2.9 Unparsed Entities

Unparsed entities combine attribute and notation declarations to define references to content that will require further (unspecified) processing by the application. Unparsed entities are described in more detail in Chapter 3, but although they are a feature available to applications, they are also rarely used.