Item 18. Include All Information in the Instance Document | Effective XML: 50 Specific Ways to Improve Your XML

An XML document is not the same thing as an XML file. XML provides a bewildering array of options for building documents and infosets out of multiple pieces. Among them are the following:

External parsed entity references
Internal parsed entity references
XInclude
Default attribute values from the DTD
Default attribute values from the schema

These can be useful shortcuts for authoring, but they are death traps for interoperable documents. Many XML processors do not read the external DTD subset and cannot resolve any entity references defined therein. They also cannot apply default attribute values. Few current XML parsers can read a schema and apply default attribute values found there. Almost no XML parsers perform XInclusion by default. And even though it doesn't conform to the XML specification, there are even a few processors that don't read the internal DTD subset, so attribute values defined there may not be accessible in all environments.

For maximum portability and robustness, include all necessary information in the instance document itself. Do not rely on default attribute values, notations, types, entity references, or anything else that can be discovered only by processing the schema or DTD. Make the instance document self-contained. You may choose to use schema defaults while authoring, but before publishing your documents to the world, be sure to resolve all of them so that a processor without access to anything other than the document itself can still correctly process the document.

For example, consider the following XHTML+SVG+MathML document.

 <!DOCTYPE html PUBLIC     "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"     "xhtml-math-svg/xhtml-math-svg-flat.dtd"> <html> <head>   <title>Equation of the Unit Circle</title> </head> <body> <h1>Equation of the Unit Circle</h1> <div>   <math>     <mrow>       <mrow>         <msup><mi>x</mi><mn>2</mn></msup>         <mo>+</mo>         <msup><mi>y</mi><mn>2</mn></msup>       </mrow>       <mo>=</mo>       <mn>1</mn>     </mrow>   </math> </div> <div>   <svg:svg width="5cm" height="5cm"            viewBox="0 0 500 500" version="1.1">       <svg:circle cx="250" cy="250" r="100"                   stroke="black" stroke-width="10" />   </svg:svg> </div> </body> </html>

It relies on namespace declaration attributes that are not present in the instance document. Instead they are defaulted in from the DTD. A browser that does not read the DTD will not know about them, will not recognize the MathML or SVG elements, may not recognize the XHTML elements, will erroneously conclude that this document is namespace malformed , and may reject the document.

Instead, the document should be written like this, with all namespace declarations spelled out explicitly.

 <!DOCTYPE html PUBLIC     "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"     "xhtml-math-svg/xhtml-math-svg-flat.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head>   <title>Equation of the Unit Circle</title> </head> <body> <h1>Equation of the Unit Circle</h1> <div>   <math xmlns="http://www.w3.org/1998/Math/MathML">     <mrow>       <mrow>         <msup><mi>x</mi><mn>2</mn></msup>         <mo>+</mo>         <msup><mi>y</mi><mn>2</mn></msup>       </mrow>       <mo>=</mo>       <mn>1</mn>     </mrow>   </math> </div> <div>   <svg:svg xmlns:svg="http://www.w3.org/2000/svg"    width="5cm" height="5cm" viewBox="0 0 500 500" version="1.1">       <svg:circle cx="250" cy="250" r="100"                   stroke="black" stroke-width="10" />   </svg:svg> </div> </body> </html>

The extra declarations are redundant with the information in the DTD, but they make the whole document more reliable. All processors will be able to handle this document, even if they don't read the DTD.

For another example, consider entity references, as in this fragment of XHTML.

 <p>   The Greek word for father is   &pi;&alpha;&tau;&rho;&omicron;&sigmaf;. </p>

In order to understand the content of this fragment, a processor must read the DTD. If it has failed to do so, it will not be able to correctly report the text. It would be much better to use the actual characters in the body of the document.

 <p>The Greek word for father is  p   a   t   r   o   s  .</p>

If this is simply not feasible given the limits of the local software and fonts, you should use character references instead.

 <p>   The Greek word for father is   &#x3C0;&#x3B1;&#3C4x;&#3C1x;&#x3BF;&#x3C2;. </p>

This is more opaque in source form but much more easily interpreted by a generic XML parser.

None of this is to say that you should not use DTDs or schemas for validation. A document can stand completely on its own and still have a DTD or a schema or both. However, those features of DTDs and schemas that augment the infoset, as opposed to merely defining validation rules, have proven to be problematic . When publishing a document, strive to make it completely self-contained. Think of entities, default attribute values, and the like as authoring tools, not publishing tools. One easy way to merge these pieces together is by using an XSLT stylesheet that performs the identity transformation.

 <xsl:template match="@*node()">   <xsl:copy>     <xsl:apply-templates select="@*node()"/>   </xsl:copy> </xsl:template>

The resulting output will have added in all default attribute values and resolved all entities. A couple of XSLT engines can even be configured to resolve XIncludes as well.

Reversing the perspective, consider not the producers of documents but the consumers. Consumers cannot assume that documents are self-contained. In the spirit of being liberal in what you accept but conservative in what you produce, when receiving a document from a third party you should do your utmost to read the external DTD subset, resolve all external entity references, and apply all default attribute values. You should only use parsers that can do this, especially for DTDs. While most APIs allow you to turn off resolution of external entity references and ignore the external DTD subset, you should almost never do so. Reading the external DTD subset will cost you little in the case where it isn't necessary and may save your bacon in the case where it is. Programs that repeatedly read the same external DTD subsets can use the public identifiers, catalogs, and various caching strategies to cache the DTDs so repeated trips back to a server are not necessary. (See Item 47.)

This is especially useful with web browsers and XHTML, and indeed it's sanctioned by the XHTML specification. Web browsers that support XHTML have, for practical intents and purposes, internally cached the XHTML DTD. Thus when they encounter a document whose public ID labels it as XHTML, they do not need to read the DTD. They already know how to resolve entity references like Θ and α . However, this option is not available to a general purpose XML parser that does not know which specific XML applications it will encounter. Such a processor must read the external DTD subset before it can confidently replace all entities.

In brief, be liberal in what you read and conservative in what you write. Always include all necessary information in the document itself, but be prepared to handle the case where other document producers have not been so wise.