Item 34. Read the Complete DTD | Effective XML: 50 Specific Ways to Improve Your XML

One of the innovations of XML was making validity optional. Many XML documents do not have DTDs at all. Even if a document has a DTD, there is no guarantee that the document is valid with respect to its DTD. Invalid, merely well- formed documents can be usefully processed . In fact, even if the document has a DTD the processor may choose to ignore it, and that's where a potential problem arises, because DTDs do more than merely determine whether or not a document is valid. They also make contributions to the document's information set (infoset) in several ways.

They define entity references such as © and &signature; .
They provide default values for attributes.
They declare notations.
They declare unparsed entities.

The last two points aren't very important in practice (see Item 16), but the first two can be crucial. If a program fails to read the DTD, it may not have a complete picture of the document's information. For example, consider the following simple SVG document.

 <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"   "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> <svg width="5cm" height="4cm" version="1.1">     <rect x="1.5cm" y="1.5cm" width="2cm" height="1cm"/> </svg>

The namespace declaration seems to be missing. However, the SVG DTD provides this through a default attribute value. A program that does not read the DTD cannot correctly process this document. It will not find any elements in the SVG namespace.

The situation is even worse if the document uses namespace prefixes. For example, consider the document below.

 <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"   "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [   <!ENTITY % NS.prefixed "INCLUDE" >   <!ENTITY % SVG.prefix "svg" > ]> <svg:svg width="5cm" height="4cm" version="1.1">`     <svg:rect x="1.5cm" y="1.5cm" width="2cm" height="1cm"/> </svg:svg>

Unless the parser reads the external DTD subset, it will conclude that this document is namespace malformed , throwing a potentially fatal error.

Problems can arise even without considering namespaces. Other attributes can be crucial too. For example, if an xlink:type attribute of a link element is defaulted from the DTD, a browser that doesn't read the DTD won't recognize the element as an XLink. Or consider a typical DocBook programlisting element like this one.

 <programlisting>if (x == 3.5) {   doSomething(); }</programlisting>

A browser that hasn't read the DocBook DTD won't know that it has xml:space="preserve" and format="linespecific" attributes, both crucial pieces of information for correct formatting.

Proper handling of entities also mandates processing the full DTD. Consider the following document.

 <?xml version="1.0"?> <!DOCTYPE Surgery SYSTEM "surgery.dtd"> <Surgery>   <Procedure>Thigh Liposuction</Procedure>   <Step>Set vacuum meter to the amount of fat to be removed,         not more than twelve ounces per procedure.</Step>   &preop;   <Step>Make one inch vertical incision six inches above        knee.</Step>   <Step>Insert suction tube.</Step>   <Step>Suck out fat until the vacuum automatically shuts         off.</Step>  <Step>Remove suction tube from patient.</Step>  <Step>Sew up incision.</Step>  <Disclaimer>     I am not a doctor. These instructions are not real.     Do not try this at home.   </Disclaimer> </Surgery>

Suppose that, when resolved, the &preop ; entity points to the document fragment below.

 <Step>Verify patient has not eaten for at least 12 hours.</Step> <Step>Set mix to 65% oxygen, 35% ether.</Step> <Step>Turn on oxygen.</Step> <Step>Turn on ether.</Step> <Step>Verify that oxygen is flowing.</Step> <Step>Place mask on patient.</Step> <Step>Ask patient to begin counting backwards from 100.</Step> <Step>Verify patient is asleep.</Step>

I wouldn't want to be the patient on the table if these instructions were read by a parser that did not resolve entities.

I am not suggesting that you write documents that depend on the DTD, especially the external DTD subset, in these fashions . In fact, Item 18 recommends exactly the opposite . However, documents like these examples do exist and will continue to exist in the real world. They cannot be avoided. Not producing documents like these is being conservative in what you send. Reading such documents correctly is being liberal in what you accept. Both are important principles for robust software.

It is not necessary to validate in order to read the DTD. Although a fully validating parser will always read the complete DTD, apply all default attribute values, and resolve all entity references, you do not have to validate in order to do these things. Most parsers provide options to read the DTD without validating. Exactly how this is configured varies from parser to parser and API to API. Using SAX you simply have to turn on the http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameter-entities features, as shown here.

 XMLReader parser = XMLReaderFactory.createXMLReader(); parser.setFeature(   "http://xml.org/sax/features/external-general-entities",  true  ); parser.setFeature(   "http://xml.org/sax/features/external-parameter-entities",  true  );

There's no special feature for indicating whether the parser should apply default attribute values or not. However, if the parser does read the DTD, it is required to apply any default attribute values it finds there. Thus turning on these two features has the effect of ensuring that default attribute values are also resolved.

SAX parsers are not required to support these features, although most do. I have seen one or two that did not recognize these features but could validate, so turning on validation instead may be an option. However, if the parser supports neither external general entities nor validation, then it's time to find a better parser.

The primary reason most developers cite for not reading the DTD is performance. Especially if the DTD resides on a remote server, loading and applying it can take a significant amount of time. There are a couple of options to ameliorate this effect, especially if you have advance knowledge of the DTDs the application must process.

If the DTDs are identified by public URIs, catalogs can replace the remote canonical DTDs with local copies. (See Item 47.) If the DTDs are identified via system identifiers and you're using SAX, you can use an EntityResolver instead. For example, here's a simple SAX EntityResolver that loads all the XHTML 1.0 DTDs once and then serves them out of local memory.

 import org.xml.sax.*; import java.io.*; import java.net.URL; import java.util.Hashtable; public class FastXHTML implements EntityResolver {   private static Hashtable entities = new Hashtable();   // fill the list of URLs   static {      // The XHTML 1.0 DTDs     addMapping("-//W3C//DTD XHTML 1.0 Strict//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");     addMapping("-//W3C//DTD XHTML 1.0 Transitional//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd");     addMapping("-//W3C//DTD XHTML 1.0 Frameset//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd");     // The XHTML 1.0 entity sets     addMapping("-//W3C//ENTITIES Latin 1 for XHTML//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent");     addMapping("-//W3C//ENTITIES Symbols for XHTML//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent");     addMapping("-//W3C//ENTITIES Special for XHTML//EN",      "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent");   }   private static void addMapping(String publicID, String url)  {     try {       URL u = new URL(url);       InputStream in = u.openStream();       ByteArrayOutputStream out = new ByteArrayOutputStream();       int c;       while ((c = in.read()) != -1) out.write(c);       entities.put(publicID, out.toByteArray());     }     catch (Exception ex) {        // exceptions in static blocks are a real pain        throw new RuntimeException("Could not initialize " + url);     }   }   public InputSource resolveEntity(String publicID,    String systemID) throws SAXException {     if (entities.contains(publicID)) {       byte[] data = (byte[]) entities.get(publicID);       InputSource source = new InputSource(systemID);       source.setPublicId(publicID);       source.setByteStream(new ByteArrayInputStream(data));       return source;     }     else return null;   } }

In the near future, DOM Level 3 Load and Save will add a similar DOMEntityResolver class for use with DOM.

Some programs that are designed to process only very specific vocabularies may hardwire knowledge of the DTD. The classic example of this is a web browser. The browser handles HTML and only HTML. It knows in advance the replacement text for all the HTML entities such as   and Ω . It knows which attributes have which default values. It even knows and uses the content models of various elements.

Of course, an application that relies on a presumed copy of the actual DTD (whether by foreknowledge, a local catalog, or an EntityResolver ) may be tripped up if it encounters a document that points to a modified copy of the DTD. Thus it's still better to read the actual DTD if you reasonably can. However, in cases such as HTML, a document that modifies the DTD but uses the usual public identifier is completely nonconformant, and the author of that document deserves what he or she gets. Modifying a DTD is OK, but you need to assign the modified DTD new system and public IDs.

Another alternative, this one for document producers rather than document consumers, is to set the standalone attribute in the XML declaration to yes. (See Item 1.) This is a promise that the external DTD subset makes no contributions to the document's infoset and thus can be safely skipped . That is, it says that no default attribute values are applied that are not also present in the document itself and no entities used in the document are defined. If the document specifies standalone="yes" , you may safely forgo reading the external DTD subset. Unfortunately, the standalone attribute can be set to yes only when the document contains no white space in element content. (See Item 10.) Since most documents do use such ignorable white space, the standalone option is not always available.

None of this should be construed as giving programs any leeway in not reading the internal DTD subset. The XML specification absolutely requires all XML processors to read the internal DTD subset, report any well- formedness errors found therein, apply default attribute values found in attribute declarations in the internal DTD subset, and use entity declarations to resolve external entity references. A parser that does not read or use the internal DTD subset does not adhere to the minimum level of conformance required by XML 1.0.

In the long run, the most reliable results are produced by reading the complete DTD of any document that carries a document type declaration, even if you're not validating. Always insist on reading the complete DTD unless the standalone attribute is set to yes. It may take a little longer, but it may also be a lot more correct.