DOM | Effective XML: 50 Specific Ways to Improve Your XML

DOM Level 3 should finally make it possible to write completely implementation-independent DOM code, at least in Java. However, in DOM Level 2 some crucial parts of DOM require parser-dependent code, and in other languages this is likely to remain true in DOM Level 3. In particular, DOM2 defines no way to:

Construct an instance of the DOMImplementation interface
Parse a document from a stream

Everything else can be done with pure DOM, but these two operations require implementation-specific classes. For example, Xerces loads its DOMImplementation using a nonstandard static method in the org.apache.xerces.dom.DOMImplementationImpl class.

 DOMImplementation impl   = DOMImplementationImpl.getDOMImplementation();

Parsing a document into a DOM tree with Xerces requires instantiating a nonstandard DOMParser class through its constructor as shown below.

 DOMParser parser = new DOMParser(); parser.parse("http://www.example.com/"); Document doc = parser.getDocument();

Other DOM implementations such as Crimson, GNU JAXP, and Oracle use different classes and patterns.

In Java, JAXP provides a partial solution for those DOM implementations that implement JAXP (which is most of the major DOM implementations for Java nowadays). The javax.xml.parsers.DOMBuilderFactory class can load a javax.xml.parsers.DOMBuilder object. This DOMBuilder object can parse a document or locate a DOMImplementation object.

 DocumentBuilderFactory builderFactory   = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = builderFactory.newDocumentBuilder(); DOMImplementation impl = parser.getDOMImplementation(); Document doc = parser.parse("http://www.example.com");

The exact implementation is read from the javax.xml.parsers.DocumentBuilderFactory Java system property. If this property is not set, JAXP looks in the lib/jaxp.properties properties file in the JRE directory. If that fails to locate a parser, next JAXP looks for a META- INF/services/javax.xml.parsers.DocumentBuilderFactory file in all the JAR files in the classpath. Finally, if that fails, DocumentBuilderFactory.newInstance() returns a default implementation.

As with SAX parsers, you do have to worry a little about exactly which features the underlying implementation supports. However, for most applications the basic features shared by all conformant implementations suffice.

The second step in writing implementation-independent DOM code is making sure you use only standard DOM methods. Many DOM implementations including Xerces provide extra, useful methods in their implementation classes. Sometimes these methods can be very helpful and even let you do things that simply cannot be done via standard methods, such as adding a document type declaration to a document. However, using any of these methods ties your code to that one implementation, binding it far more tightly than just how you load the parser or the implementation. Implementation-specific methods can infect the entire design of your code, requiring you to completely redesign a program in order to move it to another implementation, rather than simply changing a few lines here or there. Think very carefully about whether you're willing to lock yourself into a single implementation before doing this.

By far the worst offender here is the Microsoft DOM implementation in MSXML and .NET. While MSXML supports all the standard parts of DOM, it also includes many, many nonstandard extensions. Worse yet, very little third-party documentation and almost no official Microsoft documentation bothers to note the difference between the standard parts and the Microsoft extensions. Most tutorials and sample code make very heavy use of the Microsoft extensions, even to the point where there's almost no standard DOM code left. (By contrast, the documentation for Xerces-J focuses almost exclusively on the standard DOM interfaces. The documentation for the nonstandard extensions is relatively hidden, and few books discuss it. It is meant primarily for "maintainers and developers of the Xerces2 reference implementation," not for end users, and its existence is an open secret for the initiated.) In particular, be wary of the following fields and methods, which are very common in programs that use the MSXML DOM:

IXMLDOMNode , IXMLDOMNodeList , IXMLDOMAttribute , IXMLDOMComment , IXMLDOMDocument , IXMLDOMElement , IXMLDOMText , and so on
innerXml , outerXml , and xml
innerText , outerText , and text
transformNode
transformNodeToObject
SelectSingleNode
SelectNodes
definition
dataType
baseName
nodeTypedValue
nodeTypeString

These are all nonstandard Microsoft extensions to DOM, and using any of them effectively ties your program to MSXML such that it cannot be easily ported to other implementations.

One final thing to keep in mind: Although with care you can swap one DOM implementation for another, you cannot generally mix and match different DOM implementations in the same program. For example, you cannot add a GNU JAXP Element object to a Xerces Document object. Internally, all DOM implementations I've encountered do make intimate use of the detailed implementation classes, rather than limiting themselves to the public interfaces. This may be necessary and even desirable for the implementation internals, but it is not an excuse for doing the same in the public code that lives above the interface.