5.3 DOM | The OReilly Java Authors - JavaT Enterprise Best Practices

I l @ ve RuBoard

DOM, the Document Object Model, is a far better understood API than SAX. This is largely because it operates using more familiar Java principles: the XML input document is represented as a set of Java objects, output is performed via DOM serializers, and the ability to make significant programming errors is greatly lessened compared to SAX. However, DOM still presents more than its fair share of pitfalls to the unprepared developer.

When deciding to use DOM, you should always begin by first determining if you can use SAX. While SAX programming is generally more complex than DOM programming, it is almost always faster. However, if you were unable to answer "yes" to all the questions in the Section 5.2 , DOM might be a better solution for you.

In particular, there are two things that DOM excels at:

Providing an object model of an XML document
Providing access to all parts of an XML document at one time

These two aspects of DOM are interesting in that they both are significantly appealing to Java developers. On the whole, today's developers prefer using an object-oriented programming model and having an entire business unit (whether it be an XML document or some other construct) available for random access at all times. SAX forces a more procedural and generally unfamiliar programming model, while DOM's model seems more natural.

Of course, you pay performance and memory penalties for these features ”keeping an entire document in memory and representing that document as objects expends both time and resources. That being said, DOM can be very powerful once you learn to account for its shortcomings. The following tips can help you use DOM as effectively as possible, decreasing performance penalties when possible.

5.3.1 Bootstrap DOM Correctly

One of the most misunderstood and abused aspects of using DOM is getting an initial DOM implementation for programming. This is typically referred to as bootstrapping , and it is often done incorrectly. There are two different approaches, depending on which version of the DOM specification your implementation adheres to.

5.3.1.1 DOM Levels 1 and 2

In DOM Levels 1 and 2, the process of getting a DOM implementation to work is difficult. Before I explain why, though, you should understand why you need a DOM implementation in the first place.

If you are reading in an XML document (e.g., from an existing file or input stream), this entire section is not applicable . In these cases, the reader.getDocument( ) method will return a DOM Document object, and you can then operate on that DOM tree without any problem. I'm also assuming that you've chosen not to use JAXP. If you are using JAXP, these issues are taken care of for you. In many cases, however, JAXP either is not available or you have software restrictions that make it a nonoption in your application or organization.

The end goal of bootstrapping is to get a vendor's implementation of the org.w3c.dom.Document interface. Most developers' instincts , therefore, are to write this line of code:

 Document doc = new org.apache.xerces.dom.DocumentImpl(  );

The obvious problem here is that your code is now tied to Xerces, and can't work with another parser without modification. In fact, it often won't function with a different version of the same parser you are using. In addition to obtaining a Document , using this approach will force you to configure additional vendor-specific details to obtain implementations of the org.w3c.dom.DocumentType interface. However, this is not the proper way to access a DOM-creating mechanism in the first place.

Instead, you should always use the org.w3c.dom.DOMImplementation class, which acts as a factory for both interfaces. Instead of directly instantiating a DOM Document implementation, use this approach:

 DOMImplementation domImpl = new org.apache.xerces.dom.DOMImplementationImpl(  ); DocumentType docType = domImpl.createDocumentType("rootElementName", "public ID",                                                   "system ID"); Document doc = domImpl.createDocument("", "rootElementName", docType);

Now, while this is a better approach, it is by no means a great solution. The code is still tied to a vendor-specific class! You can get around that, however, by using a system property, set either in the code through a resource file or at application startup (through the -D flag to the java interpreter). I prefer naming the property org.w3c.dom.DOMImplementationClass . In this case, the value of this property would be the Xerces implementation class: org.apache.xerces.dom.DOMImplementationImpl .

You can then use the simple helper class shown in Example 5-10 to handle creation of the needed DOM types.

Example 5-10. The DOMFactory helper class

 package com.oreilly.xml;     import org.w3c.dom.Document; import org.w3c.dom.DocumentType; import org.w3c.dom.DOMImplementation;     public class DOMFactory {         /** System property name */     private static final String IMPL_PROPERTY_NAME =         "org.w3c.dom.DOMImplementationClass";         /** Initialization flag */     private static boolean initialized = false;         /** The DOMImplementation to use */     private static DOMImplementation domImpl;         private static void initialize(  ) throws Exception {         domImpl =              (DOMImplementation)Class.forName(                 System.getProperty(IMPL_PROPERTY_NAME)).newInstance(  );         initialized = true;     }         public static Document newDocument(String namespaceURI,           String rootQName, DocumentType docType)          throws Exception {                  if (!initialized) {             initialize(  );         }             return domImpl.createDocument(namespaceURI, rootQName, docType);      }         public static DocumentType newDocumentType(String rootQName,           String publicId, String systemId)         throws Exception {              if (!initialized) {             initialize(  );         }             return domImpl.createDocumentType(rootQName, publicId, systemId);     } }

Just add this class to your classpath, along with whatever parser JAR file you need, and set the system property before parsing. You'll be able to obtain DOM types without even dipping into vendor-specific code.

Note that I left the exception handling fairly vague; it's generally not a good idea to just throw Exception . Instead, it's preferable to use a more specific exception subclass. However, you should generally use an application exception here. I'll let you replace the generic Exception with your more specific one. Other than that, the class is ready to go as is.

5.3.1.2 DOM Level 3

DOM Level 3 introduced a new means of bootstrapping, one that avoids the nasty vendor-specific problems discussed in the last section. It also avoids the need for a helper class. Through the introduction of a new DOM class, org.w3c.dom.DOMImplementationRegistry , it is possible to obtain a DOM implementation in a vendor-neutral way.

First, you or your parser vendor needs to set the system property org.w3c.dom.DOMImplementationSourceList . This property's value must consist of a space-separated list of class names that implement the org.w3c.dom.DOMImplementationSource interface. This is the key mechanism for DOM-implementing parsers.

Example 5-11 shows how the Apache Xerces parser might implement this interface.

Example 5-11. Apache Xerces implementation of DOMImplementationSource

 package org.apache.xerces.dom;     import org.w3c.dom.DOMImplementationSource;     public class XercesDOMImplementationSource implements DOMImplementationSource {         public DOMImplementation getDOMImplementation(String features) {         return new DOMImplementationImpl(  );     } }

This is not the actual Xerces implementation class. In reality, the getDOMImplementation( ) method would need to verify the feature string, ensure that the Xerces implementation was sufficient, and perform other error checking before returning a DOMImplementation .

The system property could then be set to the value org.apache.xerces.dom.XercesDOMImplementationSource . Typically, this property is set through the parser's own code, or at startup of your application through a batch file or shell script:

 java -Dorg.w3c.dom.DOMImplementationSourceList\ =org.apache.xerces.dom.XercesDOMImplementationSource \   some.application.class

With this machinery in place, you can then easily bootstrap a DOM implementation using the following line of code:

 DOMImplementation domImpl =      DOMImplementationRegistry.getDOMImplementation("XML 1.0");

From here, it is simple to create a new DOM tree and perform other standard DOM operations. Because the system property handles the loading of parser- and vendor-specific details, your code remains free of vendor-specific idioms.

While this is a nice solution to the problems outlined earlier, it is important to realize that at print time, DOM Level 3 was still a bit off in the future. Some parsers include only a few DOM Level 3 features, while some include almost all of them . . . and some include none at all. As of this writing, I wasn't able to get this code working flawlessly with any of the major parser projects and products. However, getting this right should be just a matter of time, and as a Java Enterprise developer, you should know how to utilize these features as soon as they become available.

5.3.2 Don't Be Afraid to Use Helper Classes

This is more of a general tip for working with DOM, but it still certainly belongs in the category of best practices. When working with the DOM, you should always write yourself a suite of helper classes, or at least look around for existing suites.

I have several versions of helper classes floating around my development environments at any given time, usually with names such as DOMHelper , DOMUtil , or some other permutation. I've omitted many of these from this chapter, as the methods in those classes are specific to the kinds of DOM manipulation I perform, which are likely different from the kinds of DOM manipulation you will perform. If, for example, you often need to walk trees, you may want a method such as the one shown in Example 5-12 to easily obtain the text of a node.

Example 5-12. Helper class for walking DOM trees

 // Get the text of a node without the extra TEXT node steps.     public static String getText(Node node) {         // Make sure this is an element.         if (node instanceof Element) {             // Make sure there are children.             if (node.hasChildNodes(  )) {                 StringBuffer text = new StringBuffer(  );             NodeList children = node.getChildNodes(  );             for (int i=0; i<children.getLength(  ); i++) {                 Node child = children.item(i);                 if (child.getNodeType(  ) =  = Node.TEXT_NODE) {                     text.append(child.getNodeValue(  ));                 }             }             return text.toString(  );         } else {             return null;         }     } else {         return null;     } }

This is one of two types of methods that retrieve the text of an element. This variety retrieves all top-level text nodes, even if there are also nested elements. Here is a sample in which you might not get what you expect from such a method:

 <p>Here is some <i>emphasized</i> text.</p>

In this case, getText( ) would return "Here is some text." Other varieties return only the first text node, which with the same example would return only "Here is some."

In any case, these sorts of methods can be grouped into toolboxes of DOM utilities that greatly simplify your programming. In short, you shouldn't be afraid to add your own ideas and tricks to the specification and API. There certainly isn't anything wrong with that, especially when the API as it stands is admittedly shallow in terms of convenience methods.

5.3.3 Avoid Class Comparisons

If you've worked with DOM, you know that one of the most common operations is tree walking . In fact, the last best practice showed a helper method to aid in this by walking a node's children to get its textual content. This tree walking is generally accomplished through the org.w3c.dom.Node interface, as all DOM structures implement (actually, they extend, and your parser provides implementations of those interfaces) this base interface.

The problem is that there are several methods for determining a node's type, and then reacting to that type. Most Java developers familiar with polymorphism and inheritance would immediately use the methods provided around the Java Class class. Using that approach, you might end up with code such as that in Example 5-13.

Example 5-13. Using Java class comparison

 NodeList children = rootNode.getChildNodes(  );     // Figure out which node type you have and work with it. for (int i=0; i<children.getLength(  ); i++) {     Node child = children.item(i);         if (child.getClass(  ).equals(org.w3c.dom.Element.class)) {         Element element = (Element)child;         // Do something with the element.     } else if (child.getClass(  ).equals(org.w3c.dom.Text.class)) {         Text text = (Text)child;         // Do something with the text node.     } else if (child.getClass(  ).equals(org.w3c.dom.Comment.class)) {         Comment comment = (Comment)child;         // Do something with the comment.     } // etc . . .  }

In a similar vein, I've also seen code that looks similar to Example 5-14.

Example 5-14. Using string comparisons for class names

 NodeList children = rootNode.getChildNodes(  );     // Figure out which node type you have and work with it. for (int i=0; i<children.getLength(  ); i++) {     Node child = children.item(i);  if (child.getClass(  ).getName(  ).equals("org.w3c.dom.Element")) {  Element element = (Element)child;         // Do something with the element.  } else if (child.getClass(  ).getName(  ).equals("org.w3c.dom.Text")) {  Text text = (Text)child;         // Do something with the text node.  } else if (child.getClass(  ).getName(  ).equals("org.w3c.dom.Comment")) {  Comment comment = (Comment)child;         // Do something with the comment.     } // etc . . .  }

Before explaining why this doesn't work in relation to DOM, I should warn you that the second code fragment is a terrible idea. One of the slowest sets of operations within Java is String comparison; using the equals( ) method like this, over and over again, is a sure way to bog down your programs.

These might still look pretty innocuous , especially the first example. However, these code samples forget that DOM is a purely interface-based API. In other words, every concrete class in a DOM program is actually the implementation, provided by a parser project, of a DOM-standardized API. For example, you won't find in any program a concrete class called org.w3c.dom.Element , org.w3c.dom.Comment , org.w3c.dom.Text , or any other DOM construct. Instead, you will find classes such as org.apache.xerces.dom.ElementNSImpl and org.apache.xerces.dom.CommentImpl . These classes are the actual implementations of the DOM interfaces.

The point here is that using the class-specific operations will always fail. You will inevitably be comparing a vendor's implementation class with a DOM interface (which is never a concrete class, can never be instantiated , and will never be on the left side of an object comparison). Instead of these class operations, you need to use the instanceof operator, as shown in Example 5-15.

Example 5-15. Using the instanceof operator

 NodeList children = rootNode.getChildNodes(  );     // Figure out which node type you have and work with it. for (int i=0; i<children.getLength(  ); i++) {     Node child = children.item(i);  if (child instanceof org.w3c.dom.Element) {  Element element = (Element)child;         // Do something with the element.  } else if (child instanceof org.w3c.dom.Text) {  Text text = (Text)child;         // Do something with the text node.  } else if (child instanceof org.w3c.dom.Comment) {  Comment comment = (Comment)child;         // Do something with the comment.     } // etc . . .  }

Here, instanceof returns true if the class is the same as, is a subclass of, or is an implementation of the item on the righthand side of the equation.

Of course, you can also use the getNodeType( ) method on the org.w3c.dom.Node interface and perform integer comparisons, as shown in Example 5-16.

Example 5-16. Using integer comparisons

 NodeList children = rootNode.getChildNodes(  );     // Figure out which node type you have and work with it. for (int i=0; i<children.getLength(  ); i++) {     Node child = children.item(i);  if (child.getNodeType(  ) =  = Node.ELEMENT_NODE) {  Element element = (Element)child;         // Do something with the element.  } else if (child.getNodeType(  ) =  = Node.TEXT_NODE) {  Text text = (Text)child;         // Do something with the text node.  } else if (child.getNodeType(  ) =  = Node.COMMENT_NODE) {  Comment comment = (Comment)child;         // Do something with the comment.     } // etc . . .  }

This turns out to be a more efficient way to do things. Comparison of numbers will always be a computer's strong suit. (You can also use a switch/case statement here to speed things up slightly.) Consider the case in which you have an implementation class ”for example, com.oreilly.dom.DeferredElementImpl . That particular class extends com.oreilly.dom.NamespacedElementImpl , which extends com.oreilly.dom.ElementImpl , which finally implements org.w3c.dom.Element . Using the instanceof approach would cause the Java Virtual Machine (JVM) to perform four class comparisons and chase an inheritance tree, all in lieu of comparing a numerical constant such as "4" to another numerical constant. It should be pretty obvious, then, that getClass( ) doesn't work, instanceof works but performs poorly, and getNodeType( ) is the proper way to do node type discovery.

I l @ ve RuBoard