Parsing Documents with a DOM Parser | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Unlike SAX, DOM does not have a class or interface that represents the XML parser. Each parser vendor provides its own unique class.

In Xerces, it's org.apache.xerces.parsers.DOMParser .
In Crimson, it's org.apache.crimson.jaxp.DocumentBuilderImpl .
In lfred, it's an inner class, gnu.xml.dom.JAXPFactory$JAXPBuilder .
In Oracle, it's oracle.xml.parser.v2.DOMParser .
In other implementations , it will be something else.

Furthermore, because these classes do not share a common interface or superclass, the methods they use to parse documents vary too. For example, in Xerces the two methods that read XML documents have these signatures:

 public void  parse  (InputSource  source  )  throws SAXException, IOException public void  parse  (String  systemID  ) throws  SAXException, IOException

To get the Document object from the parser, you first call one of the parse methods and then call the getDocument() method.

 public Document  getDocument  ()

In this example, if parser is a Xerces DOMParser object, then these lines of code load the DOM Core 2.0 specification into a DOM Document object named spec :

 parser.parse("http://www.w3.org/TR/DOM-Level-2-Core");  Document spec = parser.getDocument();

In Crimson's parser class, by contrast, the parse() method returns a Document object directly, so that no separate getDocument() method is needed. For example,

 Document spec   = parser.parse("http://www.w3.org/TR/DOM-Level-2-Core");

Furthermore, the Crimson parse() method is five-way overloaded instead of two:

 public Document  parse  (InputSource  source  )  throws SAXException, IOException public Document  parse  (String  uri  )  throws SAXException, IOException public Document  parse  (File  file  )  throws SAXException, IOException public Document  parse  (InputStream  in  )  throws SAXException, IOException public Document  parse  (InputStream  in,  String  systemID  ) throws SAXException, IOException

Example 9.3 is a simple program that uses Xerces to check documents for well- formedness . You can see that it depends directly on the org.apache.xerces.parsers.DOMParser class.

Example 9.3 A Program That Uses Xerces to Check Documents for Well-Formedness

 import org.apache.xerces.parsers.DOMParser; import org.xml.sax.SAXException; import java.io.IOException; public class XercesChecker {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java XercesChecker URL");       return;     }     String document = args[0];     DOMParser parser = new DOMParser();     try {       parser.parse(document);       System.out.println(document + " is well-formed.");     }     catch (SAXException e) {       System.out.println(document + " is not well-formed.");     }     catch (IOException e) {       System.out.println("Due to an IOException, the parser could not check "        + document);     }   } }

It's not hard to port XercesChecker to a different parser such as the Oracle XML Parser for Java, but you do need to change the source code as shown in Example 9.4, and recompile.

Example 9.4 A Program That Uses the Oracle XML Parser to Check Documents for Well-Formedness

 import oracle.xml.parser.v2.*; import org.xml.sax.SAXException; import java.io.IOException; public class OracleChecker {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java OracleChecker URL");       return;     }     String document = args[0];     DOMParser parser = new DOMParser();     try {       parser.parse(document);       System.out.println(document + " is well-formed.");     }     catch (XMLParseException e) {       System.out.println(document + " is not well-formed.");       System.out.println(e);     }     catch (SAXException e) {       System.out.println(document + " could not be parsed.");     }     catch (IOException e) {       System.out.println("Due to an IOException, the parser could not check "        + document);     }   } }

Other parsers have slightly different methods still. What all of these have in common is that they read an XML document from a source of text, most commonly a file or a stream, and provide an org.w3c.dom.Document object. Once you have a reference to this Document object, you can work with it using only the standard methods of the DOM interfaces. There's no further need to use parser-specific classes.

JAXP DocumentBuilder and DocumentBuilderFactory

The lack of a standard means of parsing an XML document is one of the holes that JAXP fills. If your parser implements JAXP, then instead of using the parser-specific classes, you can use the javax.xml.parsers.DocumentBuilderFactory and javax.xml.parsers.DocumentBuilder classes to parse the documents. The basic approach is as follows :

Use the static DocumentBuilderFactory.newInstance() factory method to return a DocumentBuilderFactory object.
Use the newDocumentBuilder() method of this DocumentBuilderFactory object to return a parser-specific instance of the abstract DocumentBuilder class.
Use one of the five parse() methods of DocumentBuilder to read the XML document and return an org.w3c.dom.Document object.

Example 9.5 demonstrates with a simple program that uses JAXP to check documents for well-formedness.

Example 9.5 A Program That Uses JAXP to Check Documents for Well-Formedness

 import javax.xml.parsers.*; // JAXP import org.xml.sax.SAXException; import java.io.IOException; public class JAXPChecker {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java JAXPChecker URL");       return;     }     String document = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder();       parser.parse(document);       System.out.println(document + " is well-formed.");     }     catch (SAXException e) {       System.out.println(document + " is not well-formed.");     }     catch (IOException e) {       System.out.println("Due to an IOException, the parser could not check "        + document);     }     catch (FactoryConfigurationError e) {       // JAXP suffers from excessive brain-damage caused by       // intellectual in-breeding at Sun. (Basically the Sun       // engineers spend way too much time talking to each other       // and not nearly enough time talking to people outside       // Sun.) Fortunately, you can happily ignore most of the       // JAXP brain damage and not be any the poorer for it.       // This, however, is one of the few problems you can't       // avoid if you're going to use JAXP at all.       // DocumentBuilderFactory.newInstance() should throw a       // ClassNotFoundException if it can't locate the factory       // class. However, what it does throw is an Error,       // specifically a FactoryConfigurationError. Very few       // programs are prepared to respond to errors as opposed       // to exceptions. You should catch this error in your       // JAXP programs as quickly as possible even though the       // compiler won't require you to, and you should       // never rethrow it or otherwise let it escape from the       // method that produced it.       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }   } }

For example, here's the output produced when I ran this program across this chapter's DocBook source code:

 D:\books\XMLJAVA>  java JAXPChecker file:///D:/books/xmljava/dom.xml  file:///D:/books/xmljava/dom.xml is well-formed.

How JAXP Chooses Parsers

You may be wondering which parser this program actually uses. JAXP, after all, is reasonably parser independent. The answer depends on which parsers are installed in your class path and how certain system properties are set. The default is to use the class named by the javax.xml.parsers.DocumentBuilderFactory system property. For example, if you want to make sure that Xerces is used to parse documents, then you would run JAXPChecker as follows:

 D:\books\XMLJAVA>  java   -Djavax.xml.parsers.DocumentBuilderFactory=   org.apache.xerces.jaxp.DocumentBuilderFactory   JAXPChecker file:///D:/books/xmljava/dom.xml  file:///D:/books/xmljava/dom.xml is well-formed.

If the javax.xml.parsers.DocumentBuilderFactory property is not set, then JAXP looks in the lib/jaxp.properties properties file in the JRE directory to determine a default value for the javax.xml.parsers.DocumentBuilderFactory system property. If you want to use a certain DOM parser consistently, for instance gnu.xml.dom.JAXPFactory , then place the following line in that file:

 javax.xml.parsers.DocumentBuilderFactory=gnu.xml.dom.JAXPFactory

If this fails to locate a parser, then JAXP next looks for a META-INF/services/javax.xml.parsers.DocumentBuilderFactory file in all JAR files available to the runtime to find the name of the concrete DocumentBuilderFactory subclass.

Finally, if that fails, then DocumentBuilderFactory.newInstance() returns a default class, generally the parser from the vendor that also provided the JAXP classes. For example, the JDK JAXP classes pick org.apache.crimson.jaxp. DocumentBuilderFactoryImpl by default, but the lfred JAXP classes pick gnu.xml.dom.JAXPFactory instead.

Configuring DocumentBuilderFactory

The DocumentBuilderFactory has a number of options that allow you to determine exactly how the parsers it creates behave. Most of the setter methods take a boolean that turns the feature on if true or off if false. However, a couple of the features are defined as confusing double negatives , so read carefully .

Coalescing

The following two methods determine whether or not CDATA sections are merged with text nodes. If the coalescing feature is true, then the result tree will not contain any CDATA section nodes, even if the parsed XML document does contain CDATA sections.

 public boolean  isCoalescing  ()  public void  setCoalescing  (boolean  coalescing  )

The default is false, but in most situations you should set this to true, especially if you're just reading the document and are not going to write it back out again. CDATA sections should not be treated differently from any other text. Whether or not certain text is written in a CDATA section should be purely a matter of syntax sugar for human convenience, not anything that has an effect on the data model.

Expand Entity References

The following two methods determine whether the parsers that this factory produces will expand entity references:

 public boolean  isExpandEntityReferences  ()  public void  setExpandEntityReferences  (boolean  expandEntityReferences  )

The default is true. If a parser is validating, then it will expand entity references, even if this feature is set to false. That is, the validation feature overrides the expand-entity-references feature. The five predefined entity references & , < , > , " , and ' will always be expanded regardless of the value of this property.

Ignore Comments

The following two methods determine whether the parsers that this factory produces will generate comment nodes for comments seen in the input document. The default, false, means that comment nodes will be produced. (Watch out for the double negative here. False means include comments, and true means don't include comments. This confused me initially, and I was getting my poison pen all ready to write about the brain damage of throwing away comments even though the specification required them to be included, when I realized that the method was in fact behaving as it should.)

 public boolean  isIgnoringComments  ()  public void  setIgnoringComments  (boolean  ignoringComments  )

Ignore Element-Content White Space

The following two methods determine whether the parsers that this factory produces will generate text nodes for so-called "ignorable white space"; that is, white space that occurs between tags where the DTD specifies that parsed character data cannot appear.

 public boolean  isIgnoringElementContentWhitespace  ()  public void  setIgnoringElementContentWhitespace  (boolean  ignoreElementContentWhitespace  )

The default is false; that is, include text nodes for ignorable white space. Setting this to true might well be useful in record-like documents. For this property to make a difference, however, the documents must have a DTD and should be valid or very nearly so. Otherwise the parser won't be able to tell which white space is ignorable and which isn't.

Namespace Aware

The following two methods determine whether the parsers that this factory produces will be namespace aware. A namespace-aware parser will set the prefix and namespace URI properties of element and attribute nodes that are in a namespace. A non-namespace-aware parser won't.

 public boolean  isNamespaceAware  ()  public void  setNamespaceAware  (boolean  namespaceAware  )

The default is false, which is truly the wrong choice. You should always set this to true. For example,

 DocumentBuilderFactory factory   = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true);

Validating

These methods determine whether or not the parsers that this factory produces will validate the document against its DTD.

 public boolean  isValidating  ()  public void  setValidating  (boolean  validating  )

The default is false; do not validate. If you want to validate your documents, set this property to true. You'll also need to register a SAX ErrorHandler with the DocumentBuilder using its setErrorHandler() method to receive notice of validity errors. Example 9.6 demonstrates with a program that uses JAXP to validate a document named on the command line.

Example 9.6 A Program That Uses JAXP to Check Documents for Well-Formedness

 import javax.xml.parsers.*;// JAXP import org.xml.sax.*; import java.io.IOException; public class JAXPValidator {   public static void main(String[] args) {     if (args.length <= 0){       System.out.println("Usage: java JAXPValidator URL");       return;     }     String document = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       // Always turn on namespace awareness       factory.setNamespaceAware(true);       // Turn on validation       factory.setValidating(true);       DocumentBuilder parser = factory.newDocumentBuilder();       // SAXValidator was developed in Chapter 7       ErrorHandler handler = new SAXValidator();       parser.setErrorHandler(handler);       parser.parse(document);       if (handler.isValid()) {         System.out.println(document + " is valid.");       }       else {         // If the document isn't well-formed, an exception has         // already been thrown and this has been skipped.         System.out.println(document + " is well-formed.");       }     }     catch (SAXException e) {       System.out.println(document + " is not well-formed.");     }     catch (IOException e) {       System.out.println("Due to an IOException, the parser could not check "        + document);     }     catch (FactoryConfigurationError e){       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e){       System.out.println("Could not locate a JAXP parser");     }   } }

Parser-Specific Attributes

Many JAXP-aware parsers support various custom features. For example, Xerces has an http://apache.org/xml/features/dom/create-entity-ref-nodes feature that lets you choose whether or not to include entity reference nodes in the DOM tree. This is not the same as deciding whether or not to expand entity references. That determines whether the entity nodes that are placed in the tree have children representing their replacement text or not.

JAXP allows you to set and get these custom features as objects of the appropriate type using these two methods:

 public Object  getAttribute  (String  name  )  throws IllegalArgumentException  public void  setAttribute  (String  name,  Object  value  )  throws IllegalArgumentException

For example, suppose you're using Xerces and you don't want to include entity reference nodes. Because they're included by default, you would need to set http://apache.org/xml/features/dom/create-entity-ref-nodes to false. You would use setAttribute() on the DocumentBuilderFactory , like this:

 DocumentBuilderFactory factory   = DocumentBuilderFactory.newInstance();  factory.setAttribute("http://apache.org/xml/features/dom/create-entity-ref-nodes",   new Boolean(false));

The naming conventions for both attribute names and values depend on the underlying parser. Xerces uses URL strings like SAX feature names . Other parsers may do something different. JAXP 1.2 will add a couple of standard attributes related to schema validation.

DOM3 Load and Save

JAXP only works for Java, and it is a Sun proprietary standard. Consequently, the W3C DOM working group is preparing an alternative cross-vendor means of parsing an XML document with a DOM parser. This will be published as part of DOM3. DOM3 is not close to a finished recommendation at the time of this writing and is not yet implemented by any parsers, but I can give you an idea of what the interface is likely to look like.

Parsing a document with DOM3 will require four steps:

Load a DOMImplementation object by passing the feature string "LS-Load 3.0" to the DOMImplementationRegistry.getDOMImplementation() factory method. (This class is also new in DOM3.)
Cast this DOMImplementation object to DOMImplementationLS , the subinterface that provides the extra methods you need.
Call the implementation's createDOMBuilder() method to create a new DOMBuilder object. This is the new DOM3 class that represents the parser. The first argument to createDOMBuilder() is a named constant that specifies whether the document is parsed synchronously or asynchronously. The second argument is a URL identifying the type of schema to be used during the parse, "http://www.w3.org/2001/XMLSchema" for W3C XML Schemas, "http://www.w3.org/TR/REC-xml" for DTDs. You can pass null to ignore all schemas.
Pass the document's URL to the builder object's parseURI() method to read the document and return a Document object.

Example 9.7 demonstrates with a simple program that uses DOM3 to check documents for well-formedness.

Example 9.7 A Program That Uses DOM3 to Check Documents for Well-Formedness

 import org.w3c.dom.*; import org.w3c.dom.ls.*; public class DOM3Checker {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java DOM3Checker URL");       return;     }     String document = args[0];     try {       DOMImplementationLS impl = (DOMImplementationLS)        DOMImplementationRegistry        .getDOMImplementation("LS-Load 3.0");       DOMBuilder parser = impl.createDOMBuilder(DOMImplementationLS.MODE_SYNCHRONOUS,        "http://www.w3.org/TR/REC-xml");     // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^     // Use DTDs when parsing       Document doc = parser.parseURI(document);       System.out.println(document + " is well-formed.");     }     catch (NullPointerException e) {       System.err.println("The current DOM implementation does"        + " not support DOM Level 3 Load and Save");     }     catch (DOMException e) {       System.err.println(document + " is not well-formed");     }     catch (IOException e) {       System.out.println("Due to an IOException, the parser could not check "        + document);     }     catch (Exception e) {       // Probably a ClassNotFoundException,       // InstantiationException, or IllegalAccessException       // thrown by DOMImplementationRegistry.getDOMImplementation       System.out.println("Probable CLASSPATH problem.");       e.printStackTrace();     }   } }

For the time being, JAXP's DocumentBuilderFactory is the obvious choice because it works today and is supported by almost all DOM parsers written in Java. Longer term , DOM3 will provide a number of important capabilities JAXP does not, including parse progress notification and document filtering. Because these APIs are far from ready for prime time just yet, for the rest of this book I'm mostly going to use JAXP without further comment.