DOM Parsers for Java | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

DOM is defined almost completely in terms of interfaces rather than classes. Different parsers provide their own custom implementations of these standard interfaces. This offers a great deal of flexibility. Generally you do not install the DOM interfaces on their own. Instead they come bundled with a parser distribution that provides the detailed implementation classes. DOM isn't quite as broadly supported as SAX, but most of the major Java parsers provide it, including Crimson, Xerces, XML for Java, the Oracle XML Parser for Java, and GNU JAXP.

DOM is not complete to itself. Almost all significant DOM programs need to use some parser-specific classes. DOM programs are not too difficult to port from one parser to another, but a recompile is normally required. You can't just change a system property to switch from one parser to another, as you can with SAX. In particular, DOM2 does not specify how one parses a document, creates a new document, or serializes a document into a file or onto a stream. These important functions are all performed by parser-specific classes.

JAXP, the Java API for XML Processing, fills in a few of the holes in DOM by providing standard parser-independent means to parse existing documents, create new documents, and serialize in-memory DOM trees to XML files. Most current Java parsers that support DOM2 also support JAXP 1.1. JAXP is a standard part of Java 1.4. Although JAXP is not included in earlier versions of Java, it does work with Java 1.1 and later and is bundled with most parser class libraries. DOM3 promises to fill the same holes that JAXP fills (that is, parsing, serializing, and bootstrapping), but it is not yet finished and not yet supported in a large way by any parsers.

Because DOM depends so heavily on parser classes, its performance characteristics vary widely from one parser to the next . Speed is something of a concern, but memory consumption is a much bigger issue for most applications. All DOM implementations I've seen use more space for the in-memory DOM tree than the actual file on the disk occupies. Generally the in-memory DOM trees range from three to ten times as large as the actual XML text. Some parsers including Xerces offer a "lazy DOM" that leaves most of the document on the disk and reads into memory only those parts of the document that the client actually requests .

Another distinguishing factor between different DOM implementations is the extra features the parser provides. Most parsers provide methods to parse XML documents and serialize DOM trees to XML. Other useful features include schema validation, database access, XInclude, XSLT, XPath, support for different character sets, and application-specific DOMs like the MathML, SVG, and WML DOMs.

For example, the Oracle and Xerces parsers provide schema validation. lfred and Crimson don't. lfred has partial support for XInclude. The other three don't. The Oracle XML parser can produce a DOM Document object from a SQL query against a relational database or a JDBC ResultSet object. The other three can't. The Oracle XML parser can decode the WAP binary XML format. The other three can't. Xerces has specialized DOMs for HTML and WML documents. The other three don't. These are all nonstandard features; but if they're useful to you, that would be a good reason to choose one parser over another. Table 9.2 summarizes parser support for various useful features.

Measuring DOM Size

To test the memory usage of various implementations, I wrote a simple program that loaded the second edition of the XML 1.0 specification into a DOM Document object. The specification's text format is 197K (not including the DTD, which adds another 56K but isn't really modeled by DOM at all). Following is the approximate amount of memory that various parsers used to build Document objects from this file:

Xerces-J 2.0.1: 1489K
Crimson 1.1.3 ( JDK 1.4 default): 1230K
Oracle XML Parser for Java 9.2.0.2.0: 2500K

I used a couple of different techniques to measure the memory used. In one case, I used OptimizeIt and the Java Virtual Machine Profiling Interface (JVMPI) to check the heap size. I ran the program both with and without loading the document. I subtracted the total heap memory used without loading the document from the memory used when the document was loaded to get the numbers reported above. In the other test, I used the Runtime class to measure the total memory and the free memory before and after the Document was created. In both cases, I garbage collected before taking the final measurements. The results from the separate tests were within 15 percent of each other. I performed all tests in Sun's JDK 1.4.0 using Hotspot on Windows NT 4.0SP6.

I don't claim these numbers to be exact, and I certainly don't think this one test document justifies any claims whatsoever about the relative efficiency of the different DOM implementations. The difference between Crimson and Xerces is well within my margin of error. A more serious test would have to look at how the different implementations scale with the size of the initial document, and perhaps graph the curves of memory size versus file size. For example, it's possible that each of these requires a minimum of 1024K per document, but grows relatively slowly after that point. I did run the same tests on a minimal document that contained a single empty element. The results ranged from 3K to 131K for this document. However, these numbers were extremely sensitive to exactly when and how garbage was collected. I wouldn't claim the results are accurate to better than ±300K. However, I do think that together these tests demonstrate just how inefficient DOM is.

Table 9.2. DOM Parser Features

	Xerces	lfred	Oracle	Crimson
DTDs	X	X	X	X
Schemas	X		X
Namespaces	X	X	X	X
Lazy DOM	X
HTML DOM	X
Views
Stylesheets
CSS
CSS2
Events	X	X	X
UI events		X
Mouse events
Mutation events	X	X
HTML events		X
Traversal	X	Partial	X
Range			X
XSLT/XPath	Via Xalan-J		X
XInclude		X