Choosing between SAX and DOM | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The single biggest factor in deciding whether to code your programs with SAX or with DOM is personal preference. SAX and DOM are very different APIs. Whereas SAX models the parser, DOM models the XML document. Most developers find the DOM approach more to their taste, at least initially. Its pull model (in which the client program extracts the information it wants from a document by invoking various methods on that document) is much more familiar than SAX's push model (in which the parser tells you what it reads when it reads it, whether you're ready for that information or not).

However, SAX's push model, unfamiliar as it is, can be much more efficient. SAX programs can be much faster than their DOM equivalents, and they almost always use far less memory. In particular, SAX works extremely well when documents are streamed, and the individual parts of each document can be processed in isolation from other parts. If complicated processes can be broken down into serial filters, then SAX is hard to beat. SAX lends itself to assembly-line-like automation wherein different stations perform small operations on just the parts of the document they have at hand right at that moment. By contrast, DOM is more like a factory in which each worker operates only on an entire car. Every time the worker receives a new car off the line, he or she must take the entire car apart to find the piece needed to work with, then do his or her job, then put the car back together again before moving it along to the next worker. This system is inefficient if there's more than one station. DOM lends itself to monolithic applications in which one program does everything. SAX works better when the program can be divided into small bits of independent work.

In particular, the following characteristics indicate that a program should probably use a streaming API such as SAX, XNI, or XMLPULL.

Documents will not fit into available memory. This is the only rule that really mandates one or the other. If your documents are too big for available memory, then you must use a streaming API such as SAX, painful though it may be. You really have no other choice.
You can process the document in small contiguous chunks of input. The entire document does not need to be available before you can do useful work. A slightly weaker variant of this is if the decisions you make depend only on preceding parts of the document, never on what comes later.
Processing can be divided up into a chain of successive operations.

On the other hand, if the problem matches this next set of characteristics, the program should probably use DOM or perhaps another of the tree-based APIs such as JDOM.

The program needs to access widely separated parts of the document at the same time. Even more so, it needs access to multiple documents at the same time.
The internal data structures are almost as complicated as the document itself.
The program must modify the document repeatedly.
The program must store the document for a significant amount of time through many method calls, not just process it once and forget it.

On occasion, it's possible to use both SAX and DOM. In particular, you can parse the document using a SAX XMLReader attached to a series of SAX filters, then use the final output from that process to construct a DOM Document . Working in reverse, you can traverse a DOM tree while firing off SAX events to a SAX ContentHandler .

The approach is the same Example 9.14 used earlier to serialize a DOM Document onto a stream. You can use JAXP to perform an identity transform from a source to a result. JAXP supports SAX, DOM, and streams as sources and results. For example, the following code fragment reads an XML document from the InputStream in and parses it with the SAX XMLReader named saxParser . Then it transforms this input into the equivalent DOMResult from which the DOM Document is extracted.

 XMLReader saxParser = XMLReaderFactory.createXMLReader();  Source input = new SAXSource(saxParser, in); Result output = new DOMResult(); TransformerFactory xformFactory  = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); idTransform.transform(input, output); Node document = idTransform.getNode();

To go in the other direction, from DOM to SAX, you can just use a DOMSource and a SAXResult . The DOMSource is constructed from a DOM Document object, and the SAXResult is configured with a ContentHandler :

 Source input = new DOMSource(document);  ContentHandler handler = new MyContentHandler(); Result output = new SAXResult(handler); TransformerFactory xformFactory  = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); idTransform.transform(input, output); Node document = idTransform.getNode();

The transform will walk the DOM tree, firing off events to the SAX ContentHandler .

Although TrAX is the most standard, parser-independent means of passing documents back and forth between SAX and DOM, many implementations of these APIs also provide their own utility classes for crossing the border between the APIs. For example, GNU JAXP has the gnu.xml.pipeline.DomConsumer class for building DOM Document objects from SAX event streams, and the gnu. xml.util.DomParser class for feeding a DOM Document into a SAX program. The Oracle XML Parser for Java provides the oracle.xml.parser.v2.DocumentBuilder , which is a SAX LexicalHandler / ContentHandler / DeclHandler that builds a DOM Document from a SAX XMLReader .