Choosing a Parsing Method | XML Programming Bible

You should choose your parser depending on the nature of the processing and the size of the XML documents. A tree-based parser usually needs to load the entire document into memory, so it can be impractical because of physical constraints on memory when processing documents like dictionaries or large databases. With a stream-based parser you can skip over elements that you aren't interested in (for example, when looking up a particular word in a dictionary). If your application needs to process certain elements in relation to other elements, however, a tree-based parser is much easier to work with. It's worth noting that a tree-based parser can be built on top of a stream-based parser and that the output of a tree-based parser can be "walked" to provide a stream-based interface to an application. In this section we cover the DOM and SAX parsing methods and provide example scenarios in which you can decide which method is appropriate for a given task.

Figure 3-5 A SAX Event Order.

The term "walked" refers to taking pieces of the document and sending them out in parts. You are traversing, or walking, the document objective model.

The DOM Method

DOM implementations are currently biased toward in-memory storage of the document, but this may change as Persistent DOM (PDOM) implementations become more popular. Even with memory limitations, however, DOM certainly has a place because of features that help it access and manipulate documents. The following are DOM benefits you should focus on:

It allows random access to the document.
Complex searches can be easily implemented.
The DTD or schema is available.
The DOM is read/write.

The first two benefits are the ability to randomly access the document and create complex searches. These provide a means for searching for elements and retrieving information, such as data and attributes, on these elements. The DOM can also be bound to an XML DTD or schema, which means it can be checked to make sure the data contained in the document is valid according to the rules of the DTD or schema. Finally it provides the ability to read data out of a document and write data to it.

The DOM's simplicity, powerful access to the document, and a well-defined specification make it a popular parser method. It also pairs well with XSLT and other document-transformation solutions you might require. Therefore, if your project is small and you need to complete it quickly, using a DOM-based method is a great choice. However, if you are going to process large files and have the time to write a more robust application, you should look into a SAX-based implementation.

The SAX Method

If you need to parse and process huge XML documents, SAX implementations offer some benefits over DOM-based ones. You should first ask yourself, however, if an improved design would remove the need for large documents. For example, prefiltering in a database that can stream XML might suit your needs. By going with SAX, however, you can enforce options for document manipulation by using XSLT and requiring your team to write code to internally manage, store, and rewrite the document.

Like the DOM, SAX has a particular set of benefits. The following list contains some of the most useful:

It can parse files of any size.
You can build your own data structure.
You can access only a small subset of the information if you desire.
It is fast.

The biggest advantage of SAX is, arguably, its ability to process files of any size. The way the parser streams data in and out (exposes data) allows it to handle files of any size. SAX is also useful when you want to build your own data structure and allows you to grab only subsets of the information in a given document. Finally it can be a fast method of processing documents, especially when parsing large files.

SAX is best suited to sequential-scan applications when you want to go through the XML document quickly from start to finish. Also, sometimes you won't need the overhead of the full-blown DOM, so a SAX parser will be sufficient for creating a lightweight and compact internal data structure.

Example Scenarios

To help you choose a method we included a few example scenarios. While processing and using XML documents is more widely adopted every day, many lessons have gone unnoticed because of lack of experience. These scenarios should help by allowing you to walk down a decision path and choose the right approach. With these benefits in mind, review the following scenarios and determine which parser is appropriate for each one.

Scenario 1

Company XYZ currently has 20,000 employees. The Human Resources data file is currently stored in XML format, and you are to write an application that returns the average annual salary of all employees.

If you use the DOM interface your application will need to load the entire employee database into memory and retrieve the document.employee[i].annual_salary[0] value for each employee, and then average the values.

Using the SAX approach you could write an event handler that looks for only the <annual_salary> element and ignores everything else. You could parse through the file systematically and efficiently. The solution in this scenario is clear: SAX makes one pass through a large file and looks for specific data.

Scenario 2

Company ABC has 375 employees. The Human Resources data file is stored in XML format, and you are to write an application that allows users to scroll through the list of employees and find detailed information on an employee.

If you tried this with the SAX approach you would either have to parse through the XML document every time you wanted to display information to the user, which is inefficient, or you would have to build your own memory structure so that you could parse it once and then access it multiple times. You'd need to keep track of all the information yourself and develop and maintain the code required to support this data storage scheme.

By using the DOM you have access to the entire employee database as nodes in a tree. The data storage mechanism and the code that supports it is essentially provided by the parser. This is a much easier solution!

Scenario 3

If we return to the first scenario, let's say you're asked to modify your application to give 7 percent raises to those employees below the company average and 4 percent to those above it.

Your application would need to be parsed with the DOM. SAX does not support modification of data. Even if it did, as an event-based parser it would be difficult to write two sets of event handlers: one to calculate the average and one to update the data as it's parsed for the second time.