Introducing XML Parsing | Professional XML (Programmer to Programmer)

There are two widely used approaches to parsing XML data:

q Tree-based APIs
q Simple API for XML (SAX)

The following sections discuss each of these approaches.

Tree-Based APIs

One of the most popular XML APIs at the moment is the Document Object Model, which is a standard that was developed by the World Wide Web Consortium (w3.org). DOM is what is known as a tree-based API, which means that all of the information and content from the original document must be read into memory and stored in a tree structure before it can be accessed by a client program. After the document has been parsed and stored as an in-memory tree structure, the client application has full access to its contents. It is simple to follow references from one part of the document to another. It is also easy to modify the document by adding and removing nodes from the tree.

Although this approach has some obvious advantages, it has some equally obvious disadvantages. The size of the document affects the performance (and memory consumption) of the program. If the document is very large, it may not be possible to store the entire thing in memory at one time. Also, the whole document must be successfully parsed before any information is available to the client program.

A Simple API for XML (SAX)

It was to solve these and other problems that the members of the XML-DEV mailing list (www.xml.org) developed the SAX. Unlike DOM, SAX is an event-driven API. Rather than building an in-memory copy of the document and passing it to the client program, This API requires the client program to register itself to receive notifications when the parser recognizes various parts of an XML document.

In the event-driven scenario, the API itself doesn't allocate storage for the contents of the document. The required content is passed to the event notification method, and then forgotten. Whether the document is 10 kilobytes or 10 megabytes, the application's memory usage and relative performance remain constant. Unlike in the tree-based approach, the client application notifications are received as the document is parsed. This means it can begin processing before the entire document has been read. For many Internet-based applications, where bandwidth may be an issue, this can be extremely useful.

There are, of course, drawbacks to this approach. Application developers are responsible for creating their own data structures to store any document information they must reference later. Because no comprehensive model of the document is available in memory, SAX is unsuitable for sophisticated editing applications. Also, for applications where random access to arbitrary points of the document is required (such as an XSLT implementation), a tree-based API would be more appropriate.

Installing SAX

In reality, SAX is nothing more than a set of Java class and interface descriptions that document a system for writing event-driven XML applications. The SAX specification (along with the source code for a set of Java interfaces and classes) lives on its own Web site (http://www.saxproject.org) and is still maintained and extended by the members of the XML-DEV mailing list. To download SAX, you can go to the home page and then browse for the latest version, or you can go directly to the SourceForge project page at http://www.sourceforge.net/project/showfiles.php?group_id+29449.

The distribution contains all the Java interfaces, the extension interfaces, some helper files, and the documentation, but doesn't include a SAX parser. To actually use SAX, you need to download one of the many XML parsers that have been developed to work with SAX. The parser is the one that has a concrete implementation of the various interfaces and classes that make up the org.xml.sax and org.xml.sax.helpers Java packages. Some popular Java SAX parsers are shown in the following table:

Open table as spreadsheet

Parser	Driver identifier	Description
Xerces-J	`org.apache.xerces.parsers.SAXParser`	which is used throughout this chapter, is maintained by the Apache group. It is available at http://www.xml.apache.org/xerces2-j.
AElfred2	gnu.xml.aelfred2.XmlReader	AElfred2 parser is highly conformant as it was written and modified by the creators of SAX. It is available as part of the GNUJAXP project at `gnu.org/software/ classpathx/jaxp/.`
Crimson	`org.apache.crimson.parser.XMLReaderImpl`	The Crimson parser was originally part of the Crimson project at http://www.xml.apache.org/crimson/. It is now included as part of Sun's Java API for XML Parsing available at http://www.java.sun.com/xml.
Oracle	oracle.xml.parser.v2.SAXParser	Oracle maintains a SAX parser as part of its XML toolkit. It can be downloaded from the Oracle Technology Network at http://www.otn.oracle.com/tech/xml/index.html.
XP	`com.jclark.xml.sax.SAX2Driver`	XP is an XML 1.0 parser written by James Clark. A SAX2 driver was created for use with the latest versions of SAX. More information can be found at http://www.xmlmind.com_xpforjaxp/docs/.

In this chapter, I use the Apache XML parsing library developed as part of the Apache Xerces project. You can download Apache Xerces project code from http://www.xml.apache.org/xerces2-j/ or http://www.archive.apache.org/dist/xml/xerces-j/. After downloading the archive file, unzip it to the desired folder and follow the instructions to set up your environment.

After that, set the CLASSPATH environment variable to the following:

q <SAX-Installation-Drive>\sax2r3\sax2.jar
q <Xerces-Installation-Drive>\Xerces-J\xerces-2_9_0\xercesImpl.jar

These options allow java.exe (the JDK Java runtime) to locate the SAX classes at runtime so you aren't required to supply their location on the command line.

You also need a copy of the Java 2 SDK to compile and execute your SAX application. The examples in this book were compiled using the JDK version 1.5.10. As part of the Java set up, you must also set your PATH variable to Java execution path.