Recipe 21.3 Parsing XML with SAX


Problem

You want to make one quick pass over an XML file, extracting certain tags or other information as you go.

Solution

Simply use SAX to create a document handler and pass it to the SAX parser.

Discussion

The XML DocumentHandler interface specifies a number of "callbacks" that your code must provide. In one sense, this is similar to the Listener interfaces in AWT and Swing, as covered briefly in Recipe 14.4. The most commonly used methods are startElement( ) , endElement( ), and characters( ). The first two, obviously, are called at the start and end of an element, and characters( ) is called when there is character data. The characters are stored in a large array, and you are passed the base of the array and the offset and length of the characters that make up your text. Conveniently, there is a string constructor that takes exactly these arguments. Hmmm, I wonder if they thought of that . . . .

To demonstrate this, I wrote a simple program using SAX to extract names and email addresses from an XML file. The program itself is reasonably simple and is shown in Example 21-5.

Example 21-5. SAXLister.java
import java.io.IOException; import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; import com.darwinsys.util.Debug; /**  * Simple lister - extract name and children tags from a user file. Version for SAX 2.0  * @version $Id: ch21.xml,v 1.5 2004/05/04 20:13:38 ian Exp $  */ public class SAXLister {     public static void main(String[] args) throws Exception {         new SAXLister(args);     }          public SAXLister(String[] args) throws SAXException, IOException {         XMLReader parser = XMLReaderFactory                 .createXMLReader("org.apache.xerces.parsers.SAXParser");         // should load properties rather than hardcoding class name         parser.setContentHandler(new PeopleHandler( ));         parser.parse(args.length == 1 ? args[0] : "parents.xml");     }          /** Inner class provides DocumentHandler      */     class PeopleHandler extends DefaultHandler {         boolean parent = false;         boolean kids = false;         public void startElement(String nsURI, String localName,                 String rawName, Attributes attributes) throws SAXException {             Debug.println("docEvents", "startElement: " + localName + ","                     + rawName);             // Consult rawName since we aren't using xmlns prefixes here.             if (rawName.equalsIgnoreCase("name"))                 parent = true;             if (rawName.equalsIgnoreCase("children"))                 kids = true;         }         public void characters(char[] ch, int start, int length) {             if (parent) {                 System.out.println("Parent:  " + new String(ch, start, length));                 parent = false;             } else if (kids) {                 System.out.println("Children: " + new String(ch, start, length));                 kids = false;             }         }         /** Needed for parent constructor */         public PeopleHandler( ) throws org.xml.sax.SAXException {             super( );         }     } }

When run against the people.xml file shown in Example 21-2, it prints the listing:

$ java -classpath .:../jars/darwinsys.jar:../jars/xerces.jar SAXLister people.xml Parent:  Ian Darwin Parent:  Another Darwin $

In Version 2 of the XML DOM API, you can use the new XMLReaderFactory.createXMLReader( ) . One difficulty is that the SAX specification and code are maintained by the SAX Project (http://www.saxproject.org), not Sun. The no-argument form of createXMLReader( ) is expected first to try loading the class defined in the system property org.xml.sax.driver, and if that fails, to load an implementation-defined SAX parser. Unfortunately Sun's implementation (on 1.4 and on 1.5 Beta) does not do so; it simply throws an exception to the effect of System property org.xml.sax.driver not specified. An overloaded form of createXMLReader( ) takes the name of the parser as a string argument (e.g., "org.apache.xerces.parsers.SAXParser" or "org.apache.crimson.parser.XMLReaderImpl"). This class name would normally be loaded from a properties file (see Recipe 7.7) to avoid having the parser class name compiled into your application.

One problem with SAX is that it is, well, simple, and therefore doesn't scale well, as you can see by thinking about this program. Imagine trying to handle 12 different tags and doing something different with each one. For more involved analysis of an XML file, the Document Object Model (DOM) or the JDOM API may be better suited. (On the other hand, DOM requires keeping the entire tree in memory, so there are some scalability issues with extremely large XML documents.) And with SAX, you can't really "navigate" a document since you have only a stream of events, not a real structure. For that, you want DOM or JDOM.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net