Using the SAX Parser | Building an On Demand Computing Environment with IBM: How to Optimize Your Current Infrastructure for Today and Tomorrow (MaxFacts Guidebook series)

The DOM parser reads an XML document in its entirety into a tree data structure. For most practical applications, DOM works fine. However, it can be inefficient if the document is large and if your processing algorithm is simple enough that you can analyze nodes on the fly, without having to see all of the tree structure. In these cases, you should use the SAX parser instead. The SAX parser reports events as it parses the components of the XML input, but it does not store the document in any wayit is up to the event handlers whether they want to build a data structure. In fact, the DOM parser is built on top of the SAX parser. It builds the DOM tree as it receives the parser events.

Whenever you use a SAX parser, you need a handler that defines the event actions for the various parse events. The ContentHandler interface defines several callback methods that the parser executes as it parses the document. Here are the most important ones:

startElement and endElement are called when a start tag or end tag is encountered.
characters is called whenever character data are encountered.
startDocument and endDocument are called once each, at the start and the end of the document.

For example, when parsing the fragment

 <font>    <name>Helvetica</name>    <size units="pt">36</size> </font>

the parser makes sure the following calls are generated:

startElement, element name: font
startElement, element name: name
characters, content: Helvetica
endElement, element name: name
startElement, element name: size, attributes: units="pt"
characters, content: 36
endElement, element name: size
endElement, element name: font

Your handler needs to override these methods and have them carry out whatever action you want to carry out as you parse the file. The program at the end of this section prints all links <a href="..."> in an HTML file. It simply overrides the startElement method of the handler to check for links with name a and an attribute with name HRef. This is potentially useful for implementing a "web crawler," a program that reaches more and more web pages by following links.

NOTE

Unfortunately, most HTML pages deviate so much from proper XML that the example program will not be able to parse them. As already mentioned, the World Wide Web Consortium (W3C) recommends that web designers use XHTML, an HTML dialect that can be displayed by current web browsers and that is also proper XML. See http://www.w3.org/TR/xhtml1/ for more information on XHTML. Since the W3C "eats its own dog food," their web pages are written in XHTML. You can use those pages to test the example program. For example, if you run

 java SAXTest http://www.w3c.org/MarkUp

then you will see a list of the URLs of all links on that page.

The sample program is a good example for the use of SAX. We don't care at all in which context the a elements occur, and there is no need to store a tree structure.

Here is how you get a SAX parser:

 SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = factory.newSAXParser();

You can now process a document:

 parser.parse(source, handler);

Here, source can be a file, URL string, or input stream. The handler belongs to a subclass of DefaultHandler. The DefaultHandler class defines do-nothing methods for the four interfaces:

 ContentHandler DTDHandler EntityResolver ErrorHandler

The example program defines a handler that overrides the startElement method of the ContentHandler interface to watch out for a elements with an href attribute:

[View full width]
 DefaultHandler handler = new    DefaultHandler()    {       public void startElement(String namespaceURI, String lname, String qname, Attributes  attrs)          throws SAXException       {          if (lname.equalsIgnoreCase("a") && attrs != null)          {             for (int i = 0; i < attrs.getLength(); i++)             {                String aname = attrs.getLocalName(i);                if (aname.equalsIgnoreCase("href"))                   System.out.println(attrs.getValue(i));             }          }       }    };

The startElement method has three parameters that describe the element name. The qname parameter reports the qualified name of the form alias:localname. If namespace processing is turned on, then the namespaceURI and lname parameters describe the namespace and local (unqualified) name.

As with the DOM parser, namespace processing is turned off by default. You activate namespace processing by calling the setNamespaceAware method of the factory class:

 SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); SAXParser saxParser = factory.newSAXParser();

Example 12-8 contains the code for the web crawler program. Later in this chapter, you will see another interesting use of SAX. An easy way of turning a non-XML data source into XML is to report the SAX events that an XML parser would report. See the section on XSL transformations for details.

Example 12-8. SAXTest.java

  1. import java.io.*;  2. import java.net.*;  3. import javax.xml.parsers.*;  4. import org.xml.sax.*;  5. import org.xml.sax.helpers.*;  6.  7. /**  8.    This program demonstrates how to use a SAX parser. The  9.    program prints all hyperlinks links of an XHTML web page. 10.    Usage: java SAXTest url 11. */ 12. public class SAXTest 13. { 14.    public static void main(String[] args) throws Exception 15.    { 16.       String url; 17.       if (args.length == 0) 18.       { 19.          url = "http://www.w3c.org"; 20.          System.out.println("Using " + url); 21.       } 22.       else 23.          url = args[0]; 24. 25.       DefaultHandler handler = new 26.          DefaultHandler() 27.          { 28.             public void startElement(String namespaceURI, 29.                String lname, String qname, Attributes attrs) 30.             { 31.                if (lname.equalsIgnoreCase("a") && attrs != null) 32.                { 33.                   for (int i = 0; i < attrs.getLength(); i++) 34.                   { 35.                      String aname = attrs.getLocalName(i); 36.                      if (aname.equalsIgnoreCase("href")) 37.                         System.out.println(attrs.getValue(i)); 38.                   } 39.                } 40.             } 41.          }; 42. 43.       SAXParserFactory factory = SAXParserFactory.newInstance(); 44.       factory.setNamespaceAware(true); 45.       SAXParser saxParser = factory.newSAXParser(); 46.       InputStream in = new URL(url).openStream(); 47.       saxParser.parse(in, handler); 48.    } 49. }

 javax.xml.parsers.SAXParserFactory 1.4

static SAXParserFactory newInstance()
returns an instance of the SAXParserFactory class.
SAXParser newSAXParser()
returns an instance of the SAXParser class.
boolean isNamespaceAware()
void setNamespaceAware(boolean value)
are the "namespaceAware" property of the factory. If set to TRue, the parsers that this factory generates are namespace aware.
boolean isValidating()
void setValidating(boolean value)
are the "validating" property of the factory. If set to true, the parsers that this factory generates validate their input.

 javax.xml.parsers.SAXParser 1.4

void parse(File f, DefaultHandler handler)
void parse(String url, DefaultHandler handler)
void parse(InputStream in, DefaultHandler handler)
parse an XML document from the given file, URL, or input stream and report parse events to the given handler.

 org.xml.sax.ContentHandler 1.4

void startDocument()
void endDocument()
are called at the start and the end of the document.
void startElement(String uri, String lname, String qname, Attributes attr)

void endElement(String uri, String lname, String qname)

are called at the start and the end of an element.

Parameters:	`uri`	The URI of the namespace (if the parser is namespace aware)
	`lname`	The local name without alias prefix (if the parser is namespace aware)
	`qname`	The element name if the parser is not namespace aware, or the qualified name with alias prefix if the parser reports qualified names in addition to local names

void characters(char[] data, int start, int length)
is called when the parser reports character data.
Parameters:
data
An array of character data

start
The index of the first character in the data array that is a part of the reported characters

length
The length of the reported character string

 org.xml.sax.Attributes 1.4

int getLength()
returns the number of attributes stored in this attribute collection.
String getLocalName(int index)
returns the local name (without alias prefix) of the attribute with the given index, or the empty string if the parser is not namespace aware.
String getURI(int index)
returns the namespace URI of the attribute with the given index, or the empty string if the node is not part of a namespace or if the parser is not namespace aware.
String getQName(int index)
returns the qualified name (with alias prefix) of the attribute with the given index, or the empty string if the qualified name is not reported by the parser.
String getValue(int index)
String getValue(String qname)
String getValue(String uri, String lname)
return the attribute value from a given index, qualified name, or namespace URI + local name. Return null if the value doesn't exist.