Chapter 20. Simple API for XML (SAX)

     

The Simple API for XML (SAX) is an event-based API for reading XML documents. Many different XML parsers implement the SAX API, including Xerces, Crimson, the Oracle XML Parser for Java, and lfred. SAX was originally defined as a Java API and is primarily intended for parsers written in Java. Therefore, this chapter focuses on the Java version of the API. However, SAX has been ported to most other major object-oriented languages, including C++, Python, Perl, and Eiffel. The translation from Java is usually fairly obvious.

The SAX API is unusual among XML APIs because it's an event-based push model rather than a tree-based pull model. As the XML parser reads an XML document, it sends the program information from the document in real time. Each time the parser sees a start-tag, an end-tag, character data, or a processing instruction, it tells your program. The document is presented to your program one piece at a time from beginning to end. You can either save the pieces you're interested in until the entire document has been read, or process the information as soon as you receive it. You do not have to wait for the entire document to be read before acting on the data at the beginning of the document. Most importantly, the entire document does not have to reside in memory. This feature makes SAX the API of choice for very large documents that do not fit into available memory.

This chapter covers SAX2 exclusively. In 2004, all major parsers that support SAX also support SAX2. The major change in SAX2 from SAX1 is the addition of namespace support, which necessitated changing the names and signatures of almost every method and class in SAX. The old SAX1 methods and classes are still available, but they're now deprecated, and you shouldn't use them.


SAX is primarily a collection of interfaces in the org.xml.sax package. One such interface is XMLReader . This interface represents the XML parser. It declares methods to parse a document and configure the parsing process, for instance, by turning validation on or off. To parse a document with SAX, first create an instance of XMLReader with the XMLReaderFactory class in the org.xml.sax.helpers package. This class has a static createXMLReader( ) factory method that produces the parser-specific implementation of the XMLReader interface. The Java system property org.xml.sax.driver specifies the concrete class to instantiate:

 try {   XMLReader parser = XMLReaderFactory.createXMLReader( );   // parse the document... } catch (SAXException ex) {   // couldn't create the XMLReader } 

The call to XMLReaderFactory.createXMLReader( ) is wrapped in a try - catch block that catches SAXException . This is the generic checked exception superclass for almost anything that can go wrong while parsing an XML document. In this case, it means either that the org.xml.sax.driver system property wasn't set, or that it was set to the name of a class that Java couldn't find in the class path .

Do not use the SAXParserFactory and SAXParser classes included with JAXP. These classes were designed by Sun to fill a gap in SAX1. They are unnecessary and indeed actively harmful in SAX2. For instance, they are not namespace aware by default. SAX2 applications should use XMLReaderFactory and XMLReader instead.


You can choose which concrete class to instantiate by passing its name as a string to the createXMLReader() method. This code fragment instantiates the Xerces parser by name:

 try {   XMLReader parser = XMLReaderFactory.createXMLReader(    "org.apache.xerces.parsers.SAXParser");   // parse the document... } catch (SAXException ex) {   // couldn't create the XMLReader } 

Now that you've created a parser, you're ready to parse some documents with it. Pass the system ID of the document you want to read to the parse( ) method. The system ID is either an absolute or a relative URL encoded in a string. For example, this code fragment parses the document at http://www.slashdot.org/slashdot.xml :

 try {   XMLReader parser = XMLReaderFactory.createXMLReader( );   parser.parse("http://www.slashdot.org/slashdot.xml"); } catch (SAXParseException ex) {   // Well-formedness error } catch (SAXException ex) {   // Could not find an XMLReader implementation class } catch (IOException ex) {   // Some sort of I/O error prevented the document from being completely   // downloaded from the server } 

The parse( ) method throws a SAXParseException if the document is malformed , an IOException if an I/O error such as a broken socket occurs while the document is being read, and a SAXException if anything else goes wrong. Otherwise, it returns void . To receive information from the parser as it reads the document, you must configure it with a ContentHandler .



XML in a Nutshell
XML in a Nutshell, Third Edition
ISBN: 0596007647
EAN: 2147483647
Year: 2003
Pages: 232

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net