Reading XML Documents Using the Simple API for XML (SAX) Parser

   

Java™ 2 Primer Plus
By Steven Haines, Steve Potts

Table of Contents
Chapter 25.  XML


The main advantages of the SAX parser over the DOM include

  • It can parse documents of any size

  • Useful when you want to build your own data structures

  • Useful when you only want a small subset of information contained in the XML document

  • It is simple

  • It is fast

Its main disadvantages include

  • It does not provide random access to the document; it starts at the beginning and reads through serially to the end

  • Complex searches can be difficult to implement

  • Lexical information is not available

  • It is read-only

Using a SAX parser to read an XML document requires a SAX parser to read the document and a document handler to make meaningful use of the data the SAX parser reads.

JAXP provides a SAX parser in its distribution: javax.xml.parsers.SAXParser; the SAX parser is made available to you through the SAX parser factory: javax.xml.parsers.SAXParserFactory. The SAXParserFactory class is an API for obtaining SAX-based parsers. Call its newSAXParser() method to obtain a preconfigured SAX parser (based on the settings you define in the factory). Note that both of these classes are abstract and are provided as base classes so that you, the developer, have a consistent programming interface irrespective of the underlying implementation.

Your job now is to create a document handler. A document handler is defined by the class org.xml.sax.helpers.DefaultHandler. It implements several interfaces, but the one that you will be most interested in is the org.xml.sax.ContentHandler interface, see Table 25.1.

Table 25.1. ContentHandler Interface

Method

Description

void characters(char[] ch, int start, int length)

Receive notification of character data

void endDocument()

Receive notification of the end of a document

void endElement(String namespaceURI, String localName, String qName)

Receive notification of the end of an element

void endPrefixMapping(String prefix)

End the scope of a prefix-URI mapping

void ignorableWhitespace(char[] ch, int start, int length)

Receive notification of ignorable whitespace in element content

void processingInstruction(String target, String data)

Receive notification of a processing instruction

void setDocumentLocator(Locator locator)

Receive an object for locating the origin of SAX document events

void skippedEntity(String name)

Receive notification of a skipped entity

void startDocument()

Receive notification of the beginning of a document

void startElement(String namespaceURI, String localName, String qName, Attributes atts)

Receive notification of the beginning of an element

void startPrefixMapping(String prefix, String uri)

Begin the scope of a prefix-URI Namespace mapping

Table 25.1 displays all the methods defined in the org.xml.sax.ContentHandler interface. The document parser calls these methods on the class that implements this interface; for example, when the document starts the startDocument() methods is called, when the <books> element is found, the startElement() method is called and the qName is book.

The DefaultHandler class also implements the org.xml.sax.DTDHandler, org.xml.sax.EntityHandler, and org.xml.sax.ErrorHandler interfaces to help the SAX parser resolve symbols it does not understand and to handle errors. When creating a handler it is best to extend the DefaultHandler.

With that said, here are the steps to parsing and handling the content of an XML document using the SAX parser:

  1. Import the SAX classes, handler class, and parser classes into your Java program.

  2. Get an instance of the org.xml.sax.SAXParserFactory by calling its static method newInstance().

  3. Configure the SAXParserFactory's options (whether it is aware of namespaces and whether it is validating).

  4. Obtain an org.xml.sax.SAXParser by calling the SAXParserFactory's newSAXParser() method.

  5. Create an instance of your org.xml.sax.helpers.DefaultHandler class.

  6. Open a stream to your XML source.

  7. Ask the SAX parser to parse your XML stream by calling one of its parse() methods, see Table 25.2.

  8. Handle all the SAX parser notifications in your DefaultHandler class.

Table 25.2. SAX Parser parse() Methods

Method

Description

void parse(java.io.File f, DefaultHandler dh)

Parse the content of the file specified as XML using the specified DefaultHandler.

void parse(org.xml.sax. InputSource is, DefaultHandler dh)

Parse the content given InputSource as XML using the specified DefaultHandler.

void parse(java.io.InputStream is, DefaultHandler dh)

Parse the content of the given InputStream instance as XML using the specified DefaultHandler.

void parse(java.io.InputStream is, DefaultHandler dh, String systemId)

Parse the content of the given InputStream instance as XML using the specified DefaultHandler.

void parse(String uri, DefaultHandler dh)

Parse the content described by the giving Uniform Resource Identifier (URI) as XML using the specified DefaultHandler.

Table 25.2 shows the SAXParser's various parse() methods; the basic variation is that the source can be a file, a stream, a URI pointing to the file, or an InputSource (which can be an InputStream or a Reader see the javadoc that accompanies the JAXP for more information). When looking at the javadoc for the SAXParser class, you might notice reference to the deprecated HandlerBase class in addition to the DefaultHandler class. This is a remnant of the original SAX implementation, but with the advent of SAX2, which is what we are studying, it is no longer supported.

As an example, consider the aforementioned books.xml file in Listing 25.1. To read that XML file and do something meaningful with it, we will need to create a couple classes to represent an individual book and a collection of books. To realize this in software we will create two helper classes: SAXBook and SAXBooks; see Figure 25.1.

Figure 25.1. SAXBook and SAXBooks class diagram.

graphics/25fig01.gif

Listings 25.3 and 25.4 show the code for the SAXBook and SAXBooks classes.

Listing 25.3 SAXBook.java
 001: public class SAXBook {  002:    private String title;  003:    private String author;  004:    private String category;  005:    private float price;  006:  007:    public SAXBook() {  008:    }  009:  010:    public SAXBook( String title,  011:                    String author,  012:                    String category,  013:                    float price ) {  014:        this.title = title;  015:        this.author = author;  016:        this.category = category;  017:        this.price = price;  018:    }  019:  020:    public String getTitle() {  021:        return this.title;  022:    }  023:  024:    public void setTitle( String title ) {  025:        this.title = title;  026:    }  027:  028:    public String getAuthor() {  029:        return this.author;  030:    }  031:  032:    public void setAuthor( String author ) {  033:        this.author = author;  034:    }  035:  036:    public String getCategory() {  037:        return this.category;  038:    }  039:  040:    public void setCategory( String category ) {  041:        this.category = category;  042:    }  043:  044:    public float getPrice() {  045:        return this.price;  046:    }  047:  048:    public void setPrice( float price ) {   049:        this.price = price;  050:    }  051:  052:    public String toString() {  053:        return "Book: " + title + ", " + category + ", " +                     author + ", " + price;  054:    }  055:} 

Listing 25.3 has simple code to provide standard JavaBean-esque access to the four fields defined in the class: title, author, category, and price. It also overloads the toString() method to return the values contained in all the fields.

Listing 25.4 SAXBooks.java
 001:import java.util.ArrayList;   002:  003:public class SAXBooks {  004:    private ArrayList bookList = new ArrayList();  005:  006:    public SAXBooks() {  007:    }  008:  009:    public void addBook( SAXBook book ) {  010:        this.bookList.add( book );  011:    }  012:  013:    public SAXBook getBook( int index ) {  014:        if( index >= bookList.size() ) {  015:            return null;  016:        }  017:        return( SAXBook )bookList.get( index );  018:    }  019:  020:    public SAXBook getLastBook() {  021:        return this.getBook( this.getBookSize() - 1 );  022:    }  023:  024:    public int getBookSize() {  025:        return bookList.size();  026:    }  027:} 

Listing 25.4 shows that the SAXBooks class maintains its collection of SAXBook objects in a java.util.ArrayList and provides methods to add a book, retrieve a book, and retrieve the total number of books in the ArrayList. There is one additional method, getLastBook(), that returns the last book in the ArrayList; the reason for that will become apparent in the MyHandler.java file in Listing 25.6. When you run this example be sure that the DTD file is in the same directory as the XML file or you may inadvertently receive a FileNotFoundException.

Listing 25.5 SAXSample.java
 001:import java.io.*;   002:import org.xml.sax.*;  003:import org.xml.sax.helpers.DefaultHandler;  004:import javax.xml.parsers.SAXParserFactory;  005:import javax.xml.parsers.ParserConfigurationException;  006:import javax.xml.parsers.SAXParser;  007:  008:public class SAXSample {  009:    public static void main( String[] args ) {  010:        try {  011:            File file = new File( "book.xml" );  012:            if( !file.exists() ) {  013:                System.out.println( "Couldn't find file..." );  014:                return;  015:            }  016:  017:            // Use the default (non-validating) parser  018:            SAXParserFactory factory = SAXParserFactory.newInstance();  019:  020:            // Create an instance of our handler  021:            MyHandler handler = new MyHandler();  022:  023:            // Parse the file  024:            SAXParser saxParser = factory.newSAXParser();  025:            saxParser.parse( file, handler );  026:            SAXBooks books = handler.getBooks();  027:  028:            for( int i=0; i<books.getBookSize(); i++ ) {  029:                SAXBook book = books.getBook( i );  030:                System.out.println( book );  031:            }  032:  033:        }  034:        catch( Throwable t ) {  035:            t.printStackTrace();  036:        }  037:    }  038:}  

Listing 25.6 shows the code for the SAXSample class; this is the main class that opens our XML file, creates a parser, and asks the parser to notify our handler.

Lines 11 15 create a java.io.File object that points to the book.xml file that it is expecting to be in the same directory that this program is launched from. It validates that the file exists and quits out of the program if it does not exist.

 018:            SAXParserFactory factory = SAXParserFactory.newInstance(); 

Line 18 creates a new instance of the SAXParserFactory class by calling the SAXParserFactory class's static newInstance() method; recall that the SAXParserFactory is responsible for configuring a SAXParser and returning it upon request.

 021:            MyHandler handler = new MyHandler(); 

Line 21 creates an instance of the MyHandler class that will be described in Listing 25.6.

 024:            SAXParser saxParser = factory.newSAXParser(); 

Line 24 asks the SAXParserFactory to create a new SAXParser by calling its newSAXParser() method.

 025:            saxParser.parse( file, handler ); 

Line 25 uses the SAXParser to parse the book.xml file and provide notifications to the MyHandler instance.

Lines 26 31 retrieve the SAXBooks from the MyHandler instance, and then iterate over all the SAXBook instances it contains, displaying the books to the screen (passing the SAXBook instance to System.out.println invokes the toString() that was overridden in Listing 25.3).

Listing 25.6 MyHandler.java
 001:import org.xml.sax.*;   002:import org.xml.sax.helpers.DefaultHandler;  003:  004:public class MyHandler extends DefaultHandler {  005:    private SAXBooks books;  006:    private boolean readingAuthor;  007:    private boolean readingTitle;  008:    private boolean readingPrice;  009:  010:    public SAXBooks getBooks() {  011:        return this.books;  012:    }  013:  014:   public void startElement( String uri,  015:                             String localName,   016:                             String qName,  017:                             Attributes attributes ) {  018:        System.out.println( "Found element: " + qName );  019:        if( qName.equalsIgnoreCase( "books" ) ) {  020:            books = new SAXBooks();  021:        }  022:        else if( qName.equalsIgnoreCase( "book" ) ) {  023:            SAXBook book = new SAXBook();  024:            for( int i=0; i<attributes.getLength(); i++ ) {  025:                if( attributes.getQName( i ).equalsIgnoreCase( "category" ) ) {  026:                    book.setCategory( attributes.getValue( i ) );  027:                }  028:            }  029:            books.addBook( book );  030:        }  031:        else if( qName.equalsIgnoreCase( "author" ) ) {  032:            this.readingAuthor = true;  033:        }  034:        else if( qName.equalsIgnoreCase( "title" ) ) {   035:            this.readingTitle = true;  036:        }  037:        else if( qName.equalsIgnoreCase( "price" ) ) {  038:            this.readingPrice = true;  039:        }  040:        else {  041:            System.out.println( "Unknown element: " + qName );   042:        }  043:    }  044:  045:    public void startDocument() {  046:        System.out.println( "Starting..." );  047:    }  048:  049:    public void endDocument() {  050:        System.out.println( "Done..." );  051:    }  052:  053:    public void characters( char[] ch,  054:                            int start,  055:                            int length ) {  056:        String chars = new String( ch, start, length).trim();  057:        if( chars.length() == 0 ) {  058:            return;  059:        }  060:  061:        SAXBook book = books.getLastBook();  062:        if( readingAuthor ) {  063:            book.setAuthor( chars );  064:        }  065:        else if( readingTitle ) {  066:            book.setTitle( chars );   067:        }  068:        else if( readingPrice ) {  069:            book.setPrice( Float.parseFloat( chars ) );  070:        }  071:    }  072:  073:    public void endElement( String uri,  074:                            String localName,  075:                            String qName ) {  076:        System.out.println( "End Element: " + qName );  077:        if( qName.equalsIgnoreCase( "author" ) ) {  078:            this.readingAuthor = false;  079:        }  080:        else if( qName.equalsIgnoreCase( "title" ) ) {  081:            this.readingTitle = false;  082:        }  083:        else if( qName.equalsIgnoreCase( "price" ) ) {  084:            this.readingPrice = false;  085:        }  086:    }  087:}  

Listing 25.6 defines the MyHandler class; it is the most complicated class in the sample and is responsible for handling all the SAXParser notifications and building our data structures from those notifications.

 004:public class MyHandler extends DefaultHandler { 

Line 4 shows that the MyHandler class extends the org.xml.sax.helpers.DefaultHandler class; this class has methods that can be overloaded to respond to SAXParser notifications. The notifications that we will be interested in are

  • startDocument()

  • endDocument()

  • startElement()

  • endElement()

  • characters()

The startDocument() and endDocument() methods in the MyHandler class simply print out debug statements; all the real work happens in the startElement(), endElement(), and characters() methods. Because the SAX model is dealing with message notifications, the handler cannot run serially, it must instead run as a state machine.

The order of events in the SAX model is that the handler will receive a startElement() notification for the book element, at which time we can get its attributes (category), then a startElement() for the author element, then a characters() call containing the text for the author element, and then an endElement() for authors. The process continues through title and price and finally we will get an endElement() call on book; at this point we have a complete book.

When the MyHandler class gets a startElement() call for book, it creates a new book, retrieves its category attribute, and adds it to the SAXBooks class (lines 19 30).

The MyHandler class maintains three variables to help it keep track of what element it is reading (lines 6 8); invocations of startElement() for author, title, and price modify these variables to note what element a subsequent characters() method can be applied to (lines 31 42); in a larger example you might want to use one integer variable and define some constant states in which that variable could be. When the endElement() method is called for each of these elements, the handler resets the appropriate variable (lines 77 85).

Lines 53 71 define the characters() method. This method retrieves the last book added (which is the reason why the SAXBooks class has the getLastBook() method), which was added in the startElement() for the book element and is the book for which this characters() method is applicable. The characters() method has three parameters: a character array containing the entire contents of the XML file and the start index and length into that character array that this element's character text is applicable to. The method builds a new String from this subsection of the character array, trimming off white space, and then ensuring that it has some characters. The reason we check the length is that elements such as book, which contain other elements, still might have some white space between the end of the <book> tag and the beginning of the <author> tag, and so on. After it verifies that it has data, it checks the state variables defined earlier, accesses the last book added to the SAXBooks instance, and updates the appropriate book property (author, title, or price).

It took a fair amount of work, but the end result is that the MyHandler instance has a complete SAXBooks property that maintains an in-memory representation of the XML file that can now be used elsewhere in the application.


       
    Top
     



    Java 2 Primer Plus
    Java 2 Primer Plus
    ISBN: 0672324156
    EAN: 2147483647
    Year: 2001
    Pages: 332

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net