XML Parser APIs | Professional XML Development with Apache Tools: Xerces, Xalan, FOP, Cocoon, Axis, Xindice (Wrox Professional Guides)

Now that we’ve finished the XML refresher, let’s take a quick trip through the two major parser APIs: SAX and DOM. A third parser API, the STreaming API for XML (STAX), is currently making its way through the Java Community Process (JCP).

A parser API makes the various parts of an XML document available to your application. You’ll be seeing the SAX and DOM APIs in most of the other Apache XML tools, so it’s worth a brief review to make sure you’ll be comfortable during the rest of the book.

Let's look at a simple application to illustrate the use of the parser APIs. The application uses a parser API to parse the XML book description and turn it into a JavaBean that represents a book. This book object is a domain object in an application you’re building. The file Book.java contains the Java code for the Book JavaBean. This is a straightforward JavaBean that contains the fields needed for a book, along with getter and setter methods and a toString method:

  1: /*   2:  *    3:  * Book.java   4:  *    5:  * Example from "Professional XML Development with Apache Tools"   6:  *   7:  */   8: package com.sauria.apachexml.ch1;   9:   10: public class Book {  11:     String title;  12:     String author;  13:     String isbn;  14:     String month;  15:     int year;  16:     String publisher;  17:     String address;  18:   19:     public String getAddress() {  20:         return address;  21:     }  22:   23:     public String getAuthor() {  24:         return author;  25:     }  26:   27:     public String getIsbn() {  28:         return isbn;  29:     }  30:   31:     public String getMonth() {  32:         return month;  33:     }  34:   35:     public String getPublisher() {  36:         return publisher;  37:     }  38:   39:     public String getTitle() {  40:         return title;  41:     }  42:   43:     public int getYear() {  44:         return year;  45:     }  46:   47:     public void setAddress(String string) {  48:         address = string;  49:     }  50:   51:     public void setAuthor(String string) {  52:         author = string;  53:     }  54:   55:     public void setIsbn(String string) {  56:         isbn = string;  57:     }  58:   59:     public void setMonth(String string) {  60:         month = string;  61:     }  62:   63:     public void setPublisher(String string) {  64:         publisher = string;  65:     }  66:   67:     public void setTitle(String string) {  68:         title = string;  69:     }  70:   71:     public void setYear(int i) {  72:         year = i;  73:     }  74:   75:     public String toString() {  76:         return title + " by " + author;  77:     }  78: }

SAX

Now that you have a JavaBean for Books, you can turn to the task of parsing XML that uses the book vocabulary. The SAX API is event driven. As Xerces parses an XML document, it calls methods on one or more event-handler classes that you provide. The following listing, SAXMain.java, shows a typical method of using SAX to parse a document. After importing all the necessary classes in lines 8-14, you create a new XMLReader instance in line 19 by instantiating Xerces’ SAXParser class. You then instantiate a BookHandler (line 20) and use it as the XMLReader’s ContentHandler and ErrorHandler event callbacks. You can do this because BookHandler implements both the ContentHandler and ErrorHandler interfaces. Once you’ve set up the callbacks, you’re ready to call the parser, which you do in line 24. The BookHandler’s callback methods build an instance of Book that contains the information from the XML document. You obtain this Book instance by calling the getBook method on the bookHandler instance, and then you print a human-readable representation of the Book using toString.

  1: /*   2:  *    3:  * SAXMain.java   4:  *    5:  * Example from "Professional XML Development with Apache Tools"   6:  *   7:  */   8: package com.sauria.apachexml.ch1;   9:   10: import java.io.IOException;  11:   12: import org.apache.xerces.parsers.SAXParser;  13: import org.xml.sax.SAXException;  14: import org.xml.sax.XMLReader;  15:   16: public class SAXMain {  17:   18:     public static void main(String[] args) {  19:         XMLReader r = new SAXParser();  20:         BookHandler bookHandler = new BookHandler();  21:         r.setContentHandler(bookHandler);  22:         r.setErrorHandler(bookHandler);  23:         try {  24:             r.parse(args[0]);  25:             System.out.println(bookHandler.getBook().toString());  26:         } catch (SAXException se) {  27:             System.out.println("SAX Error during parsing " +   28:                 se.getMessage());  29:             se.printStackTrace();  30:         } catch (IOException ioe) {  31:             System.out.println("I/O Error during parsing " +  32:                 ioe.getMessage());  33:             ioe.printStackTrace();  34:         } catch (Exception e) {  35:             System.out.println("Error during parsing " +   36:                 e.getMessage());  37:             e.printStackTrace();  38:         }  39:     }  40: }

The real work in a SAX-based application is done by the event handlers, so let’s turn our attention to the BookHandler class and see what’s going on. The following BookHandler class extends SAX’s DefaultHandler class. There are two reasons. First, DefaultHandler implements all the SAX callback handler interfaces, so you’re saving the effort of writing all the implements clauses. Second, because DefaultHandler is a class, your code doesn’t have to implement every method in every callback interface. Instead, you just supply an implementation for the methods you’re interested in, shortening the class overall.

  1: /*   2:  *    3:  * BookHandler.java   4:  *    5:  * Example from "Professional XML Development with Apache Tools"   6:  *   7:  */   8: package com.sauria.apachexml.ch1;   9:   10: import java.util.Stack;  11:   12: import org.xml.sax.Attributes;  13: import org.xml.sax.SAXException;  14: import org.xml.sax.SAXParseException;  15: import org.xml.sax.helpers.DefaultHandler;  16:   17: public class BookHandler extends DefaultHandler {  18:     private Stack elementStack = new Stack();  19:     private Stack textStack = new Stack();  20:     private StringBuffer currentText = null;  21:     private Book book = null;  22:   23:     public Book getBook() {  24:         return book;  25:     }  26:

We’ll start by looking at the methods you need from the ContentHandler interface. Almost all ContentHandlers need to manage a stack of elements and a stack of text. The reason is simple. You need to keep track of the level of nesting you’re in. This means you need a stack of elements to keep track of where you are. You also need to keep track of any character data you’ve seen, and you need to do this by the level where you saw the text; so, you need a second stack to keep track of the text. These stacks as well as a StringBuffer for accumulating text and an instance of Book are declared in lines 18-21. The accessor to the book instance appears in lines 23-25.

The ContentHandler callback methods use the two stacks to create a Book instance and call the appropriate setter methods on the Book. The methods you’re using from ContentHandler are startElement, endElement, and characters. Each callback method is passed arguments containing the data associated with the event. For example, the startElement method is passed the localPart namespace URI, and the QName of the element being processed. It’s also passed the attributes for that element:

 27:     public void startElement(  28:         String uri,  29:         String localPart,  30:         String qName,  31:         Attributes attributes)  32:         throws SAXException {  33:         currentText = new StringBuffer();  34:         textStack.push(currentText);  35:         elementStack.push(localPart);  36:         if (localPart.equals("book")) {  37:             String version = attributes.getValue("", "version");  38:             if (version != null && !version.equals("1.0"))  39:                 throw new SAXException("Incorrect book version");  40:             book = new Book();  41:         }   42:     }

The startElement callback basically sets things up for new data to be collected each time it sees a new element. It creates a new currentText StringBuffer for collecting this element’s text content and pushes it onto the textStack. It also pushes the element’s name on the elementStack for placekeeping. This method must also do some processing of the attributes attached to the element, because the attributes aren’t available to the endElement callback. In this case, startElement verifies that you’re processing a version of the book schema that you understand (1.0).

You can’t do most of the work until you’ve encountered the end tag for an element. At this point, you will have seen any child elements and you’ve seen all the text content associated with the element. The following endElement callback does the real heavy lifting. First, it pops the top off the textStack, which contains the text content for the element it’s processing. Depending on the name of the element being processed, endElement calls the appropriate setter on the Book instance to fill in the correct field. In the case of the year, it converts the String into an integer before calling the setter method. After all this, endElement pops the elementStack to make sure you keep your place.

 43:   44:       public void endElement(String uri, String localPart,   45:         String qName)  46:         throws SAXException {  47:         String text = textStack.pop().toString();  48:         if (localPart.equals("book")) {  49:         } else if (localPart.equals("title")) {  50:             book.setTitle(text);  51:         } else if (localPart.equals("author")) {  52:             book.setAuthor(text);  53:         } else if (localPart.equals("isbn")) {  54:             book.setIsbn(text);  55:         } else if (localPart.equals("month")) {  56:             book.setMonth(text);  57:         } else if (localPart.equals("year")) {  58:             int year;  59:             try {  60:                 year = Integer.parseInt(text);  61:             } catch (NumberFormatException e) {  62:                 throw new SAXException("year must be a number");  63:             }  64:             book.setYear(year);  65:         } else if (localPart.equals("publisher")) {  66:             book.setPublisher(text);  67:         } else if (localPart.equals("address")) {  68:             book.setAddress(text);  69:         } else {  70:             throw new SAXException("Unknown element for book");  71:         }  72:         elementStack.pop();  73:     }  74:

The characters callback is called every time the parser encounters a piece of text content. SAX says that characters may be called more than once inside a startElement/endElement pair, so the implementation of characters appends the next text to the currentText StringBuffer. This ensures that you collect all the text for an element:

 75:     public void characters(char[] ch, int start, int length)  76:         throws SAXException {  77:         currentText.append(ch, start, length);  78:     }  79:

The remainder of BookHandler implements the three public methods of the ErrorHandler callback interface, which controls how errors are reported by the application. In this case, you’re just printing an extended error message to System.out. The warning, error, and fatalError methods use a shared private method getLocationString to process the contents of a SAXParseException, which is where they obtain position information about the location of the error:

 80:     public void warning(SAXParseException ex) throws SAXException {  81:         System.err.println(  82:             "[Warning] " + getLocationString(ex) + ": " +   83:             ex.getMessage());  84:     }  85:   86:     public void error(SAXParseException ex) throws SAXException {  87:         System.err.println(  88:             "[Error] " + getLocationString(ex) + ": " +   89:             ex.getMessage());  90:     }  91:   92:     public void fatalError(SAXParseException ex)   93:         throws SAXException {  94:         System.err.println(  95:             "[Fatal Error] " + getLocationString(ex) + ": " +   96:             ex.getMessage());  97:         throw ex;  98:     }  99:  100:     /** Returns a string of the location. */ 101:     private String getLocationString(SAXParseException ex) { 102:         StringBuffer str = new StringBuffer(); 103:  104:         String systemId = ex.getSystemId(); 105:         if (systemId != null) { 106:             int index = systemId.lastIndexOf('/'); 107:             if (index != -1) 108:                 systemId = systemId.substring(index + 1); 109:             str.append(systemId); 110:         } 111:         str.append(':'); 112:         str.append(ex.getLineNumber()); 113:         str.append(':'); 114:         str.append(ex.getColumnNumber()); 115:  116:         return str.toString(); 117:  118:     } 119:  120: }

DOM

Let’s look at how you can accomplish the same task using the DOM API. The DOM API is a tree-based API. The parser provides the application with a tree-structured object graph, which the application can then traverse to extract the data from the parsed XML document. This process is more convenient than using SAX, but you pay a price in performance because the parser creates a DOM tree whether you’re going to use it or not. If you’re using XML to represent data in an application, the DOM tends to be inefficient because you have to get the data you need out of the DOM tree; after that you have no use for the DOM tree, even though the parser spent time and memory to construct it. We’re going to reuse the class Book (in Book.java) for this example.

After importing all the necessary classes in lines 10-17, you declare a String constant whose value is the namespace URI for the book schema (lines 19-21):

  1: /*   2:  *    3:  * DOMMain.java   4:  *    5:  * Example from "Professional XML Development with Apache Tools"   6:  *   7:  */   8: package com.sauria.apachexml.ch1;   9:   10: import java.io.IOException;  11:   12: import org.apache.xerces.parsers.DOMParser;  13: import org.w3c.dom.Document;  14: import org.w3c.dom.Element;  15: import org.w3c.dom.Node;  16: import org.w3c.dom.NodeList;  17: import org.xml.sax.SAXException;  18:   19: public class DOMMain {  20:     static final String bookNS =  21:         "http://sauria.com/schemas/apache-xml-book/book";  22:

In line 24 you create a new DOMParser. Next you ask it to parse the document (line 27). At this point the parser has produced the DOM tree, and you need to obtain it and traverse it to extract the data you need to create a Book object (lines 27-29):

 23:     public static void main(String args[]) {  24:         DOMParser p = new DOMParser();  25:   26:         try {  27:             p.parse(args[0]);  28:             Document d = p.getDocument();  29:             System.out.println(dom2Book(d).toString());  30:   31:         } catch (SAXException se) {  32:             System.out.println("Error during parsing " +   33:             se.getMessage());  34:             se.printStackTrace();  35:         } catch (IOException ioe) {  36:             System.out.println("I/O Error during parsing " +   37:             ioe.getMessage());  38:             ioe.printStackTrace();  39:         }  40:     }  41:

The dom2Book function creates the Book object:

 42:     private static Book dom2Book(Document d) throws SAXException {  43:         NodeList nl = d.getElementsByTagNameNS(bookNS, "book");  44:         Element bookElt = null;  45:         Book book = null;  46:         try {  47:             if (nl.getLength() > 0) {  48:                 bookElt = (Element) nl.item(0);  49:                 book = new Book();  50:             } else  51:                 throw new SAXException("No book element found");  52:         } catch (ClassCastException cce) {  53:             throw new SAXException("No book element found");      54:         }  55:

In lines 43-54, you use the namespace-aware method getElementsByTagNameNS (as opposed to the non-namespace-aware getElementsByTagName) to find the root book element in the XML file. You check the resulting NodeList to make sure a book element was found before constructing a new Book instance.

Once you have the book element, you iterate through all the children of the book. These nodes in the DOM tree correspond to the child elements of the book element in the XML document. As you encounter each child element node, you need to get the text content for that element and call the appropriate Book setter. In the DOM, getting the text content for an element node is a little laborious. If an element node has text content, the element node has one or more children that are text nodes. The DOM provides a method called normalize that collapses multiple text nodes into a single text node where possible (normalize also removes empty text nodes where possible). Each time you process one of the children of the book element, you call normalize to collect all the text nodes and store the text content in the String text. Then you compare the tag name of the element you’re processing and call the appropriate setter method. As with SAX, you have to convert the text to an integer for the Book’s year field:

 56:         for (Node child = bookElt.getFirstChild();  57:             child != null;  58:             child = child.getNextSibling()) {  59:             if (child.getNodeType() != Node.ELEMENT_NODE)  60:                 continue;  61:             Element e = (Element) child;  62:             e.normalize();  63:             String text = e.getFirstChild().getNodeValue();  64:   65:             if (e.getTagName().equals("title")) {  66:                 book.setTitle(text);  67:             } else if (e.getTagName().equals("author")) {  68:                 book.setAuthor(text);  69:             } else if (e.getTagName().equals("isbn")) {  70:                 book.setIsbn(text);  71:             } else if (e.getTagName().equals("month")) {  72:                 book.setMonth(text);  73:             } else if (e.getTagName().equals("year")) {  74:                 int y = 0;  75:                 try {  76:                     y = Integer.parseInt(text);  77:                 } catch (NumberFormatException nfe) {  78:                     throw new SAXException("Year must be a number");  79:                 }  80:                 book.setYear(y);  81:             } else if (e.getTagName().equals("publisher")) {  82:                 book.setPublisher(text);  83:             } else if (e.getTagName().equals("address")) {  84:                 book.setAddress(text);  85:             }  86:         }  87:         return book;  88:     }  89: }

This concludes our review of the SAX and DOM APIs. Now we’re ready to go into the depths of Xerces.