Chapter 6. Programming SAX | JSP and XML[c] Integrating XML and Web Services in Your JSP Application

CONTENTS

IN THIS CHAPTER

What Is SAX?
The Workings of SAX
Summary

Now that we've covered DOM and the easier-to-visualize model of processing XML files, let's go in a different direction. This chapter has only one goal: to teach you what SAX is and how to use it to parse an XML file. As a result, the SAX coverage will be intense as we explore the richness of the features found in SAX.

After a walk-through introduction to SAX, we will dive into a non-validating parser example. This example will serve as a baseline and we will spend the rest of the chapter expanding and changing the original example to show additional features of SAX.

While focusing on examples, we will also illustrate how to avoid common problems that programmers find in using SAX.

After illustrating what is possible with SAX, we will show a simple example of how to take an XML document and format it for output as an HTML table.

This chapter covers quite a bit of detail, and you might find it useful to review it after you've read through it once.

What Is SAX?

The users of the XML-DEV mailing list originally developed SAX, the Simple API for XML, in 1998. Simply stated, SAX is an event-based model for accessing XML document contents. SAX invokes methods when certain conditions (such as start tags and end tags) are encountered as an XML document is sequentially read from start to finish.

The Workings of SAX

The DOM, covered in the previous chapter, creates a tree-like structure of the parsed XML document in memory. This structure is very memory consumptive, but is very easy to visualize and simple to use programmatically. SAX takes a completely different approach to processing XML documents.

In essence, the SAX parser reads the XML document sequentially from start to finish, and along the way will invoke various callback methods when particular events occur. A callback is a method registered with the parser, written to enable the code to respond to events of interest to the programmer.

An example of a callback is the startElement() method. The SAX parser will invoke this method whenever an opening element tag is encountered. This enables us to do something as a result of this event, such as outputting the tag name that invoked this event.

SAX Interfaces

SAX has four interfaces that contain all the callback methods. They are as follows:

ContentHandler Defines all the available methods pertaining to XML markup, such as startElement() and processingInstruction().
ErrorHandler Defines the methods used for the three kinds of parsing errors.
EntityResolver Defines the methods used to customize handling of external entities; this includes a DTD reference.
DTDHandler Defines the methods used to handle unparsed entities found in the DTD.

SAX2 provides a convenience class called DefaultHandler that implements all four of these interfaces. All of our examples in this chapter will extend this convenience class so that we don't have to define all the methods of each interface. Of course, it is possible to define separate classes to implement each interface.

Implementing these interfaces and registering the newly created class or classes with the parser allows the parser to call back the appropriate method pertaining to the event that has occurred. By defining the body of these callback methods, we can do anything we want in response to the parsing events.

Each parsing event will invoke the appropriate method if it is defined and registered. All methods are synchronous, meaning that once an event-handler callback is invoked, the parser cannot report another event until that callback returns. On the same note, once a parser begins to parse an XML document, it cannot be used to parse another XML document until it returns from parsing the first document.

The advantage of the event-driven, serial-access mechanism is that the SAX parser only reads a small part of the XML document at any one time. This creates a means of processing XML documents that is very fast and has very low memory usage. This means that SAX is the parser of choice when dealing with large XML documents. Now you may ask, What defines a large XML file? There isn't a direct answer to this question. It depends on a series of factors, including how fast your machine is, how much memory the machine has, what JVM you are using, the optimization of logic, and so on. If you are using another parser with an XML file and are having performance problems, it might be time to switch to SAX.

Downsides to SAX

SAX is great, but it does have a few shortcomings. The negative aspects of SAX include the following:

"Backing up" in an XML document is not possible because SAX is a read-forward process on the serial data stream.
SAX is a read-only process. XML source documents cannot be directly modified using SAX. However, an XML document can be read and a modified version output. Using this modified output as a source for another SAX reader finally results in the processing of a modified version of the original XML source document.
It takes a bit more programming to use SAX than it takes to use the DOM.
Determining the position of elements in terms of hierarchy and sibling relationships requires more programming effort than with the other parsers.

Differences Between SAX1 and SAX2

The primary difference between SAX1 and SAX2 is the addition of namespace support for element and attribute processing. Through the use of SAX2 it is now possible to obtain namespace information. This improvement has caused several classes to be replaced and deprecated.

Also new to SAX2 is the creation of a standardized style of accessing properties and features of a SAX2 parser. Each property and feature is associated with a distinct URL, similar to namespace concepts. This prevents collisions in naming. This also permits any feature or property, including vendor-specific versions, to be accessed in a standardized way through the setFeature(), getFeature(), setProperty(), and getProperty() methods of the XMLReader interface.

In this book, we're using the term SAX to refer to the combination of SAX1 and SAX2.

First SAX Example

This example echoes the events that are called by the parser. We will start with a simple example and expand it as the chapter progresses. This will be used to demonstrate the various callbacks that are invoked and what causes the changes in their usage.

We will begin by creating an instance of a non-validating parser and registering our SAX2Example class for some of the event handling. Next, our parser will parse an XML document that contains a reference to an external DTD, all the while invoking our callbacks found in SAX2Example.

Let's begin with the XML file defined in Listing 6.1. Save this file as \webapps\xmlbook\chapter6\Books.xml.

You should notice several things about this file:

Two namespaces are defined. The first is the default namespace because there is no prefix, and the second has the prefix mac.
One processing instruction exists. The target is color and the data is blue.
A DTD is associated with this XML document and is called, appropriately enough, Books.dtd. The source of this DTD can be found in Listing 6.2.
An external entity reference called info exists in the FOOTER element.

Listing 6.1 Books.xml

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE BOOKS SYSTEM "Books.dtd"> <BOOKS xmlns="http://www.samspublishing.com/"        xmlns:mac="http://mcp.com/">     <mac:BOOK pubdate="1/1/2000">         <mac:TITLE>Rising River</mac:TITLE>         <mac:AUTHOR>             <mac:NAME>John Smith &amp; Garith Green</mac:NAME>         </mac:AUTHOR>     </mac:BOOK> <?color blue?>     <BOOK pubdate="5/5/1998">         <TITLE>Timberland</TITLE>         <AUTHOR>             <FNAME>Chris</FNAME>             <LNAME>Hamilwitz</LNAME>         </AUTHOR>     </BOOK>     <FOOTER>&info;</FOOTER> </BOOKS>

Now that we have our XML document, let's take a look at the DTD associated with it. Save the DTD in Listing 6.2 as \webapps\xmlbook\chapter6\Books.dtd.

Listing 6.2 Books.dtd; DTD for Books.xml

<?xml version="1.0" encoding="UTF-8"?> <!ENTITY info "For more information contact Mary at mary@publish.com"> <!ELEMENT BOOKS (mac:BOOK*, BOOK*, FOOTER)> <!ELEMENT BOOK (TITLE, AUTHOR)> <!ATTLIST BOOK pubdate CDATA #REQUIRED> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (FNAME, LNAME)> <!ELEMENT FNAME (#PCDATA)> <!ELEMENT LNAME (#PCDATA)> <!ELEMENT FOOTER (#PCDATA)> <!ELEMENT mac:BOOK (mac:TITLE, mac:AUTHOR)> <!ATTLIST mac:BOOK pubdate CDATA #REQUIRED> <!ELEMENT mac:TITLE (#PCDATA)> <!ELEMENT mac:AUTHOR (mac:NAME)> <!ELEMENT mac:NAME (#PCDATA)>

There is nothing unusual about this DTD, but it will become very important as we change to a validating parser later in this chapter.

Next, we have our handler class, found in Listing 6.3. This is the class that defines those methods that the XML parser will invoke when particular events occur during the parsing of the XML document from Listing 6.1. Save this file as \webapps\xmlbook\WEB-INF\classes\xmlbook\chapter6\SAX2Example.java. Make sure to compile this class before using it and to restart Tomcat to make sure that it is registered. This applies to all classes throughout this chapter and will not be mentioned again.

Listing 6.3 SAX2Example.java; SAX2 Content Handler Class

package xmlbook.chapter6; import java.io.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.*; public class SAX2Example extends DefaultHandler{     private Writer w;     public SAX2Example(java.io.Writer new_w)     {   w = new_w;    }     public void startDocument() throws SAXException{         try{ output ("<br/><b>Start Document</b>"); }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endDocument() throws SAXException{         try{ output ("<br/><b>End Document</b>");  }         catch(Exception e){throw new SAXException(e.toString());}     }     public void startElement(String uri, String localName, String elemName,                              Attributes attrs) throws SAXException{         try{             output ("<br/>Start Element: \"" + elemName + "\"");             output (" Uri: \"" + uri + "\"");             output (" localName: \"" + localName + "\"");             if (attrs.getLength() > 0){                 output("<br/>&nbsp;");                 for (int i = 0; i < attrs.getLength(); i++)                 {output ("&nbsp;&nbsp;attribute: ");                  output (attrs.getQName(i) + "=\"" + attrs.getValue(i) + "\"");                 }             }         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endElement(String uri, String localName, String elemName)     throws SAXException{         try{ output ("<br/>End Element: \"" + elemName + "\""); }         catch(Exception e){throw new SAXException(e.toString());}     }     public void characters(char[] ch, int start, int length)     throws SAXException {         try         {  String s = new String(ch, start, length);            output("&nbsp;&nbsp;<b>Characters encountered:</b> \""            + s.trim()+ "\"");       }       catch(Exception e){throw new SAXException(e.toString());}     }     public void processingInstruction(String target, String data)     throws SAXException{         try         { output("<BR>PROCESSING INSTRUCTION: target: "                     + target + " data:" + data );         }         catch(Exception e){throw new SAXException(e.toString());}     }     private void output (String strOut) throws SAXException{         try { w.write (strOut);   }         catch (IOException e)             {throw new SAXException ("I/O error", e);}     } }

Again, DefaultHandler is the class that extends all four of the interfaces that make up the handlers for the SAX2 parser. We chose to extend this class instead of each interface so that we were not required to define all the methods found in each interface.

In the handler class, a Writer object is passed to the constructor. When an instance of this class is created in our JSP, found in Listing 6.4, we will pass the implicit object out of type javax.servlet.jsp.JspWriter into the object. This will give the object the ability to write to the JSP output stream.

From this point on, all the methods defined, except for the output method that is listed last, are callbacks. These methods override only a small portion of those defined in the DefaultHandler class and they have several things in common.

The bodies of all the callback methods in this example simply respond with a string that echoes the method being invoked and the tag or text that prompted the call. This will provide us with output that will explain the sequence of events that occur when a document is being parsed by SAX2.

When the parser encounters a start tag or end tag, the name of the tag is passed as a String to the startElement() or endElement() method, as appropriate. When a start tag is encountered, any attributes found therein are also passed in an Attributes object.

Each of these methods is required by the interface to throw a SAXException. In order to provide a standard interface to the parsing behavior, this is the only type of exception that SAX events ever throw. These exceptions can also wrap other exceptions, such as an IOException if there is an error writing to the output stream. To get at the wrapped exception, use the getException() from the SAXException class.

There will be more on error handling with SAX later in this chapter. For now, notice that all catch blocks in our SAX2Example handler class throw exceptions back to the parser, which in turn throws it to where the parser was created. In our case, that's back to the JSP found in Listing 6.4. Eventually, we will be adding to our handler class to explore errors more.

Characters and Ignorable Whitespace

The output() and characters() methods found in our class need further explanation. The characters() callback is the method that will output any non-markup encountered anywhere in the XML document. At times this will include whitespace.

Why would characters() be called with blank space? Without a DTD or schema, the parser has no way of figuring out whether the blank space found before or after a tag is text data or not. It will simply call characters() every time anything that isn't markup is encountered. Remember, this parser can't see ahead to know the context of the blank space or characters that it's encountering. However, if a DTD or schema is present, the parser has the definitions of the elements, and can know whether text data is permitted within an element or not. The method ignorableWhiteSpace() will be invoked on any non-markup space that is undefined within the DTD or schema.

The final item to note is that the output() method in the SAX2Example class from Listing 6.3 was written specifically to handle IOExceptions, which can occur while writing. If we just used out directly within the callbacks, we would have no means of catching the IOExceptions that might result.

Next, in Listing 6.4, we have our JSP that will glue everything together. Save this file as \webapps\xmlbook\chapter6\SAX2Example.jsp.

Listing 6.4 SAX2Example.jsp

<%@ page   import="org.xml.sax.helpers.*,   org.xml.sax.*,   javax.xml.parsers.*,   xmlbook.chapter6.*" %> <html> <head><title>SAX2 Parser Content Handler Example</title></head> <body> <% try{     XMLReader reader =         XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");     reader.setFeature("http://xml.org/sax/features/validation", false);     //create instance of handler class we wrote     SAX2Example se = new SAX2Example(out);     //register handler class     reader.setContentHandler(se);     String  ls_path = request.getServletPath();     ls_path = ls_path.substring(0,ls_path.indexOf("SAX2Example.jsp"));     String  ls_xml  = application.getRealPath(ls_path + "Books.xml");     //parse the XML document     reader.parse(ls_xml); } catch (Exception e){     out.print("<br/><br/><font color=\"red\">there was an error<BR>");     out.print (e.toString() + "</font>"); } %> </body> </html>

Examine the following lines from the listing:

XMLReader reader =     XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");

This creates an instance of an XMLReader. This instance will use the Xerces parser by providing the Xerces class. If you have not yet placed the Xerces.jar file in the /lib directory found in the root of the Tomcat installation, do it now and restart the server.

Once the JSP has the reader, the code sets the validation feature of the parser to false. This is done by telling SAX the distinct URL associated with the feature and the setting:

reader.setFeature("http://xml.org/sax/features/validation", false);

Our JSP next creates an instance of the SAX2Example handler found in Listing 6.3:

SAX2Example se = new SAX2Example(out);

Then the handler is registered with the XMLReader as the ContentHandler implementation:

reader.setContentHandler(se);

To conclude the code within the try block, this JSP finds the path to the current XML document and passes it into the reader to parse:

String  ls_path = request.getServletPath(); ls_path = ls_path.substring(0,ls_path.indexOf("SAX2Example.jsp")); String  ls_xml  = application.getRealPath(ls_path + "Books.xml"); //parse the XML document reader.parse(ls_xml);

Finally, the code ends with only the most basic error handling. The code will catch everything re-thrown by our SAX2Example class, and any error that may have occurred in the try block of this JSP. The error handling will be expanded in later examples.

The output is shown in Figure 6.1.

Figure 6.1. Output of SAX2Example.jsp.

graphics/06fig01.gif

In looking at the result of the processing, several things stand out:

Since this example uses namespaces, each opening tag causes the startElement() to be called with the following parameters: the namespace URI associated with the element, the literal tag string, and the tag name without the namespace prefix.
The same parameters are available within the endElement() method, which is called when the parser encounters an end tag.
The processing instruction data and target are available in the processingInstruction() method. Note that the XML declaration will not invoke this method. The XML declaration is for parsers only, not for applications.
Although we aren't using a validating parser now, notice that the entity declaration info declared in Books.dtd and used in Books.xml was resolved in the output.

Processing Versus Validation

Doesn't it seem strange that a non-validating parser would even look at the DTD? It turns out that if a DTD or a schema is present, it will be processed (but not validated) by SAX. This means that entity declarations will be resolved to their replacement values.

Also, if the XML file doesn't have a DTD or schema, whitespace is returned in the characters() method. Since this example references a DTD, the characters() method doesn't return whitespace. Instead, the ignorableWhiteSpace() callback is invoked by SAX. This is due to the DTD giving the parser enough information to know when to ignore the whitespace.

Characters Revisited

Unless you understand its behavior, the characters() method can cause you a great deal of heartache. Notice in the output shown in Figure 6.1 that the characters() method was called three times in a row for text data found in the single element mac:NAME. (See the eighth line of the output.) The number of times the characters() method will be invoked for text data found within one element varies. With that in mind, don't depend on there being only one invocation.

Another trap caused by the characters() method involves the parameters that are passed into the method:

public void characters(char[] ch, int start, int length)

The problem here is that when the code loops through the array as in the following line, problems will ensue:

for(int 1=0; i<ch.length; i++)

Do you see the problem with this? The code is using the length of the ch array to end the loop, not the length parameter that was given to us through this method. The characters() method is defined this way to allow lower-level optimizations, such as reusing arrays and reading ahead of the current location. It depends on the wrapping application to never go beyond the length given.

An easy way around this problem is to immediately create a string from the parameters, like this:

String s = new String(ch, start, length);

This will ensure that the characters() method doesn't cause problems in your application. On the other hand, it will create a new string object each time this method is called. With that in mind, this would be a good place to pool resources.

Another aspect of the characters() method that needs more analysis has to do with the timing of the invocations. We already mentioned that when there is no DTD present, it will be invoked every time any space that isn't markup is encountered. Let's demonstrate this.

Comment out the DTD declaration in the Books.xml file found in Listing 6.1, as shown here:

<!--  <!DOCTYPE BOOKS SYSTEM "Books.dtd"> -->

Save the file; the output is shown in Figure 6.2.

Figure 6.2. Output of SAX2Example.jsp with DTD commented out.

graphics/06fig02.gif

Look how many times characters() is called when the DTD is absent. The only reason we can't see exactly how much whitespace is passed into the method is because we trim the string upon output. The method characters() is also called after elements are closed, and anywhere else that spaces exist in the XML document. If these spaces are removed and the XML document becomes one long string, these calls won't happen, but that's not very practical.

Notice that there is an error at the bottom of Figure 6.2. This fatal error is a result of the inability to resolve the entity reference info. This makes sense considering that the declaration is in the DTD whose reference was just commented out.

Error Handling

SAX provides the ErrorHandler interface for handling three different types of parsing errors. Each of these callbacks is able to throw SAXExceptions. The reason is that the body of a method may cause exceptions, and the method needs a way to pass exceptions back up the hierarchy.

In addition to the SAXExceptions, SAX also provides for three different types of error events:

warning: Related to document validity and DTDs.
error: Recoverable errors that violate some portion of the XML specification. An error results if, for example, the wrong version number is found in the document declaration.
fatalError: Nonrecoverable errors that may relate to an XML document not being well formed, the absence of a DTD when validation is turned on, or unrecognizable encodings.

Now that we've briefly covered the three error handling events, let's create a new class with some implementations of these methods. This new class will contain the complete code of Listing 6.3, and the additional methods shown in Listing 6.5. Save the combined code as \webapps\xmlbook\WEB-INF\classes\xmlbook\chapter6\SAX2ExampleErr.java.

Listing 6.5 Additional Methods for SAX2ExampleErr Class

public void warning(SAXParseException e) throws SAXException{     output("<BR><font color=\"red\">warning: " + e.getMessage() + "  "             + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>"); } public void error(SAXParseException e) throws SAXException{     output("<BR><font color=\"red\">error: " + e.getMessage() + "  "             + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>"); } public void fatalError(SAXParseException e) throws SAXException{     output("<BR><font color=\"red\">fatalError: " + e.getMessage() + "  "             + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>"); }

Remember to change the name of the class and the constructor from SAX2Example to SAX2ExampleErr, or the file will not compile. This applies to all classes in this chapter that are using code from a previous listing and will not be mentioned again.

Next, we need to make a JSP that will use the newly created error handling callbacks. This JSP, shown in Listing 6.6, has the same content as the one found in Listing 6.4 except for some minor changes. (These changes are noted in boldface print in the listing.) Save the code as \webapps\xmlbook\chapter6\SAX2ExampleErr.jsp.

Listing 6.6 SAX2ExampleErr.jsp

<%@ page   import="org.xml.sax.helpers.*,   org.xml.sax.*,   javax.xml.parsers.*,   xmlbook.chapter6.*" %> <html> <head><title>SAX2 Parser Error Handler Example</title></head> <body> <% try{     XMLReader reader =        XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");     reader.setFeature("http://xml.org/sax/features/validation", true);     //create instance of handler class we wrote     SAX2ExampleErr se = new SAX2ExampleErr(out);     //register handler class     reader.setContentHandler(se);     reader.setErrorHandler(se);     String  ls_path = request.getServletPath();     ls_path = ls_path.substring(0,ls_path.indexOf("SAX2ExampleErr.jsp"));     String  ls_xml  = application.getRealPath(ls_path + "Books.xml");     //parse the XML document     reader.parse(ls_xml); } catch (Exception e){     out.print("<br/><br/><font color=\"red\">there was an error<BR>");     out.print (e.toString() + "</font>"); } %> </body> </html>

The first change is the reference to the handler class. Instead of using the old SAX2Example class, we will now be using the SAX2ExampleErr class with its error handling callback functions. Next, we need to register this class with the parser as the implementation of the errorHandler:

reader.setErrorHandler(se);

This will tell the parser to invoke the new methods from Listing 6.5 when any of the error classifications discussed earlier occur.

Once this is done, we need to turn validation on. This is done by changing the following line from Listing 6.4

reader.setFeature("http://xml.org/sax/features/validation", false);

to look like this in Listing 6.6:

reader.setFeature("http://xml.org/sax/features/validation", true);

Turning validation on means that the XML document structure will now be validated against the DTD. If for any reason the DTD cannot be found, or the structure is wrong, the new error() callback we just added will be invoked. Load SAX2ExampleErr.jsp and check out the results. If for some reason you have already uncommented the DTD declaration that we commented out, go ahead and comment it out again.

The results are shown in Figure 6.3.

Figure 6.3. Validation errors resulting from the absence of the DTD.

graphics/06fig03.gif

Notice that the errors resulting from the absence of the DTD when using a validating parser are non-fatal errors. A fatal error results from the parser's inability to resolve the entity reference. This is a fatal error regardless of whether validation is turned on or off.

The specifics of your application will determine how to deal with the various errors that occur when parsing XML documents. In general, your application should log all warnings and nonfatal errors in some way, and hopefully recover from them in some manner that the user doesn't see. Fatal errors, on the other hand, should display a user-friendly error message and gracefully exit after releasing all resources used.

Ignorable Whitespace

This is a good time to introduce the contentHandler interface, which is where the ignorableWhitespace() method resides and is invoked when the parser is able to discern ignorable whitespace within an XML document. It has the same method structure as characters() and will be invoked any time that there is a schema or DTD present. Validation does not have to be turned on, only DTD or schema processing. This method is useful when it's necessary to process the whitespace of an XML document for output.

Entity References

Using SAX, it is possible to perform some extra processing with entity references. That's done through the EntityResolver interface. Each time the parser comes across an entity reference, it will pass the system ID and public ID for that entity to the resolveEntity() method defined. This method provides the ability to redirect the resolution of entities, an especially handy way to avoid the fatal entity reference errors shown in the earlier examples.

The method resolveEntity() will be invoked before any external reference, including a DTD reference, is used. When this method returns a new InputSource, it will be used in place of the original ID. If it returns null, the original entity reference ID is used to resolve the entity. Through this method, DTDs can be changed, and entity references resolved.

For example, take the DTD found in Listing 6.2 and replace the entity declaration so that it now looks like Listing 6.7. (The new declaration appears in boldface print in the listing.) Save this file as \webapps\xmlbook\chapter6\Books2.dtd.

Listing 6.7 New DTD Books2.dtd for Books.xml

<?xml version="1.0" encoding="UTF-8"?> <!ENTITY info "Let us help you find the information you were looking for.                E-mail Mary at mary@publish.com"> <!ELEMENT BOOKS (mac:BOOK*, BOOK*, FOOTER)> <!ELEMENT BOOK (TITLE, AUTHOR)> <!ATTLIST BOOK pubdate CDATA #REQUIRED> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (FNAME, LNAME)> <!ELEMENT FNAME (#PCDATA)> <!ELEMENT LNAME (#PCDATA)> <!ELEMENT FOOTER (#PCDATA)> <!ELEMENT mac:BOOK (mac:TITLE, mac:AUTHOR)> <!ATTLIST mac:BOOK pubdate CDATA #REQUIRED> <!ELEMENT mac:TITLE (#PCDATA)> <!ELEMENT mac:AUTHOR (mac:NAME)> <!ELEMENT mac:NAME (#PCDATA)>

Also uncomment the DTD reference in the Books.xml file.

Next, append the method shown in Listing 6.8 to the code from the SAX2ExampleErr class and save it as \webapps\xmlbook\WEB-INF\classes\xmlbook\chapter6\SAX2ExampleRef.java.

Listing 6.8 Appended Method to Create SAX2ExampleRef Class

public InputSource resolveEntity(String publicId, String systemId) throws SAXException{    try{        output("<br>publicID: " + publicId + " systemId: " + systemId);        return new InputSource        ("file:///TomcatPath/webapps/xmlbook/chapter6/Books2.dtd");    }    catch(Exception e){throw new SAXException ("Resolve Entity Error", e);} }

Notice that you will have to substitute your Tomcat path in the sixth line of this listing.

Finally, we need to create another JSP that will register the new handler with the parser. This JSP, shown in Listing 6.9, has the same content as the one found in Listing 6.6 except for some minor changes. (These changes are noted in boldface print in the listing.) Save the code as \webapps\xmlbook\chapter6\SAX2ExampleRef.jsp.

Listing 6.9 SAX2ExampleRef.jsp

<%@ page   import="org.xml.sax.helpers.*,   org.xml.sax.*,   javax.xml.parsers.*,   xmlbook.chapter6.*" %> <html> <head><title>SAX2 Parser Entity Resolver Example</title></head> <body> <% try{     XMLReader reader =        XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");     reader.setFeature("http://xml.org/sax/features/validation", true);     //create instance of handler class we wrote     SAX2ExampleRef se = new SAX2ExampleRef(out);     //register handler class     reader.setContentHandler(se);     reader.setErrorHandler(se);     reader.setEntityResolver(se);     String  ls_path = request.getServletPath();     ls_path = ls_path.substring(0,ls_path.indexOf("SAX2ExampleRef.jsp"));     String  ls_xml  = application.getRealPath(ls_path + "Books.xml");     //parse the XML document     reader.parse(ls_xml); } catch (Exception e){     out.print("<br/><br/><font color=\"red\">there was an error<BR>");     out.print (e.toString() + "</font>"); } %> </body> </html>

Besides changing the handler class registered with the SAX parser, this JSP also registers the class as the entity resolver.

reader.setEntityResolver(se);

Load SAX2ExampleRef.jsp to see the results shown in Figure 6.4.

Figure 6.4. Results from replacement of the entity reference.

graphics/06fig04.gif

In this example we output the string that appears on the first line of the output shown in Figure 6.4 when resolveEntity() is invoked. It is at this point that the external entity, in this case the DTD reference, is replaced with our new DTD, books2.dtd (see Listing 6.7). Notice that the entity reference info, found at the bottom of the output, was resolved to the value that was defined in Books2.dtd.

This interface allows any external reference, including a DTD association, to be programmatically changed. For example, imagine an application that uses XML documents whose PUBLIC DTD is only available on the Internet. What would happen if the application could no longer access the Internet? Instead of having a document that couldn't resolve to an inaccessible DTD, it's possible to use the resolveEntity() method and replace entity references with local ones.

Entity references are not limited in any way. They can be images, XML fragments, and much more. They do not have to be strings or characters.

Generally, when multiple entities are being replaced through the resolveEntity() method, there may be Java case statements or nested if statements. Make sure that if an entity reference falls through these paths, a null is returned and not an InputSource that hasn't been properly created. If an InputSource that hasn't been properly created is returned from this method, expect unpredictable results.

The Document Locator

What happens when we need to know the line number or column at which some markup ends? SAX gives us the means of accessing the exact location of the parser when any callback event occurs. It does so through the use of a Locator. By creating a Locator variable in the ContentHandler class and registering it with the SAX parser through the setDocumentLocator() method, we have access to this information.

The setDocumentLocator() method is the very first callback to be invoked by a SAX parser upon parsing a new XML document. It's even before the startDocument() callback event. The parser passes a Locator object into the setDocumentLocator() callback event. Using the passed Locator object, it's possible to get location and ID information when any other callback is invoked by SAX. If parsing begins and the setDocumentLocator() callback is undefined, the Locator object will not be accessible from elsewhere in the code. For this reason, the next example will create a Locator variable inside the ContentHandler implementation. The example will then set the Locator instance through the setDocumentLocator() method. In this way, the example will be guaranteed access to the Locator object.

We will demonstrate this in our last SAX handler class. This new class will contain the majority of the code from the previous handler classes with some additions. (These changes are noted in boldface print in the listing.) Save the code from Listing 6.10 as \webapps\xmlbook\WEB-INF\classes\xmlbook\chapter6\SAX2ExampleLoc.java.

Listing 6.10 SAX Document Locator Class

package xmlbook.chapter6; import java.io.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.*; public class SAX2ExampleLoc extends DefaultHandler{     private Writer w;     private Locator locator;      public SAX2ExampleLoc(java.io.Writer new_w)     {   w = new_w;    }      public void setDocumentLocator(Locator locator){         this.locator = locator;     }     public void startDocument() throws SAXException{         try{ output ("<br/><b>Start Document</b>"); }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endDocument() throws SAXException{         try{ output ("<br/><b>End Document</b>");  }         catch(Exception e){throw new SAXException(e.toString());}     }     public void startElement(String uri, String localName, String elemName,                              Attributes attrs) throws SAXException{         try{             output ("<br/>Start Element: \"" + elemName + "\"");             output (" Uri: \"" + uri + "\"");             output (" localName: \"" + localName + "\"");             if (attrs.getLength() > 0){                 output("<br/>&nbsp;");                 for (int i = 0; i < attrs.getLength(); i++)                 {output ("&nbsp;&nbsp;attribute: ");                  output (attrs.getQName(i) + "=\"" + attrs.getValue(i) + "\"");                 }             }         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endElement(String uri, String localName, String elemName)     throws SAXException{         try{             output("<br/>End Element: \"" + elemName + "\"");             output("<BR><font color=\"green\">line:" + locator.getLineNumber()                     + " column: " + locator.getColumnNumber() + " system ID: "                     + locator.getSystemId() + "</font>");         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void characters(char[] ch, int start, int length)     throws SAXException {         try         {String s = new String(ch, start, length);         output("&nbsp;&nbsp;<b>Characters encountered:</b> \"" + s.trim()+ "\"");         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void processingInstruction(String target, String data)     throws SAXException{         try         { output("<BR>PROCESSING INSTRUCTION: target: "                     + target + " data:" + data );         }         catch(Exception e){throw new SAXException(e.toString());}     }     private void output (String strOut) throws SAXException{         try { w.write (strOut);   }         catch (IOException e) {throw new SAXException ("I/O error", e);}     } }

The setDocumentLocator() method simply sets our local Locator variable to the Locator object that our parser is passing into this callback. Through this, we will have access to location and ID information of the parser in the XML document.

The last step is to add some code to the endElement() method found in our SAX2ExampleLoc class so that we can see the Locator in action. The additional lines of code simply output the parser's location information each time the endElement() callback is invoked. That information includes which file the parser is working against and where it is in the parsing process. These pieces of information become more useful when multiple files are used together.

Finally, we need to create another JSP that will use our new handler. This JSP, shown in Listing 6.11, has the same content as the one found in Listing 6.9 except for some minor changes. (These changes are noted in boldface print in the listing.) Save the code as \webapps\xmlbook\chapter6\SAX2ExampleLoc.jsp.

Listing 6.11 SAX2ExampleLoc.jsp

<%@ page   import="org.xml.sax.helpers.*,   org.xml.sax.*,   javax.xml.parsers.*,   xmlbook.chapter6.*" %> <html> <head><title>SAX2 Parser Document Locator Example</title></head> <body> <% try{     XMLReader reader =        XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");     reader.setFeature("http://xml.org/sax/features/validation", false);     //create instance of handler class we wrote     SAX2ExampleLoc se = new SAX2ExampleLoc(out);     //register handler class     reader.setContentHandler(se);     String  ls_path = request.getServletPath();     ls_path = ls_path.substring(0,ls_path.indexOf("SAX2ExampleLoc.jsp"));     String  ls_xml  = application.getRealPath(ls_path + "Books.xml");     //parse the XML document     reader.parse(ls_xml); } catch (Exception e){     out.print("<br/><br/><font color=\"red\">there was an error<BR>");     out.print (e.toString() + "</font>"); } %> </body> </html>

The output of the additions is shown in Figure 6.5.

Figure 6.5. Output using the document locator.

graphics/06fig05.gif

We are only outputting the parser location during the endDocument() invocation. However, the information is available within every callback that is available in SAX.

The Locator should be reinitialized for each document parse. If the same parser is going to be used over and over again, which is possible, make sure that you don't hold a reference to the Locator outside the ContentHandler implementation. If a Locator reference is held outside, it will become meaningless as soon as SAX finishes parsing an XML document.

As already mentioned, all parsing is sequential. Each callback must return before the parser can invoke the next event handler. This is also true of parsers being reused to parse other XML documents. The parser may not be used again on another XML document until it has returned from parsing the current XML document. It's fine to reuse parsers, but don't put one in a loop structure unless it is finished parsing before it is used again.

Breaking the System to See How It Works

Now that we've looked at most of the handlers in SAX, let's break the system in various ways to see which type of error is called, and what the results are. Try the following one at a time:

Change the version number of the XML document or DTD document found in the document declaration.
Add an element or attribute that is undeclared in the DTD.
Add an element in the DTD that is nonexistent in the XML document.
Change the DTD filename so that it can't be found.

By breaking things and seeing how the validating parser reacts, it's possible to get a better understanding of how things work. It would be a good idea to try inducing these errors with validation turned on and then repeating the process with validation turned off.

Processing Versus Validation Revisited

It's important to repeat the fact that when a DTD is present, there is a very subtle difference between validating and non-validating parsers. With a non-validating parser, the DTD will still be processed. Entity declarations will be resolved, and the DTD structural definition will be used to avoid invoking the characters() method on empty space. Instead, ignorableWhiteSpace() will be invoked.

The only difference in using a validating parser is that the parser will compare the structure of the XML document to the DTD or schema. If the XML document does not maintain the structure defined in the DTD or schema, error() will be invoked. That is to say that if error() is undefined or its class is not registered as the ErrorHandler, you will have no idea that your XML document is failing validation. The failure of an XML document to validate is not considered a fatal error, and therefore, unless the error is caught correctly, will pass by unnoticed.

Using SAX to Output HTML

Now that we've had our crash course in SAX, we are going to introduce a short example that is slightly more realistic than simply outputting strings in response to events as we did earlier. In this example, we are going to create a new class that extends DefaultHandler to define the callbacks. This class is written specifically to handle the XML document that we defined in Listing 6.1.

The example will create a table to display book data. Upon encountering the processing instruction, we will change the color of the table contents from red to the value of the instruction, which is blue. All the while, we will be counting the rows and displaying the sum after the footer has been added to the table.

Let's begin with the JSP found in Listing 6.12. Save this file as \webapps\xmlbook\chapter6\XMLTable.jsp.

Listing 6.12 XMLTable.jsp

<%@ page   import="org.xml.sax.helpers.*,   org.xml.sax.*,   javax.xml.parsers.*,   xmlbook.chapter6.*" %> <html> <head><title>Using SAX to Create a Table</title></head> <body> <% try{     XMLReader reader = XMLReaderFactory.createXMLReader         ("org.apache.xerces.parsers.SAXParser");     reader.setFeature("http://xml.org/sax/features/validation", true);     //create instance of handler class we wrote     XMLTable xmlt = new XMLTable(out);     //register handler class     reader.setContentHandler(xmlt);     reader.setErrorHandler(xmlt);     String  ls_path = request.getServletPath();     ls_path = ls_path.substring(0,ls_path.indexOf("XMLTable.jsp"));     String  ls_xml  = application.getRealPath(ls_path + "Books.xml");     //parse the XML document     reader.parse(ls_xml); } catch (Exception e){     out.print("<br/><br/><font color=\"red\">there was an error<BR>");     out.print (e.toString() + "</font>"); } %> </body> </html>

This JSP is very similar to the one found in Listing 6.4. The only differences, which appear as boldface print in Listing 6.12, are as follows:

HTML output title has changed.
The handler class we are instantiating for parser events is different.
The JSP for which we are finding the path has changed.

Next, the example needs a handler class as shown in Listing 6.13. Save this class as \webapps\xmlbook\WEB-INF\classes\xmlbook\chapter6\XMLTable.java.

Listing 6.13 XMLTable.java; Converting from XML to a Table

package xmlbook.chapter6; import java.io.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.*; public class XMLTable extends DefaultHandler{     private Writer w;     private int intRowCount = 0;     private String strColor = "red";     public XMLTable(java.io.Writer new_w){         w = new_w;     }     public void startDocument() throws SAXException{         try{             output("<table border=\"2\">");             output("<tr><th>Pub Date</th><th>Book Title</th>");             output("<th>Authors</th></tr>");         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endDocument() throws SAXException{         try{             output ("</table><br> Total Books Listed: " + intRowCount);         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void startElement(String uri, String localName,                              String qName, Attributes attributes)     throws SAXException{         try{             if(0 == localName.compareTo("BOOK")){                 intRowCount++;                 output("<tr><td>");                 output(attributes.getValue("pubdate") ) ;             }             if(0 == localName.compareTo("TITLE"))                 output("<td>");             if(0 == localName.compareTo("AUTHOR"))                 output("<td>");             if(0 == localName.compareTo("FOOTER")){                 strColor = "black";                 output("<td colspan=\"3\">");             }         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void endElement(String uri, String localName, String qName)     throws SAXException{         try{             if(0 == localName.compareTo("BOOK"))                 output("</tr>");             if(0 == localName.compareTo("TITLE"))                 output("</td>");             if(0 == localName.compareTo("AUTHOR"))                 output("</td>");             if(0 == localName.compareTo("FOOTER"))                 output("</td>");         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void characters(char[] ch, int start, int length)     throws SAXException{         try{             String s = new String(ch, start, length);             output("&nbsp;<font color=\"" + strColor + "\">" +                 s.trim() + "</font>");         }         catch(Exception e){throw new SAXException(e.toString());}     }     public void processingInstruction(String target, String data)     throws SAXException{         try{             if(0 == target.compareTo("color"))                 strColor = data;         }         catch(Exception e){throw new SAXException(e.toString());}     }     private void output (String strOut) throws SAXException{         try {             w.write (strOut);         }         catch (IOException e) {throw new SAXException ("I/O error", e);}     }     public void warning(SAXParseException e)     throws SAXException{         output("<BR><font color=\"red\">warning: " + e.getMessage() + "  "                 + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>");     }     public void error(SAXParseException e)     throws SAXException{         output("<BR><font color=\"red\">error: " + e.getMessage() + "  "                 + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>");     }     public void fatalError(SAXParseException e)     throws SAXException{         output("<BR><font color=\"red\">fatalError: " + e.getMessage() + "  "                 + e.getSystemId() + " Line: " + e.getLineNumber() + "</font>");     } }

This class is somewhat similar to the one in SAX2Example, which we created in Listing 6.3. The class has the event handlers we need to define for the parser. The details of this class are the same as in Listing 6.3; the difference lies in the fact that it will be writing HTML strings to the Writer upon encountering specific hard-coded XML markup.

Some highlights include the following:

The HTML font tag color attribute changes to the color specified in the color processing instruction.
The rows are counted through a variable incremented upon the startElement() invocation with the BOOK tag. Counters like this can be used in many different ways. In HTML reporting, counters can be used to add page breaks for printing purposes.
The error handling for this class is incomplete and not usable in production systems.

The results produced by this example are shown in Figure 6.6.

Figure 6.6. Output from XMLTable.jsp.

graphics/06fig06.gif

This example gives you a taste of what is possible using SAX, especially when using extremely large XML documents where the DOM memory print is an issue.

Summary

This chapter covered the three most commonly used interfaces of SAX and provided enough information to enable you to get a good start with using SAX. It also discussed various ways that a SAX programmer could get into trouble and how to avoid those problems. The chapter then showed how to use the various callback events that SAX provides for our use in parsing XML.

While the chapter showed the richness of SAX, it also demonstrated that using SAX isn't especially hard. The state of the current development in XML means that many JSP programmers will only use SAX indirectly through various interfaces found in JAXP, JDOM, dom4j, or other XML parsers. However, this chapter has shown that using SAX directly isn't something to fear. It's important as a JSP developer to keep SAX in your back pocket for the tough problems, such as when your business logic requires every ounce of power and speed to work through some XML data. In these cases, you should have nothing to fear from SAX; armed with the information you learned from this chapter, you should be ready to tackle tough XML parsing.

CONTENTS