5.2 SAX

I l @ ve RuBoard

At the base of nearly all Java and XML APIs is SAX, the Simple API for XML. The first part of making good decisions with SAX is deciding whether to use SAX. Generally , alpha-geek types want to use SAX and nothing else, while everyone else avoids it like the plague. The mystique of using SAX and the complexity that makes it daunting are both poor reasons to decide for or against using SAX. Better criteria are presented in the following questions:

  • Am I only reading and not writing or outputting XML?

  • Is speed my primary concern (over usability, for example)?

  • Do I need to work with only portions of the input XML?

  • Are elements and attributes in the input XML independent (no one part of the document depends on or references another part of the document)?

If you can answer "yes" to all these questions, SAX is well-suited for your application. If you cannot, you might want to think about using DOM, as detailed later in this chapter.

5.2.1 Use the InputSource Class Correctly

When using the SAX API, all input begins with the org.xml.sax.InputSource class. This is a class that allows the specification of an input (e.g., a file or I/O stream), as well as a public and system ID. SAX then extracts this information from the InputSource at parse time and is able to resolve external entities and other document source-specific resources.

In fact, SAX uses the InputSource class even when you do not. Consider the code fragment in Example 5-6, which uses JAXP to initiate a SAX parse.

Example 5-6. Using JAXP to initiate a SAX parse
 import java.io.*; import java.xml.parsers.*;     File myFile = ... DefaultHandler myHandler = ...     SAXParserFactory spf = SAXParserFactory.newInstance(  ); SAXParser parser = spf.newSAXParser(  );     parser.parse(myFile, myHandler); 

Even though a java.io.File is passed in to the SAXParser parse( ) method, this is converted to a SAX InputSource before being handed off to the underlying SAX implementation. That's because this JAXP code will eventually hand off its unparsed data to the org.xml.sax.XMLReader class, which offers only the following two signatures for its parse( ) method:

 public void parse(InputSource inputSource); public void parse(String systemID); 

You might think the second method is easier, but most SAX implementations actually turn around and convert the string-based system ID into an InputSource and recall the first parse( ) method. Put succinctly, all roads lead to the parse( ) method that takes an InputSource .

Because of this, it is better to create an InputSource yourself than to allow a JAXP or SAX implementation to do it for you. In fact, an implementation will often use internal code such as the following to construct the InputSource instance:

 InputSource inputSource = new InputSource(  );     // Might be a null parameter inputSource.setByteStream(inputStream);     // Might be a null parameter inputSource.setCharacterStream(reader);     // Might be a null parameter inputSource.setSystemId(systemId);     // Might be a null parameter inputSource.setPublicId(publicId);     // Derived parameter inputSource.setEncoding(encoding); 

However, many implementations pass these methods null parameters. And while this might not take a lot of time, every second in an XML parsing application can be critical. By constructing an InputSource yourself, you can cut this down to one or two method invocations:

 InputSource inputSource = new InputSource(myInputStream); inputSource.setSystemId("http://www.oreilly.com");     // Note that if you use an input stream in the InputSource // constructor above, this step is not necessary.     inputSource.setEncoding(myEncoding); 

Note that you should also always use the setEncoding( ) method to tell the SAX parser which encoding to use; this is critical in XML applications in which internationalization is a concern or in which you are using multibyte character sets. Because this is generally the case when XML is being used in the first place, this should always be a consideration. Unfortunately, it's quite common to see the encoding manually set to a character encoding that is different from that of the supplied input stream (via a java.io.InputStream or a java.io.Reader ). This can cause all sorts of nasty parsing problems! To avoid this, you should always create your InputSource with an InputStream rather than a Reader or String system ID. When you take this approach, the SAX implementation will wrap the stream in an InputStreamReader and will automatically detect the correct character encoding from the stream. [3]

[3] To be completely accurate, I should say that all SAX parser implementations I have ever come across perform this step. It would be rare to find a parser that does not wrap an InputStream in an InputStreamReader because the parser allows for automatic encoding detection.

5.2.2 Understand How SAX Handles Entity Resolution

Another basic building block of the SAX API is the process of entity resolution. This process is handled through the org.xml.sax.EntityResolver interface. Like the aforementioned InputSource , the EntityResolver interface is often overlooked and ignored by SAX developers. However, through the use of a solid EntityResolver implementation, XML parsing speed can be dramatically enhanced.

At its simplest, an EntityResolver tells a SAX parser implementation how to look up resources specified in an XML document (such as entity references). For example, take a look at the following XML document fragment:

 <entityContainer>   <entity>&reference;</entity> </entityContainer> 

This document fragment illustrates an entity reference named reference . When a parser runs across this entity reference, it begins the process of resolving that entity. The parser will first consult the document's DTD or XML schema for a definition, like this:

 <!ENTITY reference      PUBLIC " -//O'Reilly//TEXT Best Practices Reference//EN"     "reference.xml" > 

From this, it gains both the public ID ( -//O'Reilly//TEXT Best Practices Reference//EN ) and system ID ( reference.xml ) of the entity reference. At this point, the parser checks to see if an implementation of the EntityResolver interface has been registered with the setEntityResolver( ) method on an XMLReader instance. If one has been registered, the parser invokes the resolveEntity( ) method with the public and system IDs extracted from the DTD or schema. Example 5-7 shows an EntityResolver implementation at its simplest.

Example 5-7. The simplest EntityResolver
 import java.io.IOException;     import org.xml.sax.SAXException; import org.xml.sax.EntityResolver; import org.xml.sax.InputSource;     public class SimpleEntityResolver implements EntityResolver {         public InputSource resolveEntity(String publicId, String systemId)       throws SAXException, IOException {           // Returning null means use normal resolution.         return null;     } } 

The method in Example 5-7 does nothing except return null , which signifies to the parser that normal resolution should occur. This means that the public ID and then the system ID (if needed) are looked up using the Internet or local filesystem, as specified by the IDs of the reference.

The problem is that looking up resources on the Internet is time-consuming . One common practice is to download all required references and resources to a local filesystem. However, to ensure that these are used instead of the online resources, developers typically change the DTD or schema of the XML, pointing the system ID of the reference to a local copy of a file or resource. Such an approach is problematic because it forces a change in the constraint model of the document, and it ties the reference to a specific file, on a specific filesystem, in a specific location.

A much better solution is to leave the DTD and schema alone, and to package any needed resources in a JAR file with the XML document and Java classes doing the parsing. This does not affect the XML document or its constraints, and is also independent of the filesystem. Furthermore, it makes deployment very simple, as all required resources are in the JAR file. This archive should then be added to the Java classpath. The final step is to register an entity resolver that looks up named resources (the system ID of the reference, specifically ) on the current classpath. Example 5-8 shows just such a resolver.

Example 5-8. Resolving entities through the classpath
 import java.io.InputStream; import java.io.IOException;     import org.xml.sax.SAXException; import org.xml.sax.EntityResolver; import org.xml.sax.InputSource;     public class ClassPathEntityResolver implements EntityResolver {         public InputSource resolveEntity(String publicId, String systemId)         throws SAXException, IOException {             InputSource inputSource = null;                try {             InputStream inputStream =                     EntityResolver.class.getResourceAsStream(                     systemId);             inputSource = new InputSource(inputStream);         } catch (Exception e) {             // No action; just let the null InputSource pass through         }             // If nothing found, null is returned, for normal processing         return inputSource;     } } 

So, if the system ID of your XML reference is reference.xml, you simply place a resource file at the top level of your JAR file, and ensure that it is named reference.xml.

5.2.3 Consider Using Partial Validation

Another cornerstone of SAX-based programming is validation . Of course, in standard XML parlance, validation means ensuring that an XML document conforms to a set of constraints, usually specified by a DTD or XML schema. [4] I state the obvious definition here, though, to challenge it.

[4] Although I should say that alternate constraint models such as Relax NG are looking very promising .

Certainly , this traditional type of validation has its place. If you are receiving XML documents from an untrusted source, or if you allow manual editing of XML documents, it is probably a good idea to validate these documents to ensure that nothing unexpected has occurred and that your applications don't crater on invalid XML. [5]

[5] Of course, you don't really know if you can trust the doctype in an untrusted XML document, but that's another problem.

Validation is achieved in SAX through the setFeature( ) method of the XMLReader object:

 XMLReader reader = XMLReaderFactory.createXMLReader(  );  reader.setFeature("http://xml.org/sax/features/validation", true);  reader.parse(myInputSource); 

If you're using SAX through the JAXP wrapper layer, this would change to:

 SAXParserFactory spf = SAXParserFactory.newInstance(  );  spf.setValidating(true);  SAXParser parser = spf.newSAXParser(  ); parser.parse(myInputSource, myDefaultHandler); 

The problem with this blanket validation is that it is extremely process- intensive . Validating every element, every attribute, the content within elements, the resolved content of entity references, and more can take a great deal of time. While you can certainly try to validate in development and avoid validation in production, this is impossible if XML is dynamically generated or passed around and, therefore, prone to errors. The typical case is that validation must be left on in production, and all the penalties associated with it remain .

A much better solution is to put some customized validation in place. This approach allows you to assign business rules to your validation. To get a better idea of this, consider the following fragment from a DTD:

 <!ELEMENT purchaseOrder (item+, billTo, shipTo, payment)> <!ATTLIST purchaseOrder           id          CDATA    #REQUIRED            tellerID    CDATA    #REQUIRED           orderDate   CDATA    #REQUIRED > 

With traditional validation, when a purchaseOrder element is processed , the parser must ensure it has at least one child item, as well as a billTo , shipTo , and payment child. Validation also ensures that the purchaseOrder element has an id , tellerID , and orderDate attribute. On its face, this sounds great. These are all required, so there should be no problem. However, all this data would rarely be used in the same business component. In one application component, you might need to know the ID of the teller who input the order and the date it was input ”this would be common in an audit of employee transactions. In another, such as order fulfillment, you might need the element children but none of the attributes.

In both cases, only partial validation is really required. In other words, only a subset of the complete set of constraints needs to be checked. If you handle this partial validation yourself, and turn off validation on the parser, you can achieve some drastic performance improvements. For example, if you need to ensure that the id and tellerID attributes are present, you could turn off validation:

 reader.setFeature("http://xml.org/sax/features/validation", false); 

You could then implement the following logic in the SAX startElement( ) callback, which would handle this custom, partial validation, as shown in Example 5-9.

Example 5-9. Handling partial validation with SAX
 public startElement(String namespaceURI, String localName,                     String qName, Attributes attributes)     throws SAXException {         // Handle custom validation.  if (localName.equals("purchaseOrder")) {   if (attributes.getIndex("tellerID") < 0) {   throw new SAXException("Error: purchaseOrder elements must contain " +  "  a tellerID attribute.");   }   if (attributes.getIndex("orderDate") < 0) {   throw new SAXException("Error: purchaseOrder elements must contain " +  "  an orderDate attribute.");   }   }  // Normal XML processing } 

This might seem overly simple; however, by implementing this instead of a complete validation, you will see tremendous performance improvements in your applications. Additionally, you get the benefit of some self-documentation in your code; through this code fragment, it is simple to see which business rules must be followed in your XML.

I l @ ve RuBoard


The OReilly Java Authors - JavaT Enterprise Best Practices
The OReilly Java Authors - JavaT Enterprise Best Practices
ISBN: N/A
EAN: N/A
Year: 2002
Pages: 96

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net