The XMLFilter Interface | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Example 8.1 shows the actual code for the XMLFilter interface. In addition to the methods it inherits from the XMLReader superinterface, XMLFilter has just two new methods, getParent() and setParent() . The parent of a filter is the XMLReader to which the filter delegates most of its work. (In the context of SAX filters, the parent is not normally understood to be the superclass of the filter class.)

Example 8.1 The XMLFilter Interface

 package org.xml.sax; public interface XMLFilter extends XMLReader {   public void      setParent(XMLReader parent);   public XMLReader getParent(); }

A class that implements this interface must provide a minimum of 16 methods: the getParent() and setParent() methods declared here and the 14 methods of the XMLReader superinterface. Example 8.2 is a minimal XML filter that implements all of these methods but doesn't actually do anything.

Example 8.2 A Filter That Blocks All Events

 import org.xml.sax.*; public class OpaqueFilter implements XMLFilter {   private XMLReader parent;   public void setParent(XMLReader parent) {     this.parent = parent;   }   public XMLReader getParent() {     return this.parent;   }   public boolean getFeature(String name)    throws SAXNotRecognizedException {     throw new SAXNotRecognizedException(name);   }   public void setFeature(String name, boolean value)    throws SAXNotRecognizedException {     throw new SAXNotRecognizedException(name);   }   public Object getProperty(String name)    throws SAXNotRecognizedException {     throw new SAXNotRecognizedException(name);   }   public void setProperty(String name, Object value)    throws SAXNotRecognizedException {     throw new SAXNotRecognizedException(name);   }   public void setEntityResolver(EntityResolver resolver) {}   public EntityResolver getEntityResolver() {     return null;   }   public void setDTDHandler(DTDHandler handler) {}   public DTDHandler getDTDHandler() {     return null;   }   public void setContentHandler(ContentHandler handler) {}   public ContentHandler getContentHandler() {     return null;   }   public void setErrorHandler(ErrorHandler handler) {}   public ErrorHandler getErrorHandler() {     return null;   }   public void parse(InputSource input) {}   public void parse(String systemID) {} }

The effect of attaching this filter to a parser is to block events completely. It's like a brick wall between the application and the data in the XML document. A client application that wants to use this filter (although I really can't imagine why one would) would construct an instance of it and an instance of a real parser, then pass the real parser to the filter's setParent() method.

 XMLReader parser = XMLReaderFactory.createXMLReader();  OpaqueFilter filter = new OpaqueFilter(); filter.setParent(parser);

From this point forward, the client application should only interact with the filter and forget that the original parser exists. Going behind the back of the filter, for example, by calling setContentHandler() on parser instead of on filter , runs the risk of confusing the filter by violating constraints it expects to be true. In fact, if at all possible, you should eliminate any references to the original parser so that you can't accidentally access it later. For example,

 XMLReader parser = XMLReaderFactory.createXMLReader();  OpaqueFilter filter = new OpaqueFilter(); filter.setParent(parser); parser=filter;

In some cases the filter may set up its own parent parser, typically in its constructor. This avoids the need for the client application to provide the filter with an XMLReader . For example,

 public OpaqueFilter(XMLReader parent) {   this.parent = parent; } public OpaqueFilter() throws SAXException {   this(XMLReaderFactory.createXMLReader()); }

You might even design the setParent() method to make it impossible to change the parent parser once it has initially been set in the constructor. For example,

 public void setParent(XMLReader parent) {   throw new UnsupportedOperationException(    "Can't change this filter's parent"   ); }

This does tend to limit the flexibility of the filter; in particular, it prevents you from putting it in the middle of a long chain of filters.

Example 8.3 is a marginally more interesting implementation of the XMLFilter interface. It delegates to the parent XMLReader by forwarding all method calls from the client application. It does not change or filter anything.

Example 8.3 A Filter That Filters Nothing

 import org.xml.sax.*; import java.io.IOException; public class TransparentFilter implements XMLFilter {   private XMLReader parent;   public void setParent(XMLReader parent) {     this.parent = parent;   }   public XMLReader getParent() {     return this.parent;   }   public boolean getFeature(String name)    throws SAXNotRecognizedException, SAXNotSupportedException {     return parent.getFeature(name);   }   public void setFeature(String name, boolean value)    throws SAXNotRecognizedException, SAXNotSupportedException {     parent.setFeature(name, value);   }   public Object getProperty(String name)    throws SAXNotRecognizedException, SAXNotSupportedException {     return parent.getProperty(name);   }   public void setProperty(String name, Object value)    throws SAXNotRecognizedException, SAXNotSupportedException {     parent.setProperty(name, value);   }   public void setEntityResolver(EntityResolver resolver) {     parent.setEntityResolver(resolver);   }   public EntityResolver getEntityResolver() {     return parent.getEntityResolver();   }   public void setDTDHandler(DTDHandler handler) {     parent.setDTDHandler(handler);   }   public DTDHandler getDTDHandler() {     return parent.getDTDHandler();   }   public void setContentHandler(ContentHandler handler) {     parent.setContentHandler(handler);   }   public ContentHandler getContentHandler() {     return parent.getContentHandler();   }   public void setErrorHandler(ErrorHandler handler) {     parent.setErrorHandler(handler);   }   public ErrorHandler getErrorHandler() {     return parent.getErrorHandler();   }   public void parse(InputSource input)    throws SAXException, IOException {     parent.parse(input);   }   public void parse(String systemId)    throws SAXException, IOException {     parent.parse(systemId);   } }

Of course, in most cases you're not going to either of these extremes. You're going to pass some events through unchanged, block others, and modify still others. Let's continue with a filter that adds a property to the list of properties normally supported by an XML parser. This property will provide the wall-clock time needed to parse an XML document, and it might be useful for benchmarking. I'll write it as a filter so that it can be attached to different underlying parsers and used in benchmarks that include various content handlers.

The property name will be http://cafeconleche.org/properties/wallclock/ . The value will be a java.lang.Long object containing the number of milliseconds needed to parse the last document. This can be stored in a private field initialized to null:

 private Long wallclock = null;

The wallclock time is available only after the parse() method has returned. At other times, requesting this property throws a SAXNotSupportedException . Because this will be a read-only property, trying to set it will always throw a SAXNotSupportedException . It will be implemented through the setProperty() and getProperty() methods:

 public Object getProperty(String name)   throws SAXNotRecognizedException, SAXNotSupportedException {   if ("http://cafeconleche.org/properties/wallclock/"    .equals(name)) {     if (wallclock != null) {       return wallclock;     }     else {       throw        new SAXNotSupportedException("Timing not available");     }   }   return parent.getProperty(name); } public void setProperty(String name, Object value)  throws SAXNotRecognizedException, SAXNotSupportedException {   if ("http://cafeconleche.org/properties/wallclock/"    .equals(name)) {     throw new SAXNotSupportedException(      "Wallclock property is read-only");   }   parent.setProperty(name, value); }

For any property other than http://cafeconleche.org/properties/wallclock/ , these calls simply delegate the work to the parent parser.

The parse() method is responsible for tracking the wallclock time. I'll put the work in the parse() method that takes an InputSource as an argument, and then call this method from the other overloaded parse() method that takes a system ID as the argument.

Using the filter enables some standard benchmarking techniques. First, I'll read the entire document into a byte array named cache so that it can be parsed from memory. This will eliminate most of the I/O time that would otherwise likely swamp the actual parsing time, especially if the test document were read from a slow network connection. This actually requires separate handling for the three possible sources an InputSource may offer: character stream, byte stream, and system ID:

 ByteArrayOutputStream out = new ByteArrayOutputStream();  Reader charStream = input.getCharacterStream(); InputStream byteStream = input.getByteStream(); String encoding = null; // I will only set this variable if                         // we have a reader because in this                         // case we know the encoding is UTF-8                         // regardless of what the encoding                         // declaration says if (charStream != null) {   OutputStreamWriter filter    = new OutputStreamWriter(out, "UTF-8");   int c;   while ((c = charStream.read()) != -1) filter.write(c);   encoding = "UTF-8"; } else if (byteStream != null) {   int c;   while ((c = byteStream.read()) != -1) out.write(c); } else {   URL u = new URL(input.getSystemId());   InputStream in = u.openStream();   int c;   while ((c = in.read()) != -1) out.write(c); } out.flush(); out.close(); byte[] cache = out.toByteArray();

Next, I'll warm up the Just in Time compiler (JIT) with ten untimed parses of the document before I begin taking measurements:

 for (int i=0; i < 10; i++) {   InputStream in = new ByteArrayInputStream(cache);   is.setByteStream(in);   parent.parse(is); }

Finally, I'll parse the same document 1,000 times and set wallclock to the average of the 1,000 parses:

 Date start = new Date();  for (int i=0; i < 1000; i++) {   InputStream in = new ByteArrayInputStream(cache);   is.setByteStream(in);   parent.parse(is); } Date finish = new Date(); long totalTime = finish.getTime() - start.getTime(); // Average the time this.wallclock = new Long(totalTime/1000);

Example 8.4 demonstrates the complete benchmarking filter. In addition to the previously described methods, it contains implementations of the other XMLReader methods that all forward their arguments to the equivalent method in the parent parser.

Example 8.4 A Filter That Times All Parsing

 import org.xml.sax.*; import java.io.*; import java.util.Date; import java.net.URL; public class WallclockFilter implements XMLFilter {   private XMLReader parent;   private Long wallclock = null;   public Object getProperty(String name)    throws SAXNotRecognizedException, SAXNotSupportedException {     if ("http://cafeconleche.org/properties/wallclock/"      .equals(name)) {       if (wallclock != null) {         return wallclock;       }       else {         throw          new SAXNotSupportedException("Timing not available");       }     }     return parent.getProperty(name);   }   public void setProperty(String name, Object value)    throws SAXNotRecognizedException, SAXNotSupportedException {     if ("http://cafeconleche.org/properties/wallclock/"      .equals(name)) {       throw new SAXNotSupportedException(        "Wallclock property is read-only");     }     parent.setProperty(name, value);   }   public void setParent(XMLReader parent) {     this.parent = parent;   }   public XMLReader getParent() {     return this.parent;   }   public void parse(InputSource input)    throws SAXException, IOException {     //Reset the time     this.wallclock = null;     // Cache the document     ByteArrayOutputStream out = new ByteArrayOutputStream();     Reader charStream = input.getCharacterStream();     InputStream byteStream = input.getByteStream();     String encoding = null; // I will only set this variable if                             // we have a reader because in this                             // case we know the encoding is UTF-8                             // regardless of what the encoding                             // declaration says     if (charStream != null) {       OutputStreamWriter filter        = new OutputStreamWriter(out, "UTF-8");       int c;       while ((c = charStream.read()) != -1) filter.write(c);       encoding = "UTF-8";     }     else if (byteStream != null) {       int c;       while ((c = byteStream.read()) != -1) out.write(c);     }     else {       URL u = new URL(input.getSystemId());       InputStream in = u.openStream();       int c;       while ((c = in.read()) != -1) out.write(c);     }     out.flush();     out.close();     byte[] cache = out.toByteArray();     InputSource is = new InputSource();     if (encoding != null) is.setEncoding(encoding);     // Warm up the JIT     for (int i=0; i < 10; i++) {       InputStream in = new ByteArrayInputStream(cache);       is.setByteStream(in);       parent.parse(is);     }     System.gc();     // Parse 1000 times     Date start = new Date();     for (int i=0; i < 1000; i++) {       InputStream in = new ByteArrayInputStream(cache);       is.setByteStream(in);       parent.parse(is);     }     Date finish = new Date();     long totalTime = finish.getTime() - start.getTime();     // Average the time     this.wallclock = new Long(totalTime/1000);   }   public void parse(String systemID)    throws SAXException, IOException {     this.parse(new InputSource(systemID));   }   // Methods that delegate to the parent XMLReader   public boolean getFeature(String name)    throws SAXNotRecognizedException, SAXNotSupportedException {     return parent.getFeature(name);   }   public void setFeature(String name, boolean value)    throws SAXNotRecognizedException, SAXNotSupportedException {     parent.setFeature(name, value);   }   public void setEntityResolver(EntityResolver resolver) {     parent.setEntityResolver(resolver);   }   public EntityResolver getEntityResolver() {     return parent.getEntityResolver();   }   public void setDTDHandler(DTDHandler handler) {     parent.setDTDHandler(handler);   }   public DTDHandler getDTDHandler() {     return parent.getDTDHandler();   }   public void setContentHandler(ContentHandler handler) {     parent.setContentHandler(handler);   }   public ContentHandler getContentHandler() {     return parent.getContentHandler();   }   public void setErrorHandler(ErrorHandler handler) {     parent.setErrorHandler(handler);   }   public ErrorHandler getErrorHandler() {     return parent.getErrorHandler();   } }

We still need a driver class that (1) constructs a filter XMLReader and a normal parser XMLReader , (2) connects them to each other, and (3) parses the test document. Example 8.5 demonstrates such a class that contains a simple main() method to benchmark a document named on the command line. Because no handlers are installed, it tests raw parsing time. If I wanted to test the behavior of different parsers with various callback interfaces, I could install them on the parser before parsing. After parsing, WallclockDriver reads the value of the wallclock property from the filter. To adjust the parser tested , I would set different values for the org.xml.sax.driver system property.

Example 8.5 Parsing a Document through a Filter

 import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; public class WallclockDriver {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java WallclockDriver URL");       return;     }     String document = args[0];     try {       XMLFilter filter = new WallclockFilter();       filter.setParent(XMLReaderFactory.createXMLReader());       filter.parse(document);       Long parseTime = (Long) filter.getProperty(        "http://cafeconleche.org/properties/wallclock/");        double seconds = parseTime.longValue()/1000.0;       System.out.println("Parsing " + document + " took "        + seconds + " seconds on average.");     }     catch (SAXException e) {       e.printStackTrace();       System.out.println(e);     }     catch (IOException e) {         e.printStackTrace();      System.out.println(        "Due to an IOException, the parser could not check "        + args[0]       );     }   } }

I ran the XML form of the second edition of the XML 1.0 specification through this program with a few different parsers, using Sun's Java Runtime Environment 1.3.1 on my 300 MHz Pentium II running Windows NT 4.0SP6. This isn't a scientific test, but the output is nonetheless mildly interesting: ^[1]

^[1] At a minimum a scientific test would require testing many different documents on multiple virtual and physical machinesconsidering pauses that might be caused by garbage collection, making sure background processes were kept to a minimum, and taking multiple measurements to test reproducibility of the results.

 %  java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader   WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml  Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml  took 1.209 seconds on average. %  java -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl   WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml  Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml  took 1.414 seconds on average. %  java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser   WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml  Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml  took 1.133 seconds on average. %  java -Dorg.xml.sax.driver=com.bluecast.xml.Piccolo   WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml  Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml  took 0.849 seconds on average.

The four parsers I tested here were all fairly close to one another in raw performance. In fact, given that I'm testing the wallclock time instead of the actual time used by this program alone, I'd venture that the differences are all within the margin of error for the test. Of course, when choosing a parser, you would want to run this across your own documents with your own content handlers in place.