CyberNeko Tools for XNI | Professional XML Development with Apache Tools: Xerces, Xalan, FOP, Cocoon, Axis, Xindice (Wrox Professional Guides)

Andy Clark is one of the Xerces committers and was the driving force behind the design of XNI. He’s written a suite of tools called NekoXNI to showcase some of the things you can do with XNI. Even if you aren’t interested in using XNI, you might want to have a look, because some of the tools are pretty useful. In this section, we’ll look at a few of these tools.

NekoHTML

NekoHTML uses XNI to allow an application to process an HTML document as if it were an XML document. There are both SAX and DOM parsers in the org.cyberneko.html.parsers package. You use org.cyberneko.html.parsers.SAXParser just like the regular Xerces SAXParser; you can plug in your own ContentHandlers and so on using the regular SAX API. The org.cyberneko.html.parsers.DOMParser works like the Xerces DOMParser with one notable twist. Instead of using the Xerces XML DOM, it uses the Xerces HTML DOM, which means you get a DOM implementation that is aware of some of the rules of HTML. To use NekoHTML, you need to have nekohtml.jar in your classpath, in addition to the regular jars you need for Xerces. But if you need to process HTML, it’s worth it.

ManekiNeko

Another interesting and useful component of NekoXNI is a validator for Relax-NG called ManekiNeko. This validator is based on James Clark’s Jing validator for Relax-NG, and it works by creating a wrapper that converts XNI events into the SAX events that Jing already understands. This wrapped version of Jing is then inserted into the appropriate spot in the XNI pipeline within an XMLParserConfiguration called JingConfiguration. For ease of use, Andy has again provided convenience classes that work just like the Xerces SAX and DOM parser classes. For a Relax-NG aware SAX parser, use org.cyberneko .relaxng.parsers.SAXParser; for a DOM parser, use org.cyberneko.relaxng.parsers.DOMParser. You must set the SAX validation and namespace features to true. You must also set a property that tells the Relax-NG validator where to find the Relax-NG schema to be used for validation, because Relax-NG doesn’t specify a way of associating a schema with a document. This property is called http://cyberneko.org /xml/properties/relaxng/schema-location, and its value should be the URI for the schema file.

NekoPull

The last CyberNeko tool is NekoPull, the CyberNeko pull parser. The commonly used APIs for XML, SAX, and DOM are push APIs. Once your program asks the parser to parse a document, your application doesn’t regain control until the parse completes. SAX calls your program code via its event callbacks, but that’s about as good as it gets. With the DOM, you have to wait until the entire tree has been built before you can do anything.

The difficulty with SAX is that for any non-trivial XML grammar, you end up maintaining a bunch of stacks and a state machine that remembers where you are in the grammar at any point in the parse. It also makes it very hard to modularize your application. If you have an XML grammar where the elements are turned into objects of various classes, you have to do a lot of work to keep the event-handling code for each class associated with each class. You end up trying to create ContentHandlers that handle only the section of the grammar for a particular class, and then you have to build infrastructure to multiplex between these ContentHandlers. It can be done, but the process is tedious and error prone.

With the DOM, you can create a constructor that knows how to construct an instance of your class from an org.w3c.dom.Element node, and then you can pass the DOM tree around to instances of the various classes. You can handle contained objects by passing the right element in the DOM tree to the constructors for those contained object types. The disadvantage of the DOM is that you have to wait until the whole document is processed, even if you only need part of it. And, of course, there’s the usual problem of how much memory DOM trees take up.

Pull-parsing APIs can give you the best of both worlds. In a pull-parsing API, the application asks the parser to parse the next unit in the XML document, regardless of whether that unit is an element, character data, a processing instruction, and so on. This means you can process the document in a streaming fashion, which is a benefit of SAX. You can also pass the parser instance around to your various object constructors. Because the parser instance remembers where it is in the document, the constructor can call the parser to ask for the next bits of XML, which should represent the data it needs to construct an object. Contained objects are handled just like the DOM case; you pass the parser instance (which again remembers its place) to the constructors for the contained objects. This is a much better API.

Let’s walk through a pull implementation of the Book object building program:

  1: /*   2:  *    3:  * NekoPullMain.java   4:  *    5:  * Example from "Professional XML Development with Apache Tools"   6:  *   7:  */   8: package com.sauria.apachexml.ch1;   9:   10: import java.io.IOException;  11: import java.util.Stack;  12:   13: import org.apache.xerces.xni.XMLAttributes;  14: import org.apache.xerces.xni.XNIException;  15: import org.apache.xerces.xni.parser.XMLInputSource;  16: import org.cyberneko.pull.XMLEvent;  17: import org.cyberneko.pull.XMLPullParser;  18: import org.cyberneko.pull.event.CharactersEvent;  19: import org.cyberneko.pull.event.ElementEvent;  20: import org.cyberneko.pull.parsers.Xerces2;  21:   22: public class NekoPullMain {  23:   24:     public static void main(String[] args) {  25:         try {  26:             XMLInputSource is =   27:                 new XMLInputSource(null, args[0], null);  28:             XMLPullParser pullParser = new Xerces2();  29:             pullParser.setInputSource(is);  30:             Book book = makeBook(pullParser);

You start by instantiating an instance of the pull parser and setting it up with the input source for the document. Then you pass the parser, which is at the correct position to start reading a book, to a constructor function for the Book class.

 31:             System.out.println(book.toString());  32:         } catch (IOException ioe) {  33:             ioe.printStackTrace();  34:         }  35:     }  36:  37:     private static Book makeBook(XMLPullParser pullParser)   38:         throws IOException {  39:         Book book = null;  40:         Stack textStack = new Stack();

When you ask the parser for the next bit of XML, you get back an event. That event is an object (a struct, really) that contains all the information about the piece of XML the parser saw. NekoPull includes event types for the document, elements, character data, CDATA, comments, text declaration, DOCTYPE declaration, processing instructions, entities, and namespace prefix mappings. The event types are determined by integer values in the type field of XMLEvent. Some of the events are bounded; that is, they correspond to a start/end pairing and are reported twice. The bounded events are DocumentEvent, ElementEvent, GeneralEntityEvent, CDATAEvent, and PrefixMappingEvent; a boolean field called start distinguishes start events from end events.

You loop and call pullParser’s nextEvent method to get events until there aren’t any more (or until you break out of the loop):

 41:         XMLEvent evt;  42:         while ((evt = pullParser.nextEvent()) != null) {  43:             switch (evt.type) {  44:                 case XMLEvent.ELEMENT :  45:                     ElementEvent eltEvt = (ElementEvent) evt;  46:                     if (eltEvt.start) {  47:                         textStack.push(new StringBuffer());  48:                         String localPart = eltEvt.element.localpart;  49:                         if (localPart.equals("book")) {  50:                             XMLAttributes attrs = eltEvt.attributes;  51:                             String version =   52:                                 attrs.getValue(null, "version");  53:                             if (version.equals("1.0")) {  54:                                 book = new Book();  55:                                 continue;  56:                             }  57:                             throw new XNIException("bad version");  58:                         }

If you see a starting ElementEvent for the book element, you check the version attribute to make sure it’s 1.0 and then instantiate a new Book object. For all starting ElementEvents, you push a new StringBuffer onto a textStack, just like for SAX. You do this to make sure you catch text in mixed content, which will be interrupted by markup. For example, in

<blockquote>     I really <em>didn’t</em> like what he had to say </blockquote>

the text "I really" and "like what he had to say" belongs inside the blockquote element, whereas the text "didn’t" belongs inside the em element. Keeping this text together is what the textStack is all about.

The real work of building the object is done when you hit the end tag, where you get an ending ElementEvent. Here you grab the text you’ve been collecting for this element and, based on the tag you’re closing, call the appropriate Book setter method. You should be pretty familiar with this sort of code by now:

 59:                     } else if (!eltEvt.empty) {  60:                         String localPart = eltEvt.element.localpart;  61:                         StringBuffer tos =   62:                             (StringBuffer) textStack.pop();  63:                         String text = tos.toString();  64:                         if (localPart.equals("title")) {  65:                             book.setTitle(text);  66:                         } else if (localPart.equals("author")) {  67:                             book.setAuthor(text);  68:                         } else if (localPart.equals("isbn")) {  69:                             book.setIsbn(text);  70:                         } else if (localPart.equals("month")) {  71:                             book.setMonth(text);  72:                         } else if (localPart.equals("year")) {  73:                             int year = 0;  74:                             year = Integer.parseInt(text);  75:                             book.setYear(year);  76:                         } else if (localPart.equals("publisher")) {  77:                             book.setPublisher(text);  78:                         } else if (localPart.equals("address")) {  79:                             book.setAddress(text);  80:                         }

When you see a CharactersEvent, you’re appending the characters in the event to the text you’re keeping for this element:

 81:                     }  82:                     break;  83:                 case XMLEvent.CHARACTERS :  84:                     CharactersEvent chEvt = (CharactersEvent) evt;  85:                     StringBuffer tos =   86:                         (StringBuffer) textStack.peek();  87:                     tos.append(chEvt.text.toString());  88:                     break;   89:             }  90:         }  91:         return book;  92:     }  93: }

As you can see, the style inside the constructor method is somewhat reminiscent of a SAX content handler. The difference is that when you get to contained objects, the code is dramatically simpler. You just have a bunch of methods that look like makeBook, except that as part of the processing of certain end ElementEvents, there’s a call to the constructor function of another class, with the only argument being the pull parser.

As we’re writing this, the first public review of JSR-173, the Streaming API for XML, has just begun. Perhaps by the time you’re reading this, NekoXNI’s pull parser will be implementing what’s in that JSR.

At the moment, the NekoXNI tools are separate from Xerces, but there have been some discussions about incorporating all or some of the tools into the main Xerces distribution.