XPath Engines | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

There are several good open source XPath engines for Java, most distributed as part of XSLT processors. They include the following:

Saxon 6.5.x [http://saxon. sourceforge .net]

A very fast XSLT processor written by Michael Kay and distributed under the Mozilla Public License 1.0. This is the processor I used to generate this book. (Saxon 7.x is also available, but it's an incomplete experimental implementation of XPath 2.0, which itself likely won't be finished until sometime in 2003. Both the Saxon 6.5 API and XPath 1.0 are much more stable and bug-free.)

Xalan-J [http://xml.apache.org/xalan-j]

An XSLT processor used by several Apache XML projects including Cocoon. It is of course distributed under the Apache license. If you happen to work for one of those dinosaur companies with a firm policy against using free software, you can buy the same product from IBM under the name LotusXSL [http://www.alphaworks.ibm.com/tech/LotusXSL]

Jaxen [http://www.jaxen.org]

A standalone XPath implementation that works with DOM, JDOM, dom4j, and ElectricXML.

Unfortunately, although the XPath data model and expression syntax are standardized, the API for integrating them into your Java programs is not. Each separate XPath engine does things differently. Saxon uses a custom DOM implementation that does not work with other DOM implementations such as Xerces or Crimson. Xalan-J is also based on DOM, but it only requires a generic DOM; it isn't limited to the Apache XML Project's Xerces DOM. Jaxen can work with any underlying data model, but the API still isn't portable to other XPath engines. Other implementations do something different still. This means that your code tends to become fairly closely tied to the XPath engine you choose.

To demonstrate the different APIs, let's revisit the Fibonacci SOAP client from Chapter 3. However, this time we'll use XPath to extract just the parts we want. Recall that the body of each request document contains a calculateFibonacci element in the http://namespaces.cafeconleche.org/xmljava/ch3/ namespace. This element contains a single positive integer:

 <?xml version="1.0"?>  <SOAP-ENV:Envelope  xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >   <SOAP-ENV:Body>     <calculateFibonacci       xmlns="http://namespaces.cafeconleche.org/xmljava/ch3/"       type="xsi:positiveInteger">5</calculateFibonacci>   </SOAP-ENV:Body> </SOAP-ENV:Envelope>

The server responds with a list of Fibonacci numbers enclosed in a SOAP response envelope. For example, here is the response to a request for the first five Fibonacci numbers :

 <?xml version="1.0"?>  <SOAP-ENV:Envelope  xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" />   <SOAP-ENV:Body>     <Fibonacci_Numbers       xmlns="http://namespaces.cafeconleche.org/xmljava/ch3/">       <fibonacci index="1">1</fibonacci>       <fibonacci index="2">1</fibonacci>       <fibonacci index="3">2</fibonacci>       <fibonacci index="4">3</fibonacci>       <fibonacci index="5">5</fibonacci>     </Fibonacci_Numbers>   </SOAP-ENV:Body> </SOAP-ENV:Envelope>

The client needs to find all of the fibonacci elements. There are many XPath expressions which will do this, the most obvious of which are

 //fibonacci  /SOAP-ENV:Envelope/SOAP-ENV:Body/Fibonacci_Numbers/fibonacci

But there's a catch. XPath expressions cannot match the default namespace. That is, the fibonacci element in the expression is in no namespace at all. It will not match fibonacci elements in the http://namespaces.cafeconleche.org/xmljava/ch3/ namespace. So instead you have to give it a prefix, even though it doesn't have one in the original document. For example,

 //f:fibonacci  /SOAP-ENV:Envelope/SOAP-ENV:Body/f:Fibonacci_Numbers/f:fibonacci

Having assigned it a prefix, you must then map that prefix to a namespace URI. Indeed you have to do this for the SOAP-ENV prefix as well, because the prefix will be used in a Java program instead of in the XML document where it was defined. Exactly how you do this varies from API to API, but generally you'll pass some collection of namespace bindings as an argument to the method that evaluates the expression, as well as the expression itself.

The second of these two location paths is more efficient in general. The // operator, and indeed any location step that uses the descendant, descendant-or-self, ancestor, or ancestor -or-self axis, will generally be slow relative to a more explicit spelling out of the hierarchy you expect. On the other hand, these axes are much more robust against unexpected changes in document structure. For example, //f:fibonacci would work even if somebody sent you an incorrect but well- formed document that left out the SOAP-ENV:Body element or used the SOAP 1.2 namespace instead of the SOAP 1.1 namespace. The more explicit path /SOAP-ENV:Envelope/SOAP-ENV:Body/f:Fibonacci_Numbers/f:fibonacci would not. Generally I recommend starting with the most robust path possible, and using the more explicit paths only if profiling proves performance to be a problem. In the latter case, I would also seriously consider checking each document I received against a schema, and rejecting it immediately if it wasn't valid.

XPath with Saxon

The Saxon 6.5 API is rather convoluted, involving more than 200 different classes in 18 different packages. Fortunately you can ignore most of these for basic XPath searching. The most common sequence of steps to search a document is

Use JAXP to a build a Saxon Document object.
Attach the document to a Context object.
Declare the namespaces used in the XPath expressions in a StandaloneContext .
Make an Expression from the StandaloneContext and the string form of your XPath expression.
Evaluate the Expression to return one of the four XPath data types.

Saxon requires a custom DOM that has been annotated with the information it needs. You can't just pass in a Xerces DOM or a Crimson DOM. Thus, before you use JAXP to parse the document, you have to set the javax.xml.parsers.DocumentBuilderFactory system property to com.icl.saxon.om.DocumentBuilderFactoryImpl . Because you know this at compile-time and do not want to allow the user to change it at runtime, use System.setProperty() in your code rather than passing it in on the command line. In case other parts of the program are using a different implementation, remember to save the old value and restore it when you're done. Otherwise, parsing a document with Saxon is the same as with any other parser. For example,

 String oldFactory = System.getProperty(   "javax.xml.parsers.DocumentBuilderFactory"); System.setProperty("javax.xml.parsers.DocumentBuilderFactory",  "com.icl.saxon.om.DocumentBuilderFactoryImpl");  factory.setNamespaceAware(true); // Use the factory... if (oldFactory != null) {   System.setProperty(    "javax.xml.parsers.DocumentBuilderFactory", oldFactory); }

Once you've set the DocumentBuilderFactory , parse the input document as normal to produce a DOM Node or Document object. The exact type doesn't really matter because you'll immediately cast this to the Saxon implementation class com.icl.saxon.om.NodeInfo . For example,

 DocumentBuilder builder = factory.newDocumentBuilder();  InputSource data = new InputSource(in);  // InputSource is a SAX class Node doc = builder.parse(data); // Node is a DOM interface NodeInfo info = (NodeInfo) doc; // NodeInfo is a Saxon class

You'll notice that Saxon freely mixes classes from SAX, DOM, TrAX, JAXP, and its internal implementation. A typical Saxon program imports a lot of packages.

Before this document can be searched, you'll need to establish it as an XPath context node. Saxon uses the com.icl.saxon.Context class to represent context nodes. This is constructed with a no-args constructor. You then set its context node with the aptly named setContextNode() method, like this:

 Context context = new Context();  context.setContextNode(info);

Here the root node is the context node, but you could use standard DOM methods to navigate through the tree and find another node to serve as the context node. Personally, I prefer to leave as much of the navigation work to XPath as possible.

The document we've just parsed defines its own namespace prefixes and URIs, but these may not be the same ones used in the XPath expression. In particular, any default namespaces in the document will have to be mapped to prefixes in the XPath expression. As always in XPath, the namespaces matter. The prefixes don't. There are two prefixes to map, SOAP-ENV and f . A Java program is not an XML document; therefore, these can't be mapped in the customary way with xmlns attributes. Instead they have to be added to a Saxon com.icl.saxon.expr.StandaloneContext object. Each such object needs access to the document's com.icl.saxon.om.NamePool to which the necessary namespaces can be added. This is all set up as follows :

 DocumentInfo docInfo = info.getDocumentRoot();  NamePool pool = docInfo.getNamePool(); StandaloneContext sc = new StandaloneContext(pool); sc.declareNamespace("SOAP-ENV",  "http://schemas.xmlsoap.org/soap/envelope/"); sc.declareNamespace("f",  "http://namespaces.cafeconleche.org/xmljava/ch3/");

That does it for the preliminaries . We're finally ready to search the document with XPath. The Saxon class that both represents and evaluates XPath expressions is com.icl.saxon.expr.Expression . You pass a String containing the XPath expression and the StandaloneContext object to the static Expression.make() factory method. This returns an Expression object. You then pass the Context object and a boolean specifying whether you want the result to be sorted in document order to the enumerate() method. This returns a com.icl.saxon.om.NodeEnumeration , one of Saxon's representations of node-sets . For example,

 Expression xpath = Expression.make(  "/SOAP-ENV:Envelope/SOAP-ENV:Body/f:Fibonacci_Numbers/f:fibonacci",  sc); NodeEnumeration enum = xpath.enumerate(context, true); while (enum.hasMoreElements()) {   NodeInfo result = enum.nextElement();   System.out.println(result.getStringValue()); }

The NodeEnumeration class is modeled after the Enumeration interface in the java.util package (but does not extend it). It allows you to iterate through the returned node-set. Each node in this set implements the Saxon NodeInfo interface. The getStringValue() method in this interface returns the XPath string value of that node.

NodeEnumeration is limited to a single use. That is, you cannot set it back to its beginning and iterate through a second time. If you need a persistent result, you can call evaluateAsNodeSet() which returns a com.icl.saxon.expr.NodeSetValue instead. You can then sort and enumerate this object repeatedly. For example,

 Expression xpath = Expression.make(  "/SOAP-ENV:Envelope/SOAP-ENV:Body/f:Fibonacci_Numbers/f:fibonacci",  sc); NodeSetValue set = xpath.evaluateAsNodeSet(context); set.sort(); NodeEnumeration enum = set.enumerate(); while (enum.hasMoreElements()) {   NodeInfo result = enum.nextElement();   System.out.println(result.getStringValue()); }

Alternately, if you want the expression to return a number, string, or boolean, you can call one of these three methods instead:

 public boolean  evaluateAsBoolean  (Context  context  )  throws XPathException public double  evaluateAsNumber  (Context  context  )  throws XPathException public String  evaluateAsString  (Context  context  )  throws XPathException

If the expression returns the wrong type, then Saxon will convert the result as if by the XPath number() , string() , or boolean() function. The only conversion Saxon can't perform is a primitive type to a node-set. If you try to evaluate an expression that returns one of the three basic types as a node-set, then evaluateAsNodeSet() throws an XPathException .

I can now show you the complete method that takes as an argument the InputStream from which the response document will be read and searches out the relevant parts with XPath:

 public static void readResponse(InputStream in)   throws IOException, SAXException,  XPathException,  ParserConfigurationException, TransformerException {   String oldFactory = System.getProperty(    "javax.xml.parsers.DocumentBuilderFactory");   System.setProperty(    "javax.xml.parsers.DocumentBuilderFactory",    "com.icl.saxon.om.DocumentBuilderFactoryImpl");   DocumentBuilderFactory factory    = DocumentBuilderFactory.newInstance();   factory.setNamespaceAware(true);   DocumentBuilder builder = factory.newDocumentBuilder();   InputSource data = new InputSource(in);   Node doc = builder.parse(data);   NodeInfo info = (NodeInfo) doc;   Context context = new Context();   context.setContextNode(info);   NamePool pool = info.getDocumentRoot().getNamePool();   StandaloneContext sc = new StandaloneContext(pool);   sc.declareNamespace("SOAP",    "http://schemas.xmlsoap.org/soap/envelope/");   sc.declareNamespace("f",    "http://namespaces.cafeconleche.org/xmljava/ch3/");   Expression xpath = Expression.make(    "/SOAP:Envelope/SOAP:Body/f:Fibonacci_Numbers/f:fibonacci",    sc);   NodeEnumeration enum = xpath.enumerate(context, true);   while (enum.hasMoreElements()) {     NodeInfo result = enum.nextElement();     System.out.println(result.getStringValue());   }   // Restore the original factory   if (oldFactory != null) {     System.setProperty(      "javax.xml.parsers.DocumentBuilderFactory", oldFactory);   } }

Honestly, this is a little convoluted and perhaps more complex than the pure DOM, JDOM, or SAX equivalent. The advantage is that the code is never more complex than this. As the documents you're searching grow in complexity, the XPath expressions become only slightly more complex and the Java code becomes no more complex than what you see here. The more details you can defer to the declarative XPath syntax, the simpler and more robust your program will be.

XPath with Xalan

The Xalan-J XSLT processor from the Apache XML Project also includes an XPath API that's useful for navigation in DOM programs. Underneath the hood, the basic design is strikingly similar to Saxon's for two independently developed programs. However, Xalan does have one class that Saxon doesn't, which significantly simplifies life for developers: org.apache.xpath.XPathAPI . This class, shown in Example 16.5, provides static methods that handle many simple use cases without lots of preliminary configuration.

Example 16.5 The Xalan XPathAPI Class

 package org.apache.xpath; public class XPathAPI {   public static Node selectSingleNode(Node context, String xpath)    throws TransformerException;   public static Node selectSingleNode(Node context, String xpath,    Node namespaceContextNode) throws TransformerException;   public static NodeIterator selectNodeIterator(Node context,    String xpath) throws TransformerException;   public static NodeIterator selectNodeIterator(Node context,    String xpath, Node namespaceContextNode)    throws TransformerException;   public static NodeList selectNodeList(Node context,    String xpath) throws TransformerException;   public static NodeList selectNodeList(Node context,    String xpath, Node namespaceContextNode)    throws TransformerException;   public static XObject eval(Node context, String xpath)    throws TransformerException;   public static XObject eval(Node context, String xpath,    Node namespaceContextNode) throws TransformerException;   public static XObject eval(Node context, String xpath,    PrefixResolver prefixResolver) throws TransformerException; }

Each method in this class takes two or three arguments.

The context node as a DOM Node object.
The XPath expression as a String .
The namespace prefix mappings as a DOM Node object or a Xalan PrefixResolver object. This can be omitted if the XPath expression does not use any namespace prefixes.

The methods differ primarily in return type. There are four possible return types:

A single DOM Node
A DOM NodeList
A DOM traversal NodeIterator
A Xalan XObject

The first three types you've encountered in previous chapters. I won't say anything more about them here except not to use the methods that only return a single Node . They are fragile against unexpected changes in document format.

Caution

Trust me on this one. No matter how sure you are that all of the documents you're processing contain exactly one node that matches an XPath expression, sooner or later you're going to encounter a document that is either missing the node completely or has two or more. List iteration is far more reliable than selecting a single node. If you disregard this warning and use selectSingleNode() anyway, then by all means use a schema to validate your document before accepting it for processing.

The XObject type is new. This is a class in the org.apache.xpath.objects package that represents the various kinds of XPath objectsstring, number, boolean, and node-setas well as a few XSLT objectsresult tree fragment, unknown types, and unresolved variables . This class has a number of methods, intended mostly for use in XSLT. For XPath, all you really need are the following five methods for converting an XObject to a more specific type:

 public boolean  bool  () throws TransformerException  public double  num  () throws TransformerException public String  str  () throws TransformerException public NodeIterator  nodeset  () throws TransformerException public NodeList  nodelist  () throws TransformerException

The last two methods only work if the XObject returned by the eval() method is in fact a node-set. Otherwise, they throw a TransformerException .

For the sake of comparison, let's look at how we would use these classes to solve the Fibonacci SOAP client problem addressed earlier by Saxon.

 public static void readResponse(InputStream in)   throws IOException, SAXException, TransformerException,  ParserConfigurationException {   DocumentBuilderFactory factory    = DocumentBuilderFactory.newInstance();   factory.setNamespaceAware(true);   DocumentBuilder builder = factory.newDocumentBuilder();   InputSource data = new InputSource(in);   Node doc = builder.parse(data);   // set up a document purely to hold the namespace mappings   DOMImplementation impl = builder.getDOMImplementation();   Document namespaceHolder = impl.createDocument(    "http://namespaces.cafeconleche.org/xmljava/ch3/",    "f:namespaceMapping", null);   Element root = namespaceHolder.getDocumentElement();   root.setAttributeNS("http://www.w3.org/2000/xmlns/",    "xmlns:SOAP",    "http://schemas.xmlsoap.org/soap/envelope/");   root.setAttributeNS("http://www.w3.org/2000/xmlns/", "xmlns:f",    "http://namespaces.cafeconleche.org/xmljava/ch3/");   NodeList results = XPathAPI.selectNodeList(doc,    "/SOAP:Envelope/SOAP:Body/f:Fibonacci_Numbers/f:fibonacci",    root);   for (int i = 0; i < results.getLength(); i++) {     Node result = results.item(i);     XObject value = XPathAPI.eval(result, "string()");     System.out.println(value.str());   } }

The input document is parsed in the usual way with the JAXP DocumentBuilder class. JAXP is also used to create a new Element that provides namespace bindings for the XPath expression. Alternately, I could have implemented the org.apache.xml. utils .PrefixResolver interface in a separate class and used that instead, but using a node is simpler.

The XPath expression, the input document, and the context node (the root element here) are passed to XPathAPI.selectNodeList() to find all of the matching elements. This returns a standard DOM NodeList , which can be iterated through in the usual way. Because the last XPath axis in the expression, child, is a forward axis, this list is sorted in document order. The string value of each node in this list is determined by calling the XPath string() function with the node as the context. This returns an instance of the Xalan class XObject , which can be converted to a Java String using the str() method. The result is printed on System.out .

One crucial difference you'll note between Xalan and Saxon is that at no point does Xalan require or use any specific classes from the DOM implementation. All of the DOM nodes are generic DOM nodes. Thus in theory this same code should work with any DOM2- and JAXP 1.1-compliant implementation. In practice, I've verified that it does work with Xerces and Crimson but not with GNU JAXP.