TreeWalker | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The purpose of TreeWalker is much the same as that of NodeIterator traversing a subtree of a document rooted at a particular node and filtered by both node type and custom logic. TreeWalker differs from NodeIterator in that the traversal model is based on a tree with parents, children, and sibling nodes rather than a linear list with only previous and next nodes. Because the traversal model is very similar to what's already available in the Node interface, tree-walkers aren't as commonly used as NodeIterator . But the ability to filter the nodes that appear in the tree can be very useful on occasion.

Example 12.7 summarizes the TreeWalker interface. It has getter methods that return the configuration of the TreeWalker , methods to get and set the current node, and methods to move from the current node to its parent, first child, last child, previous sibling, next sibling, previous node, and next node. In all cases, these methods return null if there is no such node (for example, if you ask for the last child of an empty element).

Example 12.7 The TreeWalker Interface

 package org.w3c.dom.traversal; public interface TreeWalker {   public Node       getRoot();   public int        getWhatToShow();   public NodeFilter getFilter();   public boolean    getExpandEntityReferences();   public Node       getCurrentNode();   public void       setCurrentNode(Node currentNode)    throws DOMException;   public Node       parentNode();   public Node       firstChild();   public Node       lastChild();   public Node       previousSibling();   public Node       nextSibling();   public Node       previousNode();   public Node       nextNode(); }

A TreeWalker object is always positioned at one of the nodes in its subtree. It begins its existence positioned at the first node in document order. From there you can change the tree-walker's position by invoking nextNode() , previousNode() , parentNode() , firstChild() , lastChild() , previousSibling() , and nextSibling() . In the event that there is no parent, sibling, or child relative to the current node within the tree-walker's tree, these methods all return null. You can find out where the tree-walker is positioned with currentNode() .

TreeWalker objects are created in almost exactly the same way as NodeIterator objects. That is, you cast the Document object you want to walk to DocumentTraversal and invoke its createTreeWalker() method. The createTreeWalker() method takes the same four arguments and their respective meanings as the createNodeIterator() method: the root node of the subtree to walk, an int constant specifying which types of nodes to display, a custom NodeFilter object or null, and a boolean indicating whether or not to expand entity references.

Note

If the root node is filtered out either by whatToShow or by the NodeFilter , then the subtree being walked may not have a single root. In other words, it's more like a DocumentFragment than a Document . As long as you're cognizant of this possibility, it is not a large problem.

TreeWalker s are called for whenever the hierarchy matters; that is, whenever what's important is not just the node itself but also its parent and other ancestor nodes. For example, suppose you want to generate a list of examples in a DocBook document in the following format:

 Example 1.1: A plain text document that indicates an order for 12 Birdsong  Clocks, SKU 244 Example 1.2: An XML document that indicates an order for 12 Birdsong Clocks, SKU 244 Example 1.3: A document that indicates an order for 12 Birdsong Clocks, SKU 244 ... Example 2.1: An XML document that labels elements with schema simple types Example 2.2: URLGrabber Example 2.3: URLGrabberTest ...

To review from Chapter 11, DocBook documents are structured roughly as follows :

 <book>    ...   <chapter>     ...     <example id="filename.java">       <title>Some Java Program</title>       <programlisting>import javax.xml.parsers;         // more Java code...       </programlisting>     </example>     ...     <example id="filename.xml">       <title>Some XML document</title>       <programlisting><![CDATA[<?xml version="1.0"?> <root>   ... </root>]]></programlisting>      </example>     ...   </chapter>   more chapters... </book>

For maximum convenience, we want a TreeWalker that sees only book , chapter , example , and title elements. However, title elements should be allowed only when they represent the title of an example , not a chapter , or a figure , or anything else. We can set whatToShow to NodeFilter.SHOW_ELEMENT to limit the tree-walker to elements, and design a NodeFilter that picks out only these four elements. Example 12.8 demonstrates this filter.

Example 12.8 The ExampleFilter Class

 import org.w3c.dom.traversal.NodeFilter; import org.w3c.dom.*; public class ExampleFilter implements NodeFilter {   public short acceptNode(Node node) {     Element candidate = (Element) node;     String name = candidate.getNodeName();     if (name.equals("example")) return FILTER_ACCEPT;     else if (name.equals("chapter")) return FILTER_ACCEPT;     else if (name.equals("book")) return FILTER_ACCEPT;     else if (name.equals("title")) {       // Is this the title of an example, in which case we accept       // it, or the title of something else, in which case we       // reject it?       Node parent = node.getParentNode();       if ("example".equals(parent.getNodeName())) {         return FILTER_ACCEPT;       }     }     return FILTER_SKIP;   } }

In each case when an element is rejected, acceptNode() returns FILTER_SKIP , not FILTER_REJECT . For TreeWalker , unlike NodeIterator , the difference is important. By returning FILTER_SKIP , acceptNode() indicates that this node should not be reported but that its children should be. If acceptNode() returns FILTER_REJECT for a node, then neither that node nor any of its descendants would be traversed.

The TreeWalker is simply a view of the document. It does not itself change the document or the nodes that the document contains. For example, even though the ExampleFilter hides all text nodes, these can still be extracted from a title element. Example 12.9 walks the tree and pulls out these titles using this filter.

Example 12.9 Navigating a Subtree with TreeWalker

 import javax.xml.parsers.*; import org.w3c.dom.*; import org.w3c.dom.traversal.*; import org.xml.sax.SAXException; import java.io.IOException; public class ExampleList {   public static void printExampleTitles(Document doc) {     // Create the NodeIterator     DocumentTraversal traversable = (DocumentTraversal) doc;     TreeWalker walker = traversable.createTreeWalker(      doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT,      new ExampleFilter(), true);     // The TreeWalker starts out positioned at the root     Node chapter = walker.firstChild();     int chapterNumber = 0;     while (chapter != null) {       chapterNumber++;       Node example = walker.firstChild();       int exampleNumber = 0;       while (example != null) {         exampleNumber++;         Node title = walker.firstChild();         String titleText = TextExtractor.getText(title);         titleText = "Example " + chapterNumber + "."          + exampleNumber + ": " + titleText;         System.out.println(titleText);         // Back up to the example         walker.parentNode();         example = walker.nextSibling();       }       // Reposition the walker on the parent chapter       walker.parentNode();       // Go to the next chapter       chapter = walker.nextSibling();     }   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java ExampleList URL");       return;     }     String url = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder();       // Check for the traversal module       DOMImplementation impl = parser.getDOMImplementation();       if (!impl.hasFeature("traversal", "2.0")) {         System.out.println(        "A DOM implementation that supports traversal is required."         );         return;       }       // Read the document       Document doc = parser.parse(url);       printExampleTitles(doc);     }     catch (SAXException e) {       System.out.println(url + " is not well-formed.");     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check " + url       );     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     } } // end main }

The use of TreeWalker here and NodeIterator in TextExtractor make this task a lot simpler than it otherwise would be. Hiding all of the irrelevant parts means, among other things, that you need not worry about the complexities of example elements that appear at different depths in the tree, or about insignificant white space that may sporadically add extra text nodes where you don't expect them. The traversal package enables you to boil down a document to the minimum structure relevant to your problem.