NodeIterator | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The NodeIterator utility interface extracts a subset of the nodes in a DOM document and presents them as a list arranged in document order. In other words, the nodes appear in the order in which you would find them in a depth-first, preorder traversal of the tree. That is,

The document node comes first.
Parents come before their children; ancestors come before their descendants.
Sibling nodes appear in the same order as their start-tags in the text representation of the document.

This is pretty much the order you would expect just by reading an XML document from beginning to end. As soon as you see the first character of text from a node, that node is counted.

You can iterate through this list without concerning yourself with the tree structure of the XML document. For many operations, this flatter view is more convenient than the hierarchical tree view. For example, a spell-checker can check all text nodes one at a time. An outline program can extract the headings in an XHTML document while ignoring everything else. All of this is possible by iterating though a list without having to write recursive methods .

Example 12.1 summarizes the NodeIterator interface. The first four getter methods simply tell you how the iterator is choosing from all of the available nodes in the document. The nextNode() and previousNode() methods move forward and backward in the list and return the requested node. Finally, the detach() method cleans up after the iterator when you're done with it. It's analogous to closing a stream.

Example 12.1 The NodeIterator Interface

 package org.w3c.dom.traversal; public interface NodeIterator {     public Node       getRoot();     public int        getWhatToShow();     public NodeFilter getFilter();     public boolean    getExpandEntityReferences();     public Node       nextNode() throws DOMException;     public Node       previousNode() throws DOMException;     public void       detach(); }

As you see, the NodeIterator interface provides only the most basic methods for an iterator. Each iterator can be thought of as having a cursor, which is initially positioned before the first node in the list. The nextNode() method returns the node immediately following the cursor and advances the cursor one space. The previousNode() method returns the node immediately before the cursor and backs up the cursor one space. If the iterator is positioned at the end of the list, then nextNode() returns null. If the iterator is positioned at the beginning of the list, then previousNode() returns null. For example, given a NodeIterator variable named iterator positioned at the beginning of its list, the following code fragment prints the names of all the nodes:

 Node node;  while ((node = iterator.nextNode()) != null) {   System.out.println(node.getNodeName()); }

Note

Design pattern aficionados will have recognized this as an instance of the iterator pattern (as if the name didn't already give it away). More precisely, it's a robust , external iterator : Robust because the iterator still works even if its backing data structure (the Document object) changes underneath it. External because that client code is responsible for moving the iterator from one node to the next , rather than having the iterator move itself.

Constructing NodeIterators with DocumentTraversal

Not all DOM implementations are guaranteed to support the traversal module, although most do. You can check this with hasFeature("traversal", "2.0") in the DOMImplementation class. For example,

 if (!impl.hasFeature("traversal", "2.0")) {   System.err.println(    "A DOM implementation that supports traversal is required.");   return; }

Assuming that the implementation does support traversal, the Document implementation class also implements the DocumentTraversal interface. This factory interface, shown in Example 12.2, allows you to create new NodeIterator and TreeWalker objects that traverse the nodes in that document.

Example 12.2 The DocumentTraversal Factory Interface

 package org.w3c.dom.traversal; public interface DocumentTraversal {   public NodeIterator createNodeIterator(Node root,    int whatToShow, NodeFilter filter,    boolean entityReferenceExpansion) throws DOMException;   public TreeWalker createTreeWalker(Node root,    int whatToShow, NodeFilter filter,    boolean entityReferenceExpansion) throws DOMException; }

Thus, to create a NodeIterator , you cast the Document object you want to iterate over to DocumentTraversal and then invoke its createNodeIterator() method. This method takes the following four arguments:

root

The Node in the document from which the iterator starts. Only this node and its descendants are traversed by the iterator. This means that you can easily design iterators that iterate over a subtree of the entire document. For example, by passing in the root element, it's possible to skip everything in the document's prolog and epilog.

whatToShow

An int bitfield constant specifying the node types the iterator will include. These constants are

NodeFilter.SHOW_ELEMENT = 1

NodeFilter.SHOW_ATTRIBUTE = 2

NodeFilter.SHOW_TEXT = 4

NodeFilter.SHOW_CDATA_SECTION = 8

NodeFilter.SHOW_ENTITY_REFERENCE = 16

NodeFilter.SHOW_ENTITY = 32

NodeFilter.SHOW_PROCESSING_INSTRUCTION = 64

NodeFilter.SHOW_DOCUMENT = 128

NodeFilter.SHOW_DOCUMENT_TYPE = 256

NodeFilter.SHOW_DOCUMENT_FRAGMENT = 512

NodeFilter.SHOW_NOTATION = 1024

NodeFilter.SHOW_ALL = 0xFFFFFFFF

filter

A NodeFilter against which all nodes in the subtree will be compared. Only nodes that pass the filter will be let through. By implementing this interface, you can define more specific filters, such as "all elements that have xlink:type="simple" attributes" or "all text nodes that contain the word fnord. " Alternatively, you can pass null to indicate no custom filtering.

entityReferenceExpansion

Pass true if you want the iterator to descend through the children of entity reference nodes, false otherwise . Generally, this should be set to true.

Example 11.20 demonstrated a comment reader program that recursively descended an XML tree, printing out all of the comment nodes that were found. A NodeIterator makes it possible to write the program nonrecursively. When creating the iterator, the root argument is the document node; whatToShow is NodeFilter.SHOW_COMMENT ; the node filter is null; and entityReferenceExpansion is true. Example 12.3 demonstrates .

Example 12.3 Using a NodeIterator to Extract All of the Comments from a Document

 import javax.xml.parsers.*; import org.w3c.dom.*; import org.w3c.dom.traversal.*; import org.xml.sax.SAXException; import java.io.IOException; public class CommentIterator {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java DOMCommentReader URL");       return;     }     String url = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder();       // Check for the traversal module       DOMImplementation impl = parser.getDOMImplementation();       if (!impl.hasFeature("traversal", "2.0")) {         System.out.println(        "A DOM implementation that supports traversal is required."         );         return;       }       // Read the document       Document doc = parser.parse(url);       // Create the NodeIterator       DocumentTraversal traversable = (DocumentTraversal) doc;       NodeIterator iterator = traversable.createNodeIterator(        doc, NodeFilter.SHOW_COMMENT, null, true);       // Iterate over the comments       Node node;       while ((node = iterator.nextNode()) != null) {         System.out.println(node.getNodeValue());       }     }     catch (SAXException e) {       System.out.println(e);       System.out.println(url + " is not well-formed.");     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not check " + url       );     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }   } // end main }

You can decide for yourself whether or not you prefer the explicit recursion and tree-walking of Example 11.20 or the hidden recursion of CommentIterator here. With a decent implementation, there shouldn't be any noticeable performance penalty, so feel free to use whichever feels more natural to you.

Liveness

Node iterators are live. That is, if the document changes while the program is walking the tree, then the iterator retains its state. For example, let's suppose that the program is at node C of a node iterator that's walking through nodes A, B, C, D, and E in that order. If you delete node D and then call nextNode() , you'll get node E. If you add node Z in between nodes B and C and then call previousNode() , you'll get node Z. The iterator's current position is always between two nodes (or before the first node or after the last node) but never on a node; thus, it is not invalidated by deleting the current node.

For example, the following method deletes all of the comments in its Document argument. When the method returns, all of the comments have been removed.

 public static void deleteComments(Document doc) {   // Create the NodeIterator   DocumentTraversal traversable = (DocumentTraversal) doc;   NodeIterator iterator = traversable.createNodeIterator(    doc, NodeFilter.SHOW_COMMENT, null, true);   // Iterate over the comments   Node comment;   while ((comment = iterator.nextNode()) != null) {     Node parent = comment.getParentNode();     parent.removeChild(comment);   } }

This method changes the original Document object. It does not change the XML file from which the Document object was created, unless you specifically write the changed document back out into the original file after the comments have been deleted.

Filtering by Node Type

You can combine the various flags for whatToShow with the bitwise or operator. For example, Chapter 11 used a rather convoluted recursive getText() method in the ExampleExtractor program to accumulate all of the text from both text and CDATA section nodes within an element. Example 12.4 shows how NodeIterator can accomplish this task in a much more straightforward fashion.

Example 12.4 Using a NodeIterator to Retrieve the Complete Text Content of an Element

 import org.w3c.dom.*; import org.w3c.dom.traversal.*; public class TextExtractor {   public static String getText(Node node) {     if (node == null) return "";     // Set up the iterator     Document doc = node.getOwnerDocument();     DocumentTraversal traversable = (DocumentTraversal) doc;     int whatToShow      = NodeFilter.SHOW_TEXT  NodeFilter.SHOW_CDATA_SECTION;     NodeIterator iterator = traversable.createNodeIterator(node,      whatToShow, null, true);     // Extract the text     StringBuffer result = new StringBuffer();     Node current;     while ((current = iterator.nextNode()) != null) {       result.append(current.getNodeValue());     }     return result.toString();   } }

I'll reuse this class a little later on. Something like this should definitely be in your toolbox for whenever you need to extract the text content of an element.

Note

DOM Level 3 is going to add an almost equivalent getTextContent() method to the Node interface:

 public String  getTextContent  () throws DOMException

The only difference is that this method will not operate on Document objects, whereas TextExtractor.getText() will.