NodeFilter | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The whatToShow argument allows you to iterate over only certain node types in a subtree . Suppose you want to go beyond that. For example, you may have a program that reads XHTML documents and extracts all heading elements but ignores everything else. Or perhaps you want to find all SVG content in a document, or all the GIFT elements whose price attribute has a value greater than $10.00. Or perhaps you want to find those SKU elements containing the ID of a product that needs to be reordered, as determined by consulting an external database. All of these tasks and many more besides can be implemented through node filters on top of a NodeIterator or a TreeWalker .

Example 12.5 summarizes the NodeFilter interface. You implement this interface in a class of your own devising. The acceptNode() method contains the custom logic that decides whether any given node passes the filter or not. This method can return one of the three named constants NodeFilter.FILTER_ACCEPT , NodeFilter.FILTER_REJECT , or NodeFilter.FILTER_SKIP to indicate what it wants to do with that node.

Example 12.5 The NodeFilter Interface

 package org.w3c.dom.traversal; public interface NodeFilter {   // Constants returned by acceptNode   public static final short FILTER_ACCEPT = 1;   public static final short FILTER_REJECT = 2;   public static final short FILTER_SKIP   = 3;   // Constants for whatToShow   public static final int SHOW_ALL               = 0xFFFFFFFF;   public static final int SHOW_ELEMENT           = 0x00000001;   public static final int SHOW_ATTRIBUTE         = 0x00000002;   public static final int SHOW_TEXT              = 0x00000004;   public static final int SHOW_CDATA_SECTION     = 0x00000008;   public static final int SHOW_ENTITY_REFERENCE  = 0x00000010;   public static final int SHOW_ENTITY            = 0x00000020;   public static final int SHOW_PROCESSING_INSTRUCTION    = 0x00000040;   public static final int SHOW_COMMENT           = 0x00000080;   public static final int SHOW_DOCUMENT          = 0x00000100;   public static final int SHOW_DOCUMENT_TYPE     = 0x00000200;   public static final int SHOW_DOCUMENT_FRAGMENT = 0x00000400;   public static final int SHOW_NOTATION          = 0x00000800;   public short acceptNode(Node n); }

For iterators, there are really only two options for the return value of acceptNode() , FILTER_ACCEPT , and FILTER_SKIP . NodeIterator treats FILTER_REJECT the same as FILTER_SKIP . (Tree-walkers, by contrast, do make a distinction between these two.) Rejecting a node prevents it from appearing in the list, but it does not prevent the node's children and descendants from appearing. They will be tested separately.

The NodeFilter does not override whatToShow . The two work in concert. For example, whatToShow can limit the iterator to only elements. Then the acceptNode() method can confidently cast every node that is passed to it to Element without first checking its node type.

To configure an iterator with a filter, pass the NodeFilter object to the createNodeIterator() method. The NodeIterator will then pass each potential candidate node to the acceptNode() method to decide whether or not to include it in the iterator.

For an example, let's revisit the DOMSpider program demonstrated in Example 11.6. That program needed to recurse through the entire document, looking at each and every node to see whether or not it was an element and, if it was, whether or not it had an xlink:type attribute with the value simple . We can write that program much more simply using a NodeFilter to find the simple XLinks and a NodeIterator to walk through them. Example 12.6 demonstrates the necessary filter.

Example 12.6 An Implementation of the NodeFilter Interface

 import org.w3c.dom.traversal.NodeFilter; import org.w3c.dom.*; public class XLinkFilter implements NodeFilter {   public static String XLINK_NAMESPACE    = "http://www.w3.org/1999/xlink";   public short acceptNode(Node node) {     Element candidate = (Element) node;     String type      = candidate.getAttributeNS(XLINK_NAMESPACE, "type");     if (type.equals("simple")) return FILTER_ACCEPT;     return FILTER_SKIP;   } }

The following is a spider() method that has been revised to take advantage of NodeIterator and this filter. This can replace both the spider() and findLinks() methods of the previous version. The filter replaces the isSimpleLink() method. The code is considerably simpler than the version in Example 11.6.

 public void spider(String systemID) {   currentDepth++;   try {     if (currentDepth < maxDepth) {       Document document = parser.parse(systemID);       process(document, systemID);       Vector uris = new Vector();       // search the document for URIs,       // store them in vector, and print them       DocumentTraversal traversal        = (DocumentTraversal) document;       NodeIterator xlinks = traversal.createNodeIterator(         document.getDocumentElement(),// start at root element         NodeFilter.SHOW_ELEMENT,      // only see elements         new XLinkFilter(),            // only see simple XLinks         true                          // expand entities       );       Element xlink;       while ((xlink = (Element) xlinks.nextNode()) != null) {         String uri = xlink.getAttributeNS(XLINK_NAMESPACE,          "href");         if (!uri.equals("")) {           try {             String wholePage = absolutize(systemID, uri);             if (!visited.contains(wholePage)              && !uris.contains(wholePage)) {               uris.add(wholePage);             }           }           catch (MalformedURLException e) {             // If it's not a good URL, then we can't spider it             // anyway, so just drop it on the floor.           }         } // end if       } // end while       xlinks.detach();       Enumeration e = uris.elements();       while (e.hasMoreElements()) {         String uri = (String) e.nextElement();         visited.add(uri);         spider(uri);       }     }   }   catch (SAXException e) {     // Couldn't load the document,     // probably not well-formed XML, skip it   }   catch (IOException e) {     // Couldn't load the document,     // likely network failure, skip it   }   finally {     currentDepth--;     System.out.flush();   } }

There is, however, one feature in the earlier version that this NodeIterator -based variant doesn't have. The DOMSpider in Chapter 11 tracked xml:base attributes. Because the xml:base attributes may appear on ancestors of the XLinks rather than on the XLinks themselves , a NodeIterator really isn't appropriate for tracking them. The key problem is that xml:base has hierarchical scope. That is, an xml:base attribute only applies to the element on which it appears and its descendants. Although the filter could easily be adjusted to notice elements that have xml:base attributes as well as those that have xlink:type="simple" attributes, an iterator really can't distinguish the other elements to which any given xml:base attribute applies.

DOM3 will add a getBaseURI() method to the Node interface that will alleviate the need to track xml:base attributes manually. In fact, this will be even more effective than the manual tracking of the Chapter 11 example, because it will also notice different base URIs that arise from external entities. Revising the spider() method to take advantage of this requires changing only a couple of lines of code, as follows :

 String wholePage = absolutize(xlink.getBaseURI(), uri);

Unfortunately, this method is not yet supported by any of the common parsers, but it should be implemented in the not too distant future.