The Node Interface | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Once you've parsed the document and formed an org.w3c.dom.Document object, you can forget about the differences among the various parsers and just work with the standard DOM interfaces. ^[1]

^[1] At least until you want to write the document back out to a file again. Then you have to consider parser-specific classes or JAXP once more.

All of the nodes in the tree are represented by instances of the Node interface summarized in Example 9.8.

Example 9.8 The Node Interface

 package org.w3c.dom; public interface Node {   // Node type constants   public static final short ELEMENT_NODE                = 1;   public static final short ATTRIBUTE_NODE              = 2;   public static final short TEXT_NODE                   = 3;   public static final short CDATA_SECTION_NODE          = 4;   public static final short ENTITY_REFERENCE_NODE       = 5;   public static final short ENTITY_NODE                 = 6;   public static final short PROCESSING_INSTRUCTION_NODE = 7;   public static final short COMMENT_NODE                = 8;   public static final short DOCUMENT_NODE               = 9;   public static final short DOCUMENT_TYPE_NODE          = 10;   public static final short DOCUMENT_FRAGMENT_NODE      = 11;   public static final short NOTATION_NODE               = 12;   // Node properties   public String   getNodeName();   public String   getNodeValue() throws DOMException;   public void     setNodeValue(String nodeValue)    throws DOMException;   public short    getNodeType();   public String   getNamespaceURI();   public String   getPrefix();   public void     setPrefix(String prefix) throws DOMException;   public String   getLocalName();   // Navigation methods   public Node         getParentNode();   public boolean      hasChildNodes();   public NodeList     getChildNodes();   public Node         getFirstChild();   public Node         getLastChild();   public Node         getPreviousSibling();   public Node         getNextSibling();   public Document     getOwnerDocument();   public boolean      hasAttributes();   public NamedNodeMap getAttributes();   // Manipulator methods   public Node insertBefore(Node newChild, Node refChild)    throws DOMException;   public Node replaceChild(Node newChild,  Node oldChild)    throws DOMException;   public Node removeChild(Node oldChild) throws DOMException;   public Node appendChild(Node newChild) throws DOMException;   // Utility methods   public Node cloneNode(boolean deep);   public void normalize();   public boolean isSupported(String feature, String version); }

You can do quite a lot with just this interface alone. You can add, move, remove, and copy nodes in the tree. You can walk the tree while reading the names and values of everything in the tree. This interface can be roughly divided into five sections:

Node type constants
Methods to set and get node properties
Methods to navigate the DOM tree
Methods to add and remove children of a node
A few utility methods.

Let's take them in that order.

Node Types

There are 12 constants1 for each of the 12 named node types defined in the DOM coreand a method that returns the type of the current node using one of these constants. To a Java developer, these are just weird all around. First of all, you'd probably expect to use instanceof , getClass() , and class names to test for types when necessary, instead of short constants and a getNodeType() method. And even if for some strange reason you did use named constants, you'd probably use the type-safe enum pattern if you were familiar with it, or ints if you weren't. Either way, a short constant is just plain weird.

What's going on here is that DOM is not designed in or for Java. It is written in IDL and intended for all object-oriented languages, including C++, Python, Perl, JavaScript, and more. And it has to make a lot of compromises to support the broad range of capabilities of those different languages. For example, AppleScript doesn't have any equivalent to Java's instanceof operator that allows it to test whether a variable is an instance of a particular class. Prior to version 1.4, JavaScript didn't have one either. Some older C++ compilers don't support runtime type information (RTTI) and no C compilers do. Consequently, DOM can't rely on these features because it has to work in those languages. Therefore, it has to reinvent things Java already has.

Note

Using a getNodeType() method also allows a single class to implement more than one of the standard interfaces, which is possible because Java supports multiple interface inheritance. For example, an implementation might use a single NodeImpl class for all 12 different subinterfaces of Node . Then, an object could simultaneously be an instance of Comment , Element , Text , and all the other things besides. I've seen exactly one DOM implementation that does this. The Saxon XSLT processor (discussed in Chapter 16) uses its NodeImpl class to represent all nondocument and nonelement nodes. However, all of the general-purpose DOM implementations I've encountered use a separate class for each separate node type.

The issue of the short constants is a little different. Here, DOM has simply chosen to implement idioms from a language other than Java. In this case, it's following the C++ conventions, where shorts and short constants are much more common than they are in Java. As for using integers instead of type-safe enums, I suspect that the DOM group simply felt that type-safe enums were too complicated to implement in IDL (if they considered the possibility at all). After all, this whole set of node types is really just a hack for languages whose reflection isn't as complete as Java's.

Example 9.9 is a simple utility class that uses the getNodeType() method and these constants to return a string specifying the node type. In itself, it isn't very interesting, but I'll need it for a few of the later programs.

Example 9.9 Changing Short Type Constants to Strings

 import org.w3c.dom.Node; public class NodeTyper {    public static String getTypeName(Node node) {     int type = node.getNodeType();     /* Yes, getNodeType() returns a short, but Java will        almost always upcast this short to an int before        using it in any operation, so we might as well just go        ahead and use the int in the first place. */     switch (type) {       case Node.ELEMENT_NODE: return "Element";       case Node.ATTRIBUTE_NODE: return "Attribute";       case Node.TEXT_NODE: return "Text";       case Node.CDATA_SECTION_NODE: return "CDATA Section";       case Node.ENTITY_REFERENCE_NODE: return "Entity Reference";       case Node.ENTITY_NODE: return "Entity";       case Node.PROCESSING_INSTRUCTION_NODE:        return "Processing Instruction";       case Node.COMMENT_NODE: return "Comment";       case Node.DOCUMENT_NODE: return "Document";       case Node.DOCUMENT_TYPE_NODE:        return "Document Type Declaration";       case Node.DOCUMENT_FRAGMENT_NODE:        return "Document Fragment";       case Node.NOTATION_NODE: return "Notation";       default: return "Unknown Type";    /* It is possible for the default case to be       reached. DOM only defines 12 kinds of nodes, but other       application-specific DOMs can add their own as well.       You're not likely to encounter these while parsing an       XML document with a standard parser, but you might       encounter such things with custom parsers designed for       non-XML documents. DOM Level 3 XPath does define a       13th kind of node, XPathNamespace. */     }   } }

Node Properties

The next batch of methods allows you to get and, in a couple of cases, set the common node properties. Although all nodes have these methods, they don't necessarily return a sensible value for every kind of node. For example, only element and attribute nodes have namespace URIs. getNamespaceURI() returns null when invoked on any other kind of node. The getNodeName() method returns the complete name for nodes that have names, and # node-type for nodes that don't have names; that is, #document , #text , #comment , and so on.

 public String  getNodeName  ()  public String  getNodeValue  () throws DOMException public String  setNodeValue  (String  value  ) throws DOMException public short  getNodeType  () public String  getNamespaceURI  () public String  getPrefix  () public void  setPrefix  (String  prefix  ) throws DOMException public String  getLocalName  ()

Example 9.10 demonstrates another simple utility class that accepts a Node as an argument and prints out the values of its non-null properties. Again, I'll be using this class shortly in another program.

Example 9.10 A Class to Inspect the Properties of a Node

 import org.w3c.dom.*; import java.io.*; public class PropertyPrinter {   private Writer out;   public PropertyPrinter(Writer out) {     if (out == null) {       throw new NullPointerException("Writer must be non-null.");     }     this.out = out;   }   public PropertyPrinter() {     this(new OutputStreamWriter(System.out));   }   private int nodeCount = 0;   public void writeNode(Node node) throws IOException {     if (node == null) {       throw new NullPointerException("Node must be non-null.");     }     if (node.getNodeType() == Node.DOCUMENT_NODE       node.getNodeType() == Node.DOCUMENT_FRAGMENT_NODE) {       // starting a new document, reset the node count       nodeCount = 1;     }     String name      = node.getNodeName(); // never null     String type      = NodeTyper.getTypeName(node); // never null     String localName = node.getLocalName();     String uri       = node.getNamespaceURI();     String prefix    = node.getPrefix();     String value     = node.getNodeValue();     StringBuffer result = new StringBuffer();     result.append("Node " + nodeCount + ":\r\n");     result.append("  Type: " + type + "\r\n");     result.append("  Name: " + name + "\r\n");     if (localName != null) {       result.append("  Local Name: " + localName + "\r\n");     }     if (prefix != null) {       result.append("  Prefix: " + prefix + "\r\n");     }     if (uri != null) {       result.append("  Namespace URI: " + uri + "\r\n");     }     if (value != null) {       result.append("  Value: " + value + "\r\n");     }     out.write(result.toString());     out.write("\r\n");     out.flush();     nodeCount++;   } }

The writeNode() method operates on a Node object without any clue what its actual type is. It prints the properties of the node onto the configured Writer in the following form:

 Node 16:    Type: Text   Name: #text   Value: RHAT

The format changes depending on what kind of node is passed to it.

There are also two methods in the Node interface that can change a node. First, the setPrefix() method changes a node's namespace prefix. Trying to use an illegal or reserved prefix throws a DOMException . This method has no effect on anything except an element or an attribute node.

Second, the setValue() method changes the node's string value. It can be used on comment, text, processing instruction, and CDATA section nodes. It has no effect on other kinds of nodes. It throws a DOMException if the node you're setting is read-only (as a text node might be inside an entity node).

The remaining properties cannot be set from the Node interface. To change names, URIs, and such you have to use the more specific interfaces, such as Element and Attr . Most of the time, you're better off using the more detailed subinterfaces if you're trying to change a tree, anyway.

Navigating the Tree

The third batch of methods allow you to navigate the tree by finding the parent, first child, last child, previous and next siblings, and attributes of any node. Because not all nodes have children, you should test for their presence with hasChildren() before calling the getFirstChild() and getLastChild() methods. You should also be prepared for any of these methods to return null in the event that the requested node doesn't exist. Similarly, you should check hasAttributes() before calling the getAttributes() method.

Example 9.11 demonstrates with a simple program that recursively traverses the tree in a preorder fashion. As each node is visited, its name and value is printed using the previous section's PropertyPrinter class. Once again, Node is the only DOM class used. That's the power of polymorphism. You can do quite a lot without knowing exactly what you're doing it to.

Example 9.11 Walking the Tree with the Node Interface

 import javax.xml.parsers.*;  // JAXP import org.w3c.dom.Node; import org.xml.sax.SAXException; import java.io.IOException; public class TreeReporter {   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java TreeReporter URL");       return;     }     TreeReporter iterator = new TreeReporter();     try {       // Use JAXP to find a parser       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       // Turn on namespace support       factory.setNamespaceAware(true);       DocumentBuilder parser = factory.newDocumentBuilder();       // Read the entire document into memory       Node document = parser.parse(args[0]);       // Process it starting at the root       iterator.followNode(document);     }     catch (SAXException e) {       System.out.println(args[0] +" is not well-formed.");       System.out.println(e.getMessage());     }     catch (IOException e) {       System.out.println(e);     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }   } // end main   private PropertyPrinter printer = new PropertyPrinter();   // note use of recursion   public void followNode(Node node) throws IOException {     printer.writeNode(node);     if (node.hasChildNodes()) {       Node firstChild = node.getFirstChild();       followNode(firstChild);     }     Node nextNode = node.getNextSibling();     if (nextNode != null) followNode(nextNode);   } }

Following is the beginning of the output produced by running this program across Example 9.2:

 %  java"TreeReporter"getQuote.xml  Node 1:   Type: Document   Name: #document Node 2:   Type: Processing Instruction   Name: xml-stylesheet   Value: type="text/css" href="xml-rpc.css" Node 3:   Type: Comment   Name: #comment   Value:  It's unusual to have an xml-stylesheet processing      instruction in an XML-RPC document but it is legal, unlike      SOAP where processing instructions are forbidden. Node 4:   Type: Document Type Declaration   Name: methodCall Node 5:   Type: Element   Name: methodCall   ...

The key to this program is the followNode() method. It first writes the node using the PropertyPrinter , then recursively invokes followNode() on the current node's first child and then its next sibling. This is equivalent to XPath document order (in which children come before siblings). The hasChildNodes() method tests whether there actually are children before asking for the first child node. For siblings, we have to retrieve the next sibling whether there is one or not, and then check to see whether it's null before de-referencing it.

TreeReporter is actually very raw. As you'll see, DOM provides a lot of helper classes that make operations such as this much simpler to code. However, it never hurts to keep in mind what all those helper classes are doing behind the scenes, which in fact is very much like this.

Modifying the Tree

The Node interface has four methods that change the tree by inserting, removing, replacing, and appending children at points specified by nodes in the tree:

 public Node  insertBefore  (Node  toBeInserted,  Node  toBeInsertedBefore  ) throws DOMException public Node  replaceChild  (Node  toBeInserted,  Node  toBeReplaced  )  throws DOMException public Node  removeChild  (Node  toBeRemoved  ) throws DOMException public Node  appendChild  (Node  toBeAppended  ) throws DOMException

Any of these four methods will throw a DOMException if you try to use it to make a document malformed ; for instance, by removing the root element or appending a child to a text node. All four methods return the node being inserted/ replaced /removed/appended.

The only use for these methods is to move nodes around in the same document. Although removeChild() and replaceChild() disconnect nodes from a document's tree, they do not change those nodes' owner document. The disconnected nodes cannot be placed in a different document. Nodes can only be placed in the document where they begin their life. Moving a node from one document to another requires importing it, a technique that I'll take up in Chapter 10.

It's hard to come up with a plausible example of these methods until I've shown you how to create new nodes, also in Chapter 10. In the meantime, Example 9.12 is a program that moves all processing instruction nodes from inside the root element to before the root element, and all comment nodes from inside the root element to after the root element. For example, this document:

 <?xml version="1.0"?>  <document>   Some data   <!-- first comment -->   <?example first processing instruction ?>   Some more data   <!-- second comment -->   <?example second processing instruction ?>   <empty/> </document>

would become this document:

 <?xml version="1.0" encoding="utf-8"?>  <?example first processing instruction ?> <?example second processing instruction ?><document>  Some data   Some more data   <empty/> </document><!-- first comment --><!-- second comment -->

I don't actually think this is a sensible thing to do. In particular, it inaccurately implies that comments and processing instructions can be removed and reordered willy-nilly without changing anything significant, which is not true in general. This is just the best example of these methods I could come up with without using too many classes and interfaces we haven't yet covered.

Example 9.12 A Method That Changes a Document by Reordering Nodes

 import org.w3c.dom.*; public class Restructurer {   // Since this method only operates on its argument and does   // not interact with any fields in the class, it's   // plausibly made static.   public static void processNode(Node current)    throws DOMException {    // I need to store a reference to the current node's next    // sibling before we delete the node from the tree, in which    // case it no longer has a sibling    Node nextSibling = current.getNextSibling();    int nodeType = current.getNodeType();    if (nodeType == Node.COMMENT_NODE      nodeType == Node.PROCESSING_INSTRUCTION_NODE) {      Node document = current.getOwnerDocument();      // Find the root element by looping through the children of      // the document until we find the only one that's an      // element node. There's a quicker way to do this once we      // learn more about the Document class in the next chapter.      Node root = document.getFirstChild();      while (!(root.getNodeType() == Node.ELEMENT_NODE )) {        root = root.getNextSibling();      }      Node parent = current.getParentNode();      parent.removeChild(current);      if (nodeType == Node.COMMENT_NODE) {        document.appendChild(current);      }      else if (nodeType == Node.PROCESSING_INSTRUCTION_NODE) {       document.insertBefore(current, root);     }   }   else if (current.hasChildNodes()) {     Node firstChild = current.getFirstChild();     processNode(firstChild);   }   if (nextSibling != null) {     processNode(nextSibling);   }  } }

This program walks the tree, calling the removeChild() method every time a comment or processing instruction node is spotted, and then inserting the processing instruction nodes before the root element with insertBefore() and the comment nodes after the root element with appendChild() . Both references to the document node, the root element node, and the nearest parent element node have to be stored at all times. The Document object is modified in place.

This program does not provide any means of outputting the changed document to a file where you can look at it. That too is coming.

Utility Methods

Finally, there are three assorted utility methods:

 public Node  cloneNode  (boolean  deep  )  public void  normalize  () public void  isSupported  (String  feature,  String  version  )

normalize()

The normalize() method descends the tree from the given node, merges all adjacent text nodes, and deletes empty text nodes. This operation makes DOM roughly equivalent to an XPath data model in which each text node contains the maximum contiguous run of text not interrupted by markup. However, normalize() does not merge CDATA section nodes, which XPath would require.

The easiest approach is to invoke normalize() on the Document object as soon as you get it. For example,

 Document document = parser.parse(document);  document.normalize();

cloneNode()

The cloneNode() method makes a copy of the given node. If the deep argument is true, then the copy contains the full contents of the node including all of its descendants. If the deep argument is false, then the clone does not contain copies of the original node's children. The cloned node is disconnected; that is, it is not a child of the original node's parent. However, it does belong to the original node's document, even though it doesn't have a position in that document's tree. It can be added via insertBefore() , or appendNode() , or replaceNode() . Conversely, the clone cannot be inserted into a different document. To make a copy for a different document, you would instead use the importNode() method in the Document interface. We'll look at this in Chapter 10.

isSupported()

The isSupported() method determines whether or not this node provides a given feature. For example, you can pass the string "Events" to this method to find out whether or not this one node supports the events module. The version number for all DOM2 features is 2.0.

The isSupported() method isn't used much, since there's little point to asking for the features an individual node supports. A similar method named hasFeature() in the DOMImplementation interface is more useful.