The Element Interface | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The Element interface is perhaps the most important of all the DOM component interfaces. After all, it's possible to write XML documents without any comments, processing instructions, attributes, CDATA sections, entity references, or even text nodes. By contrast, every XML document has at least one element, and most XML documents have many more. Elements, more than any other component, define the structure of an XML document.

Example 11.1 summarizes the Element interface. This interface includes methods to get the prefixed name of the element, manipulate the attributes on the element, and select from the element's descendants. Of course, Element objects also have all the methods of the Node superinterface, such as appendChild() and getNamespaceURI() .

Example 11.1 The Element Interface

 package org.w3c.dom; public interface Element extends Node {   public String  getTagName();   public boolean hasAttribute(String name);   public boolean hasAttributeNS(String namespaceURI,    String localName);   public String  getAttribute(String name);   public void    setAttribute(String name, String value)    throws DOMException;   public void    removeAttribute(String name)    throws DOMException;   public Attr    getAttributeNode(String name);   public Attr    setAttributeNode(Attr newAttr)    throws DOMException;   public Attr    removeAttributeNode(Attr oldAttr)    throws DOMException;   public String  getAttributeNS(String namespaceURI,    String localName);   public void    setAttributeNS(String namespaceURI,    String qualifiedName, String value) throws DOMException;   public void    removeAttributeNS(String namespaceURI,    String localName) throws DOMException;   public Attr    getAttributeNodeNS(String namespaceURI,    String localName);   public Attr    setAttributeNodeNS(Attr newAttr)    throws DOMException;   public NodeList getElementsByTagName(String name);   public NodeList getElementsByTagNameNS(String namespaceURI,    String localName); }

The aesthetics of this interface are seriously marred by DOM's requirement to avoid method overloading. The differences in the argument lists are redundantly repeated in the method names . For example, if DOM had been written in pure Java, then there would probably be three setAttribute() methods with these signatures:

 public void  setAttribute  (String  name,  String  value  )  throws DOMException public void  setAttribute  (String  namespaceURI,  String  localName,  String  value  ) throws DOMException public void  setAttribute  (Attr  attribute  ) throws DOMException

Instead, Element has these four methods with slightly varied names:

 public void  setAttribute  (String  name,  String  value  )  throws DOMException public void  setAttributeNS  (String  namespaceURI,  String  localName,  String  value  ) throws DOMException public void  setAttributeNode  (Attr  attribute  ) throws DOMException public void  setAttributeNodeNS  (Attr  attribute  )  throws DOMException

The distinction between setAttributeNode() and setAttributeNodeNS is unnecessary. setAttributeNode() is used only with attributes in no namespace, whereas setAttributeNodeNS() is used only with attributes in a namespace. The only motivation I can imagine for this is symmetry with the getter methods, where the distinction is relevant because the argument lists are different. For the setter methods, however, this is frankly a mistake. Attr objects include their own namespace information. There's no need for separate methods to set nodes with and without namespaces.

Extracting Elements

The getElementsByTagName() and getElementsByTagNameNS() methods behave the same as the similarly named methods in the Document interface discussed in Chapter 10. The only difference is that they search through a single element rather than the entire document. These methods return a NodeList that contains all of the elements with the specified name.

An asterisk ( * ) can be passed as either argument to indicate that all names or namespaces are desired. This is particularly useful for the local name passed to getElementsByTagNameNS() . For example, the following NodeList would contain all RDF elements that are descendants of element :

 NodeList rdfs = element.getElementsByTagNameNS(  "http://www.w3.org/1999/02/22-rdf-syntax-ns#","*");

The list returned is sorted in document order. In other words, elements are arranged in order of the appearance of their start-tags . If the start-tag for element A appears earlier in the document than the start-tag for element B, then element A comes before element B in the list.

The next example was inspired by the source code for this very book. Prior to publication, I needed to extract all of the code examples from the source text and put them in separate directories by chapter. That is, the examples from Chapter 1 went into examples/1; the examples from Chapter 2 went into examples/2; and so forth. XSLT 1.0 isn't quite up to this task, but DOM and Java are more powerful. ^[1]

^[1] XSLT 2.0 could handle this, and many XSLT engines include extension functions that could pull this off in XSLT 1.0. But I needed the example. :-)

The source code for this book is structured as follows :

 <book>    ...   <chapter>     ...     <example id="filename.java">       <title>Some Java Program</title>       <programlisting>import javax.xml.parsers;         // more Java code...       </programlisting>     </example>     ...     <example id="filename.xml">       <title>Some XML document</title>       <programlisting><![CDATA[<?xml version="1.0"?> <root>   ... </root>]]></programlisting>      </example>     ...   </chapter>   more chapters... </book>

At least, that's the part which is relevant to this example. The advantage to getElementsByTagName() and getElementsByTagNameNS() is that a program can extract just the parts that interest it very straightforwardly and without explicitly walking the entire tree. ^[2] These methods effectively flatten the hierarchy to just the elements of interest. In this case, those elements are chapter and example . Inside each example , the complete structure is somewhat more relevant; therefore, the normal tree-walking methods of Node are indicated.

^[2] A naive DOM implementation probably would implement getElementsByTagName() and getElementsByTagNameNS() by walking the tree or subtree , but there also exist more efficient implementations based on detailed knowledge of the data structures that implement the various interfaces. For example, a DOM that sits on top of a native XML database might have access to an index of all the elements in the document.

The program follows these steps:

Parse the entire book into a Document object.
Use Document 's getElementsByTagName() method to retrieve a list of all chapter elements in the document. (DocBook doesn't use namespaces, so getElementsByTagName() is chosen over getElementsByTagNameNS() .)
For each element in that list, use Element 's getElementsByTagName() method to retrieve a list of all example elements in that chapter.
From each element in that list, extract its programlisting child element.
Write the text content of that programlisting element into a new file named by the ID of the example element.

This example is quite specific to one XML application, DocBook. Indeed it won't even work with all DocBook documents because it relies on various private conventions of this specific DocBook document, in particular that the id attribute of each example element contains a file name. But that's all right. Most programs you write will be designed to process only certain XML documents in certain situations.

To increase robustness, I do require that the DocBook document be valid, and the parser does validate the document. If validation fails, this program aborts without extracting the examples, because it can't be sure whether the document meets the preconditions. Example 11.2 demonstrates .

Example 11.2 Extracting Examples from DocBook

 import javax.xml.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class ExampleExtractor {   public static void extract(Document doc) throws IOException {     NodeList chapters = doc.getElementsByTagName("chapter");     for (int i = 0; i < chapters.getLength(); i++) {       Element chapter = (Element) chapters.item(i);       NodeList examples = chapter.getElementsByTagName("example");       for (int j = 0; j < examples.getLength(); j++) {         Element example = (Element) examples.item(j);         String fileName = example.getAttribute("id");         // All examples should have id attributes, but it's safer         // not to assume that         if (fileName == null) {           throw            new IllegalArgumentException("Missing id on example");         }         NodeList programlistings          = example.getElementsByTagName("programlisting");         // Each example is supposed to contain exactly one         // programlisting, but we should verify that         if (programlistings.getLength() != 1) {           throw new            IllegalArgumentException("Missing programlisting");         }         Element programlisting = (Element) programlistings.item(0);         // Extract text content; this is a little tricky because         // these often contain CDATA sections and entity         // references which can be represented as separate nodes,         // so we can't just ask for the first text node child of         // each program listing.         String code = getText(programlisting);         // write code into a file         File dir = new File("examples2/" + i);         dir.mkdirs();         File file = new File(dir, fileName);         System.out.println(file);         FileOutputStream fout = new FileOutputStream(file);         Writer out = new OutputStreamWriter(fout, "UTF-8");         // Buffering almost always helps performance a lot         out = new BufferedWriter(out);         out.write(code);         // Remember to flush and close your streams         out.flush();         out.close();      } // end examples loop    } // end chapters loop   }   public static String getText(Node node) {     // We need to retrieve the text from elements, entity     // references, CDATA sections, and text nodes; but not     // comments or processing instructions     int type = node.getNodeType();     if (type == Node.COMMENT_NODE       type == Node.PROCESSING_INSTRUCTION_NODE) {        return "";     }     StringBuffer text = new StringBuffer();     String value = node.getNodeValue();     if (value != null) text.append(value);     if (node.hasChildNodes()) {       NodeList children = node.getChildNodes();       for (int i = 0; i < children.getLength(); i++) {         Node child = children.item(i);         text.append(getText(child));       }     }     return text.toString();   }   public static void main(String[] args) {     if (args.length <= 0) {       System.out.println("Usage: java ExampleExtractor URL");       return;     }     String url = args[0];     try {       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       factory.setValidating(true);       DocumentBuilder parser = factory.newDocumentBuilder();       parser.setErrorHandler(new ValidityRequired());       // Read the document      Document document = parser.parse(url);      // Extract the examples      extract(document);     }     catch (SAXException e) {       System.out.println(e);     }     catch (IOException e) {       System.out.println(        "Due to an IOException, the parser could not read " + url       );       System.out.println(e);     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a factory class");     }     catch (ParserConfigurationException e) {       System.out.println("Could not locate a JAXP parser");     }   } // end main } // Make validity errors fatal class ValidityRequired implements ErrorHandler {   public void warning(SAXParseException e)     throws SAXException {     // ignore warnings   }   public void error(SAXParseException e)    throws SAXException {     // Mostly validity errors. Make them fatal.     throw e;   }   public void fatalError(SAXParseException e)    throws SAXException {     throw e;   } }

Because ExampleExtractor is fairly involved, I've factored it into several relatively independent pieces. The main() method builds the document and parses the document as usual. The nonpublic class ValidityRequired more or less converts all errors into fatal errors by rethrowing the exception passed to it. Assuming validation succeeds, the document is then passed to the extract() method.

The extract() method iterates through all the chapter s and example s in the book using getElementsByTagName() . Each example is supposed to have an id attribute and a single programlisting child element, but because this is just a convention for this one document rather than a rule enforced by the DTD, the code checks to make sure that's true. If it isn't true, then the code throws an IllegalArgumentException .

Next comes one of the trickiest parts of working with elements in DOM. I need to extract the text content of the programlisting element. This sounds simple enough, except that there's no method in either Element or Node that performs this routine task. You might expect getNodeValue() to do this, especially if you're accustomed to XPath. But in DOM, unlike XPath, the value of an element is null. Only its children have values. Thus I need to descend recursively through the children of the programlisting element, accumulating the values of all text nodes, entity references, CDATA sections, and other elements as I go. The getText() method accomplishes this.

Once I have the actual example code from the programlisting element, it can be written into a file. The file location is relative to the current working directory and the chapter number. The file name is read from the id attribute. UTF-8 works well as the default encoding.

Attributes

Although DOM has an Attr interface, the Element interface is the primary means of reading and writing attributes. Because each element can have no more than one attribute with the same name, attributes can be stored and retrieved just by their names. There's no need to manage complex list structures, as there is with other kinds of nodes.

Here are a few tips that help explain how the attribute methods work in DOM:

Most attributes are not in any namespace. In particular, unprefixed attributes are never in any namespace. For these attributes, simply use the name and value.
When setting attributes that are in a namespace, specify the prefixed name and URI. Specify the local name and the namespace URI when getting them.
Getting the value of a nonexistent attribute returns the empty string.
Setting an attribute that already exists changes the value of the existing attribute.

With these few principles in mind, it's not complicated to write programs that read attributes. I'll demonstrate by revising the Fibonacci program from Example 10.6. That example just used elements. Now I'll add an index attribute to each fibonacci element, as shown in Example 11.3.

Example 11.3 A Document That Uses Attributes

 <?xml version="1.0"?> <Fibonacci_Numbers>   <fibonacci index="1">1</fibonacci>   <fibonacci index="2">1</fibonacci>   <fibonacci index="3">2</fibonacci>   <fibonacci index="4">3</fibonacci>   <fibonacci index="5">5</fibonacci>   <fibonacci index="6">8</fibonacci>   <fibonacci index="7">13</fibonacci>   <fibonacci index="8">21</fibonacci>   <fibonacci index="9">34</fibonacci>   <fibonacci index="10">55</fibonacci> </Fibonacci_Numbers>

This is quite simple to implement. You just need to calculate a string name and value for the attribute and call setAttribute() in the right place. Example 11.4 demonstrates.

Example 11.4 A DOM Program That Adds Attributes

 import org.w3c.dom.*; import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import java.math.BigInteger; public class FibonacciAttributeDOM {   public static void main(String[] args) {     try {       // Find the implementation       DocumentBuilderFactory factory        = DocumentBuilderFactory.newInstance();       factory.setNamespaceAware(true);       DocumentBuilder builder = factory.newDocumentBuilder();       DOMImplementation impl = builder.getDOMImplementation();       // Create the document       Document doc = impl.createDocument(null,        "Fibonacci_Numbers", null);       // Fill the document       BigInteger low  = BigInteger.ONE;       BigInteger high = BigInteger.ONE;       Element root = doc.getDocumentElement();       for (int i = 0; i < 10; i++) {         Element number = doc.createElement("fibonacci");         String value = Integer.toString(i);         number.setAttribute("index", value);         Text text = doc.createTextNode(low.toString());         number.appendChild(text);         root.appendChild(number);         BigInteger temp = high;         high = high.add(low);         low = temp;       }       // Serialize the document onto System.out       TransformerFactory xformFactory        = TransformerFactory.newInstance();       Transformer idTransform = xformFactory.newTransformer();       Source input = new DOMSource(doc);       Result output = new StreamResult(System.out);       idTransform.transform(input, output);     }     catch (FactoryConfigurationError e) {       System.out.println("Could not locate a JAXP factory class");     }     catch (ParserConfigurationException e) {       System.out.println(         "Could not locate a JAXP DocumentBuilder class"       );     }     catch (DOMException e) {       System.err.println(e);     }     catch (TransformerConfigurationException e) {       System.err.println(e);     }     catch (TransformerException e) {       System.err.println(e);     }   } }