Section 7.4. DOM


7.4. DOM

The DOM API, unlike the SAX API, allows programmers to construct an object model representing a document and then traverse and modify that representation. The DOM API is not Java-specific; it was developed by the W3C XML working group as a cross-platform API for manipulating XML files (see http://www.w3c.org/XML). As a result, it sometimes doesn't take the most direct Java-based path to a particular result. The JAXP 1.1 API incorporated DOM Level 2. In JAXP 1.3, this was updated to support DOM Level 3.

DOM is useful when programs need random access to a complex XML document or to a document whose format is not known ahead of time. This flexibility does come at a cost, however, as the parser must build a complete in-memory object representation of the document. For larger documents, the resource requirements mount quickly. Consequently, many applications use a combination of SAX and DOM, using SAX to parse longer documents (such as importing large amounts of transactional data from an enterprise reporting system) and using DOM to deal with smaller, more complex documents that may require alteration (such as processing configuration files or transforming existing XML documents).

7.4.1. Getting a DOM Parser

The DOM equivalent of a SAXParser is the org.w3c.dom.DocumentBuilder. Many DocumentBuilder implementations actually use SAX to parse the underlying document, so the DocumentBuilder implementation itself can be thought of as a layer that sits on top of SAX to provide a different view of the structure of an XML document. We use the JAXP API to get a DocumentBuilder interface in the first place, via the DocumentBuilderFactory class:

 DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(  ); // Validation dbf.setValidating(false); // Ignore text elements that are completely empty dbf.setIgnoringElementContentWhitespace(false); // Expand XML entities according to the DTD dbf.setExpandEntityReferences(true); // Treat CDATA sections the same as text dbf.setCoalescing(true);   DocumentBuilder db = null; try {   db = dbf.newDocumentBuilder(  ); } catch (ParserConfigurationException pce) {   pce.printStackTrace(  ); }

The set( ) methods, as with the SAXParserFactory, provide a simple method for configuring parser options:


setCoalescing( )

Joins XML CDATA nodes with adjoining text nodes. The default is false.


setExpandEntityReferences( )

Expands XML entity reference nodes. The default is true.


setIgnoringComments( )

Ignores XML comments. The default is false.


setIgnorningElementContentWhitespace( )

Ignores whitespace in areas defined as element-only by the DTD. The default is false.


setNamespaceAware( )

Requests a namespace-aware parser. The default is false.


setValidating( )

Requests a validating parser. The default is false.


setAttribute()

Sets various standard and implementation-specific parser attributes.

Once the DocumentBuilder is instantiated, call the parse(String URI) method to return an org.w3c.dom.Document object.

7.4.2. Navigating the DOM Tree

The Document object provides the starting point for working with a DOM tree. Once the parser has produced a Document, your program can traverse the document structure and make changes. In addition, Document implements the Node interface, which is the core of DOM's tree structure, and provides methods for traversing the tree and retrieving information about the current node.

Each element, attribute, entity, and text string (indeed, every distinct component within an XML document) is represented in DOM as a node. To determine what kind of node you are working with, you can call the getNodeType( ) method. This returns one of the constants specified by the Node interface. All node objects have methods for dealing with child elements, although not all nodes may have children. The DOM API also provides a set of interfaces that map to each node type.

The most important DOM node types and their corresponding interfaces are listed in Table 7-1. If you attempt to add child elements to a node that doesn't support children, a DOMException is thrown.

Table 7-1. Important DOM node types

Interface

Name property contains

Value

Children

Node constant

Attr

Name of attribute

Yes

No

ATTRIBUTE_NODE

CDATASection

#cdata-section

Yes

No

CDATA_SECTION_NODE

Comment

#comment

Yes

No

COMMENT_NODE

Document

#document

No

Yes

DOCUMENT_NODE

DocumentFragment

#document-fragment

No

Yes

DOCUMENT_FRAGMENT_NODE

DocumentType

Document type name

No

No

DOCUMENT_TYPE_NODE

Element

Tag name

No

Yes

ELEMENT_NODE

Entity

Entity name

No

No

ENTITY_NODE

EntityReferenced

Name of referenced entity

No

No

ENTITY_REFERENCE_NODE

ProcessingInstruction

PI target

Yes

No

PROCESSING_INSTRUCTION_NODE

Text

#text

No

No

TEXT_NODE


For most applications, element nodes (identified by Node.ELEMENT_NODE) and text nodes (identified by Node.TEXT_NODE) are the most important. An element node is created when the parser encounters an XML markup tag. A text node is created when the parser encounters text that is not included within a tag. For example, if the input XML (we're using XHTML in this example) looks like this:

 <p>  Here is some <b>boldface</b> text. </p>

The parser creates a top-level node that is an element node with a local name of p. The top-level node contains three child nodes: a text node containing "Here is some," an element node named b, and another text node containing "text." The b element node contains a single child text node containing the word "boldface."

The getNodeValue( ) method returns the contents of a text node or the value of other node types. It returns null for element nodes.

To iterate through a node's children, use the getFirstChild( ) method, which will return a Node reference. To retrieve subsequent child nodes, call the getNextSibling( ) method of the node that was returned by getFirstChild( ). To print the names of all the children of a particular Node (assume that the node variable is a valid Node):

 for (c = node.getFirstChild(  ); c != null; c = c.getNextSibling(  )) {     System.out.println(c.getLocalName(  )); }

Note that there is no getNextChild( ) method, and you can't iterate through child nodes except via the getNextSibling( ) method. As a result, if you use the removeChild( ) method to remove one of a node's children, calls to the child node's getNextSibling( ) method immediately return null.

7.4.2.1. Element attributes

Element attributes are accessed via the getAttributes( ) method, which returns a NamedNodeMap object. The NamedNodeMap contains a set of Node objects of type ATTRIBUTE_NODE. The getNodeValue( ) method can read the value of a particular attribute.

 NamedNodeMap atts = elementNode.getAttributes(  ); if(atts != null) {   Node sizeNode = atts.getNamedItem("size");   String size = sizeNode.getValue(  ); }

Alternately, you can cast a node to its true type (in this case, an org.w3c.dom.Element) and retrieve attributes or other data more directly:

 if(myNode.getNodeType(  ) == Node.ELEMENT_NODE)  {   Element myElement = (org.w3c.dom.Element)myNode;   String attributeValue = myElement.getAttribute("attr");     // attributeValue will be  an empty string if  "attr" does not exist }

This is often easier than retrieving attribute nodes from a NamedNodeMap.

7.4.3. Manipulating DOM Trees

DOM is particularly useful when you need to manipulate the structure of an XML file. Example 7-2 is an HTML document "condenser." It loads an HTML file (which must be well-formed XML, although not necessarily XHTML), iterates through the tree, and preserves only the important content. In this case, it's text within <em>, <th>, <title>, <li>, and <h1> tHRough <h6>. All text nodes that aren't contained within one of these tags are removed. A more sophisticated algorithm is no doubt possible, but this one is good enough to demonstrate the DOM principle.

Example 7-2. DocumentCondenser
 import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import org.w3c.dom.*;   public class DocumentCondenser {   public static void main(String[] args) throws Exception {       DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(  );       // For HTML, we don't want to validate without a DTD     dbf.setValidating(false);     // Ignore text elements that are completely empty:     dbf.setIgnoringElementContentWhitespace(false);     dbf.setExpandEntityReferences(true);     dbf.setCoalescing(true);       // Ensure that getLocalName() returns the HTML element name     dbf.setNamespaceAware(true);       DocumentBuilder db = null;     try {       db = dbf.newDocumentBuilder(  );     }     catch (ParserConfigurationException pce) {       pce.printStackTrace();       return;     }       Document html = null;     try {       html = db.parse("enterprisexml.html");       process(html);         // Use the XSLT Transformer to see the output       TransformerFactory tf = TransformerFactory.newInstance();       Transformer output = tf.newTransformer();       output.transform(new DOMSource(html), new StreamResult(System.out));     }     catch (Exception ex) {       ex.printStackTrace();       return;     }   }     /* We want to keep text if the parent is <em>, <title>, <b>, <li>, <th>    or <h1>..<h6>. We also want to keep text if it is in a <font> tag with    a size attribute set to a larger than normal size */   private static boolean keepText(Node parentNode) {     if (parentNode == null) return true; // top level       String parentName = parentNode.getLocalName();     if ((parentName.equalsIgnoreCase("em")) ||         (parentName.equalsIgnoreCase("title")) ||         (parentName.equalsIgnoreCase("b")) ||         (parentName.equalsIgnoreCase("li")) ||         (parentName.equalsIgnoreCase("th")) ||         ((parentName.toLowerCase().startsWith("h")) &&          (parentName.length() == 2))) {       return true;     }       if ((parentNode.getNodeType() == Node.ELEMENT_NODE) &&         (parentName.equalsIgnoreCase("font"))) {       NamedNodeMap atts = parentNode.getAttributes();       if (atts != null) {         Node sizeNode = atts.getNamedItem("size"); //get an attribue Node         if (sizeNode != null) {           if (sizeNode.getNodeValue().startsWith("+")) {             return true;           }         }       }       }     return false;   }     private static void process(Node node) {       Node c = null;     Node delNode = null;       for (c = node.getFirstChild(); c != null; c = c.getNextSibling()) {       if (delNode != null) {         delNode.getParentNode().removeChild(delNode);       }       delNode = null;       if ((c.getNodeType() == Node.TEXT_NODE) &&           (!keepText(c.getParentNode()))) {         delNode = c;       }       else if (c.getNodeType() != Node.TEXT_NODE) {         process(c);       }     } // End For       if (delNode != null) // Delete, if the last child was text       delNode.getParentNode().removeChild(delNode);   } }

After the DOM tree has been processed, use the JAXP XSLT API to output new HTML. We will discuss how to use XSL with JAXP in the next section.

If you want to replace the text with a condensed version, call the setNodeValue( ) method of Node when processing a text node.

7.4.4. Extending DOM Trees

Manipulating DOM trees falls, broadly, into three categories. We can add, remove, and modify nodes on existing trees; we can create new trees; and finally, we can merge trees together.

Back in the last example, we saw how to delete nodes from a DOM tree with the removeChild( ) method. If we want to add new nodes, we have two options. While there is no direct way to instantiate a new Node object, we can copy an existing Node using its cloneNode( ) method. The cloneNode( ) method takes a single Boolean parameter, which specifies whether the node's children will be cloned as well:

 Node newNodeWithChildren = oldElementNode.cloneNode(true); Node childlessNode = oldElementNode.cloneNode(false);

Regardless of whether children are cloned, clones of an element node will include all of the attributes of the parent node. The DOM specification leaves certain cloning behaviors, specifically Document, DocumentType, Entity and Notation nodes, up to the implementation.

New nodes can also be created via the createXXX( ) methods of the Document object. The createElement( ) method accepts a String containing the new element name and returns a new element Node. The createElementNS( ) method does the same thing, but accepts two parameters, a namespace and an element name. The createAttribute( ) method also has a version that is namespace-aware, createAttributeNS( ).

Once a new Node is created, it can be inserted or appended into the tree using the appendChild( ), insertBefore( ), and replaceChild( ) methods. Attribute nodes can be inserted into the NamedNodeMap returned by the getAttributes( ) method of Node. You can also add attributes to an element by casting the Node to an Element and calling setAttribute( ).

Creating new trees involves creating a new Document object. The easiest way to do this is via the DOMImplementation interface. An implementation of DOMImplementation can be retrieved from a DocumentBuilder object via the getdOMImplementation( ) method. Example 7-3 builds a version of the XML from a blank slate.

Example 7-3. TreeBuilder
 import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import org.w3c.dom.*;   public class TreeBuilder {     public static void main(String[] args) {     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();     dbf.setValidating(false);       DocumentBuilder db = null;     try {       db = dbf.newDocumentBuilder();     }     catch (ParserConfigurationException pce) {       pce.printStackTrace();       return;     }       Document doc =       db.getDOMImplementation().createDocument(null, "orders", null);     // create the initial document element     Element orderNode = doc.createElement("order");     orderNode.setAttribute("orderno", "123433");       Node item = doc.createElement("item");     Node subitem = doc.createElement("number");     subitem.appendChild(doc.createTextNode("3AGM-5"));     item.appendChild(subitem);       subitem = doc.createElement("handling");     subitem.appendChild(doc.createTextNode("With Care"));     item.appendChild(subitem);       orderNode.appendChild(item);     doc.getDocumentElement().appendChild(orderNode);       // View the output     try {       TransformerFactory tf = TransformerFactory.newInstance();       Transformer output = tf.newTransformer();       output.transform(new DOMSource(doc),                        new StreamResult(System.out));     }     catch (TransformerException e) {       e.printStackTrace();     }   } }

The second parameter to createDocument( ) specifies the name of the base document elementin this case, the <orders> tag. Subsequent tags can be appended to the base tag. If we were to look at the results of this program as regular XML, it would look like this (we've added some whitespace formatting to make it more readable):

 <?xml version="1.0" encoding="UTF-8"?> <orders>      <order orderno="123433">           <item>                <number>3AGM-5</number>                <handling>With Care</handling>           </item>      </order> </orders>

You've probably noticed that each Node implementation we've created has been based on a particular instance of the Document object. Since each node is related to its parent document, we can't go around inserting one document's nodes into another document without triggering an exception. The solution is to use the importNode( ) method of Document, which creates a copy of a node from another document. The original node from the source document is left untouched. Here's an example that takes the <orders> tag from the first document and puts it into a new document under an <ordersummary> tag:

 Document doc2 = db.getDOMImplementation(  ).createDocument(                null, "ordersummary", null);   DocumentFragment df = doc.createDocumentFragment(  ); df.appendChild(doc.getDocumentElement(  ).cloneNode(true)); doc2.getDocumentElement(  ).appendChild(doc2.importNode(df, true));

We use a DocumentFragment object to hold the data we're moving. Document fragment nodes provide a lightweight structure for dealing with subsets of a document. Fragments must be valid XML, but don't need to be DTD-conformant and can have multiple top-level children. When appending a document fragment to a document tree, the DocumentFragment node itself is ignored, and its children are appended directly to the parent node. In this example, we cloned the source element when creating the document fragment, since assigning a node to a fragment releases the node's relationship with its previous parent. The XML in the second document object looks like this:

 <?xml version="1.0" encoding="UTF-8"?> <ordersummary>      <orders>           <order orderno="123433">                <item><number>3AGM-5</number>                <handling>With Care</handling></item>           </order>      </orders> </ordersummary>



Java Enterprise in a Nutshell
Java Enterprise in a Nutshell (In a Nutshell (OReilly))
ISBN: 0596101422
EAN: 2147483647
Year: 2004
Pages: 269

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net