Tree-Based Parsing with the DOM | XML Programming Bible

The DOM API defines a minimal set of language and platform-independent interfaces for accessing and manipulating the content and structure of information stored in XML documents. In this section we will cover DOM's major interfaces and briefly touch on the minor ones.

In tree-based parsing with the DOM the document is checked to see if it is well-formed and valid, depending on the type of parser. The parser then converts the document's information into a tree of nodes. The entire document, no matter how simple or complex, is converted into a tree that starts from one root node, which, in DOM terms, is called a document object instance (hence Document Object Model). Once a document object tree is created, access to the elements allows you to modify, delete, and create leaves and branches by using the interfaces in the API.

We are using Titles.xml as the example XML file during this discussion. This file, shown in Listing 3-1, presents a collection of books based on the sample pubs database that comes with Microsoft SQL Server.

Listing 3-1 Titles.xml: A sample XML document.

 <?xml version="1.0" encoding="UTF-8"?> <BookList> <Book> <book_id>BU1111</book_id>  <title>Cooking with Computers: Surreptitious Balance Sheets</title> <type>business</type> <pub_id>1389</pub_id> <price>11.95</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>3876</ytd_sales> <notes>Helpful hints on how to use your electronic...</notes> <pubdate>1991-06-09T05:00:00</pubdate> </Book> <Book> <book_id>BU7832</book_id> <title>Straight Talk About Computers</title> <type>business</type> <pub_id>1389</pub_id> <price>19.99</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>4095</ytd_sales>  <notes>Annotated analysis of what computers can do for you</notes> <pubdate>1991-06-22T05:00:00</pubdate> </Book> </BookList>

Figure 3-1 is a visual representation of how Titles.xml can be represented as a tree of nodes.

Figure 3-1 A DOM hierarchy representation of Titles.xml.

Everything is a node in the Document object tree. These nodes might have child nodes or hold information like its tag name (nodeName) and value (nodeValue). This hierarchical organization of information is similar to a file system, where folders might contain files or other folders, except everything descends from one root folder.

Important Interfaces in the DOM

The DOM provides interfaces in its hierarchy of Node objects. The interfaces either have child nodes that contain other nodes or are leaf nodes that do not contain anything after them in the document structure. Some types of child or leaf nodes are Node, Element, and NodeList, all of which are interfaces in the DOM.

Node

An XML Document object created after a DOM parser reads an XML file often contains a tree-like representation of Node objects instances, while other interfaces are provided to create a more object-oriented environment. You can manipulate all the information in the DOM by using the Node interface. Even though the DOM Recommendation specifically states that it isn't necessarily a tree, for the purposes of the discussions in this chapter and the examples therein we will focus on the tree-like representations. Figure 3-2 shows the inheritance relationships between some of the important interfaces.

Figure 3-2 DOM interfaces and inheritance relationships.

Because the Document object is a subclass of Node, the root Node object of the tree is also a Document object. Every DOM object must have a root. Figure 3-3 illustrates a sample XML Document object tree and describes some of the Node objects that it contains.

Figure 3-3 A Document object, where everything in the DOM is a Node.

You can find out if a Node has children by using the hasChildNodes( ) method. This method, which takes no parameters, returns a Boolean true if the node has children and false if not.

The getNodeType( ) method, which is part of the Java bindings defined by DOM, is another important method of Node. It returns the type of a particular Node. The type is a constant integer used to identify different types of Nodes. For example, the Node.ELEMENT_NODE type identifies a Node to be an element. Table 3-1 contains a list of the other methods available for the Node object as well.

Table 3-1 Other Methods of the Node Object

Method	Description
appendChild( )	Adds a new child object, which is passed to the method, to the current Node.
cloneNode( )	Returns a duplicate of the Node.
hasAttributes( )	Returns a Boolean true if the Node has any attributes. This method was added in DOM Level 2.
insertBefore( )	Takes a new child Node and a reference child Node and inserts the new child Node before the reference Node.
isSupported( )	Tests whether or not this implementation of the DOM supports a specific feature. This method was added in DOM Level 2 and takes a version number and a feature as parameters.
normalize( )	Puts all text nodes in the full depth of the sub-tree underneath this Node.
removeChild( )	Removes the specified child.
replaceChild( )	Replaces the specified child with the new child passed.

Element

The Element interface, which is a subclass of Node, is another important interface. It can be used to access the elements in a DOM Document object tree, which allows you to read in attributes and their values, as well as change, delete, or add to them. Table 3-1 contains the list of methods of the Element object.

Table 3-2 Methods of the Element Object

Method	Description
getAttribute( )	Retrieves the specified attribute.
getAttributeNS( )	Retrieves the specified attribute by local name and namespace. This method was added in Level 2.
getAttributeNode( )	Retrieves an Attr node by name.
getAttributeNodeNS( )	Retrieves an Attr node by local name and namespace. This method was added in Level 2.
getElementsByTagName( )	Returns a NodeList of all child elements of a given tag name in the order in which they are encountered.
getElementsByTagNameNS( )	Returns a NodeList of all child elements of a given tag by local name and namespace in the order in which they are encountered. This method was added in Level 2.
hasAttribute( )	Returns a Boolean true if the specified attribute is present. Returns Boolean false otherwise.
hasAttributeNS( )	Returns a Boolean true if the specified attribute, by local name and namespace, is present. Returns Boolean false otherwise. This method was added in Level 2.
removeAttribute( )	Removes the specified attribute.
removeAttributeNS( )	Removes the attribute specified by local name and namespace. This method was added in Level 2.
removeAttributeNode( )	Removes the specified Attr node.
setAttribute( )	Adds a new attribute. If an attribute of the same name exists, its value is changed to the specified value.
setAttributeNS( )	Adds a new attribute. If an attribute of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2.
setAttributeNode( )	Adds a new Attr node. If an Attr node of the same name exists, its value is changed to the specified value.
setAttributeNodeNS( )	Adds a new Attr node. If an Attr node of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2.

NodeList

Some methods of the Node interface allow traversal of a Node tree. The getChildNodes( ) method is useful for gathering all the elements inside a Node. This method returns all Nodes, if they exist, in a container for Node objects. NodeList is an iterator for a list of Nodes. Figure 3-4 illustrates a NodeList.

Unlike Node and Element, NodeList has only a single method, item( ). This method returns the Node located at the indexed position passed to the method. For instance, if you want to retrieve the first Node, you call the method using item(0).

Other DOM Interfaces

Node, Element, and NodeList are not the only interfaces specified by the DOM. Because we do not cover all of them, we've included a list in Table 3-3 along with any children interfaces they have and a brief description. This table contains only those interfaces found in the DOM Level 1 and 2 Core and does not contain the HTML bindings.

Table 3-3 DOM Interfaces

Interface	Children	Description
Attr	Text, EntityReference	Represents an attribute of an Element object.
CDATASection	Contains no children	Used to escape characters of text that would otherwise be considered markup.
Comment	Contains no children	Stores the content of an XML or HTML comment.
Document	Element (max. of one), ProcessingInstruction, Comment, DocumentType (max. of one)	Represents the entire XML or HTML document.
DocumentFragment	Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference	A minimal Document object instance.
DocumentType	Contains no children	Represents the document type stored in the doctype attribute of the Document object instance.
Element	Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference	Represents an element in an XML or HTML document.
Entity	Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference	Represents an entity, whether it's parsed or unparsed.
EntityReference	Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference	Can be inserted into a structured model when an the entity reference is in the source document or when you want to insert a reference.
Notation	Contains no children	Represents a notation declared in a DTD.
ProcessingInstruction	Contains no children	Used to represent a "processing instruction."
Text	Contains no children	Contains textual content (character data).

Figure 3-4 A DOM NodeList Object.