Trees | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

DOM is based on an implicit data model, which is similar to but not quite the same as the data models used by other XML technologies such as XPath, the XML Infoset, and SAX. Before we delve too deeply into the nitty-gritty details of the DOM API, it's helpful to have a higher level understanding of just what DOM thinks an XML document is.

According to DOM, an XML document is a tree made up of nodes of several types. The tree has a single root node, and all nodes in this tree except for the root have a single parent node. Furthermore, each node has a list of child nodes. In some cases, this list of children may be empty, in which case the node is called a leaf node.

There can also be nodes that are not part of the tree structure. For example, each attribute node belongs to one element node but is not considered to be a child of that element. Furthermore, nodes can be removed from the tree or created but not inserted in the tree. Thus a full DOM document is composed of the following:

A tree of nodes
Various nodes that are somehow associated with other nodes in the tree but are not themselves part of the tree
A random assortment of disconnected nodes

DOM trees are not red-black trees, binary trees, B-trees, or any other sort of special-purpose trees. From a data-structures point of view, they are just plain- vanilla trees. Recursion works very well on DOM data structures, as it does on any tree. You can use all of the techniques you learned for processing trees in Data Structures 201. Breadth-first search, depth-first search, inorder traversal, preorder traversal, postorder traversal, and so on all work with DOM data structures.

In addition to its tree connections, each node has a local name, a namespace URI, and a prefix; although for several kinds of nodes, these may be null. For example, the local name , namespace URI, and prefix of a comment are always null. Each node also has a node name. For an element or attribute, the node name is the prefixed name. For other named things such as notations or entities, the node name is the name of the thing. For nodes without names , such as text nodes, the node name is the value from the following list that matches the node type:

#document
#comment
#text
#cdata-section
#document-fragment

Finally each node has a string value. For text-like things such as text nodes and comments, this tends to be the text of the node. For attributes, it's the normalized value of the attribute. For everything else, including elements and documents, the value is null.

DOM divides nodes into twelve types, seven of which can potentially be part of a DOM tree:

Document nodes
Element nodes
Text nodes
Attribute nodes
Processing instruction nodes
Comment nodes
Document type nodes
Document fragment nodes
Notation nodes
CDATA section nodes
Entity nodes
Entity reference nodes

Of these twelve, the first seven are by far the most important; and often a tree built by an XML parser will contain only the first seven.

Document Nodes

Each DOM tree has a single root document node. This node has children. Because all documents have exactly one root element, a document node always has exactly one element-node child. If the document has a document type declaration, then it also has one document-type-node child. If the document contains any comments or processing instructions before or after the root element, then these are also child nodes of the document node. The order of all children is maintained . Consider the simple XML-RPC document shown in Example 9.2.

Example 9.2 An XML-RPC Request Document

 <?xml version="1.0"?> <?xml-stylesheet type="text/css" href="xml-rpc.css"?> <!-- It's unusual to have an xml-stylesheet processing      instruction in an XML-RPC document but it is legal, unlike      SOAP where processing instructions are forbidden. --> <!DOCTYPE methodCall SYSTEM "xml-rpc.dtd"> <methodCall>   <methodName>getQuote</methodName>   <params>     <param>       <value><string>RHAT</string></value>     </param>   </params> </methodCall>

The document node representing the root of this document has four child nodes in this order:

A processing instruction node for the xml-stylesheet processing instruction
A comment node for the comment
A document type node for the document type declaration
An element node for the root methodCall element

The XML declaration, the DOCTYPE declaration, and the white space between these nodes are not included in the tree. The document type node is available as a separate property of the document node. However, it is not a child and is not included in the list of the document's children. The XML declaration (including the version, standalone, and encoding declarations) and the white space are removed by the parser. They are not part of the model.

Element Nodes

Each element node has a name, a local name, a namespace URI (which may be null if the element is not in any namespace), and a prefix (which may also be null). The string also contains children. For example, consider this value element:

 <value><string>RHAT</string></value>

When represented in DOM, it becomes a single element node with the name value . This node has a single element-node child for the string element. The string also has a single text-node child containing the text RHAT .

Or consider this para element:

 <db:para xmlns:db="http://www.example.com/"    xmlns="http://namespaces.cafeconleche.org/">   Or consider this <markup>para</markup> element: </db:para>

In DOM it's represented as an element node with the name db:para , the local name para , the prefix db , and the namespace URI http://www.example.com/ . It has three children:

A text node containing the text Or consider this
An element node with the name markup , the local name markup , the namespace URI http://namespaces.cafeconleche.org/ , and a null prefix
Another text node containing the text element:

White space is included in text nodes, even if it's ignorable. For example, consider this methodCall element:

 <methodCall>    <methodName>getQuote</methodName>   <params>     <param>       <value><string>RHAT</string></value>     </param>   </params> </methodCall>

It is represented as an element node with the name methodCall and five child nodes:

A text node containing only white space
An element node with the name methodName
A text node containing only white space
An element node with the name params
A text node containing only white space

Of course, these element nodes also have their own child nodes.

In addition to containing element and text nodes, an element node may contain comment and processing instruction nodes. Depending on how the parser behaves, an element node might also contain some CDATA section nodes, entity reference nodes, or both. However, many parsers resolve these automatically into their component text and element nodes, and do not report them separately.

Attribute Nodes

An attribute node has a name, a local name, a prefix, a namespace URI, and a string value. The value is normalized as required by the XML 1.0 specification. That is, entity and character references in the value are resolved, and all white space characters are converted to a single space. If the attribute has any type other than CDATA, then leading and trailing white space is stripped from its value, and all other runs of white space are converted to a single space. An attribute node also has children, all of which are text and entity reference nodes forming the value of the attribute. However, it's unusual to access these directly instead of by the value.

If a validating parser builds an XML document from a file, then default attributes from the DTD are included in the DOM tree. If the parser supports schemas, then default attributes can be read from the schema as well. DOM does not provide the type of the attribute as specified by the DTD or schema, or the list of values available for an enumerated type attribute. This is a major shortcoming.

Attributes are not considered to be children of the element to which they are attached. Instead they are part of a separate set of nodes. For example, consider this Quantity element:

 <Quantity amount="17" />

This element has no children, but it does have a single attribute with the name amount and the value 17 .

Attributes that declare namespaces do not receive special treatment in DOM. They are reported by DOM parsers in the same way as any other attribute. Furthermore, DOM always provides the fully qualified names and namespace URIs for all element and attribute nodes.

Leaf Nodes

Only document, element, attribute, entity, and entity reference nodes can have children. The remaining node types are much simpler.

Text Nodes

Text nodes contain character data from the document stored as a String . Any characters like 4 from outside Unicode's Basic Multilingual Plane are represented as surrogate pairs. Characters like & and < that are represented in the document by predefined entity or character references are replaced by the actual characters they represent. If these nodes are written out again to an XML document, these characters need to be re-escaped.

When a parser reads an XML document to form a DOM Document , it puts as much text as possible into each text node before being interrupted by non-predefined entities, comments, tags, CDATA section delimiters, or other markup. Thus no text node immediately follows any other text node, as there is always an intervening nontext node. However, if a DOM is created or modified in memory, then the client program may divide text between immediately adjacent text nodes. As a result, it's not always guaranteed that each text node contains the maximum possible contiguous run of text, just that this is the case immediately after a document is parsed.

Comment Nodes

A comment node has a name (which is always #comment), a string value (the text of the comment) and a parent (the node that contains it). That's all. For example, consider this comment:

 <!-- Don't forget to fix this! -->

The value of this node is Don't forget to fix this! The white space at either end is included.

Processing Instruction Nodes

A processing instruction node has a name (the target of the processing instruction), a string value (the data of the processing instruction), and a parent (the node that contains it). That's all. For example, consider this processing instruction:

 <?xml-stylesheet type="text/css" href="xml-rpc.css"?>

The name of this node is xml-stylesheet . The value is type="text/css" href="xml-rpc.css" . The white space between the target and the data is not included, but the white space between the data and the closing ?> is included. Even if the processing instruction uses a pseudo-attribute format as this one does, it is not considered to have attributes or children. Its data is just a string that happens to have some equal signs and quote marks in suggestive positions .

CDATA Section Nodes

A CDATA section node is a special text node that represents the contents of a CDATA section. Its name is #cdata-section. Its value is the text content of the section. For example, consider this CDATA section:

 <![CDATA[<?xml-stylesheet type="text/css" href="xml-rpc.css"?>]]>

Its name is #cdata-section and its value is <?xml-stylesheet type="text/css" href="xml-rpc.css"?> .

Entity Reference Nodes

When a parser encounters a general entity reference such as Æ or &copyright_notice; , it may or may not replace it with the entity's replacement text. Validating parsers always replace entity references. Nonvalidating parsers may do so at their option.

If a parser does not replace entity references, then the DOM tree will include entity reference nodes. Each entity reference node has a name, and if the parser has read the DTD, then you should be able to look up the public and system IDs for this entity reference using the map of entity nodes available on the document type node. Furthermore, the child list of the entity will contain the replacement text for this entity reference. However, if the parser has not read the DTD and resolved external entity references, then the child list may be empty.

If a parser does replace entity references, then the DOM tree may or may not include entity reference nodes. Some parsers resolve all entity reference nodes completely and leave no trace of them in the parsed tree. Other parsers instead include entity reference nodes in the DOM tree that have a list of children. The child list contains text nodes, element nodes, comment nodes, and so forth, representing the replacement text of the entity.

For example, suppose an XML document contains this element:

 <para>&AElig;lfred is a very nice XML parser.</para>

If the parser is not resolving entity references, then the para element node contains two childrenan entity reference node with the name AElig and a text node containing the text "lfred is a very nice XML parser." The AElig entity reference node will not have any children.

Now suppose the parser is resolving entity references, and the replacement text for the AElig entity reference is the single ligature character . Now the parser has a choice: It can represent the children of the para element as a single text node containing the full sentence , " lfred is a very nice XML parser." Alternately, it can represent the children of the para element as an entity reference node with the name AElig followed by a text node containing the text, "lfred is a very nice XML parser." If it chooses the second option, then the AElig entity reference node contains a single read-only text-node child containing single ligature character .

DOM never includes entity reference nodes for the five predefined entity references: & , < , > , ' , and " . These are simply replaced by their respective characters and included in a text node. Similarly, character references such as &#xA0 and   are not specially represented in DOM as any kind of node. The characters they represent are simply added to the relevant text node.

Document Type Nodes

A document type node has a name (the name the document type declaration specifies for the root element), a public ID (which may be null), a system ID (required), an internal DTD subset (which may be null), a parent (the document that contains it), and lists of the notations and general entities declared in the DTD. The value of a document type node is always null. For example, consider this document type declaration:

 <!DOCTYPE mml:math PUBLIC "-//W3C//DTD MathML 2.0//EN"   "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" [   <!ENTITY % MATHML.prefixed "INCLUDE">   <!ENTITY % MATHML.prefix "mml"> ]>

The name of the corresponding node is mml:math . The public ID is -//W3C//DTD MathML 2.0//EN . The system ID is http://www.w3.org/TR/MathML2/dtd/mathml2.dtd . The internal DTD subset is the complete text between [ and ] .

Nontree Nodes

There are four kinds of DOM nodes that are part of the document but not the document's tree: attribute nodes, entity nodes, notation nodes, and document fragment nodes. You've already seen that attribute nodes are attached to element nodes but are not children of those nodes. Entity and notation nodes are available as special properties of the type node. Document fragment nodes are used only when building DOM trees in memory, not when reading them from a parsed file.

Entity Nodes

Entity nodes (not to be confused with entity reference nodes) represent the parsed and unparsed entities declared in the document's DTD. If the parser reads the DTD, then it will attach a map of entity nodes to the document type node. Because this map is indexed by the entity names, you can use it to match entity reference nodes to entity nodes.

Each entity node has a name and a system ID. It can also have a public ID if one was used in the DTD. Furthermore, if the parser reads the entity, then the entity node has a list of children containing the replacement text of the entity. However, these children are read-only and cannot be modified, unlike children of similar type elsewhere in the document. For example, suppose the following entity declaration appeared in the document's DTD:

 <!ENTITY AElig "&#x00C6;">

If the parser read the DTD, then it would create an entity node with the name AElig. This node would have a null public and system ID (because the entity would be purely internal) and one child, a read-only text node containing the single character .

For another example, suppose this entity declaration appeared in the document's DTD:

 <!ENTITY Copyright SYSTEM "copyright.xml">

If the parser read the DTD, then it would create an entity node with the name Copyright, the system ID copyright.xml, and a null public ID. The children of this node would depend on what was found at the relative URL copyright.xml. Suppose that document contained the following content:

 <copyright>    <year>2002</year>   <person>Elliotte Rusty Harold</person> </copyright>

Then the child list of the Copyright entity node would contain a single read-only element child with the name copyright. The element would contain its own read-only element and text node children.

Notation Nodes

Notation nodes represent the notations declared in the document's DTD. If the parser reads the DTD, then it will attach a map of notation nodes to the document type node. This map is indexed by the notation name. You can use it to look up the notation for each entity node that corresponds to an unparsed entity, or the notations associated with particular processing instruction targets.

In addition to its name, each notation node has a public ID or a system ID, whichever was used to declare it in the DTD. Notation nodes do not have any children. For example, suppose this notation declaration for PNG images was included in the DTD:

 <!NOTATION PNG SYSTEM "http://www.w3.org/TR/REC-png">

This would produce a notation node with the name PNG and the system ID http://www.w3.org/TR/REC-png . The public ID would be null.

For another example, suppose this notation declaration for TeX documents was included in the DTD:

 <!NOTATION TEX PUBLIC      "+//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN">

This would produce a notation node with the name TEX and the public ID +//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN. The system ID would be null. (XML doesn't allow notations to have both public and system IDs.)

Document Fragment Nodes

The document fragment node is an alternative root node for a DOM tree. It can contain anything an element can contain (for example, element nodes, text nodes, processing instruction nodes, comment nodes, and so on). Although a parser will never produce such a node, your own programs may create one when extracting part of an XML document in order to move it elsewhere.

In DOM, the nonroot nodes never exist alone. That is, there's never a text node or an element node or a comment node that's not part of a document or a document fragment. They may be temporarily disconnected from the main tree, but they always know which document or fragment they belong to. The document fragment node enables you to work with pieces of a document that are composed of more than one node.

What Is and Isn't in the Tree

Table 9.1 summarizes the DOM data model with the name, value, parent, and possible children for each kind of node. One thing to keep in mind is the parts of the XML document that are not exposed in this data model:

The XML declaration, including the version, standalone declaration, and encoding declaration. These will be added as properties of the document node in DOM3, but current parsers do not provide them.
Most information from the DTD and/or schema, including element and attribute types and content models, is not provided.
Any white space outside the root element.
Whether or not each character was provided by a character reference. Parsers may provide information about entity references but are not required to do so.

Table 9.1. Node Properties

Node Type	Name	Value	Parent	Children
Document	#document	Null	Null	Comment, processing instruction, one element, zero or one document type
Document type	Root element name specified by the DOCTYPE declaration	Null	Document	None
Element	Prefixed name	Null	Element, document, or document fragment	Comment, processing instruction, text, element, entity reference, CDATA section
Text	#text	Text of the node	Element, attribute, entity, or entity reference	None
Attr	Prefixed name	Normalized attribute value	Element	Text, entity reference
Comment	#comment	Text of the comment	Element, document, or document fragment	None
Processing instruction	Target	Data	Element, document, or document fragment	None
Entity Reference	Name	Null	Element or document fragment	Comment, processing instruction, text, element, entity reference, CDATA section
Entity	Entity name	Null	Null	Comment, processing instruction, text, element, entity reference, CDATA section
CDATA section	#cdata-section	Text of the section	Element, entity, or entity reference	None
Notation	Notation name	Null	Null	None
Document fragment	#document-fragment	Null	Null	Comment, processing instruction, text, element, entity reference, CDATA section

A DOM program cannot manipulate any of these constructs. It cannot, for example, read in an XML document and then write it out again in the same encoding as in the original document, because it doesn't know what encoding the original document used. It cannot treat $var differently from $var , because it doesn't know which was originally written.