A number of XML applications have built useful application-specific DOMs by extending the standard DOM interfaces. XML applications with their own custom DOMs include HTML and XHTML, the Wireless Markup Language (WML), Scalable Vector Graphics (SVG), and MathML.
Whereas the generic DOM would use an
WML-specific DOM might use a
, or a
, or a
appropriate for the actual type of element it represents. These
custom subclasses and subinterfaces have all the methods and
properties of the standard interfaces, as well as other
<p align="center" mode="wrap" xml:lang="en"> Hello! </p>
Therefore, the WMLPElement interface has getter and setter methods for those three attributes:
public void setMode (String mode ) public void setAlign (String align ) public void setXMLLang (String lang ) public String getMode () public String getAlign () public String getXMLLang ()
An application-specific DOM can enforce application-specific rules such as, "The mode attribute must have one of the values wrap or nowrap ," though currently this practice is uncommon.
Of course, because WMLPElement extends Element , which extends Node , it also has the usual methods of any DOM node. When processing a WML document, you can use the generic DOM interfaces if you prefer, or you can use the more specific WML subclasses and subinterfaces.
The big issue for most application-specific DOMs is parser
support. To read these documents, you not only need a custom DOM;
you also need a custom parser that
It's somewhat easier to create new WML, SVG, MathML, or similar
documents in a particular vocabulary using an application-specific
DOM. However, you do still need a concrete implementation of that
DOM's abstract interfaces. Xerces includes HTML and WML
DOM is based on an implicit data model, which is similar to but not quite the same as the data models used by other XML technologies such as XPath, the XML Infoset, and SAX. Before we delve too deeply into the nitty-gritty details of the DOM API, it's helpful to have a higher level understanding of just what DOM thinks an XML document is.
According to DOM, an XML document is a tree made up of nodes of several types. The tree has a single root node, and all nodes in this tree except for the root have a single parent node. Furthermore, each node has a list of child nodes. In some cases, this list of children may be empty, in which case the node is called a leaf node.
There can also be nodes that are not part of the tree structure. For example, each attribute node belongs to one element node but is not
DOM trees are not
In addition to its tree connections, each node has a local name, a namespace URI, and a prefix; although for several kinds of nodes, these may be null. For example, the local
Finally each node has a string value. For text-like things such as text nodes and comments, this tends to be the text of the node. For attributes, it's the normalized value of the attribute. For everything else, including elements and documents, the value is null.
DOM divides nodes into twelve types, seven of which can
Of these twelve, the first seven are by far the most important; and often a tree built by an XML parser will contain only the first seven.
Each DOM tree has a single root document node. This node has children. Because all documents have exactly one root element, a document node always has exactly one element-node child. If the document has a document type declaration, then it also has one document-type-node child. If the document contains any comments or processing instructions before or after the root element, then these are also child nodes of the document node. The order of all children is
Example 9.2 An XML-RPC Request Document
<?xml version="1.0"?> <?xml-stylesheet type="text/css" href="xml-rpc.css"?> <!-- It's unusual to have an xml-stylesheet processing instruction in an XML-RPC document but it is legal, unlike SOAP where processing instructions are forbidden. --> <!DOCTYPE methodCall SYSTEM "xml-rpc.dtd"> <methodCall> <methodName>getQuote</methodName> <params> <param> <value><string>RHAT</string></value> </param> </params> </methodCall>
The document node representing the root of this document has four child nodes in this order:
The XML declaration, the DOCTYPE declaration, and the white space between these nodes are not included in the tree. The document type node is available as a separate property of the document node. However, it is not a child and is not included in the list of the document's children. The XML declaration (including the version, standalone, and encoding declarations) and the white space are removed by the parser. They are not part of the model.
Each element node has a name, a local name, a namespace URI (which may be null if the element is not in any namespace), and a prefix (which may also be null). The string also contains children. For example, consider this value element:
When represented in DOM, it becomes a single element node with the name value . This node has a single element-node child for the string element. The string also has a single text-node child containing the text RHAT .
Or consider this para element:
<db:para xmlns:db="http://www.example.com/" xmlns="http://namespaces.cafeconleche.org/"> Or consider this <markup>para</markup> element: </db:para>
In DOM it's represented as an element node with the name db:para , the local name para , the prefix db , and the namespace URI http://www.example.com/ . It has three children:
White space is included in text nodes, even if it's ignorable. For example, consider this methodCall element:
<methodCall> <methodName>getQuote</methodName> <params> <param> <value><string>RHAT</string></value> </param> </params> </methodCall>
It is represented as an element node with the name methodCall and five child nodes:
Of course, these element nodes also have their own child nodes.
In addition to containing element and text nodes, an element node may contain comment and processing instruction nodes. Depending on how the parser behaves, an element node might also contain some CDATA section nodes, entity reference nodes, or both. However, many parsers resolve these automatically into their component text and element nodes, and do not report them separately.
An attribute node has a name, a local name, a prefix, a namespace URI, and a string value. The value is normalized as required by the XML 1.0 specification. That is, entity and character references in the value are resolved, and all white space
If a validating parser builds an XML document from a file, then default attributes from the DTD are included in the DOM tree. If the parser supports schemas, then default attributes can be read from the schema as well. DOM does not provide the type of the attribute as specified by the DTD or schema, or the list of values available for an enumerated type attribute. This is a major shortcoming.
Attributes are not considered to be children of the element to which they are attached. Instead they are part of a separate set of nodes. For example, consider this Quantity element:
<Quantity amount="17" />
This element has no children, but it does have a single attribute with the name amount and the value 17 .
Attributes that declare namespaces do not receive special treatment in DOM. They are
Only document, element, attribute, entity, and entity reference nodes can have children. The remaining node types are much simpler.
Text nodes contain character data from the document stored as a
. Any characters like 4 from outside Unicode's Basic Multilingual Plane are represented as surrogate pairs. Characters like & and < that are represented in the document by predefined entity or character references are
When a parser reads an XML document to form a DOM
, it puts as much text as possible into each text node before being
A comment node has a name (which is always #comment), a string value (the text of the comment) and a parent (the node that contains it). That's all. For example, consider this comment:
<!-- Don't forget to fix this! -->
The value of this node is Don't forget to fix this! The white space at either end is included.
Processing Instruction Nodes
A processing instruction node has a name (the target of the processing instruction), a string value (the data of the processing instruction), and a parent (the node that contains it). That's all. For example, consider this processing instruction:
<?xml-stylesheet type="text/css" href="xml-rpc.css"?>
The name of this node is
. The value is
. The white space between the target and the data is not included, but the white space between the data and the closing
is included. Even if the processing instruction uses a pseudo-attribute format as this one does, it is not considered to have attributes or children. Its data is just a string that happens to have some equal signs and quote marks in suggestive
CDATA Section Nodes
A CDATA section node is a special text node that represents the contents of a CDATA section. Its name is #cdata-section. Its value is the text content of the section. For example, consider this CDATA section:
<![CDATA[<?xml-stylesheet type="text/css" href="xml-rpc.css"?>]]>
Its name is #cdata-section and its value is <?xml-stylesheet type="text/css" href="xml-rpc.css"?> .
Entity Reference Nodes
When a parser encounters a general entity reference such as Æ or ©right_notice; , it may or may not replace it with the entity's replacement text. Validating parsers always replace entity references. Nonvalidating parsers may do so at their option.
If a parser does not replace entity references, then the DOM tree will include entity reference nodes. Each entity reference node has a name, and if the parser has read the DTD, then you should be able to look up the public and system IDs for this entity reference using the map of entity nodes available on the document type node. Furthermore, the child list of the entity will contain the replacement text for this entity reference. However, if the parser has not read the DTD and resolved external entity references, then the child list may be empty.
If a parser does replace entity references, then the DOM tree may or may not include entity reference nodes. Some parsers resolve all entity reference nodes completely and leave no trace of them in the parsed tree. Other parsers instead include entity reference nodes in the DOM tree that have a list of children. The child list contains text nodes, element nodes, comment nodes, and so forth, representing the replacement text of the entity.
For example, suppose an XML document contains this element:
<para>Ælfred is a very nice XML parser.</para>
If the parser is not resolving entity references, then the para element node contains two childrenan entity reference node with the name AElig and a text node containing the text "lfred is a very nice XML parser." The AElig entity reference node will not have any children.
Now suppose the parser is resolving entity references, and the replacement text for the AElig entity reference is the single ligature character . Now the parser has a choice: It can represent the children of the
element as a single text node containing the full
DOM never includes entity reference nodes for the five predefined entity references: & , < , > , ' , and " . These are simply replaced by their respective characters and included in a text node. Similarly, character references such as   and   are not specially represented in DOM as any kind of node. The characters they represent are simply added to the relevant text node.
Document Type Nodes
A document type node has a name (the name the document type declaration specifies for the root element), a public ID (which may be null), a system ID (required), an internal DTD subset (which may be null), a parent (the document that contains it), and lists of the notations and general entities declared in the DTD. The value of a document type node is always null. For example, consider this document type declaration:
<!DOCTYPE mml:math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" [ <!ENTITY % MATHML.prefixed "INCLUDE"> <!ENTITY % MATHML.prefix "mml"> ]>
The name of the corresponding node is mml:math . The public ID is -//W3C//DTD MathML 2.0//EN . The system ID is http://www.w3.org/TR/MathML2/dtd/mathml2.dtd . The internal DTD subset is the complete text between [ and ] .
There are four kinds of DOM nodes that are part of the document but not the document's tree: attribute nodes, entity nodes, notation nodes, and document fragment nodes. You've already seen that attribute nodes are attached to element nodes but are not children of those nodes. Entity and notation nodes are available as special properties of the type node. Document fragment nodes are used only when building DOM trees in memory, not when reading them from a parsed file.
Entity nodes (not to be
Each entity node has a name and a system ID. It can also have a public ID if one was used in the DTD. Furthermore, if the parser reads the entity, then the entity node has a list of children containing the replacement text of the entity. However, these children are read-only and cannot be modified, unlike children of similar type elsewhere in the document. For example, suppose the following entity declaration appeared in the document's DTD:
<!ENTITY AElig "Æ">
If the parser read the DTD, then it would create an entity node with the name AElig. This node would have a null public and system ID (because the entity would be purely internal) and one child, a read-only text node containing the single character .
For another example, suppose this entity declaration appeared in the document's DTD:
<!ENTITY Copyright SYSTEM "copyright.xml">
If the parser read the DTD, then it would create an entity node with the name Copyright, the system ID copyright.xml, and a null public ID. The children of this node would depend on what was found at the relative URL copyright.xml. Suppose that document contained the following content:
<copyright> <year>2002</year> <person>Elliotte Rusty Harold</person> </copyright>
Then the child list of the Copyright entity node would contain a single read-only element child with the name copyright. The element would contain its own read-only element and text node children.
Notation nodes represent the notations declared in the document's DTD. If the parser reads the DTD, then it will attach a map of notation nodes to the document type node. This map is indexed by the notation name. You can use it to look up the notation for each entity node that corresponds to an unparsed entity, or the notations associated with particular processing instruction targets.
In addition to its name, each notation node has a public ID or a system ID, whichever was used to declare it in the DTD. Notation nodes do not have any children. For example, suppose this notation declaration for PNG images was included in the DTD:
<!NOTATION PNG SYSTEM "http://www.w3.org/TR/REC-png">
This would produce a notation node with the name PNG and the system ID http://www.w3.org/TR/REC-png . The public ID would be null.
For another example, suppose this notation declaration for TeX documents was included in the DTD:
<!NOTATION TEX PUBLIC "+//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN">
This would produce a notation node with the name TEX and the public ID +//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN. The system ID would be null. (XML doesn't allow notations to have both public and system IDs.)
Document Fragment Nodes
The document fragment node is an alternative root node for a DOM tree. It can contain anything an element can contain (for example, element nodes, text nodes, processing instruction nodes, comment nodes, and so on). Although a parser will never produce such a node, your own programs may create one when extracting part of an XML document in order to move it elsewhere.
In DOM, the nonroot nodes never exist alone. That is, there's never a text node or an element node or a comment node that's not part of a document or a document fragment. They may be temporarily disconnected from the main tree, but they always know which document or fragment they belong to. The document fragment node enables you to work with pieces of a document that are composed of more than one node.
What Is and Isn't in the Tree
Table 9.1 summarizes the DOM data model with the name, value, parent, and possible children for each kind of node. One thing to keep in mind is the
Table 9.1. Node Properties
A DOM program cannot manipulate any of these constructs. It cannot, for example, read in an XML document and then write it out again in the same encoding as in the original document, because it doesn't know what encoding the original document used. It cannot treat $var differently from $var , because it doesn't know which was originally written.