DOM | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

The Document Object Model defines a tree-based representation of XML documents. The org.w3c.dom package contains the basic node classes that represent the different components which make up the tree. The org.w3c.dom.traversal package includes some useful utility classes for navigating, searching, and querying the tree.

DOM Level 2, the version described here, is incomplete. It does not define how a DOMImplementation is loaded, how a document is parsed, or how a document is serialized. For the moment, JAXP provides a stopgap solution. Eventually, DOM3 will fill in these holes, but because it was far from complete at the time of this writing, this appendix covers DOM2 exclusively.

The DOM Data Model

Table A.1 summarizes the DOM data model with the name , value, parent, and possible children for each kind of node.

Table A.1. DOM2 Node Properties

Node Type	Name	Value	Parent	Children
Document	#document	Null	Null	Comment, processing instruction, zero or one document type, one element
Document type	Root element name specified by the DOCTYPE declaration	Null	Document	None
Element	Prefixed name	Null	Element, document, or document fragment	Comment, processing instruction, text, element, entity reference, CDATA section
Text	#text	Text of the node	Element, attr, entity, or entity reference	None
Attr	Prefixed name	Normalized attribute value	Element	Text, entity reference
Comment	#comment	Text of comment	Element, document, or document fragment	None
Processing instruction	Target	Data	Element, document, or document fragment	None
Entity reference	Name	Null	Element or document fragment	Comment, processing instruction, text, element, entity reference, CDATA section
Entity	Entity name	Null	Null	Comment, processing instruction, text, element, entity reference, CDATA section
CDATA section	#cdata-section	Text of the section	Element, entity, or entity reference	None
Notation	Notation name	Null	Null	None
Document fragment	#document-fragment	Null	Null	Comment, processing instruction, text, element, entity reference, CDATA section

One thing to keep in mind is the parts of the XML document that are not exposed in this data model:

The XML declaration, including the version, standalone, and encoding declarations. These will be added as properties of the document node in DOM3, but current parsers do not provide them.
Most information from the DTD and/or schema is not provided including element and attribute types and content models. DOM3 will add some of this.
Any white space outside the root element.
Whether or not each character was provided by a character reference. Parsers may provide information about entity references but are not required to do so.

A DOM program cannot manipulate any of these constructs. It cannot, for example, read in an XML document and then write it out again in the same encoding as in the original document, because it doesn't know what encoding the original document used. It cannot treat $var differently than $var , because it doesn't know which was originally written.

org.w3c.dom

The org.w3c.dom package contains the core interfaces that are used to form DOM documents. Node is the common superinterface that all of these node types share. In addition, this package contains a few data structures used to hold collections of DOM nodes and one exception class.

Attr

The Attr interface represents an attribute node. Its node properties are defined as follows :

Node name	The full name of the attribute, including a prefix and a colon if the attribute is in a namespace
Node value	The attribute's normalized value
Local name	The local part of the attribute's name
Namespace URI	The namespace URI of the attribute, or null if the attribute does not have a prefix
Namespace prefix	The namespace prefix of the attribute, or null if the attribute is not in a namespace

Attr objects are not part of the tree, and they have neither parents nor siblings. getParentNode() , getPreviousSibling() , and getNextSibling() all return null when invoked on an Attr object. Attr objects do have children ( Text and EntityReference objects), but it's generally best to ignore this and simply use the getValue() method to read the value of an attribute.

 package org.w3c.dom;  public interface  Attr  extends Node {   public String  getName  ();   public boolean  getSpecified  ();   public String  getValue  ();   public void  setValue  (String  value  ) throws DOMException;   public Element  getOwnerElement  (); }

CDATASection

The CDATASection interface represents a CDATA section. DOM parsers are not required to use this interface to report CDATA sections. They may just use Text objects to report the content of CDATA sections. Do not write code that depends on recognizing CDATA sections in text. The node properties of CDATASection are defined as follows:

Node name	`#cdata-section`
Node value	The text of the CDATA section
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  CDATASection  extends Text { }

CharacterData

The CharacterData interface is the generic superinterface for nodes composed of plain text: Comment , Text , and CDATASection . All actual instances of CharacterData should be instances of one of these subinterfaces. The node properties depend on the specific subinterface.

 package org.w3c.dom;  public interface  CharacterData  extends Node {   public String  getData  () throws DOMException;   public void  setData  (String  data  ) throws DOMException;   public int  getLength  ();   public String  substringData  (int  offset,  int  count  )    throws DOMException;   public void  appendData  (String  s  ) throws DOMException;   public void  insertData  (int  offset,  String  s  )    throws DOMException;   public void  deleteData  (int  offset,  int  count  )    throws DOMException;   public void  replaceData  (int  offset,  int  count,  String  s  )    throws DOMException; }

Comment

The Comment interface represents a comment node. It inherits all of its methods from the CharacterData and Node superinterfaces. Its node properties are defined as follows:

Node name	`#comment`
Node value	The text of the comment, not including `<--` and `-->`
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  Comment  extends CharacterData { }

Document

The Document interface represents the root node of the tree. It also serves as an abstract factory to create the other kinds of nodes (element, attribute, comment, and so on) that will be stored in the tree. Its node properties are defined as follows:

Node name	`#document`
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  Document  extends Node {   public DocumentType  getDoctype  ();   public DOMImplementation  getImplementation  ();   public Element  getDocumentElement  ();   public Element  createElement  (String  tagName  )    throws DOMException;   public Element  createElementNS  (    String  namespaceURI,  String  qualifiedName  ) throws DOMException;   public Attr  createAttribute  (String  name  )    throws DOMException;   public Attr  createAttributeNS  (    String  namespaceURI,  String  qualifiedName  ) throws DOMException;   public DocumentFragment  createDocumentFragment  ();   public Text  createTextNode  (String  data  );   public Comment  createComment  (String  data  );   public CDATASection  createCDATASection  (String  data  )    throws DOMException;   public ProcessingInstruction  createProcessingInstruction  (    String  target,  String  data  ) throws DOMException;   public EntityReference  createEntityReference  (String  name  )    throws DOMException;   public NodeList  getElementsByTagName  (String  tagName  );   public Node  importNode  (Node  importedNode,  boolean  deep  )    throws DOMException;   public NodeList  getElementsByTagNameNS  (String  namespaceURI,  String  localName  );   public Element  getElementById  (String  id  ); }

DocumentFragment

The DocumentFragment interface is used to hold lists of element, text, comment, CDATA section, and processing instruction nodes when those nodes do not have a parent. It's convenient for cutting and pasting or inserting and moving fragments of an XML document that don't necessarily contain a single element.

The node properties of DocumentFragment are defined as follows:

Node name	`#document-fragment`
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  DocumentFragment  extends Node { }

This interface is for advanced use only. DOM trees created by a parser won't contain any DocumentFragment objects, and adding a DocumentFragment to a Document actually adds the contents of the fragment instead.

DocumentType

The DocumentType interface represents a document type declaration. It contains the root element name it declares, the system ID and public ID for the external DTD subset, and the complete internal DTD subset as a String . It also contains lists of the notations and general entities declared in the DTD. Otherwise it contains no information from the DTD. The node properties of a DocumentType object are defined as follows:

Node name	declared root element name
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  DocumentType  extends Node {   public String  getName  ();   public String  getPublicId  ();   public String  getSystemId  ();   public String  getInternalSubset  ();   public NamedNodeMap  getEntities  ();   public NamedNodeMap  getNotations  (); }

In DOM2, the entire DocumentType object is read-only. No part of it can be modified. Furthermore, a Document object's DocumentType cannot be changed after the Document object is created. This restriction is lifted in DOM3.

DOM2 does not provide any representation of the document type definition (DTD) as distinguished from the document type declaration.

DOMImplementation

DOMImplementation is an abstract factory used to create new Document and DocumentType objects. The javax.xml.parsers.DocumentBuilder class can create new DOMImplementation objects.

 package org.w3c.dom;  public interface  DOMImplementation  {   public DocumentType  createDocumentType  (String  qualifiedName,  String  publicID,  String  systemID  ) throws DOMException;   public Document  createDocument  (String  namespaceURI,  String  qualifiedName,  DocumentType  doctype  )     throws DOMException;   public boolean  hasFeature  (String  feature,  String  version  ); }

Element

The Element interface represents an element node. The most important methods for this interface are inherited from the Node superinterface. Its node properties are defined as follows:

Node name	The qualified name of the element, possibly including a prefix and a colon
Node value	null
Local name	The local part of the element name
Namespace URI	The namespace URI of the element, or null if this element is not in a namespace
Namespace prefix	The namespace prefix of the element, or null if this element is in the default namespace or no namespace at all

 package org.w3c.dom;  public interface  Element  extends Node {   public String  getTagName  ();   public NodeList  getElementsByTagNameNS  (String  namespaceURI,  String  localName  );   public NodeList  getElementsByTagName  (String  name  );   public String  getAttribute  (String  name  );   public void  setAttribute  (String  name,  String  value  )    throws DOMException;   public void  removeAttribute  (String  name  )    throws DOMException;   public Attr  getAttributeNode  (String  name  );   public Attr  setAttributeNode  (Attr  newAttr  )    throws DOMException;   public Attr  removeAttributeNode  (Attr  oldAttr  )    throws DOMException;   public String  getAttributeNS  (String  namespaceURI,  String  localName  );   public void  setAttributeNS  (String  namespaceURI,  String  qualifiedName,  String  value  ) throws DOMException;   public void  removeAttributeNS  (String  namespaceURI,  String  localName  ) throws DOMException;   public Attr  getAttributeNodeNS  (String  namespaceURI,  String  localName  );   public Attr  setAttributeNodeNS  (Attr  newAttr  )    throws DOMException;   public boolean  hasAttribute  (String  name  );   public boolean  hasAttributeNS  (String  namespaceURI,  String  localName  ); }

Entity

The Entity interface represents an entity node. It does not appear directly in the tree; instead, an EntityReference node appears in the tree. The name of the EntityReference identifies a member of the document's entities map, which is accessible through the DocumentType interface. If the Entity object represents a parsed entity and the parser resolved the entity, then this node will have children that represent its replacement text. All aspects of the Entity object, including all of its children, are read-only. They cannot be modified or changed in any way.

The node properties of Entity are defined as follows:

Node name	The name of the entity
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  Entity  extends Node {   public String  getPublicId  ();   public String  getSystemId  ();   public String  getNotationName  (); }

Because Entity objects are not part of the tree, they have neither parents nor siblings. getParentNode() , getPreviousSibling() , and getNextSibling() all return null when invoked on an Entity object.

EntityReference

The EntityReference interface represents a parsed entity reference that appears in the document tree. Parsers are not required to use this class. Some parsers silently resolve all entity references to their replacement text. If a parser does not resolve external entity references, then it must include EntityReference objects instead, although the only information available from these objects will be the name. A parser that does resolve external entity references and chooses to include EntityReference objects anyway will also set the children of this node, so as to represent the entity's replacement text. In this case, you can use the methods inherited from the Node superinterface to walk the entity's tree. Note, however, that all of these children and their descendants are completely read-only, and you will not be able to change them in any way. If you need to modify them, you must first clone each of the EntityReference children, and then replace the EntityReference with the cloned children.

EntityReference objects are never used for the five predefined entity references ( < , > , & , " , and ' ,) or for character references such as   or   .

The node properties of EntityReference are defined as follows:

Node name	The name of the entity
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  EntityReference  extends Node { }

NamedNodeMap

DOM uses NamedNodeMap data structures to hold unordered sets of attributes, notations, and entities. You can iterate through a map using item() and getLength() . The first item in the map is at index 0. Note that the particular order the implementation chooses is not significant or even reproducible.

 package org.w3c.dom;  public interface  NamedNodeMap  {   public Node  getNamedItem  (String  name  );   public Node  setNamedItem  (Node  node  ) throws DOMException;   public Node  removeNamedItem  (String  name  ) throws DOMException;   public Node  item  (int  index  );   public int  getLength  ();   public Node  getNamedItemNS  (String  namespaceURI,  String  localName  );   public Node  setNamedItemNS  (Node  node  ) throws DOMException;   public Node  removeNamedItemNS  (String  namespaceURI,  String  localName  ) throws DOMException; }

NamedNodeMap s are live. That is, adding an item to the map or removing an item from the map will add it to or remove it from whatever construct produced the map in the first place.

Node

Node is the key superinterface for almost all of the other classes in the org.w3c.dom package. It is the primary means by which you navigate, search, query, and occasionally even update an XML document with DOM.

 package org.w3c.dom;  public interface  Node  {   // Node type constants   public static final short  ELEMENT_NODE;  public static final short  ATTRIBUTE_NODE;  public static final short  TEXT_NODE;  public static final short  CDATA_SECTION_NODE;  public static final short  ENTITY_REFERENCE_NODE;  public static final short  ENTITY_NODE;  public static final short  PROCESSING_INSTRUCTION_NODE;  public static final short  COMMENT_NODE;  public static final short  DOCUMENT_NODE;  public static final short  DOCUMENT_TYPE_NODE;  public static final short  DOCUMENT_FRAGMENT_NODE;  public static final short  NOTATION_NODE;  // Basic getter methods   public String  getNodeName  ();   public String  getNodeValue  () throws DOMException;   public void  setNodeValue  (String  value  ) throws DOMException;   public short  getNodeType  ();   public String  getNamespaceURI  ();   public String  getPrefix  ();   public void  setPrefix  (String  prefix  ) throws DOMException;   public String  getLocalName  ();   // Navigation methods   public Node  getParentNode  ();   public boolean  hasChildNodes  ();   public NodeList  getChildNodes  ();   public Node  getFirstChild  ();   public Node  getLastChild  ();   public Node  getPreviousSibling  ();   public Node  getNextSibling  ();   public Document  getOwnerDocument  ();   // Attribute methods   public boolean  hasAttributes  ();   public NamedNodeMap  getAttributes  ();   // Tree modification methods   public Node  insertBefore  (Node  newChild,  Node  refChild  )    throws DOMException;   public Node  replaceChild  (Node  newChild,  Node  oldChild  )    throws DOMException;   public Node  removeChild  (Node  oldChild  ) throws DOMException;   public Node  appendChild  (Node  newChild  ) throws DOMException;   // Utility methods   public Node  cloneNode  (boolean  deep  );   public void  normalize  ();   public boolean  isSupported  (String  feature,  String  version  ); }

NodeList

NodeList is the basic DOM list type. Its most common use is for lists of children of an Element or Document . The index of the first item in the list is 0, as with Java arrays.

The actual data structure used to implement the list can vary from implementation to implementation, but one constant is that the lists are live. In other words, if a node is deleted or moved from its parent, then it is also deleted from all lists that were built from the children of that parent. Similarly, if a new node is added to some node, then it is also added to all lists that point to the children of that node.

 package org.w3c.dom;  public interface  NodeList  {   public Node  item  (int  index  );   public int  getLength  (); }

Notation

The Notation interface represents a notation declared in the document's DTD. It does not have a position in the tree. The complete list of notations in the document is accessible through the getNotations() method of the DocumentType interface. Both this list and the individual Notation objects are read-only.

The node properties of Notation are defined as follows:

Node name	Notation name
Node value	null
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  Notation  extends Node {   public String  getPublicId  ();   public String  getSystemId  (); }

ProcessingInstruction

The ProcessingInstruction interface represents a processing instruction node. Its node properties are defined as follows:

Node name	The target
Node value	The data
Local name	null
Namespace URI	null
Namespace prefix	null

 package org.w3c.dom;  public interface  ProcessingInstruction  extends Node {   public String  getTarget  ();   public String  getData  ();   public void  setData  (String  data  ) throws DOMException; }

Text

The Text interface represents a text node. It can contain any characters that are legal in XML text, including characters such as the less-than sign and ampersand that may need to be escaped when the document is serialized. When a parser reads an XML document and builds a DOM tree, each Text object will contain the longest-possible contiguous run of text. However, DOM does not maintain this constraint as the document is manipulated in memory. Its node properties are defined as follows:

Node name	`#text`
Node value	The text of the node
Local name	null
Namespace URI	null
Namespace prefix	null

The Text interface declares only one method of its own, splitText() . Most of its functionality is inherited from the superinterfaces CharacterData and Node .

 package org.w3c.dom;  public interface  Text  extends CharacterData {   public Text  splitText  (int  offset  ) throws DOMException; }

Exceptions and Errors

DOM2 defines only one exception class, DOMException . This is a runtime exception used for almost anything that can go wrong while constructing or manipulating a DOM Document . The details are provided by a short field, code , which is set to any of several named constants.

 package org.w3c.dom;  public class  DOMException  extends RuntimeException {   public short  code;  public static final short  INDEX_SIZE_ERR;  public static final short  DOMSTRING_SIZE_ERR;  public static final short  HIERARCHY_REQUEST_ERR;  public static final short  WRONG_DOCUMENT_ERR;  public static final short  INVALID_CHARACTER_ERR;  public static final short  NO_DATA_ALLOWED_ERR;  public static final short  NO_MODIFICATION_ALLOWED_ERR;  public static final short  NOT_FOUND_ERR;  public static final short  NOT_SUPPORTED_ERR;  public static final short  INUSE_ATTRIBUTE_ERR;  public static final short  INVALID_STATE_ERR;  public static final short  SYNTAX_ERR;  public static final short  INVALID_MODIFICATION_ERR;  public static final short  NAMESPACE_ERR;  public static final short  INVALID_ACCESS_ERR;  public  DOMException  (short  code,  String  message  ); }

org.w3c.dom.traversal

The DOM traversal API in the org.w3c.dom.traversal package provides some convenience classes for navigating and searching an XML document. The most useful aspect of this class is the capability to get lists and trees that contain the kinds of nodes that you're interested in while ignoring everything else.

DocumentTraversal

DocumentTraversal is a factory interface for creating new NodeIterator and TreeWalker objects that present a filtered view of the content of an element or a document. (You can filter other kinds of nodes, too, but there's not much point to this if they don't have any children.)

In implementations that support the traversal API (which can be determined by invoking the hasFeature("Traversal", "2.0" ) method in the Document or DOMImplementation classes) all objects that implement Document also implement DocumentTraversal . That is, to create a DocumentTraversal object, just cast a Document to DocumentTraversal .

 package org.w3c.dom.traversal;  public interface  DocumentTraversal  {   public NodeIterator  createNodeIterator  (Node  root,  int  whatToShow,  NodeFilter  filter,  boolean  expandEntities  )    throws DOMException;   public TreeWalker  createTreeWalker  (Node  root,  int  whatToShow,  NodeFilter  filter,  boolean  expandEntities  )    throws DOMException; }

NodeFilter

The NodeFilter interface is used by NodeIterator s and TreeWalker s to determine which nodes are included in the view of the document that they present to the client. Each node in the subtree will be passed to the filter's acceptNode() method. This returns one of the three named constants:

`NodeFilter.FILTER_ACCEPT`	Include the node.
`NodeFilter.FILTER_REJECT`	Do not include the node or any of its descendants when tree-walking ; do not include the node but do include its descendants when iterating.
`NodeFilter.FILTER_SKIP`	Do not include the node but do include its children if they pass the filter individually.

In addition, this class has 13 named constants that can be combined with the bitwise operators and passed to createNodeIterator() and createTreeWalker() to specify which kinds of nodes should be included in their views.

 package org.w3c.dom.traversal;  public interface  NodeFilter  {   public static final short  FILTER_ACCEPT;  public static final short  FILTER_REJECT;  public static final short  FILTER_SKIP;  public static final int  SHOW_ALL;  public static final int  SHOW_ELEMENT;  public static final int  SHOW_ATTRIBUTE;  public static final int  SHOW_TEXT;  public static final int  SHOW_CDATA_SECTION;  public static final int  SHOW_ENTITY_REFERENCE;  public static final int  SHOW_ENTITY;  public static final int  SHOW_PROCESSING_INSTRUCTION;  public static final int  SHOW_COMMENT;  public static final int  SHOW_DOCUMENT;  public static final int  SHOW_DOCUMENT_TYPE;  public static final int  SHOW_DOCUMENT_FRAGMENT;  public static final int  SHOW_NOTATION;  public short  acceptNode  (Node  node  ); }

NodeIterator

The NodeIterator interface presents a subset of nodes from the document as a list in document order. The list is live; that is, changes to the document are reflected in the list.

 package org.w3c.dom.traversal;  public interface  NodeIterator  {   public Node  nextNode  () throws DOMException;   public Node  previousNode  () throws DOMException;   public Node  getRoot  ();   public int  getWhatToShow  ();   public NodeFilter  getFilter  ();   public boolean  getExpandEntityReferences  ();   public void  detach  (); }

TreeWalker

The TreeWalker interface presents a subset of nodes from the document as a tree. Walking the TreeWalker is much like walking a full Document or Element , except that many of the node's descendants in which you aren't interested can be filtered out so they don't get in your way. The tree is live; that is, changes to the document are reflected in the tree.

 package org.w3c.dom.traversal;  public interface  TreeWalker  {   public Node  parentNode  ();   public Node  firstChild  ();   public Node  lastChild  ();   public Node  previousSibling  ();   public Node  nextSibling  ();   public Node  previousNode  ();   public Node  nextNode  ();   public Node  getRoot  ();   public int  getWhatToShow  ();   public NodeFilter  getFilter  ();   public boolean  getExpandEntityReferences  ();   public Node  getCurrentNode  ();   public void  setCurrentNode  (Node  node  )    throws DOMException; }