Parsing
a language refers to the process of taking a piece of code or data written in that language and breaking it into component
A
validating
parser can use a DTD or schema to verify that a document is properly
A
non-validating
parser only requires that the document be well-formed. Because of the design of XML, it's possible to parse well-formed documents without referring to a DTD or XSD schema. Additional information for being well-
Non-validating parsers are simpler, and many of the free parsers available over the Web are non-validating. They are usually adequate for processing XML documents generated within the same organization or documents whose validity constraints are so complex that they can't be
A parser can make the
You'll hear two common terms regarding these methods: Simple API for XML (SAX) and the Document Object Model (DOM). SAX is a standard developed informally by
The DOM API defines a minimal set of language and platform-independent interfaces for accessing and manipulating the content and structure of information stored in XML documents. In this section we will cover DOM's major interfaces and
In tree-based parsing with the DOM the document is checked to see if it is well-
We are using Titles.xml as the example XML file during this discussion. This file, shown in Listing 3-1,
Listing 3-1 Titles.xml: A sample XML document.
<?xml version="1.0" encoding="UTF-8"?> <BookList> <Book> <book_id>BU1111</book_id> <title>Cooking with Computers: Surreptitious Balance Sheets</title> <type>business</type> <pub_id>1389</pub_id> <price>11.95</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>3876</ytd_sales> <notes>Helpful hints on how to use your electronic...</notes> <pubdate>1991-06-09T05:00:00</pubdate> </Book> <Book> <book_id>BU7832</book_id> <title>Straight Talk About Computers</title> <type>business</type> <pub_id>1389</pub_id> <price>19.99</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>4095</ytd_sales> <notes>Annotated analysis of what computers can do for you</notes> <pubdate>1991-06-22T05:00:00</pubdate > </Book> </BookList>
Figure 3-1 is a visual representation of how Titles.xml can be represented as a tree of nodes.
Figure 3-1 A DOM hierarchy representation of Titles.xml.
Everything is a node in the
Document
object tree. These nodes might have child nodes or hold information like its tag
The DOM provides interfaces in its hierarchy of Node objects. The interfaces either have child nodes that contain other nodes or are leaf nodes that do not contain anything after them in the document structure. Some types of child or leaf nodes are Node , Element , and NodeList , all of which are interfaces in the DOM.
An XML
Document
object created after a DOM parser reads an XML file often contains a tree-like representation of
Node
objects instances, while other interfaces are provided to create a more object-oriented environment. You can manipulate all the information in the DOM by using the
Node
interface. Even though the DOM Recommendation
Figure 3-2 DOM interfaces and inheritance relationships.
Because the Document object is a subclass of Node , the root Node object of the tree is also a Document object. Every DOM object must have a root. Figure 3-3 illustrates a sample XML Document object tree and describes some of the Node objects that it contains.
Figure 3-3 A Document object, where everything in the DOM is a Node.
You can find out if a Node has children by using the hasChildNodes( ) method. This method, which takes no parameters, returns a Boolean true if the node has children and false if not.
The
getNodeType( )
method, which is part of the Java bindings defined by DOM, is another important method of
Node
. It returns the type of a particular
Node
. The type is a constant integer used to identify different types of
Node
s. For example, the
Node.ELEMENT_NODE
type identifies a
Node
to be an element. Table 3-1 contains a list of the other
Table 3-1 Other Methods of the Node Object
| Method | Description |
|---|---|
|
appendChild( ) |
Adds a new child object, which is passed to the method, to the current Node . |
|
cloneNode( ) |
Returns a duplicate of the Node . |
|
hasAttributes( ) |
Returns a Boolean true if the Node has any attributes. This method was added in DOM Level 2. |
|
insertBefore( ) |
Takes a new child
Node
and a reference child
Node
and
|
|
isSupported( ) |
Tests whether or not this implementation of the DOM supports a specific feature. This method was added in DOM Level 2 and takes a version number and a feature as parameters. |
|
normalize( ) |
Puts all text nodes in the full depth of the sub-tree underneath this Node . |
|
removeChild( ) |
Removes the specified child. |
|
replaceChild( ) |
Replaces the specified child with the new child passed. |
The Element interface, which is a subclass of Node , is another important interface. It can be used to access the elements in a DOM Document object tree, which allows you to read in attributes and their values, as well as change, delete, or add to them. Table 3-1 contains the list of methods of the Element object.
Table 3-2 Methods of the Element Object
| Method | Description |
|---|---|
|
getAttribute( ) |
Retrieves the specified attribute. |
|
getAttributeNS( ) |
Retrieves the specified attribute by local name and namespace. This method was added in Level 2. |
|
getAttributeNode( ) |
Retrieves an Attr node by name. |
|
getAttributeNodeNS( ) |
Retrieves an Attr node by local name and namespace. This method was added in Level 2. |
|
getElementsByTagName( ) |
Returns a
NodeList
of all child elements of a given tag name in the order in which they are
|
|
getElementsByTagNameNS( ) |
Returns a NodeList of all child elements of a given tag by local name and namespace in the order in which they are encountered. This method was added in Level 2. |
|
hasAttribute( ) |
Returns a Boolean true if the specified attribute is present. Returns Boolean false
|
|
hasAttributeNS( ) |
Returns a Boolean true if the specified attribute, by local name and namespace, is present. Returns Boolean false otherwise. This method was added in Level 2. |
|
removeAttribute( ) |
Removes the specified attribute. |
|
removeAttributeNS( ) |
Removes the attribute specified by local name and namespace. This method was added in Level 2. |
|
removeAttributeNode( ) |
Removes the specified Attr node. |
|
setAttribute( ) |
Adds a new attribute. If an attribute of the same name exists, its value is changed to the specified value. |
|
setAttributeNS( ) |
Adds a new attribute. If an attribute of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2. |
|
setAttributeNode( ) |
Adds a new Attr node. If an Attr node of the same name exists, its value is changed to the specified value. |
|
setAttributeNodeNS( ) |
Adds a new Attr node. If an Attr node of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2. |
Some methods of the Node interface allow traversal of a Node tree. The getChildNodes( ) method is useful for gathering all the elements inside a Node . This method returns all Node s, if they exist, in a container for Node objects. NodeList is an iterator for a list of Node s. Figure 3-4 illustrates a NodeList .
Unlike Node and Element , NodeList has only a single method, item( ) . This method returns the Node located at the indexed position passed to the method. For instance, if you want to retrieve the first Node , you call the method using item(0) .
Node , Element , and NodeList are not the only interfaces specified by the DOM. Because we do not cover all of them, we've included a list in Table 3-3 along with any children interfaces they have and a brief description. This table contains only those interfaces found in the DOM Level 1 and 2 Core and does not contain the HTML bindings.
Table 3-3 DOM Interfaces
| Interface | Children | Description |
|---|---|---|
|
Attr |
Text, EntityReference |
Represents an attribute of an Element object. |
|
CDATASection |
Contains no children |
Used to escape
|
|
Comment |
Contains no children |
Stores the content of an XML or HTML comment. |
|
Document |
Element (max. of one) , ProcessingInstruction, Comment, DocumentType (max. of one) |
Represents the entire XML or HTML document. |
|
DocumentFragment |
Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference |
A minimal Document object instance. |
|
DocumentType |
Contains no children |
Represents the document type stored in the doctype attribute of the Document object instance. |
|
Element |
Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference |
Represents an element in an XML or HTML document. |
|
Entity |
Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference |
Represents an entity, whether it's parsed or unparsed. |
|
EntityReference |
Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference |
Can be inserted into a structured model when an the entity reference is in the source document or when you want to insert a reference. |
|
Notation |
Contains no children |
Represents a notation declared in a DTD. |
|
ProcessingInstruction |
Contains no children |
Used to represent a "processing instruction." |
|
Text |
Contains no children |
Contains textual content (character data). |
Figure 3-4 A DOM NodeList Object.