Flylib.com

Books Software

 
 
 

What Does an XML Parser Do?

What Does an XML Parser Do?

Parsing a language refers to the process of taking a piece of code or data written in that language and breaking it into component parts as defined by the rules of that language. XML parsers are classified along two independent dimensions: validating vs. nonvalidating and stream based vs. tree based.

Validating and Non-Validating Parsers

A validating parser can use a DTD or schema to verify that a document is properly constructed according to the rules for the XML application it's an instance of, and it is supposed to complain loudly if the rules aren't followed. A DTD can also specify default values for the attributes of various elements, and a validating parser can fill them in when it encounters elements with no attributes listed. This capabililty can be important when you're processing XML documents you've received from the outside world. For example, if vendors send XML- marked invoices to your company, you'll want to ensure that they contain the right elements in the right order.

A non-validating parser only requires that the document be well-formed. Because of the design of XML, it's possible to parse well-formed documents without referring to a DTD or XSD schema. Additional information for being well- formed , which was discussed in Chapter 2, can be found in the XML 1.0 Recommendation ( http://www.w3.org/TR/2000/REC-xml-20001006).

Non-validating parsers are simpler, and many of the free parsers available over the Web are non-validating. They are usually adequate for processing XML documents generated within the same organization or documents whose validity constraints are so complex that they can't be expressed by a DTD and need to be verified by application logic instead.

Stream-Based and Tree-Based Parsers

A parser can make the components of an XML document known to an application in two ways. It can read through the document and signal the application every time a new component appears, or it can read the entire document and give the application a tree structure corresponding to the element structure of the document. Parsers that use the first method are called stream-based or event-driven parsers. Parsers that use the second method are tree-based parsers. Both methods will be discussed in greater detail later.

You'll hear two common terms regarding these methods: Simple API for XML (SAX) and the Document Object Model (DOM). SAX is a standard developed informally by members of the xml-dev mailing list for how a stream-based parser should "talk" to an application (see http://www.megginson.com/SAX/index.html). The DOM is a formal Recommendation of the W3C on how an application can access and manipulate the tree structure of a document (for an example see http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001 where the first edition is referred to as Level 1).

Tree-Based Parsing with the DOM

The DOM API defines a minimal set of language and platform-independent interfaces for accessing and manipulating the content and structure of information stored in XML documents. In this section we will cover DOM's major interfaces and briefly touch on the minor ones.

In tree-based parsing with the DOM the document is checked to see if it is well- formed and valid, depending on the type of parser. The parser then converts the document's information into a tree of nodes. The entire document, no matter how simple or complex, is converted into a tree that starts from one root node, which, in DOM terms, is called a document object instance (hence Document Object Model). Once a document object tree is created, access to the elements allows you to modify, delete, and create leaves and branches by using the interfaces in the API.

We are using Titles.xml as the example XML file during this discussion. This file, shown in Listing 3-1, presents a collection of books based on the sample pubs database that comes with Microsoft SQL Server.

Listing 3-1 Titles.xml: A sample XML document.

<?xml version="1.0" encoding="UTF-8"?> <BookList> <Book> <book_id>BU1111</book_id> <title>Cooking with Computers: Surreptitious Balance Sheets</title> <type>business</type> <pub_id>1389</pub_id> <price>11.95</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>3876</ytd_sales> <notes>Helpful hints on how to use your electronic...</notes> <pubdate>1991-06-09T05:00:00</pubdate> </Book> <Book> <book_id>BU7832</book_id> <title>Straight Talk About Computers</title> <type>business</type> <pub_id>1389</pub_id> <price>19.99</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>4095</ytd_sales> <notes>Annotated analysis of what computers can do for you</notes> <pubdate>1991-06-22T05:00:00</

pubdate

> </Book> </BookList>

Figure 3-1 is a visual representation of how Titles.xml can be represented as a tree of nodes.

Figure 3-1 A DOM hierarchy representation of Titles.xml.

Everything is a node in the Document object tree. These nodes might have child nodes or hold information like its tag name ( nodeName ) and value ( nodeValue ). This hierarchical organization of information is similar to a file system, where folders might contain files or other folders, except everything descends from one root folder.

Important Interfaces in the DOM

The DOM provides interfaces in its hierarchy of Node objects. The interfaces either have child nodes that contain other nodes or are leaf nodes that do not contain anything after them in the document structure. Some types of child or leaf nodes are Node , Element , and NodeList , all of which are interfaces in the DOM.

Node

An XML Document object created after a DOM parser reads an XML file often contains a tree-like representation of Node objects instances, while other interfaces are provided to create a more object-oriented environment. You can manipulate all the information in the DOM by using the Node interface. Even though the DOM Recommendation specifically states that it isn't necessarily a tree, for the purposes of the discussions in this chapter and the examples therein we will focus on the tree-like representations. Figure 3-2 shows the inheritance relationships between some of the important interfaces.

Figure 3-2 DOM interfaces and inheritance relationships.

Because the Document object is a subclass of Node , the root Node object of the tree is also a Document object. Every DOM object must have a root. Figure 3-3 illustrates a sample XML Document object tree and describes some of the Node objects that it contains.

Figure 3-3 A Document object, where everything in the DOM is a Node.

You can find out if a Node has children by using the hasChildNodes( ) method. This method, which takes no parameters, returns a Boolean true if the node has children and false if not.

The getNodeType( ) method, which is part of the Java bindings defined by DOM, is another important method of Node . It returns the type of a particular Node . The type is a constant integer used to identify different types of Node s. For example, the Node.ELEMENT_NODE type identifies a Node to be an element. Table 3-1 contains a list of the other methods available for the Node object as well.

Table 3-1 Other Methods of the Node Object

Method Description

appendChild( )

Adds a new child object, which is passed to the method, to the current Node .

cloneNode( )

Returns a duplicate of the Node .

hasAttributes( )

Returns a Boolean true if the Node has any attributes. This method was added in DOM Level 2.

insertBefore( )

Takes a new child Node and a reference child Node and inserts the new child Node before the reference Node .

isSupported( )

Tests whether or not this implementation of the DOM supports a specific feature. This method was added in DOM Level 2 and takes a version number and a feature as parameters.

normalize( )

Puts all text nodes in the full depth of the sub-tree underneath this Node .

removeChild( )

Removes the specified child.

replaceChild( )

Replaces the specified child with the new child passed.

Element

The Element interface, which is a subclass of Node , is another important interface. It can be used to access the elements in a DOM Document object tree, which allows you to read in attributes and their values, as well as change, delete, or add to them. Table 3-1 contains the list of methods of the Element  object.

Table 3-2 Methods of the Element Object

Method Description

getAttribute( )

Retrieves the specified attribute.

getAttributeNS( )

Retrieves the specified attribute by local name and namespace. This method was added in Level 2.

getAttributeNode( )

Retrieves an Attr node by name.

getAttributeNodeNS( )

Retrieves an Attr node by local name and namespace. This method was added in Level 2.

getElementsByTagName( )

Returns a NodeList of all child elements of a given tag name in the order in which they are encountered .

getElementsByTagNameNS( )

Returns a NodeList of all child elements of a given tag by local name and namespace in the order in which they are encountered. This method was added in Level 2.

hasAttribute( )

Returns a Boolean true if the specified attribute is present. Returns Boolean false otherwise .

hasAttributeNS( )

Returns a Boolean true if the specified attribute, by local name and namespace, is present. Returns Boolean false otherwise. This method was added in Level 2.

removeAttribute( )

Removes the specified attribute.

removeAttributeNS( )

Removes the attribute specified by local name and namespace. This method was added in Level 2.

removeAttributeNode( )

Removes the specified Attr node.

setAttribute( )

Adds a new attribute. If an attribute of the same name exists, its value is changed to the specified value.

setAttributeNS( )

Adds a new attribute. If an attribute of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2.

setAttributeNode( )

Adds a new Attr node. If an Attr node of the same name exists, its value is changed to the specified value.

setAttributeNodeNS( )

Adds a new Attr node. If an Attr node of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2.

NodeList

Some methods of the Node interface allow traversal of a Node tree. The getChildNodes( ) method is useful for gathering all the elements inside a Node . This method returns all Node s, if they exist, in a container for Node objects. NodeList is an iterator for a list of Node s. Figure 3-4 illustrates a NodeList .

Unlike Node and Element , NodeList has only a single method, item( ) . This method returns the Node located at the indexed position passed to the method. For instance, if you want to retrieve the first Node , you call the method using item(0) .

Other DOM Interfaces

Node , Element , and NodeList are not the only interfaces specified by the DOM. Because we do not cover all of them, we've included a list in Table 3-3 along with any children interfaces they have and a brief description. This table contains only those interfaces found in the DOM Level 1 and 2 Core and does not contain the HTML bindings.

Table 3-3 DOM Interfaces

Interface Children Description

Attr

Text, EntityReference

Represents an attribute of an Element object.

CDATASection

Contains no children

Used to escape characters of text that would otherwise be considered markup.

Comment

Contains no children

Stores the content of an XML or HTML comment.

Document

Element (max. of one) , ProcessingInstruction, Comment, DocumentType (max. of one)

Represents the entire XML or HTML document.

DocumentFragment

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

A minimal Document object instance.

DocumentType

Contains no children

Represents the document type stored in the doctype attribute of the Document object instance.

Element

Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference

Represents an element in an XML or HTML document.

Entity

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

Represents an entity, whether it's parsed or unparsed.

EntityReference

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

Can be inserted into a structured model when an the entity reference is in the source document or when you want to insert a reference.

Notation

Contains no children

Represents a notation declared in a DTD.

ProcessingInstruction

Contains no children

Used to represent a "processing instruction."

Text

Contains no children

Contains textual content (character data).

Figure 3-4 A DOM NodeList Object.