Creating a Parser | Real World XML (2nd Edition)

This first Java example will get us started by parsing an XML document and displaying the number of a certain element in it. In this chapter, I'm taking a look at using the XML DOM with Java, and I'll use the Java DocumentBuilder class, which creates a W3C DOM tree as its output. Here's the document we'll parse:

Listing ch11_01.xml

 <?xml version = "1.0" standalone="yes"?> <DOCUMENT>     <CUSTOMER>         <NAME>             <LAST_NAME>Smith</LAST_NAME>             <FIRST_NAME>Sam</FIRST_NAME>         </NAME>         <DATE>October 15, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Tomatoes</PRODUCT>                 <NUMBER>8</NUMBER>                 <PRICE>.25</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Oranges</PRODUCT>                 <NUMBER>24</NUMBER>                 <PRICE>.98</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Jones</LAST_NAME>             <FIRST_NAME>Polly</FIRST_NAME>         </NAME>         <DATE>October 20, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Bread</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Apples</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER>     <CUSTOMER>         <NAME>             <LAST_NAME>Weber</LAST_NAME>             <FIRST_NAME>Bill</FIRST_NAME>         </NAME>         <DATE>October 25, 2003</DATE>         <ORDERS>             <ITEM>                 <PRODUCT>Asparagus</PRODUCT>                 <NUMBER>12</NUMBER>                 <PRICE>.95</PRICE>             </ITEM>             <ITEM>                 <PRODUCT>Lettuce</PRODUCT>                 <NUMBER>6</NUMBER>                 <PRICE>.50</PRICE>             </ITEM>         </ORDERS>     </CUSTOMER> </DOCUMENT>

In this first example, the code will scan ch11_01.xml and report how many <CUSTOMER> elements the document has.

To start this program, I'll import the Java classes we'll need (which support the W3C DOM interfaces, such as Node and Element ) and the XML parser classes we'll use:

 import javax.xml.parsers.*;  import org.w3c.dom.*;     .     .     .

I'll call this first program ch11_02.java, so the public class in that file is ch11_02 :

 import javax.xml.parsers.*;  import org.w3c.dom.*;  public class ch11_02   {   public static void main(String[] args)   {   .   .   .   }

To parse the XML document, you need a DocumentBuilderFactory object, which you use to create an object of the DocumentBuilder class (it's called a document builder factory because you can use it to create parsers using Java classes from different parser vendors , not just the default Java XML parser that we'll use here):

 import javax.xml.parsers.*;  import org.w3c.dom.*; public class ch11_02 {     public static void main(String[] args)     {  try {   DocumentBuilderFactory dbf =   DocumentBuilderFactory.newInstance();   DocumentBuilder db = null;   try {   db = dbf.newDocumentBuilder();   }   catch (ParserConfigurationException pce) {}   .   .   .  } }

You can find the constructors for the DocumentBuilderFactory class in Table 11-1 and the methods of the DocumentBuilder class in Table 11-2.

Table 11-1. Methods of the javax.xml.parsers.DocumentBuilderFactory Class

Method	Does This
`protected DocumentBuilderFactory()`	The default constructor
`abstract` `Object getAttribute (String name)`	Returns specific attribute values
`boolean isCoalescing()`	Is true if the factory is configured to produce parsers that convert `CDATA` nodes to text nodes
`boolean isExpandEntityReferences()`	Is true if the factory is configured to produce parsers that expand XML entity reference nodes
`boolean isIgnoringComments()`	Is true if the factory will produce parsers that ignore comments
`boolean isIgnoringElementContentWhitespace()`	Is true if the factory will produce parsers that ignore ignorable whitespace (such as that used to indent elements) in element content
`boolean isNamespaceAware()`	Is true if the factory will produce parsers that can use XML namespaces
`boolean isValidating()`	Is true if the factory will produce parsers that validate the XML content during parsing operations
`abstract DocumentBuildernewDocumentBuilder()`	Creates a new `DocumentBuilder` object
`static DocumentBuilderFactorynewInstance()`	Returns a new `DocumentBuilderFactory` object
`abstract void setAttribute(String name, Object value)`	Sets specific attributes
`void setCoalescing(boolean coalescing )`	Requires the parser produced to convert `CDATA` nodes to text nodes
`void setExpandEntityReferences(boolean expandEntityRef)`	Requires the parser produced to expand XML entity reference nodes
`void setIgnoringComments(boolean ignoreComments)`	Requires the parser produced to ignore comments
`void setIgnoringElementContentWhitespace(boolean whitespace)`	Requires the parsers created to eliminate ignorable whitespace
`void setNamespaceAware(boolean awareness)`	Requires the parser produced to provide support for XML namespaces
`void setValidating(boolean validating)`	Requires the parser produced to validate documents as they are parsed

Table 11-2. Methods of the javax.xml.parsers.DocumentBuilder Class

Method	Does This
`protected DocumentBuilder()`	The default constructor
`abstract DOMImplementation getDOMImplementation()`	Returns a `DOMImplementation` object
`abstract boolean isNamespaceAware()`	Is true if this parser is configured to understand namespaces
`abstract boolean isValidating()`	Is true if this parser is configured to validate XML documents
`abstract Document newDocument()`	Returns a new instance of a DOM `Document` object to build a DOM tree
`Document parse(File f)`	Parses the content of the file as an XML document and returns a new DOM `Document` object
`abstract Document parse(InputSource is)`	Parses the content of the specified source as an XML document and returns a new DOM `Document` object
`Document parse(InputStream is)`	Parses the content of the specified `InputStream` as an XML document and returns a new DOM `Document` object
`Document parse(InputStream is` , `String systemId)`	Parses the content of the specified `InputStream` as an XML document and returns a new DOM `Document` object
`Document parse(String uri)`	Parses the content of the specified URI as an XML document and returns a new DOM `Document` object
`abstract void setEntityResolverEntityResolver (EntityResolver er)`	Specifies the `EntityResolver` object to be used to resolve entities
`abstract void setErrorHandler(ErrorHandler eh)`	Specifies the ErrorHandler to be used to report errors

To actually parse the XML document, you use the parse method of the DocumentBuilder object. I'll let the user specify the name of the document to parse on the command line by parsing args[0] . Note that you don't need to pass the name of a local file to the parse methodyou can pass the URL of a document on the Internet, and the parse method will retrieve that document.

Here's how you can use the parse method:

 import javax.xml.parsers.*;  import org.w3c.dom.*; public class ch11_02 {     public static void main(String[] args)     {         try {             DocumentBuilderFactory dbf =                 DocumentBuilderFactory.newInstance();             DocumentBuilder db = null;             try {                 db = dbf.newDocumentBuilder();             }             catch (ParserConfigurationException pce) {}  Document doc = null;   doc = db.parse(args[0]);  .         .         .         } catch (Exception e) {             e.printStackTrace(System.err);         }     } }

If the document is successfully parsed, this code creates a Document object based on the W3C DOM. The Document interface is part of the W3C DOM, and you can find the methods of this interface in Table 11-3.

Table 11-3. Methods of the org.w3c.dom.Document Interface

Method	Does This
`Attr createAttribute(String name)`	Creates an `Attr` object of the specified name
`Attr createAttributeNS(String namespaceURI, String qualifiedName)`	Creates an attribute of the specified name and name space
`CDATASection createCDATASection(String data)`	Creates a `CDATASection` node whose value is the specified string
`Comment createComment(String data)`	Creates a `Comment` node using the specified string
`DocumentFragment createDocumentFragment()`	Creates an empty `DocumentFragment` object
`Element createElement(String tagName )`	Creates an element of the type specified
`Element createElementNS(String namespaceURI, String qualifiedName)`	Creates an element of the specified qualified name and namespace uniform resource identifier (URI)
`EntityReference createEntityReference(String name)`	Creates an `EntityReference` object
`ProcessingInstruction createProcessingInstruction(String target, String data)`	Creates a `ProcessingInstruction` node
`Text createTextNode(String data)`	Creates a text node using the specified string
`DocumentType getDoctype()`	Returns the document type definition (DTD) for this document
`Element getDocumentElement()`	Provides direct access to the `Document` element
`Element getElementById(String elementId)`	Returns the element whose ID is specified
`NodeList getElementsByTagName(String tagname)`	Returns all the elements with a specified tag name
`NodeList getElementsByTagNameNS(String namespaceURI, String localName)`	Returns all the elements with a specified name and name space
`DOMImplementation getImplementation()`	Gets the `DOMImplementation` object that handles this document
`Node importNode(Node importedNode` , `boolean deep)`	Imports a node from another document to this document

The Document interface is based on the Node interface, which supports the W3C Node object. Nodes represent a single node in the document tree (as you recall, everything in the document tree, including text and comments, is treated as nodes). The Node interface has many methods that you can use to work with nodes. For example, you can use methods such as getNodeName and getNodeValue to get information about the node, and we'll use this kind of information a great deal in this chapter. This interface also has data members , called fields, which hold constant values corresponding to various node types; we'll see them in this chapter as well. You'll find the Node interface fields in Table 11-4 and the methods of this interface in Table 11-5. As you see in Table 11-4, the Node interface contains all the standard W3C DOM methods for navigating in a document that we've already used with JavaScript in Chapter 7, "Handling XML Documents with JavaScript." These include getNextSibling , getPreviousSibling , getFirstChild , getLastChild , and getParent . We'll put those methods to work here as well.

Table 11-4. Node Interface Fields

Field Summary
`static short ATTRIBUTE_NODE`
`static short CDATA_SECTION_NODE`
`static short COMMENT_NODE`
`static short DOCUMENT_FRAGMENT_NODE`
`static short DOCUMENT_NODE`
`static short DOCUMENT_TYPE_NODE`
`static short ELEMENT_NODE`
`static short ENTITY_NODE`
`static short ENTITY_REFERENCE_NODE`
`static short NOTATION_NODE`
`static short PROCESSING_INSTRUCTION_NODE`
`static short TEXT_NODE`

Table 11-5. Methods of the org.w3c.dom.Node Interface

Method	Does This
`Node appendChild(Node newChild)`	Adds the specified node to the end of the list of children of the current node
`Node cloneNode(boolean deep)`	Returns a duplicate of this node
`NamedNodeMap getAttributes()`	Returns the attributes of this node if it is an element
`NodeList getChildNodes()`	Returns all the children of this node
`Node getFirstChild()`	Returns the first child of this node
`Node getLastChild()`	Returns the last child of this node
`String getLocalName()`	Returns the local part of the full name of this node
`String getNamespaceURI()`	Returns the namespace URI of this node
`Node getNextSibling()`	Returns the node following this node
`String getNodeName()`	Returns the name of this node
`short getNodeType()`	Returns the type of the node's object
`String getNodeValue()`	Returns the value of this node
`Document getOwnerDocument()`	Returns the `Document` object for this node
`Node getParentNode()`	Returns the parent of this node
`String getPrefix()`	Returns the namespace prefix of this node
`Node getPreviousSibling()`	Returns the node preceding this node
`boolean hasAttributes()`	Is true if this node has any attributes
`boolean hasChildNodes()`	Is true if this node has any children
`Node insertBefore(Node newChild, Node refChild)`	Inserts the new node before the existing reference child node
`boolean isSupported(String feature, String version)`	Is true if the specific feature is implemented
`void normalize()`	Puts all text nodes into XML "normal" form
`Node removeChild(Node oldChild)`	Removes the child node and returns it
`Node replaceChild(Node newChild, Node oldChild)`	Replaces the child node in the list of children and returns the old child node
`void setNodeValue(String nodeValue)`	Sets the value of a node
`void setPrefix(String prefix)`	Sets the namespace prefix of this node

At this point, we've gotten access to the root node of the document in Java. Our goal here is to check how many <CUSTOMER> elements the document has. I'll use the getElementsByTagName method to get a Java NodeList object containing a list of all <CUSTOMER> elements:

 import javax.xml.parsers.*;  import org.w3c.dom.*; public class ch11_02 {     public static void main(String[] args)     {         try {             DocumentBuilderFactory dbf =                 DocumentBuilderFactory.newInstance();             DocumentBuilder db = null;             try {                 db = dbf.newDocumentBuilder();             }             catch (ParserConfigurationException pce) {}             Document doc = null;             doc = db.parse(args[0]);  NodeList nodelist = doc.getElementsByTagName("CUSTOMER");  .         .         .         } catch (Exception e) {             e.printStackTrace(System.err);         }     } }

The NodeList interface supports an ordered collection of nodes. You can access nodes in such a collection by index, and we'll do that in this chapter. You can find the methods of the NodeList interface in Table 11-6.

Table 11-6. NodeList Interface Methods

Method	Summary
`int getLength()`	Gets the number of nodes in this list
`Node item(int index)`	Gets the item at index in the collection

If you take a look at Table 11-6, you'll see that the NodeList interface supports a getLength method that returns the number of nodes in the list. This means that we can find how many <CUSTOMER> elements there are in the document like this:

Listing ch11_02.java

 import javax.xml.parsers.*; import org.w3c.dom.*; public class ch11_02 {     public static void main(String[] args)     {         try {             DocumentBuilderFactory dbf =                 DocumentBuilderFactory.newInstance();             DocumentBuilder db = null;             try {                 db = dbf.newDocumentBuilder();             }             catch (ParserConfigurationException pce) {}             Document doc = null;             doc = db.parse(args[0]);             NodeList nodelist = doc.getElementsByTagName("CUSTOMER");  System.out.println(args[0] + " has " + nodelist.getLength() + "   <CUSTOMER> elements.");  } catch (Exception e) {             e.printStackTrace(System.err);         }     } }

And that's it. You can see the results of this code here, indicating that ch11_01.xml has three <CUSTOMER> elements, which is correct:

 %java ch11_02 ch11_01.xml  ch11_01.xml has 3 <CUSTOMER> elements.

That's all it takes to get started with the Java XML parsers.