Before we dive into manipulating XML files, let's define the components that comprise XML documents.
The XML Document
The core of an XML document is the document itself. It is comprised of the following components:
Prolog: Contains version information, comments, and references to Documenet Type Definition (DTD) files.
Body: Contains a document root and subelements.
Epilog: Contains comments and processing instructions.
Listing 25.1 shows a simple XML file that might be used in a bookstore to define a set of books.
Listing 25.1 books.xml
1: <?xml version="1.0"?> 2: <!DOCTYPE books SYSTEM "Books.dtd"> 3: <books> 4: <book category="computer-programming"> 5: <author>Steven Haines</author> 6: <title>Java 2 From Scratch</title> 7: <price>39.95</price> 8: </book> 9: <book category="fiction"> 10: <author>Tim LaHaye</author> 11: <title>Left Behind</title> 12: </book> 13: </books>
Lines 1 and 2 define the header. Line 1 notes that this XML document is written against the 1.0 version of the XML specification; line 2 references a Document Type Definition (DTD) file that defines the syntax rules for this XML file.
Lines 3 through 13 define the body of the XML document. Most XML documents that you will encounter, as well as this one, will not have an epilog, but the specification allows for it.
Document Type Definition (DTD)
Document Type Definition files, or DTD for short, define the syntactical rules by which XML files are written. It defines things such as:
The name of the root element in the document (for example, <books>).
The elements that can be contained inside of other elements (for example, <book> can be contained by <books>).
The multiplicity of elements (for example, <book> can appear multiple times).
The order of elements (for example, <author> must precede <title>, which must precede <price>).
Optional elements (for example, <price> appears in the first <book>, but not in the second <book>).
The type of data that is contained in the content of the element (for example, all elements are text elements).
The attributes that can appear in an element (<book> contains the attribute category and it is a text string).
The origin of DTD files dates back to XML's forefather markup language: SGML, so it might initially appear cryptic, but its conventions are pretty straightforward. Listing 25.2 shows the DTD that defines the books.xml file in Listing 25.1.
Listing 25.2 books.dtd
1: <!ELEMENT books (book*)> 2: <!ELEMENT book (author, title, price?)> 3: <!ATTLIST book category CDATA> 4: <!ELEMENT author (#PCDATA)> 5: <!ELEMENT title (#PCDATA)> 6: <!ELEMENT price (#PCDATA)>
Line 1 defines the element <books> as containing zero or more <book> elements:
1: <!ELEMENT books (book*)>
The <books> element is defined as an element, denoted by the <!ELEMENT prefix. The parentheses contain a comma-separated list of subelements that the <books> element can contain. In this case it can only contain the element <book>, but because there is an asterisk following book, it denotes that it can contain zero or more instances of the <book> element. Note that if book was followed by a plus sign it would denote that <books> could contain one or more <book> elements (but it must contain at least one).
2: <!ELEMENT book (author, title, price?)>
Similarly, line 2 defines the subelements that the <book> element can contain; that is <author>, followed by <title>, followed by an optional <price> element. The question mark following price notes that price is optional and can appear zero or one time as a subelement of the <book> element. Note that the order is enforced by the DTD; if <price> appears, it must be after <title>, which must be after <author>.
3: <!ATTLIST book category CDATA>
Line 3 defines one of the attributes of the <book> element as category and notes its type: CDATA. CDATA (character data) and its counterpart PCDATA (parsed character data) both refer to character data or text.
Lines 4 6 define the elements <author>, <title>, and <price> as containing character data. Note that if you were to consider an XML document as a tree, the <book> element would be a branch that contained other elements and the <author>, <title>, and <price> elements would be considered leaf nodes (they do not have any children of their own).
Why all these rules? The answer lies in document validation. When an XML document is parsed (or read) the parser (process that is responsible for reading the document) can validate the XML document's syntax against its DTD file and report back whether the document is well formed. If it is well formed you can trust that you will be able to extract its data according to the rules in the DTD file; if not you can reject the document. This is a valuable asset to you as the programmer.
Sun, through the Java API for XML Parsing (JAXP), provides two mechanisms for reading XML documents:
An event model
A tree model
The JDOM open-source project provides a solution to manipulating XML documents using Java-familiar Collection classes.
The Simple API for XML (SAX) Parser is event driven: the program registers a listener with the parser, and the parser streams through the file, firing notifications when it encounters XML elements. The Document Object Model (DOM) constructs a tree representation of the XML document and provides an Application Programmers Interface (API) for accessing and manipulating the data in the tree.
SAX and DOM both exist for different application purposes and both have their advantages and disadvantages. The SAX parser maintains a very small memory footprint because it does not save any information, but simply streams through the document and fires notifications of what it finds. It is extremely fast, but it requires the developer to build his own data-structure representation of the document for in-memory use of the data. The DOM is slow to build and consumes a lot of memory, but it maintains an in-memory representation of the data, provides an API to access and manipulate that data, and even allows for complex searching and reporting based off of that data.
If you are reading an XML file to load data into your own data structures, you should use SAX. If need a subset of the information or are simply trying to compute a value based on what is contained in the XML file (for example, how many books are in my document), you should use SAX.
On the other hand, if you want to access the entire document in memory, you should use DOM. If you want to manipulate the document, and then output the modified document to a destination (for example, save it to a file), you should use DOM.
The choice is all a matter of how you are going to use the data obtained from the XML document.
Before we get started you are going to have to obtain a copy of JAXP from Sun's Web site http://java.sun.com/xml/jaxp.
When you download JAXP simply decompress it to a directory on your computer and you're ready to go. The files of interest in the distribution are
docs subdirectory: This contains all the javadocs for all the classes provided in the JAXP. It is the resource that you are going to make the most use of in your XML development.
crimson.jar: This archive contains all the World Wide Web Consortium (W3C) DOM and SAX classes along with the Apache Crimson JAXP implementation classes.
jaxp.jar: This archive contains all the JAXP interfaces.
You will need to add these two archives to your CLASSPATH when compiling and running Java applications that make use of the JAXP.
Although both SAX and DOM are standard interfaces defined by the World Wide Web Consortium (www.w3c.org), and implementations are available for a variety of programming languages, JDOM is a proprietary solution that is only usable in the Java programming language. Then why use it? If you are a Java programmer and familiar with the Java Collection classes (as you are now), it is extremely easy to use! As we get into the examples later in the chapter you will see the difference, and why I personally use JDOM over both SAX and DOM whenever I am developing in Java.
You are going to need to obtain a copy of JDOM, which is available as a free download at http://www.jdom.org.
Download and decompress the latest version of JDOM to your local computer. The following files must be in your CLASSPATH when you compile programs that use JDOM:
build/jdom.jar: This is the implementation JDOM API.
lib/xerces.jar: This is the open source Xerces Java XML libraries that JDOM is built on top of.
build/apidocs/index.html: This is the root of the JavaDoc that describes the JDOM API.