Motivating XML | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

If you're reading this book, you're a developer. (At least I hope you are. Otherwise a lot of what I say isn't going to make any sense :-) ) Doubtless over the course of your career you've written numerous programs that read and write files. And every time you wrote a new program you had to invent or learn a new file format. File formats I personally have had to deal with over the years include RTF, Word .doc files, tab delimited text, FITS, PDF, PostScript, and many more. You've probably encountered a few of these yourself. Doubtless, you've also seen many other formats.

If you're like me, you've learned to dread encountering a new file format. If the format is documented at all, the documentation is likely incomplete or, worse yet, misleading. Important details like byte order and line-ending conventions are often left unspecified. Different tools that all claim to read and write the same format actually produce subtly different variants which are often incompatible in practice. When you think you've finally wrestled the last bug out of your code, you discover a file written by somebody else's software that you can't read. You realize you've made one too many assumptions about the format and have to go back to the drawing board.

Consequently, when designing new file formats, developers have tended to gravitate toward the simplest formats they can imagine, often tab-delimited text or comma-separated values. Nonetheless, even these plain, undecorated formats often present unexpected problems. For example, should two tabs in a row be interpreted as an empty string, null, or the same as one tab? In practice, all three variations are used. Java's StringTokenizer class takes the last interpretationtwo consecutive tabs are the same as one tabeven though this is the least-common approach in actual data files, a fact which has surprised many Java programmers and has led to not a few bugs in Java programs. ^[1]

^[1] This interpretation makes sense once you realize that java.util.StringTokenizer is designed for parsing Java source code, not for reading tab delimited data files. Nonetheless, many programmers do use it for reading tab-delimited data.

A Thought Experiment

With all that in mind, let's do a thought experiment. Imagine you've been tasked with writing a server-side program that accepts orders over the Internet for an e-commerce site. The web server must send each completed order to the internal system, one order at a time. You're responsible for writing the code on the server that sends the order to the internal system and for writing the code on the internal system that receives and processes the order. The only connection between the two systems is a TCP/IP network; that is, you don't have some sort of higher level API like JDBC that lets you move data between the two systems. You need to invent a data format you can generate on one end and parse on the other end that's flexible enough to contain all the information in a typical order. This includes the customer name , the product ordered, its price, the manufacturer's stock keeping unit (SKU) number, the address to ship to, the tax, and the shipping and handling charges. One possibility is to place each piece of information on a separate line, as shown in Example 1.1.

Example 1.1 A Plain Text Document That Indicates an Order for 12 Birdsong Clocks, SKU 244

 c32 Chez Fred Birdsong Clock 244 12 USD 21.95 135 Airline Highway Narragansett RI 02882 USD 263.40 7.0 USD 18.44 USPS USD 8.95 USD 290.79

An alternative is to use a more complex and verbose XML format, as in Example 1.2.

Example 1.2 An XML Document That Indicates an Order for 12 Birdsong Clocks, SKU 244

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order>   <Customer id="c32">Chez Fred</Customer>   <Product>     <Name>Birdsong Clock</Name>     <SKU>244</SKU>     <Quantity>12</Quantity>     <Price currency="USD">21.95</Price >   </Product>   <ShipTo>     <Street>135 Airline Highway</Street >     <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>   </ShipTo>   <Subtotal currency='USD'>263.40</Subtotal>   <Tax rate="7.0"        currency='USD'>18.44</Tax>   <Shipping  method="USPS" currency='USD'>8.95</Shipping>   <Total currency='USD' >290.79</Total> </Order>

Would you rather write the code to send and receive orders as nice, simple linefeed delimited files as in Example 1.1, or as complex, marked up XML documents as in Example 1.2? Both documents contain the same information. Most uninitiated developers prefer the first, simpler form. After all, each piece of information is presented on a line by itself with no extraneous markup characters getting in the way. It's my goal to convince you that, contrary to most developers' first intuition, the second form is more robust, more extensible, and much easier to work with.

Robustness

Let's consider robustness first. Suppose your program receives the order in Example 1.3.

Example 1.3 A Document That Indicates an Order for 12 Birdsong Clocks, SKU 244

 c32 Chez Fred Birdsong Clock 12 244 USD 21.95 135 Airline Highway Narragansett RI 02882 USD 263.40 7.0 USD 18.44 USPS USD 290.79 USD 8.95

It looks the same as Example 1.1, doesn't it? However, if you compare Example 1.1 and Example 1.3 very carefully , you will notice that the 12 and the 244 have changed places. What was an order for 12 bird clocks may now be an order for 244 whoopee cushions. Maybe somebody will notice the problem before the order is shipped, and maybe they won't. Worse yet, the shipping charge and the total price got flipped around. This entire order now costs $8.95. Again, maybe someone will notice the problem before it's too late, but maybe not. These sorts of problems aren't theoretical. More than one e-commerce site has lost both revenue and customer goodwill by mispricing items.

In the XML version, this simply would not be an issue because each datum is marked up with what it means. You can freely reorder the quantity and the SKU or the shipping cost and the total price without any confusion, as Example 1.4 demonstrates . What can be devastating mistakes in a traditional system are harmless in XML.

Example 1.4 Still an Order for 12 Birdsong Clocks, SKU 244

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order>   <Customer id="c32">Chez Fred</Customer>   <Product>     <Name>Birdsong Clock</Name>     <Quantity>12</Quantity>     <SKU>244</SKU>     <Price currency="USD">21.95</Price >   </Product>   <ShipTo>     <Street>135 Airline Highway</Street >     <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>   </ShipTo>   <Subtotal currency='USD'>263.40</Subtotal>   <Tax rate="7.0"        currency='USD'>18.44</Tax>   <Total currency='USD' >290.79</Total>   <Shipping  method="USPS" currency='USD'>8.95</Shipping> </Order>

You may be objecting at this point that you would never let a mistake like that through your system. After all, you check every value for sensibility. You look up the SKU in the company database to make sure it matches the product name and price before completing an order. You check every return value from a method call to see if it's null, and you catch every exception. You write extensive tests to verify that each method is doing what you think it's doing. You use a source-code control system so you can always back out of changes, and you never check code in until it has passed all the regression tests. Every line of code is scrupulously documented. In fact, you write more documentation than actual code. And you've never, ever missed church on Sunday. In this case your name is Donald Knuth. The rest of us need a little more help making sure we don't do something stupid.

Even if you are that conscientious , are you really willing to gamble that everyone else who sends or receives data from you will be equally anal retentive? Wouldn't it make more sense to use the most robust format possible so that when the inevitable errors do creep in, they'll do less damage?

Of course, XML has a lot to offer the anal developer as well. When defining constraints such as "Every order must have a shipping address," "the currency must be one of the three letter codes USD, CAN, or GBP," or "the total cost must be the sum of the unit price times the number of items, the tax, and the shipping," it's easiest to use a declarative language that specifies what the constraints are without elaborating the actual code to check these constraints. When your data is XML, you can use a declarative schema language to define and test such constraints. Indeed, you have a choice of several schema languages. The simplest and most broadly supported, the classic document type definition (DTD), allows you to verify that all required elements are present in the required order, and with any necessary attributes. The W3C XML schema language goes further by allowing you to constrain the contents of particular elements and attributes to guarantee that the total price is a decimal number greater than 1.00. Schematron, the most powerful schema language of all, allows you to state multi-element constraints such as "the actual price must be less than or equal to the suggested retail price."

I discuss all of these languages in more detail later in this chapter and throughout this book. For now all you need to know is that you can list all the constraints on a document in a simple fashion and then check those constraints without writing a lot of extra code. You feed your documents through a validator before you act on them. Validation becomes a separate, modular, and more maintainable part of the process. You can even change constraints or add new ones without recompiling your code.

Extensibility

Robustness isn't the only advantage of the XML approach. XML is also far more extensible. For example, suppose you suddenly needed to add a discount percentage to some products. The change to the XML code would be straightforward. You would simply add an extra element:

 <Product>    <Name>Birdsong Clock</Name>   <Quantity>12</Quantity>   <SKU>244</SKU>   <Price currency="USD">21.95</Price >   <Discount>.10</Discount> </Product>

The change to the plain text file (or the equivalent binary file) would be much less obvious. Although you could certainly add an extra line of data, everything that followed it would then be out of order. You could put the new information at the end of the document, but then it wouldn't be close to the item with which it logically belonged. And suppose not all orders had discounts. Would there be blank lines for products without discounts ? How would your program know to convert an empty string into a zero discount rather than NaN (Not a Number) or throwing an exception? This is not an insurmountable problem, but the simple solution is becoming more complex.

Now suppose someone wanted to add a gift message field whose value could contain line breaks. Now the data might contain the delimiter character! You could probably escape the line breaks as \n or some such, and then escape the backslash character as \\ , but your nice simple solution would become quite a bit more complex. However, once again this would not be a problem for XML, as this solution demonstrates:

 <GiftMessage>     Happy Birthday Monica!   Love Always,   Linda </GiftMessage>

Throughout this example, I've assumed each order to be for exactly one product, but that's probably not true. Some customers will order multiple products at a time. Thus each order will contain between one product and an indefinite number of products. Different products may even be going to different addresses. Would you break each individual item into a separate order document and repeat the customer information? If so, how would you calculate the total shipping and total cost? Or would you allow multiple products in a single order? If so, how would you tell where one product ended and the next began ? Again, none of these problems are unsolvable , but the simple solution proves more and more complex as the needs grow. The XML approach, by contrast, scales very well to expanded functionality in a very obvious way. Example 1.5 is an XML document that accomplishes all of the above. The boundaries between the individual parts are obvious.

Example 1.5 An XML Document That Indicates an Order for Multiple Products Shipped to Multiple Addresses

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order>   <Customer id="c32">Chez Fred</Customer>   <Product>     <Name>Birdsong Clock</Name>     <SKU>244</SKU>     <Quantity>12</Quantity>     <Price currency="USD">21.95</Price >     <ShipTo>       <Street>135 Airline Highway</Street >       <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>     </ShipTo>   </Product>   <Product>     <Name>Brass Ship's Bell</Name>     <SKU>258</SKU>     <Quantity>1</Quantity>     <Price currency="USD">144.95</Price >     <Discount>.10</Discount>     <ShipTo>       <GiftRecipient>Samuel Johnson</GiftRecipient>       <Street>271 Old Homestead Way</Street >       <City>Woonsocket</City> <State>RI</State> <Zip>02895</Zip>     </ShipTo>     <GiftMessage>       Happy Father's Day to a great Dad!       Love,       Sam and Beatrice     </GiftMessage>   </Product>   <Subtotal currency='USD'>393.85</Subtotal>   <Tax rate="7.0"        currency='USD'>28.20</Tax>   <Shipping  method="USPS" currency='USD'>8.95</Shipping>   <Total currency='USD' >431.00</Total> </Order>

This example still isn't complete. Missing pieces include the credit card information, billing address, and more. Real-world examples are larger and more complex than can comfortably fit in a book. Adding these other parts would only stretch the flat format further and make the advantages of XML still more obvious. The more complex your data, the more important it becomes to use a hierarchical format such as XML rather than a flat format such as tab- or line-delimited text.

Ease-of-Use

Now here's the real kicker : Not only is the XML document far more robust. Not only is it much more extensible in the face of both expected and unexpected changes. Not only does it more easily adapt to more complex structures. It is also easier for your programs to read! Writing a program to accept orders written in XML will be many times easier than writing a program to accept orders delivered in simple line-delimited files. "How can that be?" you may be asking. After all, the program reading the XML document has to hunt for less-than signs and quotation marks, rather than simply picking each piece of data off a line. It has to distinguish between any less-than signs and quotation marks that appear in the data itself and those in the markup. It has to deal with data that may extend across multiple lines. And in fact, there are many more possibilities not evident in this simple example that a real program must handle.

Fortunately none of this matters to you as a developer because you don't have to do any of it. Instead of writing the code to process XML documents directly, you let an XML parser do the hard work for you. A parser is a software library that knows how to read XML documents and handle all the markup it finds. Your own code reads the XML document only through the parser's API. At this level, you can simply ask the parser to tell you what it saw in any particular element. Or you can ask the parser to tell you everything it sees as soon as it sees it. In either case, the parser simply gives you the data after resolving all of the markup. For instance, if you want to ask the parser what the total price was, it can tell you 290.79 in the currency USD. You don't have to concern yourself with stripping off the markup around the information you want. Nor do you necessarily have to take the information in the order it appears in the input document. If you want the total price before the customer name, you can have it. If you just want to look at the price and ignore the rest of the order completely, you can do that too. You take the information in the form that's convenient to you without worrying excessively about low-level serialization details.

Note

One of the ten original goals for XML was that "It shall be easy to write programs which process XML documents." Originally, this was interpreted as meaning that a "Desperate Perl Hacker" (DPH) could write an XML parser in a weekend . Later it became clear that XML was simply too complex, even in its simplest form, for this goal to be met. However, the understanding of this requirement evolved to mean that a typical programmer could use any of a number of free tools and libraries to process XML. Given this interpretation, the goal most certainly has been met.

The parser shields you from a lot of irrelevant details, including

How text is encoded: in Unicode, ASCII, Latin-1, SJIS, or something else
Whether lines are separated by carriage returns, line feeds, or both
How reserved characters such as < are escaped when used in the plain text parts of the document
Whether the byte order is big-endian or little-endian

None of these issues affect what the data means or what the format allows you to say. However, when designing a data format, you must decide all of them. As soon as you've said, "The underlying format of the data is XML," every one of these issues is resolved. Some are resolved simply by choosing one possible solution. (The less-than sign is escaped as < .) Others are answered by allowing all possibilities and letting the parser sort things out (line endings). In all cases, the design problem is greatly simplified when you choose XML as the underlying format.