To avoid the difficulties inherent in parsing raw XML input, almost all programs that need to process XML documents rely on an XML parser to actually read the document. The parser is a software library (in Java, it's a class) that reads the XML document and checks it for well- formedness . Client applications use method calls defined in the parser API to receive or request information that the parser retrieves from the XML document.
The parser shields the client application from all of the complex and not particularly relevant details of XML, including
One of the original goals of XML was that it be simple enough that a "Desperate Perl Hacker" (DPH) be able to write an XML parser. The exact interpretation of this requirement varied from person to person. At one extreme, the DPH was assumed to be a web designer accustomed to writing CGI scripts without any formal training in programming, who was going to hack it together in a weekend . At the other extreme, the DPH was assumed to be Larry Wall and he was allowed two months for the task. The middle ground was a smart grad student with a couple of weeks.
Whichever way you interpreted the requirement, it wasn't met. In fact, it took Larry Wall more than a couple of months just to add the Unicode support to Perl that XML assumed. Java developers already had adequate Unicode support, however; and thus Java parsers were a lot faster out the gate. Nonetheless, it probably still isn't possible to write a fully conforming XML parser in a weekend, even in Java. Fortunately, however, you don't need to. There are several dozen XML parsers available under a variety of licenses that you can use. In 2002, there's very little need for any programmer to write his or her own parser. Unless you have very unusual requirements, the chance that you can write a better parser than Sun, IBM, the Apache XML Project, and numerous others have already written is quite small.
Java 1.4 is the first version of Java to include an XML parser as a standard feature. With earlier Java versions, you need to download a parser from the Web and install it in the usual way, typically by putting its .jar file in your jre/lib/ext directory. Even with Java 1.4, you may well want to replace the standard parser with a different one that provides additional features or is simply faster with your documents.
If you're using Windows, then chances are good you have two different ext directories, one where you installed the JDK, such as C:\jdk1.3.1\jre\lib\ext , and one in your Program Files folder, probably C:\Program Files\Javasoft\jre\ 1.3.1\lib\ext . The first is used for compiling Java programs, the second for running them. To install a new class library, you need to place the relevant JAR file in both directories. It is not sufficient to place the JAR archive in one and a shortcut in the other. You need to place full copies in each ext directory.
Choosing an XML API
The most important decision you'll make at the start of an XML project is choosing the application programming interface (API) that you'll use. Many APIs are implemented by multiple vendors , so if the specific parser gives you trouble, you can swap in an alternative, often without even recompiling your code. However, if you choose the wrong API, changing to a different one may well involve redesigning and rebuilding the entire application from scratch. Of course, as Fred Brooks taught us, "In most projects, the first system built is barely usable. It may be too slow, too big, awkward to use, or all three. There is no alternative but to start again, smarting but smarter , and build a redesigned version in which these problems are solved . Hence plan to throw one away; you will, anyhow. "  Still, it is much easier to change parsers than it is to change APIs.
There are two major standard APIs for processing XML documents with Javathe Simple API for XML (SAX) and the Document Object Model (DOM)each of which comes in several versions. In addition there are a host of other, somewhat idiosyncratic APIs including JDOM, dom4j, ElectricXML, and XMLPULL. Finally, each specific parser generally has a native API that it exposes below the level of the standard APIs. For example, the Xerces parser has the Xerces Native Interface (XNI). However, picking such an API limits your choice of parser, and indeed may even tie you to one particular version of the parser, since parser vendors tend not to worry a great deal about maintaining native compatibility between releases. Each of these APIs has its own strengths and weaknesses.
SAX, the Simple API for XML, is the gold standard of XML APIs. It is the most complete and correct by far. Given a fully validating parser that supports all its optional features, there is very little you can't do with it. It has one or two holes, but those are really off in the weeds of the XML specifications, and you have to look pretty hard to find them. SAX is an event-driven API. The SAX classes and interfaces model the parser, the stream from which the document is read, and the client application receiving data from the parser. However, no class models the XML document itself. Instead the parser feeds content to the client application through a callback interface, much like the ones used in Swing and the AWT. This makes SAX very fast and very memory efficient (since it doesn't have to store the entire document in memory). However, SAX programs can be harder to design and code because you normally need to develop your own data structures to hold the content from the document.
SAX works best when processing is fairly local; that is, when all the information you need to use is close together in the document (for example, if you were processing one element at a time). Applications that require access to the entire document at once in order to take useful action would be better served by one of the tree-based APIs, such as DOM or JDOM. Finally, because SAX is so efficient, it's the only real choice for truly huge XML documents. Of course, "truly huge" needs to be defined relative to available memory. However, if the documents you're processing are in the gigabyte range, you really have no choice but to use SAX.
DOM, the Document Object Model, is a fairly complex API that models an XML document as a tree. Unlike SAX, DOM is a read-write API. It can both parse existing XML documents and create new ones. Each XML document is represented as a Document object. Documents are searched, queried, and updated by invoking methods on this Document object and the objects it contains. This makes DOM much more convenient when random access to widely separated parts of the original document is required. However, it is quite memory intensive compared with SAX, and not nearly as well suited to streaming applications.
JAXP, the Java API for XML Processing, bundles SAX and DOM together along with some factory classes and the TrAX XSLT API. (TrAX is not a general-purpose XML API like SAX and DOM. I'll get to it in Chapter 17.) JAXP is a standard part of Java 1.4 and later. However, it is not really a different API. When starting a new program, you ask yourself whether you should choose SAX or DOM. You don't ask yourself whether you should use SAX or JAXP, or DOM or JAXP. SAX and DOM are part of JAXP.
JDOM is a Java-native tree-based API that attempts to remove a lot of DOM's ugliness. The JDOM mission statement is, "There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck." And for the most part JDOM delivers. Like DOM, JDOM reads the entire document into memory before it begins to work on it; and the broad outline of JDOM programs tends to be the same as for DOM programs. However, the low-level code is a lot less tricky and ugly than the DOM equivalent. JDOM uses concrete classes and constructors rather than interfaces and factory methods. It uses standard Java coding conventions, methods, and classes throughout. JDOM programs often flow a lot more naturally than the equivalent DOM program.
I think JDOM often does make the easy problems easier; but in my experience, JDOM also makes the hard problems harder. Its design shows a very solid understanding of Java, but the XML side of the equation feels much rougher. It's missing some crucial pieces, such as a common node interface or superclass for navigation. JDOM works well (and much better than DOM) on fairly simple documents with no recursion, limited mixed content, and a well-known vocabulary. It begins to show some weakness when asked to process arbitrary XML. When I need to write programs that operate on any XML document, I tend to find DOM simpler despite its ugliness.
dom4j was forked from the JDOM project fairly early on. Like JDOM, it is a Java-native, tree-based, read-write API for processing generic XML. However, it uses interfaces and factory methods rather than concrete classes and constructors. This enables you to plug in your own node classes that put XML veneers on other forms of data such as objects or database records. (In theory, you could do this with DOM interfaces too; but in practice most DOM implementations are too tightly coupled to interoperate with each other's classes.) It does have a generic node type that can be used for navigation.
ElectricXML is yet another tree-based API for processing XML documents with Java. It's quite small, which makes it suitable for use in applets and other storage-limited environments. It's the only API I mention here that isn't open source, and the only one that requires its own parser rather than being able to plug into multiple different parsers. It has gained a reputation as a particularly easy-to-use API. However, I'm afraid its perceived ease-of-use often stems from catering to developers' misconceptions about XML. It is far and away the least correct of the tree-based APIs. For example, it tends to throw away a lot of white space it shouldn't, and its namespace handling is poorly designed. Ideally, an XML API should be as simple as it can be and no simpler. In particular, it should not be simpler than XML itself is. ElectricXML pretends that XML is less complex than it really is, which may work for a while as long as your needs are simple, but will ultimately fail when you encounter more complex documents. The only reason I mention it here is because the flaws in its design aren't always apparent to casual users, and I tend to get a lot of e-mail from ElectricXML users asking me why I'm ignoring it.
SAX is fast and very efficient, but its callback nature is uncomfortable for some programmers. Recently some effort has gone into developing pull parsers that can read streaming content the way that SAX does, but only when the client application requests it. The recently published standard API for such parsers is XMLPULL. XMLPULL shows promise for the future ( especially for developers who need to read large documents quickly but just don't like callbacks). However, pull parsing is still clearly in its infancy. On the XML side, namespace support is turned off by default. Even worse , XMLPULL ignores the DOCTYPE declaration, even the internal DTD subset, unless you specifically ask it to read it. From the Java side of things, XMLPULL does not take advantage of polymorphism, relying instead on such un-OOP constructs as int type codes to distinguish nodes instead of making them instances of different classes or interfaces. I don't think XMLPULL is ready for prime time quite yet. None of this is unusual for such a new technology, however. Some of the flaws I cite were also present in earlier versions of SAX, DOM, and JDOM and were only corrected in later releases. In the next couple of years , as pull parsing evolves, XMLPULL may become a much more serious competitor to SAX.
Recently, there has been a flood of so-called data-binding APIs that try to map XML documents into Java classes. Although DOM, JDOM, and dom4j all map XML documents into Java classes, these data-binding APIs attempt to go furthermapping a Book document into a Book class rather than just a generic Document class, for example. These are sometimes useful in very limited and predictable domains. But they tend to make too many assumptions that simply aren't true in the general case to make them broadly suitable for XML processing. In particular, these products tend to depend implicitly on one or more of the following common fallacies:
The fundamental flaw in these schemes is an insistence on seeing the world through object-colored glasses . XML documents can be used for object serialization, and in that use case all of these assumptions are reasonably accuratebut XML is a lot more general than that. The vast majority of XML documents cannot plausibly be understood as serialized objects, although a lot of programmers come at it from that point of view because that's what they're familiar with. Once again, when you're an expert with a hammer , it's not surprising that the world looks like it's full of nails .
The truth is, XML documents are not objects and schemas are not classes. The constraints and structures that apply to objects simply do not apply to XML elements and vice versa. Unlike Java objects, XML elements routinely violate their declared types, if indeed they even have a type in the first place. Even valid XML elements often have different content in different locations. Mixed content is quite common. Recursive content isn't as common, but it does exist. A little more subtle but even more important, XML structures are based on hierarchy and position rather than on the explicit pointers of object systems. It is possible to map one to the other, but the resulting structures are ugly and fragile. When you're finished, what you've accomplished tends to merely reinvent DOM. XML needs to be approached and understood on its own terms, not Java's. Data-binding APIs are just a little too limited to interest me, and I do not plan to treat them in this book.
Choosing an XML Parser
When choosing a parser library, many factors come into play. These include what features the parser has, how much it costs, which APIs it implements, how buggy it is, andlast and certainly leasthow fast the parser parses.
The XML 1.0 specification does allow parsers some leeway in how much of the specification they implement. Parsers can be divided roughly into three categories:
In practice there's also a fourth category of parsers that read the instance document but do not perform all of the mandated well-formedness checks. Technically such parsers are not allowed by the XML specification, but there are still a lot of them out there.
If the documents you're processing have DTDs, then you need to use a fully validating parser. You don't necessarily have to turn on validation if you don't want to. However, XML is designed such that you really can't be sure to get the full content of an XML document without reading its DTD. In some cases, the differences between a document whose DTD has been processed and the same document whose DTD has not been processed can be huge. For example, a parser that reads the DTD will report default attribute values, but one that doesn't won't. The handling of ignorable white space can vary between a validating parser and a parser that merely reads the DTD but does not validate. Furthermore, external entity references will be expanded by a validating parser, but not necessarily by a non-validating parser. You should use a nonvalidating parser only if you're confident none of the documents you'll process carry document type declarations. One situation in which this is reasonable is with a SOAP server or client, because SOAP specifically prohibits documents from using DOCTYPE declarations. (But even in that case, I still recommend that you check for a DOCTYPE declaration and throw an exception if you spot one.)
Beyond the lines set out by XML 1.0, parsers also differ in their support for subsequent specifications and technologies. In 2002, all parsers worth considering support namespaces and automatically check for namespace well-formedness as well as XML 1.0 well-formedness. Most of these parsers do allow you to disable these checks for the rare legacy documents that don't adhere to namespaces rules. Currently, Xerces and Oracle are the only Java parsers that support schema validation, although other parsers are likely to add this in the future.
Some parsers also provide extra information not required for normal XML parsing. For example, at your request, Xerces can inform you of the ELEMENT , ATTLIST , and ENTITY declarations in the DTD. Crimson will not do this, so if you needed to read the DTD you would pick Xerces over Crimson.
Most of the major parsers support both SAX and DOM. But a few parsers only support SAX, and at least a couple only support their own proprietary API. If you want to use DOM or SAX, make sure you pick a parser that can handle it. Xerces, Oracle, and Crimson can.
SAX actually includes a number of optional features that parsers are not required to support. These include validation, reporting comments, reporting declarations in the DTD, reporting the original text of the document before parsing, and more. If any of these are important to you, you'll need to make sure that your parser supports them, too.
The other APIs, including JDOM and dom4j, generally don't provide parsers of their own. Instead they use an existing SAX or DOM parser to read a document, which they then convert into their own tree model. Thus they can work with any convenient parser. The notable exception here is ElectricXML, which does include its own built-in parser. ElectricXML is optimized for speed and size and does not interoperate well with SAX and DOM.
One often overlooked consideration when choosing a parser is the license under which the parser is published. Most parsers are free in the free-beer sense, and many are also free in the free-speech sense. However, license restrictions can still get in your way.
Because parsers are essentially class libraries that are dynamically linked to your code (as all Java libraries are), and because parsers are generally released under fairly lenient licenses, you don't have to worry about viral infections of your code with the GPL. In one case I'm aware of, lfred, any changes you make to the parser itself would have to be donated back to the community; but this would not affect the rest of your classes. That being said, you'll be happier and more productive if you do donate your changes back to the communities for the more liberally licensed parsers such as Xerces. It's better to have your changes rolled into the main code base than to have to keep applying them every time a new version is released.
There actually aren't too many parsers you can buy. If your company is really insistent about not using open source software, then you can probably talk IBM into selling you an overpriced license for their XML for Java parser (which is just an IBM-branded version of the open source Xerces). However, there isn't a shrink-wrapped parser you can buy, nor is one really needed. The free parsers are more than adequate.
An often overlooked criterion for choosing a parser is correctnesshow much of the relevant specifications are implemented and how well. All of the parsers I've used have had nontrivial bugs in at least some versions. Although no parser is perfect, some parsers are definitely more reliable than others.
I wish I could say that there was one or more good choices here, but the fact is that every single parser I've ever tried has sooner or later exhibited significant conformance bugs. Most of the time these fall into two categories:
It's hard to say which is worse. On one hand, unnecessarily rejecting well- formed documents prevents you from handling data others send you. On the other hand, when a parser fails to report incorrect XML documents, it's virtually guaranteed to cause problems for people and systems who receive the malformed documents and correctly reject them.
One thing I will say is that well-formedness is the most important criterion of all. To be seriously considered , a parser has to be absolutely perfect in this area, and many aren't. A parser must allow you to confidently determine whether a document is or is not well-formed. Validity errors are not quite as important, but they're still significant. Many programs can ignore validity and consequently ignore any bugs in the validator.
Continuing downward in the hierarchy of seriousness are failures to properly implement the standard SAX and DOM APIs. A parser might correctly detect and report all well-formedness and validity errors, but fail to pass on the contents of the document. For example, it might throw away ignorable white space rather than making it available to the application. Even less serious but still important are violations of the contracts of the various public APIs. For example, DOM guarantees that each Text object read from a parsed document will contain the longest-possible string of characters uninterrupted by markup. However, I have seen parsers that occasionally passed in adjacent text nodes as separate objects rather than merging them.
Java parsers are also subject to a number of edge conditions. For example, in SAX each attribute value is passed to the client application as a single string. Because the Java String class is backed by an array of chars indexed by an int , the maximum number of chars in a String is the same as the maximum size of an int , 2,147,483,647. However, there is no maximum number of characters that may appear in an attribute value. Admittedly a three-gigabyte attribute value doesn't seem too likely (perhaps a base64 encoded video?), and you'd probably run out of memory long before you bumped up against the maximum size of a string. Nonetheless, XML doesn't prohibit strings of such lengths, and it would be nice to think that Java could at least theoretically handle all XML documents within the limits of available memory.
The last consideration is efficiency, or how fast the parser runs and how much memory it uses. Let me stress that again: Efficiency should be your last concern when choosing a parser. As long as you use standard APIs and keep parser-dependent code to a minimum, you can always change the underlying parser later if the one you picked initially proves too inefficient.
The speed of parsing tends to be dominated by I/O considerations. If the XML document is served over the network, it's entirely possible that the bottleneck is the speed with which data can move over the network, not the XML parsing at all. With situations in which the XML is being read from the disk, the time to read the data can still be significant even if not quite the bottleneck it is in network applications.
Anytime you're reading data from a disk or the network, remember to buffer your streams. You can buffer at the byte level with a BufferedInputStream or at the character level with a BufferedReader . Perhaps a little counter-intuitively, you can gain extra speed by double buffering with both byte and character buffers. However, most parsers are happier if you feed them a raw InputStream and let them convert the bytes to characters (parsers normally detect the correct encoding better than most client code). Therefore, I prefer to use just a BufferedInputStream , and not a BufferedReader , unless speed is very important and I'm very sure of the encoding in advance. If you don't buffer your I/O, then I/O considerations will limit total performance no matter how fast the parser is.
Complicated programs can also be dominated by processing that happens after the document is parsed. For example, if the XML document lists store locations, and the client application is attempting to solve the traveling salesman problem for those store locations, then parsing the XML document is the least of your worries. In such a situation, changing the parser isn't going to help very much at all. The time taken to parse a document normally grows only linearly with the size of the document.
One area in which parser choice does make a significant difference is in the amount of memory used. SAX is generally quite efficient no matter which parser you pick. However, DOM is exactly the opposite . Building a DOM tree can easily eat up as much as ten times the size of the document itself. For example, given a one-megabyte document, the DOM object representing it could be ten megabytes. If you're using DOM or any other tree-based API to process large documents, then you want a parser that uses as little memory as possible. The initial batch of DOM-capable parsers were not really optimized for space, but more recent versions are doing a lot better. With some testing you should be able to find parsers that use only two to three times as much memory as the original document. Still, it's pretty much guaranteed that the memory usage will be larger than the document itself.
I now want to discuss a few of the more popular parsers and the relative advantages and disadvantages of each.
I'll begin with my parser of choice, Xerces-J [http://xml.apache.org/xerces-j/] from the Apache XML Project. In my experience, this very complete, validating parser has the best conformance to the XML 1.0 and Namespaces in XML specifications I've encountered. It fully supports the SAX2 and DOM Level 2 APIs, as well as JAXP, although I have encountered a few bugs in the DOM support. The latest versions feature experimental support for parts of the DOM Level 3 working drafts. Xerces-J is highly configurable and suitable for almost any parsing need. Xerces-J is also notable for being the first parser to support the W3C XML Schema Language, although that support is not yet 100 percent complete or bug-free.
The Apache XML Project publishes Xerces-J under the very liberal open source Apache license. Essentially, you can do anything you like with it except use the Apache name in your own advertising. Xerces-J 1.x was based on IBM's XML for Java parser [http://www.alphaworks.ibm.com/tech/xml4j], whose code base IBM donated to the Apache XML Project. Today, the relationship is reversed , and XML for Java is based on Xerces-J 2.x. However, in neither version is there significant technical difference between Xerces-J and XML for Java. The real difference is that if you work for a large company with a policy that prohibits using software from somebody you can't easily sue, then you can probably pay IBM a few thousand dollars for a support contract for XML for Java. Otherwise, you might as well just use Xerces-J.
The Apache XML Project also publishes Xerces-C, an open source XML parser written in C++, which is based on IBM's XML for C++ product. But because this is a book about Java, all future references to the undifferentiated name Xerces should be understood as referring strictly to the Java version (Xerces-J).
Crimson, previously known as Java Project X, is the parser Sun bundles with the JDK 1.4. Crimson supports more or less the same APIs and specifications as Xerces doesSAX2, DOM2, JAXP, XML 1.0, Namespaces in XML, and so onwith the notable exception of schemas. In my experience, Crimson is somewhat buggier than Xerces. I've encountered well-formed documents that Crimson incorrectly reported as malformed but which Xerces could parse without any problems. All of the bugs I've encountered with Xerces related to validation, not to the more basic criterion of well-formedness.
The reason Crimson exists is that some Sun engineers disagreed with some IBM engineers about the proper internal design for an XML parser. (Also, the IBM code was so convoluted that nobody outside of IBM could figure it out.) Crimson was supposed to be significantly faster, more scalable, and more memory efficient than Xercesand not get soggy in milk either. However, whether it's actually faster than Xerces (much less significantly faster) is questionable. When first released, Sun claimed that Crimson was several times faster than Xerces. But IBM ran the same benchmarks and got almost exactly opposite results, claiming that Xerces was several times faster than Crimson. After a couple of weeks of hooting and hollering on several mailing lists, the true cause was tracked down. Sun had heavily optimized Crimson for the Sun virtual machine and just-in-time compiler and naturally ran the tests on Sun virtual machines. IBM publishes its own Java virtual machine and was optimizing for and benchmarking on that.
To no one's great surprise, Sun's optimizations didn't perform nearly as well when run on non-Sun virtual machines; and IBM's optimizations didn't perform nearly as well when run on non-IBM virtual machines. To quote Donald Knuth, "Premature optimization is the root of all evil."  Eventually both Sun and IBM began testing on multiple virtual machines and watching out for optimizations that were too tied to the architecture of any one virtual machine; and now both Xerces-J and Crimson seem to run about equally fast on equivalent hardware, regardless of the virtual machine.
The real benefit to Crimson is that it's bundled with the JDK 1.4. (Crimson does work with earlier virtual machines back to Java 1.1. It just isn't bundled with them in the base distribution.) Thus if you know you're running in a Java 1.4 or later environment, you don't have to worry about installing extra JAR archives and class libraries just to parse XML. You can write your code to the standard SAX and DOM classes and expect it to work out of the box. If you want to use a parser other than Crimson, then you can still install the JAR files for Xerces-J or some other parser and load its implementation explicitly. However, in Java 1.3 and earlier (which is the vast majority of the installed base at the time of this writing), you have to include some parser library with your own application.
Going forward, Sun and IBM are cooperating on Xerces-2, which will probably become the default parser in a future release of the JDK. Crimson is unlikely to be developed further or gain support for new technologies like XInclude and schemas.
The GNU Classpath Extensions Project's lfred [http://www.gnu.org/software/classpathx/jaxp/] is actually two parsers, gnu.xml.aelfred2.SAXDriver and gnu.xml. aelfred2.XmlReader . SAXDriver aims for a small footprint rather than a large feature set. It supports XML 1.0 and Namespaces in XML. However, it implements the minimum part of XML 1.0 needed for conformance. For example, it neither resolves external entities nor validates them. Nor does it make all the well-formedness checks it should make, and it can miss malformed documents. It supports SAX but not DOM. Its small size makes it particularly well suited for applets. For less resource constrained environments, lfred provides XmlReader , a fully validating, fully conformant parser that supports both SAX and DOM.
lfred was originally written by the now defunct Microstar, which placed it in the public domain. David Brownell picked up development of the parser and brought it under the aegis of the GNU Classpath Extensions Project [http://www.gnu.org/software/classpathx/], an attempt to reimplement Sun's Java extension libraries (the javax packages) as free software. lfred is published under the GNU General Public License with library exception. In brief, this means that as long as you only call lfred through its public API and don't modify the source code yourself, the GPL does not infect your code.
Yuval Oren's Piccolo [http://piccolo. sourceforge .net/] is the newest entry into the parser arena. It is a very small, very fast, open source, nonvalidating XML parser. However, it does read the external DTD subset, apply default attribute values, and resolve external entity references. Piccolo supports the SAX API exclusively; it does not have a DOM implementation.