18.1 Common XML Processing Models | XML in a Nutshell, Third Edition

XML's structured and tagged text can be processed by developers in several ways. Programs can look at XML as plain text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.

18.1.1 Text-Based XML Processing

At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose to.

One of the original design goals of XML was for documents to be easy to parse. For very simple documents that do not depend on features such as attribute defaulting and validation, it is possible to parse tags, attributes, and text data using standard programming tools such as regular expressions and tokenizers , but the complexity of processing grows rapidly as documents use more features. Unless the application can completely control the content of incoming documents, it is almost always preferable to use one of the many high-quality XML parsers that are freely available for most programming languages.

Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi , Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressionsin environments such as sed, grep, Perl, and Pythoncan be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. Various standards are beginning to take advantage of regular expression matching after a particular document has been parsed. The W3C's XML Schema recommendation, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 17.

Text-based processing can be performed in conjunction with other XML processing. Parsing and then serializing XML documents after other processing has taken place doesn't always produce the desired results. XSLT, for instance, will remove entity references and replace them with entity content. Preserving entities requires replacing them in the original document with unique placeholders, and then replacing the placeholder as it appears in the result. With regular expressions, this is quite easy to do.

XML's dependence on Unicode means that developers need to be careful about the text-processing tools they choose. Many development environments have been upgraded to support Unicode, but there are still tools available that don't. Before using text-processing tools on the results of an XML parse, make sure they support Unicode. Text-processing tools being applied to raw XML documents must support the character encoding used for the document. Most modern languages (including Java, C#, Perl 5.6, and Python 2.2) and tools support Unicode. The difficult cases tend to arise in C and C++ where you have to worry about using wchar versus char and understand what a wchar actually is on a particular platform.

18.1.2 Event-Driven XML Processing

As an XML parser reads a document, it moves from the beginning of the document to the end. It may pause to retrieve external resourcesfor a DTD or an external entity, for instancebut it builds an understanding of the document as it moves along. Tree-based XML technologies (such as the DOM) combine these incremental parsing events into a monolithic image of an XML document once parsing has been completed successfully.

Event-based parsers, on the other hand, report these interim events to their client applications as they happen. Some common parsing events are element start-tag read, element content read, and element end-tag read. For example, consider the document in Example 18-1.

Example 18-1. Simple XML document

 <name><given>Keith</given><family>Johnson</family></name>

An event-based parser might report events such as this:

 startElement:name startElement:given content: Keith endElement:given startElement:family content:Johnson endElement:family endElement:name

The list and structure of events can become much more complex as features such as namespaces, attributes, whitespace between elements, comments, processing instructions, and entities are added, but the basic mechanism is quite simple and generally very efficient.

Event-based applications are generally more complex than tree-based applications. Processing events typically means the creation of a state machine, code that understands the current context and can route the information in the events to the proper consumer. Because events occur as the document is read, applications must be prepared to discard results should a fatal error occur partway through the document. Also, accessing a wide variety of data scattered throughout a document is much more involved than it would be if the entire document were parsed into a tree structure.

The upside to an event-based API is speed and efficiency. Because event-based APIs stream the document to the client application, your program can begin working with the data from the beginning of the document before the end of the document is seen. It doesn't have to wait for the entire document to be read before commencing. For instance, a brokerage program receiving a long list of requests to buy individual stocks could execute the first trade before the parser reads the second trade, execute the second trade before the parser reads the third trade, and so forth. This could save crucial seconds on the initial trades if the document includes many separate orders.

Even more important than speed is size . XML documents can be quite large, sometimes ranging into the gigabytes. An event-based API does not need to store all this data in memory at one time. It can process the document in small, easily handled chunks , then reclaim that storage. In practice, even on the largest, beefiest servers with gigabytes of RAM, XML documents larger than a couple of hundred megabytes can't be processed with a tree-based API. In an embedded environment (like a cell phone), memory limitations mandate streaming APIs.

Event-based parsers also more naturally fit certain tasks , such as content filtering. Filters can process and modify events before passing them to another processor, efficiently performing a wide range of transformations. Filters can be chained, providing a relatively simple means of building XML processing pipelines, where the information from one processor flows directly into another. Applications that want to feed information directly from XML documents into their own internal structures may find events to be the most efficient means of doing that. Even parsers that report XML documents as complete trees, as described in the next section, typically build those trees from a stream of events.

The Simple API for XML (SAX), described in Chapter 20 and Chapter 26, is the most commonly used event-based API. SAX2, the current version, is hosted at http://sax. sourceforge .net/. Expat, a widely used XML parser written in C, also uses an event-based API. For information on the expat parser and its API, see http:// expat .sourceforge.net.

18.1.3 Tree-based XML Processing

XML documents, because of the requirements for well- formedness , can be readily described using tree structures. Elements are inherently hierarchical, as they may contain other elements, text content, comments, and so forth.

There is a wide variety of tree models for XML documents. XPath (described in Chapter 9), used in XSLT transformations, has a slightly different set of expectations than does the Document Object Model (DOM) API, which is also different from the XML Information Set (Infoset), another W3C project. XML Schema (described in Chapter 17 and Chapter 22) defines a Post-Schema Validation Infoset (PSVI), which has more information in it (derived from the XML Schema) than any of the others.

Developers who want to manipulate documents from their programs typically use APIs that provide access to an object model representing the XML document. Tree-based APIs typically present a model of an entire document to an application once parsing has successfully concluded. Applications don't have to worry about manually maintaining parsing context or partial processing when a parse error is encountered , as the tree-based parser generally handles errors on its own. Rather than following a stream of events, an application can just navigate through the tree to find the desired pieces of a document.

Working with a tree model has substantial advantages. The entire document is always available, and moving well-balanced portions of a document from one place to another or modifying them is fairly easy. The complete context for any given part of the document is always available. When using APIs that support it, developers can use XPath expressions to locate content and make decisions based on content anywhere in the document. (DOM Level 3 adds formal support for XPath, and various implementations already provide their own nonstandard support.)

Tree models of documents have a few drawbacks. They can take up large amounts of memory, typically three to ten times the original document's file size. Navigating documents can require additional processing after the parse, as developers have more options available to them. (Tree models don't impose the same kinds of discipline as event-based processing.) These issues can make it difficult to scale and share applications that rely on tree models, although they may still be appropriate where small numbers of documents or small documents are being used.

The Document Object Model (DOM), described in Chapter 19 and Chapter 25, is the most common tree-based API. JDOM (http://jdom.org/ ), DOM4J (http://dom4j.org/ ), and XOM (http://www.cafeconleche.org/XOM) are Java-only alternatives. (XOM is an object model promoted by Elliotte Rusty Harold, one of the authors.)

18.1.4 Pull-Based XML Processing

The most recent entrant into the XML processing arena is the so-called pull processing model. One of the most widely used pull processors is the Microsoft .NET XMLReader class. The pull model is most similar to the event-based model in that it makes the contents of the XML document available progressively as the document is parsed.

Unlike the event model, the pull approach relies on the client application to request content from the parser at its own pace. For example, a pull client might include the following code to parse the simple document shown in Example 18-1:

 reader.ReadStartElement("name") reader.ReadStartElement("given") givenName = reader.ReadString( ) reader.ReadEndElement( ) reader.ReadStartElement("family") familyName = reader.ReadString( ) reader.ReadEndElement( ) reader.ReadEndElement( )

The pull client requests the XML content it expects to see from the pull parser. In practice, this makes pull client code easier to read and understand than the corresponding event-based code would be. It also tends to reduce the need to create stacks and structures to contain document information, as the code itself can be written to mirror recursive descent parsing.

In the Java world, BEA, Sun, and several individual developers have collaborated to create the Streaming API for XML (StAX). StAX and other pull parsers share the advantages of streaming with SAX such as speed, parallelism, and memory efficiency while offering an API that is more comfortable to many developers. In essence, SAX and other push parsers are based on the Observer design pattern. StAX, XMLReader, and other pull parsers are based on the Iterator design pattern.

18.1.5 Transformations

Another facility available to the XML programmer is document transformation. The Extensible Stylesheet Language Transformation (XSLT) language, covered in Chapter 8, is the most popular tool currently available for transforming XML to HTML, XML, or any other regular language that can be expressed in XSLT. In some cases, using a transformation to perform pre- or post-processing on XML data can reduce the complexity of a DOM or SAX application. For instance, XSLT could be used as a preprocessor for a screen-scraping application that starts from XHTML documents. The complex XHTML document could be transformed into a smaller, more accessible application-specific XML format that could then be read by a script.

Transformations may be used by themselves , in browsers, or at the command line, but many XSLT implementations and other transformation tools offer SAX or DOM interfaces, simplifying the task of using them to build document processing pipelines.

18.1.6 Abstracting XML Away

Developers who want to take advantage of XML's cross-platform benefits but have no patience for the details of markup can use various tools that rely on XML but don't require direct exposure to XML's structures. Web Services, mentioned in Chapter 16, can be seen as a move in this direction. You can still touch the XML directly if you need to, but toolkits make it easier to avoid doing so.

These kinds of applications are generally built as a layer on top of event- or tree-based processing, presenting their own API to the underlying information. We feel that in most cases, the underlying XML data is as clear and accessible as it can be. Additional layers of abstraction above the XML simply add to the overall complexity and rigidity of the application.

18.1.7 Standards and Extensions

The SAX and DOM specifications, along with the various core XML specifications, provide a foundation for XML processing. Implementations of these standards, especially implementations of the DOM, sometimes vary from the specification. Some extensions are formally specifiedScalable Vector Graphics (SVG), for instance, specifies extensions to the DOM that are specific to working with SVG. Others are just kind of tacked on, adding functionality that a programmer or vendor felt was important but wasn't in the original specification. The multiple levels and modules of the DOM have also led to developers claiming support for the DOM but actually supporting particular subsets (or extensions) of the available specifications.

Porting standards also leads to variations. SAX was developed for Java, and the core SAX project only defines a Java API. The DOM uses Interface Definition Language (IDL) to define its API, but various implementations have interpreted the IDL slightly differently. SAX2 and the DOM are somewhat portable, but moving between environments may require some unlearning and relearning.

Some environments also offer libraries well outside the SAX and DOM interfaces. Perl and Python both offer libraries that combine event and tree processingfor instance, permitting applications to work on partial trees rather than SAX events or full DOM trees. These nonstandard approaches do not make moving between environments easy, but they can be very useful.

18.1.8 Combining Approaches

While text, events, trees, and transformations may seem very different, it isn't unusual to combine them. Most parsers that produce DOM trees also offer the option of SAX events, and there are a number of tools that can create DOM trees from SAX events or vice versa. Some tools that accept and generate SAX events actually build internal treesmany XSLT processors operate this way, using optimized internal models for their trees rather than the generic DOM. XSLT processors themselves often accept either SAX events or DOM trees as input and can produce these models (or text) for their output.

Most programmers who want direct access to XML documents start with DOM trees, which are easier to figure out initially. If they have problems that are better solved in event-based environments, they can either rewrite their code for eventsit's a big changeor mix and match event processing with tree processing.