This chapter briefly explains the most popular programming models for parsing and manipulating XML data in use today. XML processing includes a diverse set of tools, which require different approaches but offer distinct advantages and disadvantages.
XML's structured and labeled text can be processed by developers in several of ways. Programs can look at XML as text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.
At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose.
Despite this textual nature, however, XML presents some serious limitations for programs that attempt to process XML documents as text documents. It is possible to process extremely simple XML documents reliably using basic textual tools like regular expressions, but this becomes much more difficult as features such as attribute defaulting, entity processing, and namespaces are added to documents. Using these features is extremely difficult when treating a document purely as text.
Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions in environments such as sed, grep, Perl, and Python can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. These tools can also be very useful for searching and querying the information in XML documents, even without an understanding of the surrounding structure.
Textual tools may also be applied to the results of an XML parser. Regular expressions and similar text-processing tools can be applied usefully to the results of an XML parse, working on the document when its XML-specific nature has already been resolved. The W3C's XML Schema, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 16. A smart search and replace or spell checker might process only the contents of elements (and perhaps attributes), not the markup that defines the structures.
Text-based processing can be preformed in conjunction with other XML processing. Parsing and then reserializing XML documents after other processing has taken place doesn't always produce the desired results. XSLT, for instance, will remove entity references and replace them with entity content. Preserving entities requires replacing them in the original document with unique placeholders, and then replacing the placeholder as it appears in the result. With regular expressions, this is quite easy to do. Developers may also need to replace particular characters with references to images; this approach can be very useful where an obscure or nonstandard glyph is needed in XHTML.
As an XML parser reads a document, it moves from the beginning of the document to the end. It may pause to retrieve external resources for a DTD or an external entity, for instance but it builds an understanding of the document as it moves along. Enforcing well-formedness and validity constraints and applying namespaces requires keeping track of context; applying attribute defaults and entities requires keeping a list of appropriate content to insert; but the end result is a complete "reading" of the XML document.
Event-based parsers report this reading as it happens, in a stream of events representing the information in the document. The "events" are, for example, the start of an element, the content of an element, and the end of an element. For example, given this document:
an event-based parser might report events such as this:
startElement:name startElement:given content: Keith endElement:given startElement:family content:Johnson endElement:family endElement:name
The list and structure of events can become much more complex as features, such as namespaces, attributes, whitespace between elements, comments, processing instructions, and entities are added, but the basic mechanism is quite simple and generally very efficient.
Event-based parsers only have to keep track of a limited amount of information. They need to understand the contents of DTDs (and possibly schemas), if the documents use them, and they need to maintain context stacks for element names and namespace declarations. They don't need to build a complete record of the document as they parse it, which minimizes the amount of memory needed for the parse.
Event-based parsers require the consumer of the events to do a lot more work, however. Processing events typically means the creation of a state machine, i.e., code that understands current context and can route the information in the events to the proper consumer. Because events occur as the document is read, applications must be prepared to discard results should a fatal error occur partway through the document. Applications can't depend on information that occurs later in a document to interpret the current event, either, making it hard to use some kinds of XPaths, for instance, in an event-based environment. These factors can make it difficult to work directly with event-based parsers.
Despite the potential difficulty, event-based parsers are very useful for a wide variety of tasks. Filters can process and modify events before passing them to another processor, efficiently performing a wide range of transformations. Filters can be stacked, providing a relatively simple means of building XML processing pipelines, where the information from one processor flows directly into another. Applications that want to feed information directly from XML documents into their own internal structures may find events to be the most efficient means of doing that. Even parsers that report XML documents as complete trees, as described in the next section, typically build those trees from a stream of events.
XML documents, because of the requirements for well-formedness, describe tree structures. Documents typically contain an element that then contains text, attributes, and other elements, and these may contain elements, text, and attributes, and so on. Declarations, comments, and processing instructions enrich the mix, but all basically hold positions in the overall tree.
There are a wide variety of tree models for XML documents. XPath (described in Chapter 9), used in XSLT transformations, has a slightly different set of expectations than does the Document Object Model (DOM) API, which is also different from the XML Information Set (Infoset), another W3C project. XML Schema (described in Chapter 16 and Chapter 21) defines a Post-Schema Validation Infoset (PSVI), which has more information in it (derived from the XML Schema) than any of the others.
Developers who want to manipulate documents from their programs typically use APIs that provide access to an object model representing the XML document. Tree-based APIs typically present a model of an entire document to an application once parsing has successfully concluded. Applications don't have to worry about figuring out context or dealing with rollback when an error is encountered, since the tree model and parsing already address those issues. Rather than following a stream of events, an application can just navigate a tree to find the desired pieces of a document. Browsers and editors can present or modify the tree in conformance with user or script requests, using the tree as a persistent reference to the current content of the document.
Working with a tree model of a document isn't very different conceptually from working with a document as text. The entire document is always available, and moving around well-formed portions of a document or modifying them is fairly easy. The complete set of context for any given part of the document is always available. Developers can use XPath expressions to locate content and make decisions based on content anywhere in the document where APIs support XPath. (DOM Level 3 adds formal support for XPath, and various implementations provide their own support.)
Tree models of documents have a few drawbacks. They can take up large chunks of memory, typically multiplying the original document's size. Navigating documents can require additional processing after the parse, as developers have more options available to them. (Tree models don't impose the same kinds of discipline as event-based processing.) Both of these issues can make it difficult to scale and share applications that rely on tree models, though they may still be appropriate where small numbers of documents or small documents are being used.
Another facility available to the XML programmer is a form of the XML transformation library. The Extensible Stylesheet Language Transformation (XSLT) language, covered in Chapter 8, is the most popular tool currently available for transforming XML to HTML, XML, or any other regular language that can be expressed in XSLT. In some cases, using a transformation to perform pre- or post-processing on XML data when processing it with either DOM or SAX might be simpler or more efficient. For instance, XSLT could be used as a preprocessor for a screen-scraping application that starts from XHTML documents. A script could extract the meaningful features from the XHTML document and pour them into an application-specific XML format.
Transformations may be used by themselves, in browsers, or at the command line, but many XSLT implementations and other transformation tools offer SAX or DOM interfaces, simplifying the task of using them to build pipelines.
Developers who want to take advantage of XML's cross-platform benefits but have no patience for the details of markup can use various tools that rely on XML but don't require direct exposure to XML's structures. Web Services, mentioned in Chapter 15, can be seen as a move in this direction. You can still touch the XML directly if you need to, but toolkits make it easier to avoid doing so.
These kinds of applications are generally built as a layer on top of event- or tree-based processing, presenting their own API to the underlying information. This level of abstraction may be very useful in some cases or an inefficient inconvenience in others. It's probably helpful to understand more direct connections to XML if you need to evaluate the advantages and disadvantages of abstraction, as well as provide a bridge to systems that don't support a particular abstraction layer but still need access to the information.
The SAX and DOM specifications, along with the various core XML specifications, provide a foundation for XML processing. Implementations of these standards, especially implementations of the DOM, sometimes vary from the specification. Some extensions are themselves formally specified Scalable Vector Graphics (SVG), for instance, specifies extensions to the DOM that are specific to working with SVG. Others are just kind of tacked on, adding functionality that a programmer or vendor felt was important but wasn't in the original specification. The multiple levels and modules of the DOM have also led to developers claiming support for the DOM, but actually supporting particular subsets (or extensions) of the available specifications.
Porting standards also leads to variations. SAX was developed for Java, and the core SAX project only defines a Java API. The DOM uses Interface Definition Language (IDL) to define its API, but different implementations have interpreted the IDL slightly differently. SAX2 and the DOM are somewhat portable, but moving between environments may require some unlearning and relearning.
Some environments also offer libraries well outside the SAX and DOM interfaces. Perl and Python both offer libraries that combine event and tree processing for instance, permitting applications to work on partial trees rather than SAX events or full DOM trees. Microsoft .NET's XMLReader offers similarly flexible processing. These approaches do not make moving between environments easy, but they can be very useful.
While text, events, trees, and transformations may seem very different, it isn't unusual to combine them. Most parsers that produce DOM trees also offer the option of SAX events, and there are a number of tools that can create DOM trees from SAX events or vice versa. Some tools that accept and generate SAX events actually build internal trees many XSLT processors operate this way, using optimized internal models for their trees rather than the generic DOM. XSLT processors themselves often accept either SAX events or DOM trees as input and can produce these models (or text) for their output.
Most programmers who want direct access to XML documents start with DOM trees, which are easier to figure out initially. If they have problems that are better solved in event-based environments, they can either rewrite their code for events it's a big change or mix and match event processing with tree processing.
As with any technology, there are several ways to accomplish most design goals when developing a new XML application, as well as a few potential problems worth knowing about ahead of time. An understanding of the intended uses for these features can help ensure that new applications will be compatible not only with their intended target audience, but also with other XML processing systems that may not even exist yet.
The XML specification provides several loopholes that permit XML parsers to play fast and loose with your document's literal contents, while retaining the semantic meaning. Comments can be omitted and entity references silently replaced by the parser without any warning to the client application. Nonvalidating parsers aren't required to retrieve external DTDs or entities, though the parser should at least warn applications that this is happening. While reconstructing an XML document with exactly the same logical structure and content is possible, guaranteeing that it will match the original in a byte-by-byte comparison is not.
Authors of simple XML processing tools that act on data without storing or modifying it might not consider these constraints particularly restrictive. The ability to reconstruct an XML document precisely from in-memory data structures, however, becomes more critical for authors of XML editing tools and content-management solutions. While no parser is required to make all comments, whitespace, and entity references available from the parse stream, many do or can be made to do so with the proper configuration options.
The only real option to ensure that your parser reports documents as you want, and not just the minimum required by the XML 1.0 specification, is to check its documentation and configure (or choose) your parser accordingly.
XML parsers are required to provide client applications access to XML processing instructions. Processing instructions provide a mechanism for document authors to communicate with XML-aware applications behind the scenes in a way that doesn't interfere with the content of the documentation. DTD and schema validation both ignore processing instructions, making it possible to use them anywhere in a document structure without changing the DTD or schema. The processing instruction's most widely recognized application is its ability to embed stylesheet references inside XML documents. The following XML fragment shows a stylesheet reference:
<?xml version="1.0"?> <?xml-stylesheet type="text/css" href="test.css"?>
An XML-aware application, such as Internet Explorer 5.5, would be capable of recognizing the XML author's intention to display the document using the test.css stylesheet. This processing instruction can also be used for XSLT stylesheets or other kinds of stylesheets not yet developed, though the application needs to understand how to process them to make this work. Applications that do not understand the processing instructions can still parse and use the information in the XML document while ignoring the unfamiliar processing instruction.
The furniture example from Chapter 20 (see Figure 20-1) gives a hypothetical application of processing instructions. A processing instruction in the bookcase.xml file signals the furniture example's processor to verify the parts list from the document against the true list of parts required to build the furniture item:
<parts_list> <part_name id="A" count="1">END PANEL</part_name> <part_name id="B" count="2">SIDE PANEL</part_name> <part_name id="C" count="1">BACK PANEL</part_name> <part_name id="D" count="4">SHELF</part_name> <part_name id="E" count="8">HIDDEN CONNECTORS</part_name> <part_name id="F" count="8">CONNECTOR SCREWS</part_name> <part_name id="G" count="22">7/16" TACKS</part_name> <part_name id="H" count="16">SHELF PEGS</part_name> </parts_list> <?furniture_app verify_parts_list?>
This processing instruction is meaningless unless the parsing application understands the given type of processing instruction.
The XML specification also permits the association of the processing instruction's target the XML name immediately after the <? with a notation, as described in the next section but this is not required and is rarely used in XML.
The notation syntax of XML provides a way for the document author to specify an external unparsed entity's type within the XML document's framework. If an application requires access to external data that cannot be represented in XML, consider declaring a notation name and using it where appropriate when declaring external unparsed entities. For example, if an XML application were an annotated Java source-code format, the compiled bytecode could then be referenced as an external unparsed entity.
Notations effectively provide metadata, identifiers that applications may apply to information. Using notations requires making declarations in the DTD, as described in Chapter 3. One use of notations is with NOTATION-type attributes. For example, if a document contained various scripts designed for different environments, it might declare some notations and then use an attribute on a containing element to identify what kind of script it contained:
<!NOTATION DOS PUBLIC "-//MS/DOS Batch File/"> <!NOTATION BASH PUBLIC "-//UNIX/BASH Shell Script/"> <!ELEMENT batch_code (#PCDATA)*> <!ATTLIST batch_code lang NOTATION (DOS | BASH)> . . . <batch_code lang="DOS"> echo Hello, world! </batch_code>
Applications that read this document and recognized the public identifier could interpret the foreign element data correctly, based on its type. (Notations can also have system identifiers, and applications can use either approach.)
Categorizing processing instructions is the other use of notations important to custom XML applications. For instance, the previous furniture_app processing-instruction example could have been declared as a notation in the DTD:
<!NOTATION furniture_app SYSTEM "http://namespaces.example.com/furniture">
Then the furniture-document processing application could verify that the processing instruction was actually intended for itself and not for another application that used a processing instruction with the same name.
Unparsed entities combine attribute and notation declarations to define references to content that will require further (unspecified) processing by the application. Unparsed entities are described in more detail in Chapter 3, but though they are a feature available to applications, they are also rarely used and not generally considered interoperable among XML processors. The linking and referencing tools described in the next section are more commonly used instead.
The ability to create links between and within documents is important to XML's long-term success, both on the World Wide Web and for other applications concerned about the relationships between information. The XLink specification, described in Chapter 10, defines the semantics of how these links can be created. Unlike simple HTML links, XLinks can express sophisticated relationships between the source and target elements of a link.
If an XML application requires the ability to encode relationships between various parts of an XML document, or between different documents, implementing this functionality using the XLinks recommendation should be considered. Not only would it save the effort of defining a new (and incompatible) linking scheme, the resulting documents would be intelligible to new XML authoring tools and browsers as XLinks support becomes more widespread. RDDL, described in Chapter 14, makes extensive use of XLink for machine-readable linking.