XPath and XML Infosets

An XML infoset is intended to hold all the information in an XML document in compact form. Reducing an XML document to its infoset is intended to make comparisons between all kinds of XML documents easier by presenting the data in those documents in a standard way. You can find the official XML Information Set specification at www.w3.org/TR/xml-infoset.

To understand what infosets are and what they're used for, imagine searching for data on the World Wide Web. You may want to search for a particular topic, such as XML, and you'd turn up millions of matches. How could you possibly write software to compare those documents? The data in those documents isn't stored in any way that's directly comparable.

That's where infosets come in because the idea is to regularize how data is stored in an XML document, which will, ultimately, let you work with thousands of such documents. The idea behind infosets is to set up an abstract way of looking at an XML document that allows it to be compared to others.

XML infosets have their own data model, which is not the same as the XPath data model. An XML infoset can contain 15 different types of information items:

A document information item
Element information items
Attribute information items
Processing instruction information items
Reference to skipped entity information items
Character information items
Comment information items
A document type declaration information item
Entity information items
Notation information items
Entity start marker information items
Entity end marker information items
CDATA start marker information items
CDATA end marker information items
Namespace declaration information items

Each of these information items themselves have a set of properties, which contain more informationfor example, the document information item has properties that let you access the children of the root node.

Over time, several XML standards have developed their own data model, and W3C is trying to get them all reconciled. You won't have to know about infosets in this book, but if you're already familiar with them, it's useful to know how you can derive the nodes in the XPath data model from the information items provided by an XML infoset. Here's how that works:

The root node comes from the Infoset document information item. The children of the root node come from the children and children-comments properties.
Element nodes come from Infoset element information items. The children of an element node come from the children and children-comments properties. The attributes of an element node come from the attributes property.
Attribute nodes come from attribute information items. The string-value of the node comes from concatenating the character code property of each member of the children property.
Text nodes come from one or more consecutive character information items. The string-value of the node comes from concatenating the character code property of each of the character information items.
Processing instruction nodes come from processing instruction information items. The local part of the expanded name of the node comes from the target property. The string value of the node comes from the content property.
Comment nodes come from comment information items. The string value of the node comes from the content property.
Namespace nodes come from a namespace declaration information item. The local part of the expanded-name of the node comes from the prefix property. The string value of the node comes from the namespace URI property.

In fact, one of the tasks of XPath 2.0 was to reconcile the data models used in XPath and the XML Infoset specifications, and we'll discuss that later in Chapter 7.