4.1 The XPath Data Model

The foundation of XPath is its view of the XML document as a tree with branches called nodes. XPath's data model is a tree data model. The tree model comes to us from traditional computer science. It is a way of organizing or imagining the order of data in a hierarchical or structured way. To illustrate the tree model, Figure 4-1 represents roughly the XML document nodes.xml found in examples/ch04 as a tree of nodes.

Each box in Figure 4-1 represents a node or point in the tree structure of the document. In the XPath data model, a node represents part of an XML document such as the root or starting point of the document, elements, attributes, text, and so on. In the traditional tree model, the lines connecting the nodes are called edges. If a node does not have children, it is called a leaf node. (The terms edge and leaf node are not used in the XPath spec.) If you follow the edges, you are following a path. The nodes in a tree have family relationships: parent-child, ancestor-descendant, sibling, and so forth.

Figure 4-1. A tree of nodes

4.1.1 XPath Nodes

An XML document, according to the XPath 1.0 data model, can be conceptually described as having seven possible node types:

Root (called the document node in XPath 2.0)
Element
Attribute
Text
Namespace
Comment
Processing instruction

You have already encountered nodes of all these types earlier in the book. For further illustration, the file nodes.xml contains at least one occurrence of each of these nodes:

<?xml-stylesheet href="tree-view.xsl" type="text/xsl"?>     <!-- Last invoice of day's batch -->     <amount vendor="314" xml:lang="en"  xmlns="urn:wyeast-net:invoice">7598.00</amount>

Each node is labeled with its appropriate XPath 1.0 node type in Figure 4-2, and Table 4-1 describes each of the XPath node types.

Figure 4-2. The seven XPath 1.0 nodes in nodes.xml

Table 4-1. XPath nodes types
Node type	Description
Root (document) node	The whole document, starting conceptually at the beginning of the document, before the document or root element. The root node must have at least (and at most) one element child: the document element. In the XPath model, a root node may also have processing instruction and comment children. Other children are ignored.
Element node	An element, such as `amount`, which is also the document element in nodes.xml.
Attribute node	An attribute, such as `vendor="314`" or `xml:lang="en`".
Text node	Text inside of an element, such as `7598.00` inside `amount` (yes, it looks like a real number, but XPath just sees it as text here).
Namespace node	A namespace name, a URI such as the URN `urn:wyeast-net:invoice` (also includes a prefix, if applicable).
Comment node	A comment, such as `<!-- Last invoice` `of day's batch -->`.
Processing instruction node	A processing instruction, such as `<?xml-stylesheet href="tree-view.xsl" type="text/css"?>`.

XPath 2.0, which is not yet an approved recommendation of the W3C, takes a slightly different approach in regard to nodes and types, at least at this book's level of detail. You will be introduced to XPath 2.0 in Chapter 16. For more information, see http://www.w3.org/TR/xpath20/

4.1.2 A View of the Tree

To get a good idea of the how the XPath 1.0 data model views an XML document as a tree, you can use the ASCII Tree Viewer (the stylesheet ascii-treeview.xsl) created by Mike Brown and Jeni Tennison. This stylesheet labels all seven node types using plain text or ASCII output. An edited version of this stylesheet is available in examples/ch04.

When you process nodes.xml with ascii-treeview.xsl using Xalan, as follows:

xalan nodes.xml ascii-treeview.xsl

you will see each of the nodes labeled in the output:

root   |_ _ _processing instruction target='xml-stylesheet' instruction= 'href="tree-view.xsl" type="text/xsl"'   |_ _ _comment ' Last invoice of day's batch '   |_ _ _element 'amount' in ns 'urn:wyeast-net:invoice' ('amount')         |  \_ _ _attribute 'vendor' = '314'         |  \_ _ _attribute 'lang' in ns 'http://www.w3.org/XML/1998/namespace' ('xml:lang') = 'en'         |  \_ _ _namespace 'xml' = 'http://www.w3.org/XML/1998/namespace'         |  \_ _ _namespace 'xmlns' = 'urn:wyeast-net:invoice'         |_ _ _text '7598.00'

You can download the original, unedited version of ascii-treeview.xsl from http://skew.org/xml/stylesheets/treeview/ascii/. I have edited this stylesheet so that it will find and label namespace nodes and ignore insignificant whitespace.

The stylesheet referenced at the top of nodes.xml is tree-view.xsl. It is the Pretty XML Tree Viewer, also developed by Mike Brown and Jeni Tennison. It produces HTML output rather than ASCII. You can get tree-view.xsl, along with its required companion stylesheet tree-view.css, from http://skew.org/xml/stylesheets/treeview/html/. There already are edited copies of these stylesheets in examples/ch04.

If you open and view nodes.xml with IE, you will see the result shown in Figure 4-3. The seven node types are all represented, as you can see from the labels.

Figure 4-3. nodes.xml shown in IE

As with ascii-treeview.xsl, I have made a few small edits to tree-view.xsl. The edit changes a parameter value to a nonzero value, switching on the behavior that makes the stylesheet show namespace nodes. I have also uncommented a line so that insignificant whitespace is stripped using the strip-space element. You will learn more about parameters in Chapter 7. You will learn about stripping and preserving insignificant space later in the book.

The xml:lang Attribute

The document nodes.xml uses the xml:lang attribute. This attribute indicates that the content of the element that specifies it, and any of its associated attribute values, are given in the language defined by the value of xml:lang. This attribute is a special attribute from the XML namespace, http://www.w3.org/XML/1998/namespace, and is associated with the xml prefix. This attribute takes a token value that represents a language according to IETF RFC 1766, Tags for the Identification of Languages (see http://www.ietf.org/rfc/rfc1766.txt), in conjunction with the ISO/IEC 639 standards (see http://www.iso.ch). (IETF RFC, by the way, stands for Internet Engineering Task Force Request for Comments; see http://www.ietf.org/rfc.html.) Some examples of these tokens are en for English, fr for French, de for German (Deutsch), and es for Spanish (Español). These can also include tokens with subtags such as en-US for United States English and en-GB for Great Britain English.

4.1.3 What's a Context?

In order to work properly, XPath and XSLT have to keep track of where processing occurs in the source document and what node it's working on at any particular moment. XPath and XSLT have developed a vocabulary to describe such things. The more familiar you are with the terms described in the following paragraphs, the better off you will be when working with XSLT. You will get more and more exposure to these terms throughout the remainder of this book.

Most of the terms revolve around something called a context. In XPath, the context node is the node that is currently selected and being processed. The context node is usually the node addressed by a select attribute, such as with the apply-templates element. The XSLT spec also refers to a current node, which is almost always the same thing as the context node. You can retrieve the current node with the current( ) function, an XSLT function that I'll discuss in Chapter 5.

The only time the context node and the current node are not the same thing is when a predicate is being evaluated. A predicate is a filter for nodes, contained in square brackets, such as in amount[@xml:lang='en']. When a node is being evaluated within the square brackets or predicate, it temporarily becomes the current node. This is the only time that the context node and the current node are not identical. You'll learn about predicates in Section 4.5, later in this chapter.

A node-set is a set of unordered nodes that can be of different kinds. A node-set can consist of an unordered group of element, attribute, and text nodes, for example. The current node list is an XSLT term and refers to an ordered set of nodes, obtained when, for example, the select attribute of the apply-templates element is processed.

The context position, represented by a nonzero, positive integer, is an XPath term that indicates the node at which processing is positioned, something like the current position when iterating through an array or vector in a programming language. The context size represents the number of nodes in the current list, and is also a nonzero, positive integer. This is like an array size, though numbering starts at 1, not 0.

The term document order refers to the order in which nodes actually appear as they are encountered in a source document. The current node list can be a subset of the nodes found in document order in a source tree. Document order can be in forward or reverse, along a given axis such as the child or parent axis (see Section 4.6, later in the chapter for a more thorough explanation).

If you don't feel like you've got your arms around all these terms, that's okay: you'll get more exposure to them over time and they'll eventually sink in. Now that you have a basic understanding of the XPath data model and some of its essential terminology, I'll start exploring expressions and patterns after a brief discussion of location paths.