XPath Node Trees

Working with XML documents as node trees is a conceptual way of looking at them. As you can tell from the name , the root node is at the base of the tree, and all other nodes are in a tree structure beginning at the root. Considering XML documents as node trees means that XPath can work with the relationships between nodes in different ways, and those ways are the XPath axes. When you use an axis, you tell XPath what relationships you want to explore in the node tree, starting with the context nodewe'll see all the axes at work in Chapter 3.

Let's take a look at an example. You can see a short XML document holding the names of two books in ch02_03.xml in Listing 2.3.

Listing 2.3 A Short XML Document ( `ch02_03.xml` )

 <?xml version="1.0"?> <library>    <book>         <title>             I Love XPath         </title>         <title>             XPath is the BEST         </title>    </book> </library>

Here's how the XML document we just saw looks to an XPath processor as a tree of nodes:

 root                                           element: <library>                                            element: <book>                                         -------------------                                          element: <title>        element: <title>                                    text: "I Love XPath"        text: "XPath is the BEST"

Actually, the preceding tree diagram does not represent the whole picture from an XPath processor's point of view. I've left out one type of node that causes a great deal of confusiontext nodes that contain only whitespace. Because this causes so much confusion in XPath, it's worth taking a look at. The sample XML document we've been working on so far is nicely indented to show the hierarchical structure of its elements, like this:

 <?xml version="1.0"?> <library>    <book>         <title>             I Love XPath         </title>         <title>             XPath is the BEST         </title>    </book> </library>

However, from an XPath point of view, the whitespace we've used to indent elements in this example actually represents text nodes. That means that by default, those spaces will be copied to the output document. The way whitespace works is a major source of confusion in XPath, so we'll see how it works in this example.

Four characters are treated as whitespace: spaces, carriage returns, line feeds, and tabs. That means that from an XSLT processor's point of view, the input document looks like this:

 <?xml version="1.0"?> <library>  ....<book>  ........<title>  ............I Love XPath  ........</title>  ........<title>  ............XPath is the BEST  ........</title>  ....</book>  </library>

All the whitespace between the elements is treated as whitespace text nodes in XPath. That means that there are five whitespace text nodes we have to add to our diagram: one before the <book> element, one after the <book> element, as well as one before, after, and in between the <title> elements:

 root                                                                                  element: <library>                                                                             ----------------------------                                                                                  text:whitespace   element: <book>   text:whitespace                                                      ------------------------------------------------------------------------                                                                                  text: whitespace  element: <title>  text: whitespace  element: <title>  text:whitespace                                                                               text: "I Love XPath"             text: "XPath is the BEST"

Whitespace nodes like these are text nodes that contain nothing but whitespace. XPath processors preserve this whitespace by default. Note that text nodes that contain characters other than whitespace are not considered whitespace nodes, and so will never be stripped from a document.

As we know, attributes are treated as nodes as well. Although attribute nodes are not considered child nodes of the elements in which they appear, the element is considered their parent node. Suppose you add an attribute to an element like this:

 <?xml version="1.0"?> <library>    <book>         <title>             I Love XPath         </title>  <title pub_date="2003">  XPath is the BEST         </title>    </book> </library>

Here's how this attribute would appear in the document tree:

 root                                                                                  element: <library>                                                                             ----------------------------                                                                                  text:whitespace   element: <book>   text:whitespace                                                      ------------------------------------------------------------------------                                                                                  text: whitespace  element: <title>  text: whitespace  element: <title>  text:whitespace                                                                            text: I Love XPath                  ------------------------------                                                                                                                   text: XPath is the BEST    attribute: pub_date="2003"

When you consider an XML document as a tree of nodes, there are various relationships between those nodes. For example, take our simple example:

 root                                           element: <library>                                            element: <book>                                         -------------------                                          element: <title>        element: <title>                                    text: "I Love XPath"        text: "XPath is the BEST"

The root node is at the very top of the tree, followed by the root element's node, corresponding to the <library> element. This is followed by the <book> node, which has two <title> node children. These two <title> nodes are grandchildren of the <library> element. The parents, grandparents, and great-grandparents of a node, all the way back to and including the root node, are that element's ancestors . The nodes that are descended from a nodeits children, grandchildren, great-grandchildren, and so onare called its descendants . As we've seen, nodes on the same level are called siblings .

XPath 1.0 formalizes these relationships with its 13 axes, which we're going to start using in Chapter 3. These axes include the child axis, which lets you indicate that you're interested in children of the context node, the descendant axis, which points to descendants of the context node, and so on.

You use these axes to navigate from the context node along the branches of the node tree to the node(s) you want. Here are a few examples:

/descendant::planet[position() = 3] Returns the third <planet> element in the document.
preceding-sibling::name[position() = 2] Returns the second previous <name> sibling element of the context node.
ancestor ::planet Returns all <planet> ancestors of the context node.
ancestor-or-self::planet Returns the <planet> ancestors of the context node. If the context node is a <planet> as well, also returns the context node.
child::*/child::planet Returns all <planet> grandchildren of the context node.

That completes our look at the XPath data model in this chapter. We started by taking a look at the various data types you can use in XPathnumbers, strings, Booleans, and node-sets. Then we took a closer look at the different types of XPath nodes that you can use in node-sets , and saw that when nodes are arranged into trees, you can use XPath axes to access them.

Now that we know how the XPath data model worksthat is, how XPath views the data in an XML documentand have an introduction to using XPath axes to take advantage of the relationships that XPath knows about between nodes, we're ready to start working with real XPath expressions, and we'll do that in Chapter 3.

Before we finish with data models entirely, however, it's worth noting that there are other XML data models than the XPath data modelthe Infoset and DOM models, for exampleand we'll take a look at them and how they impact XPath 1.0 next . (If you prefer, you can skip this material and go directly to Chapter 3, or skim over itI've added it for the sake of completeness for readers who use the Infoset and DOM data models.)

Listing 2.3 A Short XML Document ( ch02_03.xml )

Listing 2.3 A Short XML Document ( `ch02_03.xml` )