Major Features of XPath | Professional XML (Programmer to Programmer)

What follows is a short introduction to some of the most important concepts and features of XPath. Here you will learn how to navigate the tree structure of an XML document with path expressions, about node types, predicates, axes, and sequences.

Nodes

XPath looks at an XML document as a tree of nodes. Let's see what those nodes are through the following example:

      <catalog>          <product >              <price>5.95</price>              <description>Custom printed stainless steel coffee mug</description>          </product>          <product >              <price>119.95</price>              <description>Natural maple bedside table</description>          </product>      </catalog>

From the perspective of XPath, everything in this document is a node. There are seven types of nodes in XPath. The following four are used most frequently:

q Element nodes, such as catalog or product.
q Attribute nodes, such as id=“>mug”.
q Text nodes, such as 5.95 or Custom Printed Stainless steel coffee mug.
q Document node is a somewhat artificial node that stands as the root the tree, with one of its children (and sometimes its only child) being the root element (such as catalog in the previous example).

The other three types of nodes that you might encounter occasionally in XPath are as follows:

q Processing instructions
q Namespaces
q Comments

Note

Do not confuse elements with tags. Tags refer to the lexical structure of XML, where <product> and </product> are opening and closing tags, and elements are what is placed between these tags, such as the id, price, and description attributes of that product.

In XPath, you always talk about elements, not tags. If you write an expression that points to the first product element, it returns the whole element, including its attributes and anything else between the opening and closing tags in the textual representation of XML.

Tree Structure

Nodes are organized in a tree structure as follows:

q Every node has exactly one parent, except the document node, which doesn't have a parent.
q Nodes can have zero or more children nodes.
q Nodes that have same parent are called siblings.
q The ancestors of a node are its parent, the parent of the parent, and so on until you reach the document node.
q The descendants of a node are its children, the children of those children, and so on until you reach and include nodes that don't have any children.

Path Expressions

The tree structure of an XML document is not unlike the structure of a file system. Instead of the elements and attributes used in XML, the file system has directories and files. On UNIX or Windows, you use a particular syntax, called a path, to point to a directory or file. The path to a file looks like C:\ windows\system32\drivers\etc\hosts on Windows or /etc/hosts on UNIX. In both cases, you specify directory and file names starting from the root and separating them by a forward or backward slash. For example, A/B or A\B refers to the child B of A.

The same is true in XPath. So the /catalog specifications in the previous document example signify the following:

q /catalog points to the catalog element.
q /catalog/product points to the two product elements, which are children of the catalog element.

Just as with path expressions on UNIX or Windows, you can use .. (two dots) to refer to parent node. For example:

q /catalog/product/.. is another (albeit longer) way to point to the catalog element.
q /catalog/.. points to the parent of the catalog element, which is the document element.

Note

In the first expression, /catalog/product returns two catalog elements. So you might wonder if /catalog/product/.. returns the parents of these two elements, and if the parent would be the same if the expression returns the catalog element twice. This doesn't happen, because a path expression never returns duplicate nodes. So /catalog/product/.. returns just one node: the catalog element.

If you prefix a name with @ (the "at" symbol), it points to an attribute with that name. For instance, the following expression returns the two id attributes “mug” and “table”:

      /catalog/product/@id

Predicates

What if you don't want to get all the products from the catalog, but only those with a price lower than 10 dollars? You can filter the nodes returned by a path expression by adding a condition between square brackets. So to return only the products with a price lower than 10 dollars, you would write this:

      /catalog/product[price <= 10]

There are two types of predicates:

q When the expression in the predicate evaluates to a value of a numeric type, then it is called a numeric predicate. A numeric predicate selects the node that has a context position equal to the specified value. Context positions are 1-based (not 0-based). For example:
- q /catalog/product[1] returns the first product, the one with id “mug”.
- q /catalog/product[0] doesn't return any product, because the context position of the first product is 1. Note that this is a valid expression that does not generate an error.
q When the expression does not evaluate to a value of numeric type, it is taken as a Boolean value. If the expression does not evaluate to a Boolean value, it is converted with the boolean() function. For example:
- q /catalog/product[price >=100 and price < 200] returns products with a price point between 100–200 dollars.
- q /catalog/product[contains(description, ‘table’)] returns products with descriptions that contain the word "table." Note that the predicate expression uses the contains() function, and the ‘table’ string is within single quotes.

Boolean expressions in predicates

Many developers don't know exactly how the boolean() function works, so the best solution is to always write expressions that either return a numeric value or a Boolean value. For example:

q boolean() converts an empty string to false and a nonempty string to true, even if the value of the string is the text “false”. So to get all the products with nonempty descriptions, you could write this:
```
 /catalog/product[string(description)] 
```
However, to ensure that you get numeric or Boolean values, you should write this:
```
 /catalog/product[description != ''] 
```
q boolean() converts an empty sequence to false and a node to true. So you could use the following expression to return all the products that have an id attribute:
```
 /catalog/product[@id] 
```
However, to state your intention more clearly, you can use the exists() function, like this:
```
 /catalog/product[exists(@id)] 
```

Axes

XPath expressions navigate through a tree. An axis is the direction in which this navigation happens. Let's see what this means on the expression /catalog/product that you have seen before:

q The first / refers to the document node.
q catalog selects the catalog child element of the document node.
q /product selects the product elements children of the catalog element.

The / operator is used here to select child elements. But you can also use it to navigate other axes, as they are called in XPath. For example, this expression selects the product elements that follow the first product:

      /catalog/product[1]/following-sibling::product

In the case of the document you saw earlier, this returns the second product. You select the followingsibling axis by prefixing the last occurrence of product with following-sibling::. When no axis is specified in front of an element name, the child axis is implied. So you could rewrite the /catalog/ product expression as follows:

      /child::catalog/child::product

There are 13 axes available in XPath. The eight axes that are most frequently used are these:

q descendant
q descendant-or-self
q following-sibling
q following
q ancestor
q preceding-sibling
q preceding
q ancestor-or-self

The remaining five axes are these:

q parent
q self
q attribute
q child
q namespace

In the previous example, the child and attribute axes were written as /child::catalog, which is just a long version of /catalog. Similarly, /catalog/product/attribute::id is a long version of /catalog/product/@id. But it is interesting to note here how these are defined as two distinct axes. One consequence is that an attribute is not a child of the element on which it is defined. The @id attribute is not a child of the product element, but the product element is the parent of the @id attribute.

Sequences

You have seen expressions that return more than one element, like /catalog/product. They are said to return a sequence. Sequences in XPath are similar to lists in other languages-they can contain items of different types, they can contain duplicates, and items in the sequence are ordered. However, a sequence cannot contain other sequences-they cannot be nested.

Path expressions can return sequences, but you can also build your own sequences using the comma (,) operator. For example, the following expression returns a sequence with the two numbers 42 and 43:

      (42, 43)