Lessons from the Trenches | Professional XML (Programmer to Programmer)

Even starting with little or no knowledge in XPath, you can become productive very quickly. This is because in most cases, simple problems have a simple solution in XPath. But you can also solve complex problems with XPath. While simple in appearance, XPath is in fact a very powerful language. In this section you learn about some more advanced features of the languages. But not those obscure features no one knows about, but those we have found the most useful based on years of practical experience with XPath.

When A != B Is Different from not(A = B)

Are you a genius? Of course you are. Then the same question written in XPath, you = genius, returns true(). In this case, you != genius must return false(). No rocket science here: if A=B returns true(), you expect A!=B to return false(), and the other way around. In other words, you expect A != B and not(A = B) to be the same.

In most cases they are, but not always. Because of the way XPath compares sequences, the result of the comparison is true if and only if one value in the first sequence and one value in the second, when compared with the specified operator, return true(). This causes the following:

q If at least one of the two sequences is empty, the comparison always returns false, so both () = 42 and () != 42 return false().
q For some sequences, you can find pairs of values, one in the first sequence and one in the second sequence, that both match the = and != operators. For example, (1, 2) = (1) returns true() because 1 is in both sequences. But (1, 2) != (1) also returns true(), because 2 from the first sequence is not equal to any value from the second sequence.

Is this all here to confuse you? Certainly not. The way comparison works in XPath has a number of benefits, maybe the most important one in practice being that you can use the = and != operators to check if a value is present in a sequence, like some sort of contains() function. For example, x = (1, 2, 3, 5, 8, 13, 21, 34, 55, 89), where x is of type xs:double, returns true() if x is a Fibonacci number lower than 100.

The Many Faces of a Document

From time to time, you will come across a function in XPath that returns an XML document. If you are using XSLT 1.0 or 2.0, XPath 2.0, XQuery, or XForms, you can use a mix of instance(), doc(), and document().

XForms is a technology used to create forms. The data you enter in the form is stored in one or more XML documents, which in the XForms jargon are called XForms instances. Each document in XForms has an id, as in the following:

      <xforms:instance >          <address>              <street>1 Infinite Loop</street>              <city>Cupertino</city>              <state>California</state>          </address>      </xforms:instance>

Assuming that the document in the instance with the address id is also accessible at the URI http://www.example.org/address.xml, consider the following XPath expressions:

q instance(‘address’)
q doc(‘http://www.example.org/address.xml’)
q document(‘http://www.example.org/address.xml’)

All three expressions return the same "address" document. You can use the first expression with instance() in XForms, the second one with doc() wherever you have XPath 2.0 or XQuery expressions, and the third with document() within an XPath expression in XSLT 1.0 or 2.0 stylesheets.

Although all three return the same address document, they don't return the same node of the document. instance() returns the root element, and doc() and document() return the root node. This means that to point to the street element, you will need to write the following:

q instance(‘address’)/street
q doc(‘http://www.example.org/address.xml’)/address/street
q document(‘http://www.example.org/address.xml’)/address/street

Note how the name of the root element (address) is used with doc() and document() but not with instance(). Granted, the difference between doc() / document() on one side and instance() on the other side is trivial. But it is surprisingly easy to make a mistake when you're using both functions in the same day. So keep this in mind: the same document can have many faces depending on which function you use.

Tuning Your XPath Expressions

When you write an XPath expression, you describe what information you want to extract from an XML document, but you are not saying how that information ought to be extracted. Consider, for example, the expression /phonebook/person[starts-with(phone-number, ‘323’) and last-name = ‘Lee’]. Imagine you are running this query on a hypothetical XML document that contains the information found in the phone book. The query retrieves all the persons from Hollywood (area code 323) with the last name Lee. Here are a few ways in which the XPath engine could run this query:

q It can go through the list of persons and start by checking the first condition first. If the first three digits of the phone number are 323, it checks if the last name is Lee.
q A more advanced engine might figure that because the first test on the phone number is more expensive than the straight comparison of the last name with Lee, to run the query more efficiently it will instead do the second comparison first and only perform the comparison on the area code if the last name is Lee.
q An even more advanced engine might maintain an index of the persons based on their last name. With this index, it can quickly locate the persons with the last name Lee. A standalone XPath engine wouldn't typically index XML documents, but this can be expected from an engine running in a database.

The XPath engine has a lot of freedom in the way it runs your XPath queries, and unless you know the engine you are using extremely well, you don't know if a query will run more efficiently because it is written one way instead of another way. So start by writing your queries optimizing for human readability, making your queries explicit and simple to understand. For example:

q Instead of //person use /phonebook/person, because of the following:
- q Using /phonebook/person might be more efficient. With //person, some engines will traverse every element of the document, but they would only need to go through child elements of the root element with /phonebook/person.
- q /phonebook/person states your intension more clearly and makes your code more readable.
q In large XPath expressions, avoid duplicating part of the expression. For example, the expression (count(/company/department[name = ‘HR’]/employee), avg(/company/department [name = ‘HR’]/employee/salary)) returns a sequence with two numbers: the number of employees in the HR departments, and their average salary. Instead, write it as for $hr in /company/department[name = ‘HR’] return (count($hr/employee), avg($hr/ employee/salary)). Unlike XQuery, XPath doesn't have a let construct for you to declare variables. In some cases however, you can get around this by using the for construct.

Don't try to optimize your XPath expression prematurely. Or as 37signals puts it in their book Getting Real, "it's a problem when it's a problem." Until then, just write clean and readable expressions.

Function Calls in Path Expressions

You may have seen an expression like /company/department[@name = ‘Engineering’]/employee [@firstname = ‘Bruce’] used to retrieve the employee element that corresponds to that Bruce guy in the engineering department. This is a path expression, and each step of the expression selects some node from the input document relative to the nodes selected by the previous step.

One new feature of XPath 2.0 is that you can have functions calls as a step expression. Consider this document:

      <company>          <department  name="HR">              <employee firstname="John" lastname="Smith" salary="60000"/>              <employee firstname="Peter" lastname="Strain" salary="70000"/>              <employee firstname="Carl" lastname="Thompson" salary="80000"/>          </department>          <department  name="Engineering">              <employee firstname="Letticia" lastname="Vallejo" salary="80000"/>              <employee firstname="Bruce" lastname="Wilson" salary="90000"/>          </department>      </company>

What if you want to return the name of each department and the average salary of the employees working in that department? With imperative languages like Java, C++, or most scripting languages, you would typically use some type of iteration. In XPath, the equivalent would be to use the for construct to iterate over the departments and then run the avg() function to compute the average salary for each department, like this:

      for $d in /company/department return avg($d/employee/&commat;salary)

When executed on the preceding document, this expression returns (70000, 85000). But instead of having just a list of average salaries, you also want the name of the department, as in: (‘HR’, 70000, ‘Engineering’, 85000). A simple addition to the previous query will get you the expected result:

      for $d in /company/department return (string($d/@name), avg($d/employee/@salary))

As mentioned earlier, with XPath 2.0 you can use function calls as step expressions. So instead of string($d/@name), you can write $d/@name/string(). As long as $d/@name returns an attribute node instead of an empty sequence, those two expressions are equivalent. So the one you use is a matter of personal choice. However, keeping simplicity and clarity in mind, it would probably be best to use $d/name/string(), which does exactly what you'd think from looking at it: given the element in the variable $d, take the attribute name, and then take the string value of that attribute.

Note that in cases where $d/@name can potentially return an empty sequence, those two expressions are not equivalent anymore, because string() applied to an empty sequence returns a zero-length string. So the following occurs:

q string(()) returns a zero-length string.
q ()/string() returns an empty sequence.

With XPath 2.0, you can further simplify the expression you saw earlier, getting rid of the <code>for</code> construct altogether to create a much simpler expression such as this:

      /company/department/(string(@name), avg(employee/@salary))

You can push the envelope further. In addition to the average salary for each department, you can get the first name of the person who has the highest salary, like this:

      /company/department/(string(@name), avg(employee/@salary), employee[@salary =      max(../employee/@salary)]/@firstname/string())

This returns (‘HR’ 70000 ‘Carl’ ‘Engineering’ 85000 ‘Bruce’). Try to imagine how many lines of code you would need with a traditional programming language if you had to extract the same information from a text file with tab separated fields, for example. Indeed, using XML to represent data, and XPath 2.0 to extract information for XML can be quite a time-saver.

Using Comments and Nested Comments

Did you know you could have comments in XPath? You get lured into using XPath because of its simplicity, and as you get more and more familiar with the language, and recognize how powerful it is, your XPath expressions tend to grow in size. Then one day, they get to such a level of heftiness that adding comments within the expression becomes a requirement. And yes, you can do it. For example, the following adds a couple of comments to the expression you saw previously:

      /company/department/(          string(@name), (: Department name :)          avg(employee/@salary), (: Average salary :)          employee[@salary = max(../employee/@salary)]              /@firstname/string() (:Employee with highest salary :)     )

You start a comment with (: and close it with :). One interesting feature of XPath comments is that they can be nested, a feature that is missing from many languages. Here is the use case: you have a complex expression and, maybe the sole purpose of verifying a hypothesis, you would like to run only a subset of that expression. Say that in the preceding expression, you would like to return only the name of each department. For this, you need to comment the rest of the expression-the part that computes the average salary per department and the first name of employee with the highest pay. Because XPath supports nested comments, you don't have to worry about the :) after Average salary as being interpreted as the end of your comment. You can write this:

      /company/department/(          string(@name)          (:          , (: Department name :)          avg(employee/@salary), (: Average salary :)          employee[@salary = max(../employee/@salary)]              /@firstname/string() (:Employee with highest salary :)          :)     )

When this expression is executed on the previous document, it returns the sequence (‘HR’ ‘Engineering’).

Because XPath 2.0 supports nested comments, in most cases you don't need to worry if part of an expression you are commenting out already contains comments. In most cases it does, but not always. Consider the following expression:

      /company/department[@id = 1 and @name != ':)']

This is a valid XPath expression and when executed on the document shown earlier, it returns the element corresponding to the HR department, because its id is 1 and its name is not equal to the string "(:".

Now look at the second condition in the predicate:

      /company/department[@id = 1 (: and @name != ':)' :)]

If you run this expression, the XPath engine will throw an error at you that will read something like "Unmatched quote in expression." This is because while the comment is being parsed, the parser only looks for the following two-character sequences:

q (:, which signals the beginning of a nested comment
q :), which signals the end of a comment, nested or not

When the parser finds the :) that was originally inside a string, it considers it the end of the comment. So essentially the previous expression becomes this:

      /company/department[@id = 1 ' :)]

Notice that the single quote that follows the :) is still there, hence the error message "Unmatched quote in expression." Fortunately, this only happens very rarely, and you can usually comment parts of your expressions without any worries, even if the part you are commenting out contains a comment.

Note

Support for nested expressions is one of those features that you could wish every language had, especially XML. Maybe this will be considered for XML 2.0, if there is ever one.

Using Regular Expressions

With XPath 2.0, you get three new functions that let you use regular expressions. But you might be wondering why you need regular expressions in XPath. Regular expressions are useful to extract information from text, and because information is already clearly structured in XML, you should not need regular expressions, right? Although this is certainly true in theory, the documents you have to work with often contain information buried in strings. Consider this document, which represents an order from a customer:

      <order>          <number>3837482006122593897</number>          ...      </order>

Here, the order number starts with a six-digit client number, followed by the year, month, and day when the order was processed, and it ends with digits that make the order number unique if the client sent multiple orders in the same day. You can extract the date from the order number with a series of calls to the substring() function. However, this replace() function in XPath 2.0 makes your job much easier:

      replace(/order/number,          '^[0-9]{6}([0-9]{4})([0-9]{2})([0-9]{2})[0-          '$1-$2-$3')

Execute this XPath expression on the document, and you will get 2006-12-25, where:

q [0-9] matches one digit (i.e. character between 0 and 9).
q Adding {n} matches exactly n digits. For example [0-9]{4} matches 4 digits.
q The ˇ at the beginning and the $ at the end of the expression indicate that the expression matches the whole string, not just a subset of the string.
q Adding parenthesis around parts of the expression enables you to refer to what was matched by this part of the expression with $x. You use $1 to refer to the first parenthesized expression, $2 to the second, and so on.

Regular expressions that are available in different languages or libraries are similar, but there are quite a few variations. Regular expressions in XPath are a superset of those available in XML Schema, which in turn are based on Perl regular expressions. One difference is that regular expressions in XML Schema do not support the ˇ and $ character, the {n} qualifier, or the group semantic using parenthesized expressions. Those features are all used in the previous expression, so it is quite fortunate that XPath offers significant extensions over what is available in XML Schema.

The unordered() Function: Quite an Oddity

There is a function in XPath that takes one parameter and that XPath engines are free to implement by just returning the parameter. It is the unordered() function.

The unordered() function takes a sequence of items as a parameter and returns a sequence that contains the same items, but not necessarily in the same order. The purpose of this function is not to shuffle items around, but to give an optimization hint to the XPath engine. Consider this expression:

      /company/department/employee[@salary > 80000]

This returns all the employees with a salary higher than 80,000 dollars. In XPath, any expression that uses the / path operator must return nodes in document order. This expression meets that criteria, so if Leticia appears before Bruce in the document, Leticia must be before Bruce in the sequence of nodes returned by the expression.

If the document is stored in an XML database, you might have an index for salaries, and the XPath engine might be able to use that index to quickly retrieve the sequence of employees with a salary higher than 80,000 dollars. But because it is created based on the index, this sequence is ordered by increasing salary. To return nodes in document order, the XPath engine needs to reorder the nodes in the sequence. If you don't care about getting the nodes in document order, you can use the unordered() to tell the engine that this last reordering is not necessary, like this:

      unordered(/company/department/employee[@salary > 80000])

If you think that unordered() adds clarity to your expressions, then you should use it. But most likely, adding unordered() will do more to clutter your XPath expressions. So you should find out if your engine does anything special with unordered() before using it. If you are using a standalone engine, most likely unordered() won't work. XML databases might handle unordered() in a special way, but they are not guaranteed to do so. For example, using unordered() in the open source eXist XML database won't have any effect.

Union and Sequence Operators

Consider these two XPath expressions:

q /r/a | /r/b
q /r/a, /r/b

The first expression uses the union operator (|), which already existed in XPath 1.0. The second expression uses the sequence concatenation operator (,). When they are executed on the same document, they both return the following sequence, which contains element a first and element b second:

      <r>          <a/>          <b/>      </r>

If you modify the expression to put /r/b before /r/a, the expression /r/b | /r/a still returns the same result, but /r/b, /r/a returns a sequence with the b element first. This illustrates one difference between the union and concatenation operators: the union operator always returns nodes in document order, and concatenation does not change the order you specified.

Also, the sequence returned by the union operator never contains duplicates, so the following occurs:

q /r/a | /r/a returns the a element.
q /r/a, /r/a returns a sequence that contains the a element twice.

You can only use the union operator on nodes, so using the previous example, note the following:

q /r/a | 1 is not a valid expression.
q /r/a, 1 returns a sequence with the a element first and the atomic value 1 second.

With XPath 2.0, instead of the | character, you can use union, which is strictly equivalent to |.

//h1[1] Different Than (//h1)[1]

Say you want to extract the main title from an XHTML document. Looking for h1 elements in the document is a good bet, but there might be more than one h1, so you decide to take the first one, in document order. At first, you might think that you could use XPath expression //h1[1] to do this. Although it will return the first h1 on some documents, in some cases it will also return other h1 elements. Consider this document:

      <body>          <div>               <p/>              <h1>A</h1>              <h1>B</h1>              <p/>          </div>          <div>              <h1>C</h1>              <p/>              <h1>D</h1>              <p/>          </div>      </body>

Here //h1[1] returns the h1 with A and C, but not B and D. Those are the first h1 child elements of their parent. The expression //h1[1] works this way because the predicate operator ([]) has a higher precedence than the // operator. So //h1[1] is in fact equivalent to //(h1[1]), and not (//h1)[1]. In this case, it is the later that you want to use to get the first h1 element in the document.

The precedence of some operators, also sometimes referred to as operator priority, is nothing new. In most languages, the multiplication operator has a higher precedence than that additive operator, so 1 + 2 * 3 reads 1 + (2 * 3), not (1 + 2) * 3. Operators can be ranked by their precedence, and for a given language you can assign a number to each operator that represents its precedence. The higher the precedence of an operator is, the higher the number is.

In XPath, the precedence of operators is formally defined by the grammar of the language. It can be quite time-consuming to look at the XPath grammar to figure out what the precedence of an operator is, because the grammar is scattered throughout the XPath specification. Fortunately, the editors have included a table in the appendix with the precedence of each operator. The following table is sorted by ascending precedence, so remember that the lower an operator is in the table, the higher its precedence.

Open table as spreadsheet

Precedence Number	Operators
1	`,` (comma)
2	`for`, `some`, `even`, `if`
3	`Or`
4	`And`
5	`eq`, `ne`, `lt`, `le`, `gt`, `ge`, `=`, `!=`, `<`, `<=`, `>`, `>=`, `is`, `<<`, `>>`
6	`to`
7	`+`, `-`
8	`*`, `div`, `idiv`, `mod`
9	`union`, `\|`
10	`intersect`, `except`
11	`instance of`
12	`treat`
13	`castable`
14	`cast`
15	`-(unary)`, `+(unary)`
16	`/`, `//`
17	`[]`, `()`

Reverse Axis-Evil at Times

An XPath expression can return a sequence. Items in the sequence are in a certain order, and each of them has a context position. For example, consider this document, with three employees-John, Peter, and Carl:

      <company>          <employee firstname="John"/>          <employee firstname="Peter"/>          <employee firstname="Carl"/>      </company>

Now consider these two expressions:

q /company/employee[1]/following-sibling::employee
q /company/employee[3]/preceding-sibling::employee

The first expression returns the employees that follow the first employee. There is not much to be surprised about here: John is the first employee, so it returns Peter and Carl in that order. The second expression gets the employees before Carl. It returns John and Peter in this order, because all the path expressions in XPath return nodes in document order. This can be summarized as follows:

q The first expression returns Peter, Carl.
q The second expression returns John, Peter.

Now add the predicate [1] to both of those expressions, as follows:

q /company/employee[1]/following-sibling::employee[1]
q /company/employee[3]/preceding-sibling::employee[1]

When the value of a predicate is of a numeric type, as is the case here, the predicate is called a numeric predicate. A numeric predicate is true if the value is equal to the context position and false otherwise. So, the item in each sequence that has a context position equal to 1 is as follows

q The first sequence is composed of Peter and Carl in that order, and Peter is the employee with context position equal to 1.
q The second sequence is composed of John and Peter in that order, and the second employee in the sequence (which in this case is Peter) is the one with a context position equal to 1 (not John, who is the first employee in the sequence).

The reason for this potentially surprising result is that when you use a reverse axis, such as preceding-sibling, position is assigned in reverse order. Because a reverse axis is used here, the context position of the last item in the sequence is 1.

You can think of the engine as assigning context position starting from the node where you start your search. If you are going down, as with following-sibling, context positions are assigned in document order, but if you are going up, as with preceding-sibling, then context positions are assigned in reverse document order. Even if context positions are assigned differently depending on the type of axis you are using, the nodes returned by a path expression are always in document order.

Debugging with trace()

XPath is designed to be used within a host language, such as XSLT. Some host languages provide a tracing facility, such as the <xsl:message> construct in XSLT. Other host languages don't, such as XForms. For this reason, the XPath trace() function can be quite useful.

Note

trace() is an XPath 2.0 function. XForms 1.0 uses XPath 1.0, so unfortunately you can't use trace() in XForms, unless your XForms engine specifically supports XPath 2.0, as does Orbeon Forms.

trace() takes two arguments: a value, which is sequence of items, and a label, which is a string. It returns the value and logs both the label and value in an implementation-dependent way. Consider what is logged when you execute the following XPath expressions with the open source Saxon engine on the employees document:

q The following expression:
trace(/company/employee[1]/@firstname, ‘Name’)

logs this:
```
 Name [1]: attribute(firstname, untypedAtomic): /company/employee[1]/@firstname 
```

q The following expression:

 trace(string(/company/employee[1]/@firstname),

logs this:

 Name: xs:string: John

q The following expression:

 trace(/company/employee, 'Employee')

logs this:

 Employee [1]: element(employee, untyped): /company/employee[1] Employee [2]: element(employee, untyped): /company/employee[2] Employee [3]: element(employee, untyped): /company/employee[3]

q The following expression:

 trace(/, 'Document node')

logs this:

 Document node: document-node(): /

When you use trace() in an expression, there is no guarantee that the function will be executed, because the engine might not need to run that part of the expression, which means that you might not see anything in the trace output. The following expression reads false() and …:

      false() and trace(true(), 'This doesn't get displayed')

This is an and expression that starts with false(), so whatever comes after that doesn't matter: the result is always false(). In cases like this, the XPath engine typically does not run the trace() function call.

You can see what an expression returns by putting the whole expression inside a trace(), as shown in the previous examples. You can also use it inside a path expression. For example, this returns a sequence of names:

     /company/employee/string(@firstname)

To see what are the employees taken into consideration by this expression, just add a trace() step within the path expression, like this:

      /company/employee/trace(., 'Employee')/string(@firstname)