Using Pattern Matching

A big addition to XPath in version 2.0 are the pattern-matching functions. These functions let you use regular expression syntax:

fn:matches returns true if a string is matched by a supplied regular expression.
fn:replace replaces every occurrence of a match to a regular expression with a replacement string.
fn:tokenize returns a sequence of substrings of a given string.

All of these functions use regular expressions. The regular expressions used in XPath 2.0 are the same as those used in XML schema, with some additions.

Understanding Regular Expressions

Now that regular expressions are supported by XML schema, books on XML discuss how to create regular expressions. Nonetheless, we'll give a brief introduction to the topic here for those not familiar with the subject.

MORE ON REGULAR EXPRESSIONS

You can find the XML schema support for regular expressions discussed in http://www.w3.org/TR/xmlschema-2/. This support is a subset of the regular expressions used in the Perl programming language, and you can find the complete documentation for Perl regular expressions at the Comprehensive Perl Archive Network (CPAN) Web site: www.cpan.org/doc/manual/html/pod/perlre.html.

Regular expressions are made up of patterns , and these patterns can be used to match text in your data. Each character matches itself by default, so if you have the pattern

 Hello

you can match the text "Hello". You can also use regular expression special characters and assertions, which start with a backslash, \, in your patterns. For example, to match the beginning or ending of a word, called a word boundary , you use \b. That means that the regular expression pattern shown here will match the word "Hello":

 \bHello\b

Here are the special characters, called metacharacters , that you can use in regular expressions:

\077 Octal char
\d Match a digit character
\D Match a non-digit character
\E End case modification
\e Escape
\f Form feed
\l Lowercase next char
\L Lowercase until \E found
\n Newline
\Q Quote (that is, disable) pattern metacharacters until \E found
\r Return
\S Match a non-whitespace character
\s Match a whitespace character
\t Tab
\u Uppercase next char
\U Uppercase until \E found
\w Match a word character ( alphanumeric characters and "_")
\W Match a non-word character
\x1A Hex char

And here are the available assertions, which assert that a particular condition is true:

^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)
\b Match a word boundary
\B Match a non(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string

For example, if you wanted to match a three-digit number, you can use the pattern \d\d\d . To match U.S. social security numbers , therefore, you can use this pattern:

 \d\d\d-\d\d-\d\d\d\d

Here's another example, this time using character classes . For example, the character class [abc] matches only the characters "a", "b", or "c". You can use a dash in a character class as a shortcut to indicate a range, as in the character class that would match any uppercase letter, [A-Z] .This regular expression will match any word made up of lower- or uppercase characters using a character class , [A-Za-z] , which matches any single lower- or uppercase character, and a plus sign, + , which means "one or more of" in regular expressions:

 \b([A-Za-z]+)\b

The + sign is a regular expression quantifier . You can use these quantifiers in regular expressions:

* Match zero or more times
+ Match one or more times
? Match one or zero times
{ n } Match n times
{ n ,} Match at least n times
{ n , m } Match at least n but not more than m times

This regular expression matches any word (even if it includes digits):

 \b\w+\b

You can also group subexpressions in regular expressions using parentheses. For example, see if you can figure out how this regular expression worksit matches valid email addresses:

 \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

The matches to the parenthesized subexpressions in a regular expression are preserved. The matched text can be accessed after the regular expression has been evaluated if you use the fn:replace function, as we'll see when we cover that function.

XPath 2.0 Versus XML Schema Difference

The regular expressions in XPath 2.0 are actually more powerful than those in XML schema. XML Schema uses regular expressions only for validity checking, which means it doesn't support some powerful text-handling techniques.

In particular, two modes are defined in XPath 2.0 regular expressions: string mode and multiline mode (just as in Perl regular expressions). You specify which mode you want with flags, coming up in a page or two.

In addition, two special characters, ^ and $, are also supported in XPath 2.0 regular expressions. As in standard regular expressions, in string mode, the character ^ matches the start of the string, and $ matches the end of the string. In multiline mode, ^ matches the start of any line (lines are broken up with newline, \n , characters, which is #x0A in Unicode), and $ matches the end of any line.

As in standard regular expressions, when you're in string mode, the character . matches any character. In multiline mode, the metacharacter . matches any character except a newline character. For example, the regular expression ^J.*d$ will match this text:

 James Bond

Minimal matching is also supported in XPath 2.0 regular expressions. For example, suppose you have the text, "That is some book, isn't it?" and you want to match the regular expression .*is to this string. In the default case, this expression will match as much as it can, so instead of matching "That is", this regular expression will match "That is some book, is". To indicate that you want to match as little as possible, you can use an additional question mark, so the regular expression .*?is would match "That is". Here is how minimal matching works in XPath 2.0 regular expressions:

X ?? matches X , once or not at all.
X *? matches X , zero or more times.
X +? matches X , one or more times.
X { n } ? matches X , exactly n times.
X ( n ,} ? matches X , at least n times.
X { n , m } ? matches X , at least n times, and not more than m times total.

The three regular expression functions support an optional parameter, $flags , that you use to set options. This parameter is a string, and individual letters are used to set the corresponding options. The presence of a letter in the string indicates that the option is on; if it's not present, the option is off. Letters may appear in any order and may be repeated. Here are the current options:

m Makes a match operate in multiline mode. Otherwise , the match operates in string mode (the default).
i Makes a match operate in case-insensitive mode. Otherwise, the match operates in case-sensitive mode (the default).

The `fn:matches` Function

The fn:matches function returns true if a regular expression matches the text in a string, and false otherwise. Here are the two ways to use this function:

 fn:matches(  $srcval  as xs:string?,  $pattern  as xs:string) as xs:boolean? fn:matches(  $srcval  as xs:string?,  $pattern  as xs:string,     $flags as xs:string) as xs:boolean?

Here's an example; we'll check whether we can find the word "bananas" in a string like this: fn:matches('Want some bananas today?', '\bbananas\b') . You can see how this works in ch10_15.xsl in Listing 10.15.

Listing 10.15 An XSLT Example Using the XPath Function `fn:matches` ( `ch10_15.xsl` )

 <xsl:stylesheet version="2.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema">     <xsl:template match="/">  <xsl:value-of select="if(matches('Want some bananas today?',   '\bbananas\b'))   then 'Yes, we have some bananas.'   else 'No, we have no bananas.'"/>  </xsl:template> </xsl:stylesheet>

And here's the result, where we matched the word "bananas":

 <?xml version="1.0" encoding="UTF-8"?> Yes, we have some bananas.

Using this function, then, you can perform regular expression matching.

The `fn:replace` Function

This function replaces matched text with other text. Here are the two ways to use it:

 fn:replace(  $srcval  as xs:string?,  $pattern  as xs:string,  $replacement  as xs:string) as xs:string? fn:replace(  $srcval  as xs:string?,  $pattern  as xs:string,  $replacement  as xs:string, $flags  as xs:string) as xs:string?

In this case, the function replaces matches to $pattern in $srcval with $replacement .

For example, say that you wanted to replace "bananas" in our text "Want some bananas today?" with "oranges". You can do that with the fn:replace function, as you see in ch10_16.xsl (Listing 10.16).

Listing 10.16 An XSLT Example Using the XPath Function `fn:replace` ( `ch10_16.xsl` )

 <xsl:stylesheet version="2.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema">     <xsl:template match="/">  <xsl:value-of select="replace('Want some bananas today?',   '\bbananas\b', 'oranges')"/>  </xsl:template> </xsl:stylesheet>

And here is the result:

 <?xml version="1.0" encoding="UTF-8"?> Want some oranges today?

If you enclose subexpressions in parentheses, you can refer to matches to those subexpressions as $1 , $2 , and so on up to $9 in the $replacement string. For example, say that you want to extract the two words from the string "Bananas, Apples". To do that, you can use the regular expression (\w+), (\w+) , and refer to the text that matched the first \w+ subexpression as $1 in the replacement text, and the text that matched the second \w+ subexpression as $2 in the replacement text. You can see how this works in ch10_17.xsl (Listing 10.17).

Listing 10.17 Matching Subexpressions with `fn:replace` ( `ch10_17.xsl` )

 <xsl:stylesheet version="2.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema">     <xsl:template match="/">  <xsl:value-of select="replace('Bananas, Apples',   '(\w+), (\w+)', 'Item 1: Item 2:')"/>  </xsl:template> </xsl:stylesheet>

And here are the results you get from Saxon:

 <?xml version="1.0" encoding="UTF-8"?> Item 1:Bananas Item 2:Apples

As you can see, using the fn:replace function, you can perform replacements using regular expressions.

The `fn:tokenize` Function

This function is designed to break up text into smaller parts , or tokens . Specifically, it breaks the string you pass it into a sequence of strings, using substrings that match a given pattern as separators. Here's how you use this function:

 fn:tokenize(  $srcval  as xs:string?,  $pattern  as xs:string) as xs:string* fn:tokenize(  $srcval  as xs:string?,  $pattern  as xs:string,  $flags  as xs:string) as xs:string*

You use this function to split up the text in $srcval into pieces separated by text matching the pattern in $pattern . For example, say that you want to break up the text "Now is the time" into the words "Now", "is", "the", "time". You can do that if you instruct the fn:tokenize function to break on space characters, \s .

You can see how this works in ch10_18.xsl (Listing 10.18).

Listing 10.18 An XSLT Example Using the XPath Function `fn:tokenize` ( `ch10_18.xsl` )

 <xsl:stylesheet version="2.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema">     <xsl:template match="/">  <xsl:value-of select="tokenize('Now is the time', '\s+')"   separator=", "/>  </xsl:template> </xsl:stylesheet>

And here are the results you get from Saxon:

 <?xml version="1.0" encoding="UTF-8"?> Now, is, the, time

Understanding Regular Expressions

XPath 2.0 Versus XML Schema Difference

The fn:matches Function

Listing 10.15 An XSLT Example Using the XPath Function fn:matches ( ch10_15.xsl )

The fn:replace Function

Listing 10.16 An XSLT Example Using the XPath Function fn:replace ( ch10_16.xsl )

Listing 10.17 Matching Subexpressions with fn:replace ( ch10_17.xsl )

The fn:tokenize Function