A big addition to XPath in version 2.0 are the pattern-matching functions. These functions let you use regular expression syntax:
All of these functions use regular expressions. The regular expressions used in XPath 2.0 are the same as those used in XML schema, with some additions. Understanding Regular ExpressionsNow that regular expressions are supported by XML schema, books on XML discuss how to create regular expressions. Nonetheless, we'll give a brief introduction to the topic here for those not familiar with the subject. MORE ON REGULAR EXPRESSIONS You can find the XML schema support for regular expressions discussed in http://www.w3.org/TR/xmlschema-2/. This support is a subset of the regular expressions used in the Perl programming language, and you can find the complete documentation for Perl regular expressions at the Comprehensive Perl Archive Network (CPAN) Web site: www.cpan.org/doc/manual/html/pod/perlre.html. Regular expressions are made up of patterns , and these patterns can be used to match text in your data. Each character matches itself by default, so if you have the pattern Hello you can match the text "Hello". You can also use regular expression special characters and assertions, which start with a backslash, \, in your patterns. For example, to match the beginning or ending of a word, called a word boundary , you use \b. That means that the regular expression pattern shown here will match the word "Hello": \bHello\b Here are the special characters, called metacharacters , that you can use in regular expressions:
And here are the available assertions, which assert that a particular condition is true:
For example, if you wanted to match a three-digit number, you can use the pattern \d\d\d . To match U.S. social security numbers , therefore, you can use this pattern: \d\d\d-\d\d-\d\d\d\d Here's another example, this time using character classes . For example, the character class [abc] matches only the characters "a", "b", or "c". You can use a dash in a character class as a shortcut to indicate a range, as in the character class that would match any uppercase letter, [A-Z] .This regular expression will match any word made up of lower- or uppercase characters using a character class , [A-Za-z] , which matches any single lower- or uppercase character, and a plus sign, + , which means "one or more of" in regular expressions: \b([A-Za-z]+)\b The + sign is a regular expression quantifier . You can use these quantifiers in regular expressions:
This regular expression matches any word (even if it includes digits): \b\w+\b You can also group subexpressions in regular expressions using parentheses. For example, see if you can figure out how this regular expression worksit matches valid email addresses: \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* The matches to the parenthesized subexpressions in a regular expression are preserved. The matched text can be accessed after the regular expression has been evaluated if you use the fn:replace function, as we'll see when we cover that function. XPath 2.0 Versus XML Schema DifferenceThe regular expressions in XPath 2.0 are actually more powerful than those in XML schema. XML Schema uses regular expressions only for validity checking, which means it doesn't support some powerful text-handling techniques. In particular, two modes are defined in XPath 2.0 regular expressions: string mode and multiline mode (just as in Perl regular expressions). You specify which mode you want with flags, coming up in a page or two. In addition, two special characters, ^ and $, are also supported in XPath 2.0 regular expressions. As in standard regular expressions, in string mode, the character ^ matches the start of the string, and $ matches the end of the string. In multiline mode, ^ matches the start of any line (lines are broken up with newline, \n , characters, which is #x0A in Unicode), and $ matches the end of any line. As in standard regular expressions, when you're in string mode, the character . matches any character. In multiline mode, the metacharacter . matches any character except a newline character. For example, the regular expression ^J.*d$ will match this text: James Bond Minimal matching is also supported in XPath 2.0 regular expressions. For example, suppose you have the text, "That is some book, isn't it?" and you want to match the regular expression .*is to this string. In the default case, this expression will match as much as it can, so instead of matching "That is", this regular expression will match "That is some book, is". To indicate that you want to match as little as possible, you can use an additional question mark, so the regular expression .*?is would match "That is". Here is how minimal matching works in XPath 2.0 regular expressions:
The three regular expression functions support an optional parameter, $flags , that you use to set options. This parameter is a string, and individual letters are used to set the corresponding options. The presence of a letter in the string indicates that the option is on; if it's not present, the option is off. Letters may appear in any order and may be repeated. Here are the current options:
The fn:matches FunctionThe fn:matches function returns true if a regular expression matches the text in a string, and false otherwise. Here are the two ways to use this function: fn:matches( $srcval as xs:string?, $pattern as xs:string) as xs:boolean? fn:matches( $srcval as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:boolean? Here's an example; we'll check whether we can find the word "bananas" in a string like this: fn:matches('Want some bananas today?', '\bbananas\b') . You can see how this works in ch10_15.xsl in Listing 10.15. Listing 10.15 An XSLT Example Using the XPath Function fn:matches ( ch10_15.xsl )<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:template match="/"> <xsl:value-of select="if(matches('Want some bananas today?', '\bbananas\b')) then 'Yes, we have some bananas.' else 'No, we have no bananas.'"/> </xsl:template> </xsl:stylesheet> And here's the result, where we matched the word "bananas": <?xml version="1.0" encoding="UTF-8"?> Yes, we have some bananas. Using this function, then, you can perform regular expression matching. The fn:replace FunctionThis function replaces matched text with other text. Here are the two ways to use it: fn:replace( $srcval as xs:string?, $pattern as xs:string, $replacement as xs:string) as xs:string? fn:replace( $srcval as xs:string?, $pattern as xs:string, $replacement as xs:string, $flags as xs:string) as xs:string? In this case, the function replaces matches to $pattern in $srcval with $replacement . For example, say that you wanted to replace "bananas" in our text "Want some bananas today?" with "oranges". You can do that with the fn:replace function, as you see in ch10_16.xsl (Listing 10.16). Listing 10.16 An XSLT Example Using the XPath Function fn:replace ( ch10_16.xsl )<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:template match="/"> <xsl:value-of select="replace('Want some bananas today?', '\bbananas\b', 'oranges')"/> </xsl:template> </xsl:stylesheet> And here is the result: <?xml version="1.0" encoding="UTF-8"?> Want some oranges today? If you enclose subexpressions in parentheses, you can refer to matches to those subexpressions as $1 , $2 , and so on up to $9 in the $replacement string. For example, say that you want to extract the two words from the string "Bananas, Apples". To do that, you can use the regular expression (\w+), (\w+) , and refer to the text that matched the first \w+ subexpression as $1 in the replacement text, and the text that matched the second \w+ subexpression as $2 in the replacement text. You can see how this works in ch10_17.xsl (Listing 10.17). Listing 10.17 Matching Subexpressions with fn:replace ( ch10_17.xsl )<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:template match="/"> <xsl:value-of select="replace('Bananas, Apples', '(\w+), (\w+)', 'Item 1: Item 2:')"/> </xsl:template> </xsl:stylesheet> And here are the results you get from Saxon: <?xml version="1.0" encoding="UTF-8"?> Item 1:Bananas Item 2:Apples As you can see, using the fn:replace function, you can perform replacements using regular expressions. The fn:tokenize FunctionThis function is designed to break up text into smaller parts , or tokens . Specifically, it breaks the string you pass it into a sequence of strings, using substrings that match a given pattern as separators. Here's how you use this function: fn:tokenize( $srcval as xs:string?, $pattern as xs:string) as xs:string* fn:tokenize( $srcval as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:string* You use this function to split up the text in $srcval into pieces separated by text matching the pattern in $pattern . For example, say that you want to break up the text "Now is the time" into the words "Now", "is", "the", "time". You can do that if you instruct the fn:tokenize function to break on space characters, \s . You can see how this works in ch10_18.xsl (Listing 10.18). Listing 10.18 An XSLT Example Using the XPath Function fn:tokenize ( ch10_18.xsl )<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:template match="/"> <xsl:value-of select="tokenize('Now is the time', '\s+')" separator=", "/> </xsl:template> </xsl:stylesheet> And here are the results you get from Saxon: <?xml version="1.0" encoding="UTF-8"?> Now, is, the, time |