Recipe2.11.Exploiting Regular Expressions | XSLT Cookbook: Solutions and Examples for XML and XSLT Developers, 2nd Edition

Recipe 2.11. Exploiting Regular Expressions

Problem

You heard regular expressions (regex) are a powerful new tool in XSLT 2.0, but you are unsure how to harness this power.

Solution

Matching text patterns

The most basic application of regex is matching text patterns. You can use matches( ) in a template pattern to extend XSLT's matching capabilities into the text of a node:

<!-- --> <!-- A date in the form May 3, 1964 -->   <xsl:template match="birthday[matches(.,'^[A-Z][a-z]+\s[0-9]+,\s[0-9]+$')]">    <!-- ... --> </xsl:template> <!-- A date in the form 1964-05-03 --> <xsl:template match="birthday[matches(.,'^[0-9]+-[0-9]+-[0-9]+$')]">    <!-- ... --> </xsl:template>   <!-- A date in the form 3 May 1964 --> <xsl:template match="birthday[matches(.,'^[0-9]+\s[A-Z][a-z]+\s[0-9]+$')]">    <!-- ... --> </xsl:template>

Alternatively, you can use matches in an xsl:if or xsl:choose instruction:

<xsl:choose>    <xsl:when test="matches($date,'^[A-Z][a-z]+\s[0-9]+,\s[0-9]+$')">    </xsl:when>    <xsl:when test="matches($date,'^[0-9]+-[0-9]+-[0-9]+$')">    </xsl:when>    <xsl:when test="matches($date,'^[0-9]+\s[A-Z][a-z]+\s[0-9]+$')">    </xsl:when> </xsl:choose>

Tokenizing stylized text

Often one uses regex to split a string into tokens:

(: Break an ISO date (YYYY-MM-DD) into a sequence consisting of year, month, day :) tokenize($date, '-')  (: Break an ISO dateTime (YYYY-MM-DDThh:mm:ss) into a sequence consisting of year, month, day, hour,  min, sec :) tokenize($date, '-|T|:')  (: Break a sentence into words :) tokenize($text, '\W+')

Replacing and augmenting text

There are two ways to use the XPath replace( ) function.

The first is simply to replace patterns in a string with other text. Sometimes you will replace the pattern with the empty string (`') because you want to strip the text that matches the pattern:

(: Replace the day of the month in an ISO date with 01 :) replace($date,'\d\d$','01') (: Strip away all but the year in an ISO date :) replace($date,'-\d\d-\d\d$','')

The second way you use replace is to insert text into the string where a pattern matches while leaving the matched part intact. It may seem counterintuitive that you can use a function called replace to perform an insertion; however, this is exactly the effect you can achieve by using back reference variables.

(: Insert a space after punctuation characters that are not followed by a space :) replace($text, '([,;:])\S', '$1 ')

Parsing text to convert to XML

More powerful than either tokenize( ) or replace() is the new XSLT 2.0 xsl:analyze-string instruction. This function allows one to go beyond textual substitution and build up XML content from text. See Chapter 6 for recipes using xsl:analyze-string.

Discussion

Regular expressions (or simply regex) are such a rich and powerful tool for text processing that one could write a whole book dedicated to them. In fact, someone did. Jeffery E. F. Friedl's book Mastering Regular Expressions (O'Reilly) is a classic on the topic, and I highly recommended it.

Regular expressions derive their power from pattern matching. Interestingly, pattern matching is also at the heart of XSLT's power. Where XSLT is ideally suited to matching patterns in the structure of an XML document, regular expressions are optimized for matching patterns in ad hoc text. However, the pattern language of regular expressions is more intricate than the XPath expressions used in XSLT. This is unavoidable simply because ad hoc text lacks the uniform tree structure of XML.

The keys to mastering regular expressions are practice and judicious borrowing from example expressions designed by others. Beside Friedl's book, one can find sample regex patterns in many of the books on Perl and online at RegExLib.com (http://regexlib.com/).