Recipe6.7.Processing Unstructured Text with Regular Expressions | XSLT Cookbook: Solutions and Examples for XML and XSLT Developers, 2nd Edition

Recipe 6.7. Processing Unstructured Text with Regular Expressions

Problem

You need to transform XML documents that contain chunks of unstructured text that must be marked up into a proper document.

Solution

There are three XPath 2.0 function for working with regular expressions: match( ), replace(), and tokenize( ). We covered these in Chapter 1. There is also a new XSLT instruction, xsl:analyze-string, which allows you to do even more advanced text processing.

The xsl:analyze-string instruction takes a select attribute for specifying the string to be processed, a regex attribute for specifying the regular expression to apply to the string, and an optional flags attribute to modify the action of the regex engine. The standard flags are:

i Case-insensitive mode.
m Multi-line mode makes metacharacters ^ and $ match the beginning and ends of lines rather than the beginning and end of the entire string (the default).
s Causes the metacharacter . to match newlines (entity 
). The default is not to match newlines. This mode is sometimes called single-line mode, but from its definition, it should be clear that it is not the opposite of multi-line mode. Indeed, one can use both the s and m flags together.
x Allows whitespace to be used in a regular expression as a separator rather than a significant character.

The child element xsl:matching-substring is used to process the substring that matches the regex and xsl:non-matching-substring is used to process the substrings that match the regex. Either may be omitted. It is also possible to refer to captured groups (parts of a regex surrounded by parenthesis) using the regex-group function within xsl:matching-substring:

<xsl:template match="date">   <xsl:copy>     <xsl:analyze-string select="normalize-space(.)"          regex="(\d\d\d\d) ( / | - ) (\d\d) ( / | - ) (\d\d)"          flags="x">       <xsl:matching-substring>         <year><xsl:value-of select="regex-group(1)"/></year>         <month><xsl:value-of select="regex-group(3)"/></month>         <day><xsl:value-of select="regex-group(5)"/></day>       </xsl:matching-substring>       <xsl:non-matching-substring>         <error><xsl:value-of select="."/></error>       </xsl:non-matching-substring>     </xsl:analyze-string>   </xsl:copy> </xsl:template>

A nice complement to xsl:analyze-string is the XSLT function unparsed-text( ). This function allows you to read the contents of a text file as a string. Thus, as the name suggests, the file is not parsed and therefore need not be XML. In fact, except in the most unique of circumstances, you would not normally use unparsed-text( ) on XML content.

The following stylesheet will convert a simple comma delimited file (one with no quoted strings) to XML:

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  xmlns:xs="http://www.w3.org/2001/XMLSchema"  xmlns:fn="http://www.w3.org/2005/02/xpath-functions"  xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes">  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>  <xsl:param name="csv-file" select=" 'test.csv' "/>        <xsl:template match="/">        <converted-csv filename="{$csv-file}">       <xsl:for-each select="tokenize(unparsed-text($csv-file, 'UTF-8'),                                     '\n')">         <xsl:if test="normalize-space(.)">           <row>             <xsl:analyze-string select="." regex="," flags="x">               <xsl:non-matching-substring>                 <col><xsl:value-of select="normalize-space(.)"/></col>               </xsl:non-matching-substring>             </xsl:analyze-string>           </row>         </xsl:if>       </xsl:for-each>     </converted-csv>        </xsl:template>    </xsl:stylesheet>

Discussion

The regex capabilities of XSLT 2.0 along with unparsed-text() open up whole new processing possibilities to XSLT that were next to impossible in XSLT 1.0. Still, XSLT would not be my first choice for non-XML processing unless I was working in a context where a multi-language solution (e.g., Java and XSLT or Perl and XLST) was not practical. Of course, if XSLT is the only language you want to master, the new capabilities certainly open up new vistas for you to explore.

Part of my motivation for jumping the XSLT ship when entering the domain of unstructured text processing are the "missing features" of xsl:analyze-string. It would be nice if the position( ) and last( ) functions worked within xsl:matching-substring to tell you that this is match number position( ) of last( ) possible matches. I sometimes use xsl:for-each over a tokenize( ) instead of xsl:analyze-string but that is also deficient because it only returns the non-matching portions. Further, you often feel compelled to use xsl:analyze-string for a complex parsing problem involving many possible regex matches in a regex using alternation (|). However, there is no way to tell which regex matched without re-matching using the match( ) function, which is a tad redundant and wasteful for my taste because surely the regex engine knows what part it just matched:

<xsl:template match="text( )">   <xsl:analyze-string select="."                          regex='[\-+]?\d\.\d+\s*[eE][\-+]?\d+ |                                [\-+]?\d+\.\d+                |                                 [\-+]?\d+                     |                                "[^"]*?"                      '                          flags="x">       <xsl:matching-substring>         <xsl:choose>           <xsl:when test="matches(.,'[\-+]?\d\.\d+\s*[eE][\-+]?\d+')">             <scientific><xsl:value-of select="."/></scientific>                       </xsl:when>           <xsl:when test="matches(.,'[\-+]?\d+\.\d+')">             <decimal><xsl:value-of select="."/> </decimal>           </xsl:when>           <xsl:when test="matches(.,'[\-+]?\d+')">             <integer><xsl:value-of select="."/> </integer>           </xsl:when>           <xsl:when test='matches(.," "" [^""]*? "" ", "x")'>             <string><xsl:value-of select="."/></string>           </xsl:when>       </xsl:choose>     </xsl:matching-substring>   </xsl:analyze-string> </xsl:template>

Now, hindsight is always 20/20, and there are, of course, all sorts of implementation issues and tradeoffs that one needs to overcome when enhancing a language; so, with all due respect to the XSLT 2.0 committee, it would have been sweeter if xsl:analyze-string worked as follows:

<!-- NOT VALID XSLT 2.0 - Author's wishful thinking -->  <xsl:template match="text( )">   <xsl:analyze-string select="."                        flags="x">     <xsl:matching-substring regex="[\-+]?\d\.\d+\s*[eE][\-+]?\d+">       <scientific><xsl:value-of select="."/></scientific>     </xsl:matching-substring>               <xsl:matching-substring regex="[\-+]?\d+\.\d+'">       <decimal><xsl:value-of select="."/> </decimal>     </xsl:matching-substring>               <xsl:matching-substring regex=" [\-+]?\d+')">       <integer><xsl:value-of select="."/> </integer>     </xsl:matching-substring>               <xsl:matching-substring regex=' "[^"]*?" '>       <string><xsl:value-of select="."/></string>     </xsl:matching-substring>     <xsl:non=matching-substring>       <other><xsl:value-of select="."/></other>    </xsl:non=matching-substring>   </xsl:analyze-string> </xsl:template>