Recipe 6.7. Processing Unstructured Text with Regular ExpressionsProblemYou need to transform XML documents that contain chunks of unstructured text that must be marked up into a proper document. SolutionThere are three XPath 2.0 function for working with regular expressions: match( ), replace(), and tokenize( ). We covered these in Chapter 1. There is also a new XSLT instruction, xsl:analyze-string, which allows you to do even more advanced text processing. The xsl:analyze-string instruction takes a select attribute for specifying the string to be processed, a regex attribute for specifying the regular expression to apply to the string, and an optional flags attribute to modify the action of the regex engine. The standard flags are:
The child element xsl:matching-substring is used to process the substring that matches the regex and xsl:non-matching-substring is used to process the substrings that match the regex. Either may be omitted. It is also possible to refer to captured groups (parts of a regex surrounded by parenthesis) using the regex-group function within xsl:matching-substring: <xsl:template match="date"> <xsl:copy> <xsl:analyze-string select="normalize-space(.)" regex="(\d\d\d\d) ( / | - ) (\d\d) ( / | - ) (\d\d)" flags="x"> <xsl:matching-substring> <year><xsl:value-of select="regex-group(1)"/></year> <month><xsl:value-of select="regex-group(3)"/></month> <day><xsl:value-of select="regex-group(5)"/></day> </xsl:matching-substring> <xsl:non-matching-substring> <error><xsl:value-of select="."/></error> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:copy> </xsl:template> A nice complement to xsl:analyze-string is the XSLT function unparsed-text( ). This function allows you to read the contents of a text file as a string. Thus, as the name suggests, the file is not parsed and therefore need not be XML. In fact, except in the most unique of circumstances, you would not normally use unparsed-text( ) on XML content. The following stylesheet will convert a simple comma delimited file (one with no quoted strings) to XML: <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:param name="csv-file" select=" 'test.csv' "/> <xsl:template match="/"> <converted-csv filename="{$csv-file}"> <xsl:for-each select="tokenize(unparsed-text($csv-file, 'UTF-8'), '\n')"> <xsl:if test="normalize-space(.)"> <row> <xsl:analyze-string select="." regex="," flags="x"> <xsl:non-matching-substring> <col><xsl:value-of select="normalize-space(.)"/></col> </xsl:non-matching-substring> </xsl:analyze-string> </row> </xsl:if> </xsl:for-each> </converted-csv> </xsl:template> </xsl:stylesheet> DiscussionThe regex capabilities of XSLT 2.0 along with unparsed-text() open up whole new processing possibilities to XSLT that were next to impossible in XSLT 1.0. Still, XSLT would not be my first choice for non-XML processing unless I was working in a context where a multi-language solution (e.g., Java and XSLT or Perl and XLST) was not practical. Of course, if XSLT is the only language you want to master, the new capabilities certainly open up new vistas for you to explore. Part of my motivation for jumping the XSLT ship when entering the domain of unstructured text processing are the "missing features" of xsl:analyze-string. It would be nice if the position( ) and last( ) functions worked within xsl:matching-substring to tell you that this is match number position( ) of last( ) possible matches. I sometimes use xsl:for-each over a tokenize( ) instead of xsl:analyze-string but that is also deficient because it only returns the non-matching portions. Further, you often feel compelled to use xsl:analyze-string for a complex parsing problem involving many possible regex matches in a regex using alternation (|). However, there is no way to tell which regex matched without re-matching using the match( ) function, which is a tad redundant and wasteful for my taste because surely the regex engine knows what part it just matched: <xsl:template match="text( )"> <xsl:analyze-string select="." regex='[\-+]?\d\.\d+\s*[eE][\-+]?\d+ | [\-+]?\d+\.\d+ | [\-+]?\d+ | "[^"]*?" ' flags="x"> <xsl:matching-substring> <xsl:choose> <xsl:when test="matches(.,'[\-+]?\d\.\d+\s*[eE][\-+]?\d+')"> <scientific><xsl:value-of select="."/></scientific> </xsl:when> <xsl:when test="matches(.,'[\-+]?\d+\.\d+')"> <decimal><xsl:value-of select="."/> </decimal> </xsl:when> <xsl:when test="matches(.,'[\-+]?\d+')"> <integer><xsl:value-of select="."/> </integer> </xsl:when> <xsl:when test='matches(.," "" [^""]*? "" ", "x")'> <string><xsl:value-of select="."/></string> </xsl:when> </xsl:choose> </xsl:matching-substring> </xsl:analyze-string> </xsl:template> Now, hindsight is always 20/20, and there are, of course, all sorts of implementation issues and tradeoffs that one needs to overcome when enhancing a language; so, with all due respect to the XSLT 2.0 committee, it would have been sweeter if xsl:analyze-string worked as follows: <!-- NOT VALID XSLT 2.0 - Author's wishful thinking --> <xsl:template match="text( )"> <xsl:analyze-string select="." flags="x"> <xsl:matching-substring regex="[\-+]?\d\.\d+\s*[eE][\-+]?\d+"> <scientific><xsl:value-of select="."/></scientific> </xsl:matching-substring> <xsl:matching-substring regex="[\-+]?\d+\.\d+'"> <decimal><xsl:value-of select="."/> </decimal> </xsl:matching-substring> <xsl:matching-substring regex=" [\-+]?\d+')"> <integer><xsl:value-of select="."/> </integer> </xsl:matching-substring> <xsl:matching-substring regex=' "[^"]*?" '> <string><xsl:value-of select="."/></string> </xsl:matching-substring> <xsl:non=matching-substring> <other><xsl:value-of select="."/></other> </xsl:non=matching-substring> </xsl:analyze-string> </xsl:template> |