Recipe2.9.Tokenizing a String


Recipe 2.9. Tokenizing a String

Problem

You want to break a string into a list of tokens based on the occurrence of one or more delimiter characters.

Solution

XSLT 1.0

Jeni Tennison implemented this solution (but the comments are my doing). The tokenizer returns each token as a node consisting of a token element text. It also defaults to character-level tokenization if the delimiter string is empty:

<xsl:template name="tokenize">   <xsl:param name="string" select="''" />   <xsl:param name="delimiters" select="' &#x9;&#xA;'" />   <xsl:choose>      <!-- Nothing to do if empty string -->     <xsl:when test="not($string)" />          <!-- No delimiters signals character level tokenization. -->     <xsl:when test="not($delimiters)">       <xsl:call-template name="_tokenize-characters">         <xsl:with-param name="string" select="$string" />       </xsl:call-template>     </xsl:when>     <xsl:otherwise>       <xsl:call-template name="_tokenize-delimiters">         <xsl:with-param name="string" select="$string" />         <xsl:with-param name="delimiters" select="$delimiters" />       </xsl:call-template>     </xsl:otherwise>   </xsl:choose> </xsl:template>     <xsl:template name="_tokenize-characters">   <xsl:param name="string" />   <xsl:if test="$string">     <token><xsl:value-of select="substring($string, 1, 1)" /></token>     <xsl:call-template name="_tokenize-characters">       <xsl:with-param name="string" select="substring($string, 2)" />     </xsl:call-template>   </xsl:if> </xsl:template>     <xsl:template name="_tokenize-delimiters">   <xsl:param name="string" />   <xsl:param name="delimiters" />   <xsl:param name="last-delimit"/>    <!-- Extract a delimiter -->   <xsl:variable name="delimiter" select="substring($delimiters, 1, 1)" />   <xsl:choose>      <!-- If the delimiter is empty we have a token -->     <xsl:when test="not($delimiter)">       <token><xsl:value-of select="$string"/></token>     </xsl:when>      <!-- If the string contains at least one delimiter we must split it -->     <xsl:when test="contains($string, $delimiter)">       <!-- If it starts with the delimiter we don't need to handle the -->        <!-- before part -->       <xsl:if test="not(starts-with($string, $delimiter))">          <!-- Handle the part that comes before the current delimiter -->          <!-- with the next delimiter. If there is no next the first test -->          <!-- in this template will detect the token -->         <xsl:call-template name="_tokenize-delimiters">           <xsl:with-param name="string"                            select="substring-before($string, $delimiter)" />           <xsl:with-param name="delimiters"                            select="substring($delimiters, 2)" />         </xsl:call-template>       </xsl:if>        <!-- Handle the part that comes after the delimiter using the -->        <!-- current delimiter -->       <xsl:call-template name="_tokenize-delimiters">         <xsl:with-param name="string"                          select="substring-after($string, $delimiter)" />         <xsl:with-param name="delimiters" select="$delimiters" />       </xsl:call-template>     </xsl:when>     <xsl:otherwise>        <!-- No occurrences of current delimiter so move on to next -->       <xsl:call-template name="_tokenize-delimiters">         <xsl:with-param name="string"                          select="$string" />         <xsl:with-param name="delimiters"                          select="substring($delimiters, 2)" />       </xsl:call-template>     </xsl:otherwise>   </xsl:choose> </xsl:template>     </xsl:stylesheet>

XSLT 2.0

Use the XPath 2.0 tokenize() function covered in Recipe 2.11.

Discussion

Tokenization is a common string-processing task. In languages with powerful regular-expression engines, tokenization is trivial. In this area, languages such as Perl, Python, JavaScript, and Tcl currently outshine XSLT. However, this recipe shows that XSLT can deal with tokenization if you must stay within the bounds of pure XSLT. If you are willing to use extensions, then you can defer to another language for low-level string manipulations such as tokenization.

If you use the XSLT approach and your processor does not optimize for tail-recursion, then you may want to use a divide-and-conquer algorithm for character tokenization:

<xsl:template name="_tokenize-characters">   <xsl:param name="string" />   <xsl:param name="len" select="string-length($string)"/>   <xsl:choose>        <xsl:when test="$len = 1">        <token><xsl:value-of select="$string"/></token>        </xsl:when>        <xsl:otherwise>       <xsl:call-template name="_tokenize-characters">         <xsl:with-param name="string"                         select="substring($string, 1, floor($len div 2))" />         <xsl:with-param name="len" select="floor($len div 2)"/>       </xsl:call-template>       <xsl:call-template name="_tokenize-characters">         <xsl:with-param name="string"                        select="substring($string, floor($len div 2) + 1)" />         <xsl:with-param name="len" select="ceiling($len div 2)"/>       </xsl:call-template>        </xsl:otherwise>      </xsl:choose> </xsl:template>

See Also

Chapter 12 shows how to access the regex facility in JavaScript if your XSLT processor allows JavaScript-based extensions. Java also has a built-in tokenizer (java.util.StringTokenizer).




XSLT Cookbook
XSLT Cookbook: Solutions and Examples for XML and XSLT Developers, 2nd Edition
ISBN: 0596009747
EAN: 2147483647
Year: 2003
Pages: 208
Authors: Sal Mangano

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net