Recipe 2.7. Replacing TextProblemYou want to replace all occurrences of a substring within a target string with another string. SolutionXSLT 1.0The following recursive template replaces all occurrences of a search string with a replacement string: <xsl:template name="search-and-replace"> <xsl:param name="input"/> <xsl:param name="search-string"/> <xsl:param name="replace-string"/> <xsl:choose> <!-- See if the input contains the search string --> <xsl:when test="$search-string and contains($input,$search-string)"> <!-- If so, then concatenate the substring before the search string to the replacement string and to the result of recursively applying this template to the remaining substring. --> <xsl:value-of select="substring-before($input,$search-string)"/> <xsl:value-of select="$replace-string"/> <xsl:call-template name="search-and-replace"> <xsl:with-param name="input" select="substring-after($input,$search-string)"/> <xsl:with-param name="search-string" select="$search-string"/> <xsl:with-param name="replace-string" select="$replace-string"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <!-- There are no more occurrences of the search string so just return the current input string --> <xsl:value-of select="$input"/> </xsl:otherwise> </xsl:choose> </xsl:template> If you want to replace only whole words, then you must ensure that the characters immediately before and after the search string are in the class of characters considered word delimiters. We chose the characters in the variable $punc plus whitespace to be word delimiters: <xsl:template name="search-and-replace-whole-words-only"> <xsl:param name="input"/> <xsl:param name="search-string"/> <xsl:param name="replace-string"/> <xsl:variable name="punc" select="concat('.,;:( )[ ]!?$@&"',"'")"/> <xsl:choose> <!-- See if the input contains the search string --> <xsl:when test="contains($input,$search-string)"> <!-- If so, then test that the before and after characters are word delimiters. --> <xsl:variable name="before" select="substring-before($input,$search-string)"/> <xsl:variable name="before-char" select="substring(concat(' ',$before),string-length($before) +1, 1)"/> <xsl:variable name="after" select="substring-after($input,$search-string)"/> <xsl:variable name="after-char" select="substring($after,1,1)"/> <xsl:value-of select="$before"/> <xsl:choose> <xsl:when test="(not(normalize-space($before-char)) or contains($punc,$before-char)) and (not(normalize-space($after-char)) or contains($punc,$after-char))"> <xsl:value-of select="$replace-string"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="$search-string"/> </xsl:otherwise> </xsl:choose> <xsl:call-template name="search-and-replace-whole-words-only"> <xsl:with-param name="input" select="$after"/> <xsl:with-param name="search-string" select="$search-string"/> <xsl:with-param name="replace-string" select="$replace-string"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <!-- There are no more occurrences of the search string so just return the current input string --> <xsl:value-of select="$input"/> </xsl:otherwise> </xsl:choose> </xsl:template>
XSLT 2.0The functionality of search-and-replace is built-in to the 2.0 function replace( ). The functionality of search-and-replace-whole-words-only can easily be emulated using a regex that matches words: <xsl:function name="ckbk:search-and-replace-whole-words-only"> <xsl:param name="input" as="xs:string"/> <xsl:param name="search-string" as="xs:string"/> <xsl:param name="replace-string" as="xs:string"/> <xsl:sequence select="replace($input, concat('(^|\W)',$search-string,'(\W|$)'), concat('$1',$replace-string,'$2'))"/> </xsl:function>
Here we build up a regex by surrounding $search-string with (^|\W) and (\W|$) where \W means "not \w" or "not a word character." The ^ and $ handle the case when the word appears at the beginning or end of the string. We also need to put the matched \W character back into the text using references to the captured groups $1 and $2. The function replace( ) is more powerful than the preceding XSLT 1.0 solutions because it uses regular expressions and can remember parts of the match and use them in the replacement via the variables $1, $2, etc. We explore replace( ) further in Recipe 2.10. DiscussionSearching and replacing is a common text-processing task. The solution shown here is the most straightforward implementation of search and replace written purely in terms of XSLT. When considering the performance of this solution, the reader might think it is inefficient. For each occurrence of the search string, the code will call contains( ), substring-before() , and substring-after() . Presumably, each function will rescan the input string for the search string. It seems like this approach will perform two more searches than necessary. After some thought, you might come up with one of the following, seemingly more efficient, solutions shown in Example 2-4 and Example 2-5. Example 2-4. Using a temp string in a failed attempt to improve search and replace<xsl:template name="search-and-replace"> <xsl:param name="input"/> <xsl:param name="search-string"/> <xsl:param name="replace-string"/> <!-- Find the substring before the search string and store it in a variable --> <xsl:variable name="temp" select="substring-before($input,$search-string)"/> <xsl:choose> <!-- If $temp is not empty or the input starts with the search string then we know we have to do a replace. This eliminates the need to use contains( ). --> <xsl:when test="$temp or starts-with($input,$search-string)"> <xsl:value-of select="concat($temp,$replace-string)"/> <xsl:call-template name="search-and-replace"> <!-- We eliminate the need to call substring-after by using the length of temp and the search string to extract the remaining string in the recursive call. --> <xsl:with-param name="input" select="substring($input,string-length($temp)+ string-length($search-string)+1)"/> <xsl:with-param name="search-string" select="$search-string"/> <xsl:with-param name="replace-string" select="$replace-string"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$input"/> </xsl:otherwise> </xsl:choose> </xsl:template> Example 2-5. Using a temp integer in a failed attempt to improve search and replace<xsl:template name="search-and-replace"> <xsl:param name="input"/> <xsl:param name="search-string"/> <xsl:param name="replace-string"/> <!-- Find the length of the sub-string before the search string and store it in a variable --> <xsl:variable name="temp" select="string-length(substring-before($input,$search-string))"/> <xsl:choose> <!-- If $temp is not 0 or the input starts with the search string then we know we have to do a replace. This eliminates the need to use contains( ). --> <xsl:when test="$temp or starts-with($input,$search-string)"> <xsl:value-of select="substring($input,1,$temp)"/> <xsl:value-of select="$replace-string"/> <!-- We eliminate the need to call substring-after by using temp and the length of the search string to extract the remaining string in the recursive call. --> <xsl:call-template name="search-and-replace"> <xsl:with-param name="input" select="substring($input,$temp + string-length($search-string)+1)"/> <xsl:with-param name="search-string" select="$search-string"/> <xsl:with-param name="replace-string" select="$replace-string"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$input"/> </xsl:otherwise> </xsl:choose> </xsl:template> The idea behind both attempts is that if you remember the spot where substring-before( ) finds a match, then you can use this information to eliminate the need to call contains( ) and substring-after( ). You are forced to introduce a call to starts-with( ) to disambiguate the case in which substring-before( ) returns the empty string; this can happen when the search string is absent or when the input string starts with the search string. However, starts-with( ) is presumably faster than contains( ) because it doesn't need to scan past the length of the search string. The idea that distinguishes the second attempt from the first is the thought that storing an integer offset might be more efficient than storing the entire substring. Alas, these supposed optimizations fail to produce any improvement when using the Xalan XSLT implementation and actually produce timing results that are an order of magnitude slower on some inputs when using either Saxon or XT! My first hypothesis regarding this unintuitive result was that the use of the variable $temp in the recursive call interfered with Saxon's tail-recursion optimization (see Recipe 2.6). However, by experimenting with large inputs that have many matches, I failed to cause a stack overflow. My next suspicion was that for some reason, XSLT substring() is actually slower than the substring-before( ) and substring-after( ) calls. Michael Kay, the author of Saxon, indicated that Saxon's implementation of substring( ) was slow due to the complicated rules that XSLT substring must implement, including floating-point rounding of arguments, handling special cases where the start or end point are outside the bounds of the string, and issues involving Unicode surrogate pairs. In contrast, substring-before( ) and substring-after( ) translate more directly into Java. The real lesson here is that optimization is tricky business, especially in XSLT where there can be a wide disparity between implementations and where new versions continually apply new optimizations. Unless you are prepared to profile frequently, it is best to stick with simple solutions. An added advantage of obvious solutions is that they are likely to behave consistently across different XSLT implementations. |