xsl:analyze-string

The <xsl:analyze-string> instruction is used to process an input string using a regular expression (often abbreviated to regex ). It is useful where the source document contains text whose structure is not fully marked up using XML elements and attributes, but has its own internal syntax: For example, the value of an attribute might be a list of numbers separated by commas.

I use the term regex to refer to regular expressions in this section, because it helps to avoid any confusion with XPath expressions.

Changes in 2.0

This instruction is new in XSLT 2.0.

Format

 <xsl:analyze-string   select = expression   regex = { string }   flags? = { string }>   <!-- Content: (xsl:matching-substring?,                  xsl:non-matching-substring?,                  xsl:fallback*) --> </xsl:analyze-string>

Position

<xsl:analyze-string> is an instruction, and is always used within a sequence constructor.

Attributes

Name	Value	Meaning
select mandatory	Expression	The input string to be analyzed using the regex. A type error occurs if the value of the expression cannot be converted to a string using the standard conversion rules described on page 476
regex optional	Attribute value template, returning a regular expression, as defined below	The regular expression (regex) used to analyze the string
flags optional	Attribute value template, returning regex flags, as defined below	Flags controlling how the regex is interpreted. Omitting the attribute is equivalent to supplying a zero-length string (no special flags)

Name

Value

Meaning

select

mandatory

Expression

The input string to be analyzed using the regex. A type error occurs if the value of the expression cannot be converted to a string using the standard conversion rules described on page 476

regex

optional

Attribute value template, returning a regular expression, as defined below

The regular expression (regex) used to analyze the string

flags

optional

Attribute value template, returning regex flags, as defined below

Flags controlling how the regex is interpreted. Omitting the attribute is equivalent to supplying a zero-length string (no special flags)

The construct expression (meaning an XPath expression) is defined at the beginning of this chapter and more formally in Chapter 5 of XPath 2.0 Programmer's Reference .

The syntax of regular expressions permitted in the regex attribute is the same as the syntax accepted by the functions matches(), replace(), and tokenize() in XPath 2.0. This is described fully in Chapter 11 of XPath 2.0 Programmer's Reference , and is summarized below. It is based on the syntax used for regular expressions in XML Schema, with some extensions.

The regex attribute is an attribute value template. This makes it possible to construct the regex at runtime, using an XPath expression. For example, the regex can be supplied as a stylesheet parameter. The downside of this is that curly braces within the attribute value must be doubled if they are to be treated as part of the regex, rather than having their special meaning for attribute value templates. For example, to match a sequence of three digits, write « regex="[0-9] {{3}}" » .

The flags attribute controls how the regex is to be interpreted. Four flags are defined, each denoted by a single letter, and they can be written in any order. Like the regex attribute, flags may be written as an attribute value template. The flags have the following meaning:

Flag	Meaning
i	Selects case-insensitive mode. In simple terms, this means that «X » and «x » will match each other. The detailed definition is much more complex, and is specified by reference to character properties in the Unicode character database. Without this flag, characters match only if the Unicode codepoints are identical
m	Selects multiline mode. In multiline mode, the metacharacters «^ » and «$ » match the beginning and end of a line, while in string mode, they match the beginning and end of the entire string. (People sometimes find the name confusing: think of multiline as meaning "treat each line separately," and string as meaning "treat all the lines as a single string")
s	Selects dot-all mode (this is called single-line mode in Perl-hence the abbreviation-but this term is confusing since it suggests the opposite of multiline mode). In dot-all mode the metacharacter «. » matches any character, whereas normally it matches any character except a newline (x0A)
x	Allows whitespace to be used as an insignificant separator within the regex. Without this setting, whitespace characters in a regex are ordinary characters that represent themselves

Content

Zero or one <xsl:matching-substring> element

Zero or one <xsl:non-matching-substring> element

Zero or more <xsl:fallback> elements

These must appear in the order specified, if they appear at all. An XSLT 2.0 processor will ignore any <xsl:fallback> instructions; they are allowed so that a stylesheet can specify fallback actions to be taken by an XSLT 1.0 processor when it encounters this element, if it is working in forward-compatible mode.

The elements <xsl:matching-substring> and <xsl:non-matching-substring> take no attributes, and their content is in each case a sequence constructor.

Effect

The XPath expression given in the select attribute is evaluated, and provides the input string to be matched by the regex. A type error occurs if the value of this expression can't be converted to a string using the standard conversion rules described on page 476.

The regex must not be one that matches a zero-length string. This rules out values such as «regex="" » or «regex="[0-9]*" » . The reason for this rule is that languages such as Perl have different ways of handling this situation, none of which are completely satisfactory, and which are sensitive to additional parameters such as "limit," which XSLT chose not to provide.

The input string is formed by evaluating the select expression, and the processor then analyzes this string to find all substrings that match the regex. The substrings that match the regex are processed using the instructions within the <xsl:matching-substring> element, while the intervening substrings are processed using the instructions in the <xsl:non-matching-substring> element. For example, if the regex is «[0-9]+ » , then any consecutive sequence of digits in the input string is passed to the <xsl:matching-substring> element, while consecutive sequences of non-digits are passed to the <xsl:non-matching-substring> element.

Within the <xsl:matching-substring> or <xsl:non-matching-substring> element, the substring in question can be referenced as the context item, using the XPath expression «. » . It is also possible within the <xsl:matching-substring> element to refer to the substrings that matched particular parts of the regex: see Captured Groups below.

Neither a matching substring nor a nonmatching substring will ever be zero-length. This means that if two matching substrings are adjacent to each other in the input string, there will be two consecutive calls on the <xsl:matching-substring> element, with no intervening call on the <xsl:non-matching-substring> element.

Omitting either the <xsl:matching-substring> element or the <xsl:non-matching-substring> element causes the relevant substring to be discarded (no output is produced in respect of this substring). It is legal, though pointless, to omit both these elements.

In working its way through the input string, the processor always looks for the first match that it can find: That is, it looks first for a match starting at the first character of the input string, then for a match starting at the second character, and so on. There are several situations that can result in several candidate matches occurring at the same position (that is, starting with the same character in the input). The rules that apply are:

The quantifiers «* » and «+ » are greedy: They match as many characters as they can, consistent with the regular expression as a whole succeeding. For example, given the input «Here[1] or there[2] », the regex «[.*] » will match the string « [1] or there[2] » .
The quantifiers «*? » and «+? » are non-greedy: They match as few characters as they can, consistent with the regular expression as a whole succeeding. For example, given the input «Here[1] or there[2] », the regex «[.*?] » will match the strings «[1] » and «[2] » .
When there are two alternatives that both match at the same position in the input string, the first alternative is preferred, regardless of its length. For example, given the input « size = 5.2 » , the regular expression «[0-9] + [0-9] * \.[0-9]* » will match «5 » rather than «5.2 » .

Regular Expression Syntax

The regular expression syntax accepted in the regex attribute is the same as that accepted by the match() , tokenize() , and replace() functions, and is fully described in Chapter 11 of XPath 2.0 Programmer's Reference . This section provides a quick summary only; it makes no attempt to define details such as precedence rules.

In this summary, capital letters A and B represent arbitrary regular expressions. n and m represent a number (a sequence of digits). a, b, c represent an arbitrary character, which is either a normal character, or one of the metacharacters «. », »\ », »? », «* », «+ », «{ », «} », «( », «) », «[ » or «] » escaped by preceding it with a backslash «\ » , or one of the symbols «\n », «\r », «\t » representing a newline, carriage return, or tab respectively.

Construct	Matches a string S if...
AB	S matches either A or B
AB	The first part of S matches A and the rest matches B
A?	S either matches A or is empty
A*	S is a sequence of zero or more strings that each match A
A+	S is a sequence of one or more strings that each match A
A{n,m}	S is a sequence of between n and m strings that each match A
A{n,}	S is a sequence of n or more strings that each match A
A{n}	S is a sequence of exactly n strings that each match A
Q?	Where Q is one of the regular expressions described in the previous six rows: matches the same strings as Q, but using nongreedy matching
(A)	S matches A
c	S consists of the single character c
[abc]	S consists of one of the characters a, b, or c
[^abc]	S consists of a single character that is not one of a, b, or c
[a-b]	S is a character whose Unicode codepoint is in the range a to b
\p{prop}	S is a character that has property prop in the Unicode database
\P{prop}	S is a character that does not have property prop in the Unicode database
.	S is any single character (in dot-all mode) or any single character other than a newline (when not in dot-all mode)
\s	S is a single space, tab, newline, or carriage return
\S	S is a character that does not match \s
\i	S is a character that can appear at the start of an XML Name
\I	S is a character that does not match \i
\c	S is a character that can appear in an XML Name
\C	S is a character that does not match \c
\d	S is a character classified in Unicode as a digit
\D	S is a character that does not match \d
\w	S is a character that does not match \W
\W	S is a character that is classified in Unicode as a punctuation, separator, or "other" character
^	Matches the start of the input string, or the start of a line if in multiline mode
$	Matches the end of the input string, or the end of a line if in multiline mode

The most useful properties that may be specified in the «\p » and «\P » constructs are described below; for a full list see Chapter 11 of XPath 2.0 Programmer's Reference :

Property	Meaning
L	All letters
Lu	Uppercase letters, for example, A, B, , & pound ;
Ll	Lowercase letters, for example, a, b, ±, »
N	All numbers
P	Punctuation (full stop, comma, semicolon, and so on)
Z	Separators (for example, space, newline, no-breaking space, en space, em space)
S	Symbols (for example, currency symbols, mathematical symbols, dingbats, and musical symbols)

Captured Groups

Within the <xsl:matching-substring> element, it is possible to refer to the substring that matched the regular expression as «. » , because it is provided as the context item. Sometimes, however, it is useful to be able to determine the strings that matched particular parts of the regular expression.

Any subexpression of the regular expression that is enclosed in parentheses causes the string that it matches to be available as a captured group . For example, if the regex «([0-9]+) ([A-Z]+) ([0-9]+) » is used to match the string «13DEC1987 » , then the three captured groups will be «13 » , «DEC » , and «1987 » . If the regular expression were written instead as «([0-9]+) ( [ A-Z ] + ([0-9]+)) » , then the three captured groups would be «13 » , «DEC1987 » , and «1987 » . The subexpression that starts with the n th left parenthesis in the regular expression delivers the n th captured group in the result.

Some parenthesized subexpression might not match any part of the string. For example if the regex «{[0-9]+) ([A-Z]+) » is used to match the string «12 » , the first captured subgroup will be «12 » and the second will be empty.

A parenthesized expression might also match more than one substring. For example, if the regex «([0-9]+) (, [0-9]+)* » is used to match the string «12, 13, 14 » , then the second part in parentheses matches both «,13 » and «,14 » . In this case only the last one is captured. The first captured group in this example will be «12 » , and the second will be «,14 » .

While the <xsl:matching-substring> element is being evaluated, the captured groups found during the regular expression match are available using the regex-group () function. This takes an integer argument, which is the number of the captured group that is required. If there is no corresponding subexpression in the regular expression, or if that subexpression didn't match anything, the result is a zero-length string.

Usage and Examples

There are three functions in the core function library (see XPath 2.0 Programmer's Reference , Chapter 10) that use regular expressions: matches() , replace() , and tokenize() . These are used as follows :

Function	Purpose
matches()	Tests whether a string matches a given regular expression
replace()	Replaces the parts of a string that match a given regular expression with a different string
tokenize()	Splits a string into a sequence of substrings, by finding occurrences of a separator that matches a given regular expression

There are many ways to use the XPath regex functions in an XSLT stylesheet. For example, you might write a template rule that matches customers with a customer number in the form 999-AAAA-99 (this might be the only way, for example, that you can recognize customers acquired as a result of a corporate takeover). Write this as:

  <xsl:template   match="customer[matches(cust-nr, '^[0-9]{3}-[A-Z]{4}-[0-9]{2}$')]">

There is no need to double the curly braces in this example. The match attribute of <xsl:template> is not an attribute value template, so curly braces have no special significance.

The <xsl:analyze-string> instruction is more powerful (but also more complex) than any of these three functions. In particular, none of the three XPath functions can produce new elements or other nodes. The <xsl:analyze-string> instruction can do so, which makes it very useful when you want to find a non-XML structure in the source text (for example, the comma-separated list of numbers mentioned earlier) and convert it into an XML representation (a sequence of elements, say). This is sometimes called up-conversion.

There are two main ways of using <xsl:analyze-string> , which I will describe as single-match and multiple-match applications. I shall give an example of each.

A Single-Match Example

In the single-match use of <xsl:analyze-string>, a regex is supplied that is designed to match the entire input string. The purpose is to extract and process the various parts of the string using the captured groups. This is all done within the <xsl:matching-substring> child element, which is only invoked once. The <xsl:non-matching-substring> element is used only to define error handling, to deal with the case where the input doesn't match the expected format.

For example, suppose you want to display a date as 13 ^th March 2005. To achieve this, you need to generate the output «13<sup><i>th</i></sup> March 2005 » (or rather, text nodes and element nodes corresponding to this serial XML representation). You can achieve the basic date formatting using the format-date () function described in Chapter 7, but to add the markup you need to post-process the output of this function.

Here is the code:

  <xsl:analyze-string   select="format-date(current-date(), '[Do] [Mn] [Y]')"   regex="^[0-9]+) <[a-z]+) (.*)$">   <xsl:matching-substring>   <xsl:value-of select=" (1)"/>   <sup><i><xsl:value-of select="regex-group(2)"></i></sup>   <xsl:value-of select="regex-group(3)"/>   </xsl:matching-substring>    <    xsl:non-matching-substring>   <xsl:value-of select="."/>   </xsl:non-matching-substring>   </xsl:analyze-string>

Note that the regex is anchored (it starts with «^ » and ends with «$ » ) to force it to match the whole input string. Unlike regex expressions used in the pattern facet in XML Schema, a regex used in the <xsl:analyze-string> instruction is not implicitly anchored.

In this example I chose in the <xsl:non-matching-substring> to output the whole date as returned by format-date() , without any markup. This error might occur, for example, because the stylesheet is being run in a locale that uses an unexpected representation of ordinal numbers. The alternative would be to call <xsl:message> to report an error and perhaps terminate.

A Multiple-Match Example

In a multiple-match application, you supply a regular expression that will match the input string repeatedly, breaking it into a sequence of substrings. There are two main ways you can design this:

Match the parts of the string that you are interested in. For example, the regex «[0-9] + » will match any sequence of consecutive digits, and pass it to the <xsl:matching-substring> element to be processed. The characters that separate groups of digits are passed to the <xsl:non-matching-substring> element, if there is one (you might choose to ignore them completely).

There is a variant of this approach that is useful where there are no separators as such. For example you might be dealing with a format such as the one used for ISO 8601 durations, which look like this: «P12H30M10S » , with the requirement to split out the components «12H », «30M » , and «10S » . The regex «[0-9] + [A-Z] » will achieve this, passing each component to the <xsl:matching-substring> element in turn .
Match the separators between the parts of the string that you are interested in. For example, if the string uses comma as a separator, the regex «, \s* » will match any comma followed optionally by spaces. The fields that appear between the commas will be passed, one at a time, to the <xsl:non-matching-substring> element, while the separators (if you want to look at them at all) are passed to the <xsl:matching-substring> element.

The following example blends these techniques. It analyzes an XPath expression and lists all the variable names that it references. The regex chosen is one that matches things you're interested in (the variable names) but it also uses parentheses to provide access to a captured group from which the leading «$ » sign is left out. It's not an industrial-quality solution to this problem, for example it doesn't try to ignore the content of comments and string literals. But it does allow for the fact that a space between the «$ » sign and the variable name is permitted. You can extend it to handle these extra challenges if you like. (And if you are really keen, you can extend it to extract the namespace prefix from the variable name, and look up the corresponding namespace URI using the get- namespace-uri-for-prefix () function.)

  <xsl:analyze-string select="$param" regex="$\s    *    (\i\c*)"   <xsl:matching-substring>   <ref><xsl:value-of select="regex-group(1)"/></ref>   </xsl:matching-substring>   </xsl:analyze-string>

Note in this example that a «$ » sign used to represent itself must be escaped using a backslash, and that we are taking advantage of the rather specialized regex constructs «\i » and «\c » to match an XML name. The output is a sequence of <ref> elements containing the names of the referenced variables .