Regular Expression Syntax | The Official XMLSPY Handbook

XML Schema regular expressions are used with the XML Schema pattern facet to place constraints or restrictions on a string of characters that conform to the defined pattern. If you recall from the discussion of XML Schemas in Chapter 5 the pattern facet is a constraint on the value space of a data type, achieved by constraining the lexical space to literals that match a specific pattern. Patterns are expressed by means of a regular expression, which is the subject of this appendix.

Regular expressions are not a programming language; rather, they constitute a concise syntax for describing character patterns, which have been widely implemented by numerous languages and tools, including any validating XML Schema processor. The syntax for the pattern-matching scheme is, for the most part, borrowed from the Perl scripting language, and a regular expression is essentially Perl regular expressions (or simply regex for short) with a few minor omissions and modifications. Perl stands for Practical Extraction and Report Language, and it is a great language for doing just that!

The following discussion pertains to the use of regular expressions in conjunction with validating instance documents against XML Schemas. However, many other freely available tools and APIs support the use of regular expressions, including Python, and the Unix programs grep (Global Regular Expression Print), sed (Stream EDitor), and egrep (Extended GREP).

Matching individual characters

If I use the value xmlspy as a regular expression to restrict a string element, I am indicating to the XML Schema validator that the element or attribute in question must exactly match the specified pattern of the characters x, m, l, s, p, y, consecutively and in the specified case. Therefore, the smallest unit of a pattern is a single character, and a regular expression consists of a pattern of characters. Of course, in this example, I pre-specified exactly what the pattern value had to be equal to. A more complex regular expression is constructed using metacharacters, which are placeholders for a specified range of character values. Table C-1 shows a listing of metacharacters.

Table C-1: METACHARACTERS FOR MATCHING INDIVIDUAL CHARACTER VALUES
Syntax Form	Description	Example	Possible Example Values
.	Any single character (except the newline or return character)	xmlspy.	xmlspyA, xmlspy4, xmlspy!
[...]	Matches any character inside the square brackets; uses a dash to indicate a range, multiple ranges are permitted. xmlspy1, xmlspy2 ... xmlspy5	xmlspy[1-5]
[^...]	Matches any character except those listed within the brackets	xmlspy[^6-9 ]	xmlspy1, xmlspy2 ... xmlspy5

Metacharacters are used as wildcards to specify permissible permutations inside a pattern. The period (.) character has special meaning, representing a wildcard for any single character. To explicitly specify that an actual period character should appear, you need to use an escape character, which is discussed later in this appendix. Inside a set of square brackets, you can list all the allowable characters. For example, [12345] indicates that 1, 2, 3, 4, or 5 are acceptable values. You can express this more concisely using the hyphen (-) character to specify a range of ordinal values, as in [1-5]. The hyphen only has a special meaning if there is exactly one alphanumerical character to the left and right of it. You can indicate multiple ranges by separating the characters with commas. For example, [0-9,A-Z,a-z] satisfies any alphanumerical character. Finally, you can use the caret character (^) to denote exception and to specify the set of invalid characters. Some additional examples are provided in Table C-2.

Table C-2: EXAMPLES OF CHARACTER RANGES
Example	Possible Values	Illegal Values	Notes
0-9,A-Z,a-z	3, t, Q	#, %, $	Any number or letter
[-cde]	-, c, d, e	a, b, c	The dash has no special meaning
[cde-]	c, d, e, -	a, b, c	The dash has no special meaning

Quantifiers

The metacharacters specified in the previous section act as placeholders for a single character only; it is not allowed to pick, say, two of the values specified in the range. Metacharacters are often used in conjunction with quantifiers, which allow you to explicitly specify the number of times a particular character or metacharacter should appear. Quantifiers are listed in Table C-3.

Table C-3: QUANTIFIERS FOR SPECIFYING CHARACTER OCCURRENCES
Syntax Form	Description	Example	Possible Example Values
?	The preceding character may appear zero or one times	abc.	abc, ab
*	The preceding character may appear zero or more times	abc*	ab, abc, abcccccc
+	The preceding character may appear one or more times	abc+	abc, abccccc
{x}	Matches the preceding element x times, where x is a positive integer value	abc{3}	abccc
{x, y}	Matches the preceding element at least x times, but not more than y times	abc{3,5}	abccc, abccc, abccccc

Escape characters

So far we’ve been using various special characters such as asterisks, periods, dashes, braces, and so on, to specify regular expression patterns. The question naturally arises: How do you search for an actual occurrence for one of the special characters? The answer is to use an escape character, which is a backslash (\) prepended in front of special character, which removes the special meaning of the character (meaning that it is treated as a normal character instead of part of a metacharacter expression). Escape characters have the unfortunate side effect of making your regular expression rather unreadable, but they are an important aspect of properly using regular expressions. Table C-4 lists the most commonly escaped characters.

Table C-4: ESCAPING SPECIAL CHARACTERS
Expression	Description
\s	Any whitespace (tab, space, etc)
\n	Newline character
\r	Return character
\t	Tab character
\\	Backslash character
\\|	Vertical bar
\.	Period character
\-	Dash character
\^	Caret character
\?	Question mark character
\*	Asterisk character
\+	Addition sign character
\{	Opening brace character
\}	Closing brace character
\[	Opening bracket character
\]	Closing bracket character
\(	Opening parenthesis character
\)	Closing parenthesis character

Alternation

Alternation is equivalent to the logical OR operation, and can be used to specify an enumeration of patterns, one of which must be satisfied. Alternation uses the vertical bar symbol (|) to indicate the logical OR operator. As an example, suppose you wanted to perform a match that allowed for both U.S. and British spellings of the word center or centre. The following regular expression could effectively allow for both spellings: cent(re|er). Parentheses are used to delimit groupings, which correspond to available listing of choices separated by the OR character.

Differences compared to Perl regular expressions

If you have already used and are familiar with Perl regular expressions, be advised that although the syntax for pattern matching in XML Schema is very similar to Perl regular expressions, there are a few noteworthy exceptions. The first major difference is that XML Schema patterns are meant to match on an entire string, and not on substrings. In contrast, Perl can be used to search through multiple documents simply to try to find matching patterns. Anchor characters, like the ^ and $ are no longer used at the beginning and end of an expression to indicate the beginning or end of string (although the ^ symbol is still used in XML Schema patterns as the exception character). Another important difference is that the zero-width assertions, lookahead and lookbehind, and the use of backreferences are all not permitted.