Basic Expression Syntax | Professional .NET Framework 2.0 (Programmer to Programmer)

Regular expressions are simply arrangements of text that describe a pattern to match within some input text. It is a mini-language unto itself. The purpose of an expression might be to extract specific pieces of information captured through matching an expression, to validate that indeed some text is formatted correctly, or even just to verify the presence of a pattern within a larger body of text. Regular expression patterns can perform these operations in just a few lines of code; ordinary imperative constructs can often do the same job but usually take tens or even hundreds of lines to implement.

Simple variants of regular expressions pop up everywhere. The MS-DOS command 'dir *.cs' employs a very simple pattern matching technique, much like what regular expressions can do. The asterisk followed by the period and the two letters "cs" is a simple wildcard pattern. This instructs the dir program to list files in the current directory but to filter any out that do not end in .cs. While this is not a real, valid regular expression, the syntax and concepts are quite similar.

Regular expressions consist of meta-characters, symbols which are interpreted and treated in a special manner by the expression matcher, and literals, which are characters that match input literally without interpretation by the matcher. There are many variants of meta-characters, such as wildcards (a.k.a. character classes) and quantifiers, and most are represented with either a single or escaped character. Figure 13-1 shows a graphical depiction of some of these concepts, using the pattern (abc)*.{1,10}([0–9]|\1)? as an example.

image from book
Figure 13-1: Expression syntax highlights.

Realize also that when the regular expression library attempts to match some input — unless noted otherwise — the term "success" is used to mean that at least part of the input text is matched by an expression. Thus, unless we are explicit about doing so, a match does not indicate that the entire input string matched the expression. This is important for validation scenarios. You will see how to use the latter technique (matching the entire input) through the use of positional meta-characters.

This section will introduce the regular expression grammar for pattern matching, in particular drilling deep into the rich support for a variety of meta-characters. But first, we'll look at a set of brief examples to help illustrate some fundamental concepts. This should also give you a better idea of where regular expressions might come in handy.

Some (Simple) Pattern Examples

Before diving into the specifics of regular expressions, it might be useful to consider a few examples. This should help you to understand the purpose and syntax of regular expressions before learning the intricate details. Each sample will show you a pattern to parse some simple input. If you are unfamiliar with some of the constructs demonstrated, this is to be expected. Just focus on what the purpose of the expression is and come back as you learn new concepts to analyze in detail how they work. If you're already with regular expressions, this section will probably be redundant. Moreover, the examples shown are admittedly very simple.

Matching Social Security Numbers

Say you were accepting input from a web page where the user had to enter their U.S. Social Security number (SSN). Of course, SSNs have very precise formatting requirements, so you would probably want to validate this input. Specifically, they consist of three numbers, followed by a dash, followed by two numbers, followed by a dash, followed by four numbers. You could write this down much more concisely as a pattern nnn-nn-nnnn, where each n is any number from 0 through 9. In fact, this is exactly what regular expressions enable you to do: they enable you to concisely capture textual formatting details within a computer-readable expression.

This brings us to our first incarnation of a regular expression for SSNs:

 [0–9][0–9][0–9]-[0–9][0–9]-[0–9][0–9][0–9][0–9]

The characters between each set of brackets constitute a custom character class (we will explore these in more depth later). The class consists of an enumeration of characters or, as in this example, a range. The class [0–9] matches any character between '0' and '9', inclusively. "Between" in this example is determined by the numerical index of the target character set.

The presence of such classes says to the regular expression matcher that it should expect to find any one character in the class. If we pass in "011-01-0111" it will match this pattern successfully, whereas something like "abc-01-0111" will fail, for instance, because the letters 'a', 'b', and 'c' all fall outside of the character class range. If any part of the expression fails to match, the entire expression will report failure — so even a missing dash, for example, will cause a match to fail.

We can use quantifiers to make this expression a bit more concise:

 [0–9]{3}-[0–9]{2}-[0–9]{4}

The curly braces modify the expression preceding them and indicate that the pattern should match exactly n occurrences of it, where n is the number found between the curly braces. With this pattern, we describe the expression just as we would when explaining SSN formatting to another human being, and indeed it is very similar to the way we described it in the opening paragraph.

Another slight modification can make this pattern simpler yet. There is a set of prebuilt character classes available, one of which matches any digit. The presence of \d is nearly the same as the custom class [0–9]; it is not precisely the same because of some Unicode complexities that arise (it's better!), which are detailed further in the section on character classes later in this chapter. The following expression takes advantage of the \d class:

 \d{3}-\d{2}-\d{4}

To quickly introduce you to some of the System.Text.RegularExpression APIs, the following code uses the Regex class to check a couple strings for SSN pattern matches:

 Regex ssn = new Regex(@"\d{3}-\d{2}-\d{4}"); bool isMatch1 = ssn.IsMatch("011-01-0111"); bool isMatch2 = ssn.IsMatch("abc-01-0111"); // isMatch1 == true, while isMatch2 == false...

In this example, the first call to IsMatch will return true because while the second returns false. We'll explore the APIs available further later in this chapter. The APIs do not get terribly more complicated than this, example, although they do permit some interesting operations like extracting matching text, compilation, replacement, among other features. For now, however, let's continue to explore some other simple samples.

Matching E-Mail Addresses

As with the SSN example from above, often you will want to validate or extract bits of an e-mail address provided by a user. While this can become extraordinarily complex if you are trying to cover all cases (e.g., when implementing RFC 822 completely, the pattern for which can be thousands of characters), a pattern such as the following will be suitable for many cases:

 \w+@(\w+\.)+\w+

\w is yet another prebuilt character class that matches any "word" character, meaning alphanumeric characters both in the English and upper-Unicode character set. This pattern matches at least one word character followed by a '@' followed by a set of '.'-delimited words. Before the '@' is the user account, while the part after is the domain name.

While this is an extremely simplified example compared to what might be required with a robust e-mail address parser, it brings to light the fact that "good enough" is very subjective and scenario-specific. Implementing a pattern that is 100% correct is often not feasible nor worth the investment to try. In such cases, you should aim to avoid rejecting valid input while still reducing the number of corner cases a pattern must deal with. As stated, the decision here is very subjective — there are plenty of cases where preciseness and completeness are paramount. For example, an e-commerce web site might not need to be extraordinarily robust when dealing with e-mail address validation, but an e-mail client or SMTP routing software application would certainly need to be more precise.

Matching Money

Consider a situation where you wanted to parse a string that contains quantity of money in a U.S. dollar denomination. This would be represented (typically) as a dollar sign '$', followed by at least one digit (the dollar part), followed by a period '.', followed by two digits (the cents part). Well, as you might expect, the pattern for this is as easy as transcribing the description in the preceding sentence into a regular expression:

 \$\d+\.\d{2}

This matches successfully the following input: "$0.99", "$1000000.00", and so on. (Notice that we had to escape both the '$' and '.' symbols with a backslash. This is so because the $ character is actually interpreted as a special meta-character by the regular expression parser. We want both to be interpreted literally — escaping them does exactly that.) There is one important thing that we left out of our pattern: In the United States, people will separate groups of three dollar digits using the comma character ','. It would be nice if our expression supported this. Indeed, it's quite simple to do:

 \$(\d{1,3},)*\d+\.\d{2}

This uses some concepts that will be introduced later, but really it isn't hard to break apart into chunks that are easy to understand. We know that the \$ part simply matches the literal character '$'. We haven't seen the grouping feature yet, however, which is what the set of parenthesis does: (\d{1,3},) means to match a sequence of between one and three digits followed by a comma. Following it with an asterisk, as in (\d{1,3},)* indicates that the matcher should match this expression anywhere from zero to an unbounded number of times. So, for example, it would match each of these: "000,", "000,000,", "000,000,000,000,000,", and so on. This is then followed by at least one digit. And then the rest is the same — that is, there must be two digits for the cents part.

Note

Notice that this pattern isn't entirely correct. If somebody decides to use comma-separated digits, then we probably want to ensure that they are consistent. The above pattern would match input such as $1,000000.00" for example. And we want to ensure that there are three characters in the group just prior to the period.

This is a recurring theme with regular expressions: it is often easy to get 90% of the way there, but the remaining 10% of the special cases result in tedious pattern refactoring. It's easy to get lost in this. Sometimes living with the 90% pattern is good enough. However, if you really need to guarantee that input is parsed correctly in the above example, you would have to do something along the lines of \$(((\d{1,3},)+\d{3})|\d+)\.\d{2}.

Note that this pattern also doesn't handle internationalization issues. Alternative group separators, currency characters, decimal characters, and so on are used all over the world. Attempting to take all of these factors into account can drive you insane. Thankfully, if you need a robust parser for money, you should take a look at Decimal.Parse and related APIs. The Framework developers have already done the hard work for you.

Incidentally, you can get even fancier than just using the Boolean IsMatch method. You can use captured groups to mark specific pieces of the pattern that allow you to extract corresponding matching text later on. For example, consider the slightly modified pattern:

 \$((\d{1,3},)*\d+)\.(\d{2})

Notice that we added a few more parentheses to wrap interesting groups of text. If you use this pattern instead, then you can then extract specific dollar and cents numbers from a piece of matching text after doing a match:

 Regex r = new Regex(@"\$((\d{1,3},)*\d+)\.(\d{2})"); Match m = r.Match("$1,035,100.99"); Console.WriteLine("Dollar part: {0}, cents: {1}",     m.Groups[1].Captures[0].Value, m.Groups[3].Captures[0].Value);

Groups and capturing are very powerful features, detailed later on in the section on meta-characters.

Literals

Regular expressions are made up of literals and meta-characters. Literals are characters interpreted as they are, whereas meta-characters have special meaning and instruct the pattern matcher to take very precise actions. You can match on literal sequences of characters both in combination with or independent of using the special regular expression meta-characters, which are detailed in depth below.

As an example of a completely literal expression, consider the pattern Washington. As you might imagine, this matches any pieces of text that contain the literal text "Washington". You can modify literals using the meta-characters described below, for example to specify alternations, quantifiers, groups, and so on.

Mixing Literals with Meta-Characters

Most patterns mix literals with meta-characters that don't modify the literal. For example, the expression January 1, [0–9]{4} will match any occurrence of "January 1, "followed by a four-digit number. Because meta-characters are usually interpreted in a special way by the matcher, you can ask the parser to treat them as opaque literals by escaping them. This is done simply by preceding the character with a backslash (i.e., \). This instructs the parser that the meta-character following the backslash is to be treated literally. The pattern January 1, \[0–9]\{4} will only match the precise literal string "January 1, [0–9]{4}", for instance.

Notice in this example that the corresponding closing meta-character ] didn't have to be escaped explicitly. In other words, we used a backslash preceding [ but not ]. This is because the meta-character ] will only be treated specially if the parser knows it is closing a character class grouping, for example. The only characters that require escaping to prevent meta-character interpretation are: *, +, ?, ., |, {, [, (,), \, $, and ^.

Standard Escapes

Lastly, there is a set of available escapes that you can use to match characters that are difficult to represent literally, shown in the table below.

Escape	Character #	Description
\0	\u0000 (0)	Null
\a	\u0007 (7)	Bell
\b	\u0008 (8)	Backspace
\t	\u0009 (9)	Tab
\v	\u000B (11)	Vertical tab
\f	\u000C (12)	Form feed
\n	\u000A (10)	New line
\r	\u000D (13)	Carriage return
\e	\u001B (13)	Escape

The \b escape is only valid to represent a backspace character within character groups. When found in any other location, it is treated as a word-boundary meta-character. This feature is described later in this chapter.

There are also pattern-oriented escape sequences, enabling you to provide more information to identify the target match than just a single character following the backslash. They are shown in the following table.

Escape	Description
\u[nnnn]	Matches a single character. [nnnn] is the character's Unicode index in hexidecimal representation. For example, \004A matches the letter 'J'.
\x[nn]	Matches a single character. [nn] is the character's index in hexidecimal representation. For example, \4A matches the letter 'J'.
\0[nn]	Matches a single character. [nn] is the character's index in octal representation. For example, \012 matches the newline character.
\c[n]	Matches an ASCII control character n. For example, Control-H represents an ASCII backspace (i.e., \b). \cH matches this character.

For the hexadecimal and octal escapes, the sequences are limited to representing characters from the ASCII character set. In fact, the octal escape is limited to characters in the range of 000–077 (decimal numbers 0–63). This is because any escape beginning with a nonzero number will always be interpreted by the expression parser as a group back reference, a meta-character construct outlined further in the text below.

You should notice that most of the expression escapes are very similar to the support that the C# language has for escaping character literals. Interestingly, you don't want to use the language support for escapes, but rather you want the expression to interpret them. This means that you must either double escape them (that is, use \\ instead of \) or prefereably use the C# syntax to avoid this problem by prefixing a string literal with the @ character. For example, consider this code:

 Regex r1 = new Regex(".*\n.*"); Regex r2 = new Regex(".*\\n.*"); Regex r3 = new Regex(@".*\n.*");

The first expression is probably not what was intended. It embeds a true newline right in the middle of the expression. The second and third accomplish the correct behavior, albeit in different ways. In most cases, the latter of the three will end up being the most readable, especially when building complex expressions that are likely to have multiple escapes within them. (If you choose the second approach, be prepared for some difficult-to-read lengthy expressions.)

Meta-Characters

Meta-characters are special tokens that ask the pattern matcher to perform some special logic. You've already seen examples of some meta-characters in the introduction to this chapter. For instance, character classes (e.g., [0–9]) and quantifiers (e.g., {3}, {2}, and {4}) are two examples of different types of meta-characters. In this section, we'll take a look at the most commonly used meta-characters available. Along the way, you will see some useful examples that detail how and why you would want to use them.

Quantifiers

A quantifier just tells the matcher to match a specific number of occurrences of a particular pattern. The quantity of occurrences can be an exact number or an open-ended or bounded range.

The first quantifier available is the asterisk, that is, *. This character indicates that there can be zero or more occurrences of the pattern that it follows. So, for example, [0–9]* matches a sequence of any numbers, including none. This will match an empty string (i.e., ""), "100", and a string that contains thousands of numbers, for example. Notice the implication of this: if your entire pattern is modified with *, it will always match regardless of input. That's because bit can interpret zero occurrences as success. This is, of course, only true if there are no surrounding bits of the expression that do not result in a match on the input.

A quantifier modifies the pattern it follows. For instance, a* matches any number of 'a's. But ab* will match only a single 'a' followed by any number of 'b's (including none). You might have expected this pattern to match a sequence of 'ab' pairs instead. To accomplish this you must group a pattern with parenthesis and then modify the grouping. For example, (ab)* will successfully match a sequence of zero-to-many 'ab' pairs. Grouping is a very powerful construct, which can be composed to build up complicated expressions. For example, ((ab)*c)* will match a sequence of a sequence of 'ab' pairs followed by a 'c' pairs. Being able to decompose complex groupings is paramount to being able to read regular expressions. Grouping is detailed later in this chapter.

The second quantifier available is the plus symbol, that is, +. This character indicates that there can be one or more occurrences of the modified pattern. Notice that this differs slightly from * in that it requires at least one match for success. It is similar, however, in that it will match an open-ended quantity. For instance, [0–9]+ will match input that has at least one number up to an unbounded amount. The question mark, ?, will match either zero or one instances of a pattern. In other words, the pattern it modifies is optional, hence the use of a question mark.

If none of these quantifiers address your needs, you can specify an exact quantity or range using the {n} meta-character. We saw some examples of this above For example, [0–9]{8} will match a string of exactly eight numbers. You may also supply a range if input will vary. Say that you wanted to match anywhere between five and eight numbers: [0–9]{5,8} would do the trick. You can even leave out the upper bound if you prefer it to be open-ended. That is, [0–9]{5,} will match any string of at least five numbers, for example.

Note

*, +, and ? are just syntactic sugar for the more general-purpose quantifier {x,y}. However, it should be evident from the examples above that typing and reading the special-purpose quantifiers is significantly more convenient than {0,}, {1,}, and {0,1}, respectively.

Character Classes

As briefly noted above, you can describe patterns to match sets of characters using either literals or character classes. A literal is useful if you have a precise idea of what should be found at a particular position within input. For example, if you need to ensure that a web site URL must end in ".com", this is easy to express via literals. However, as in the case of the SSN example at the beginning of this chapter, sometimes you only know some general set of attributes contained within the text you are matching. Common classes include characters in the English alphabet, digits, whitespace, and so on; as noted, you can construct your own custom class containing arbitrary characters.

To construct a new class, surround the characters to be considered part of it inside square brackets (i.e., [ and ]). Say you want to match on the vowels 'a', 'e', 'i', 'o', or 'u'. The pattern [aeiou] will do the job. Notice that this is an easy way to perform alternation. Using a character class causes the regular expression to match only one of the possible characters in that class. As you will see later, alternation is a more general-purpose construct that enables you to do similar things by composing several complex expressions together.

This technique works well for small sets of characters but what if you wanted to match any character of the alphabet? It's hardly convenient to have to type all 26 characters (plus another 26 more if you intend to do case-sensitive matches and need to account for uppercase letters). This is where character ranges become useful. You simply specify the beginning and ending part of a consecutive set of characters, separated by a dash. For example: [a–zA–Z]. This matches any English letter (or more precisely anything found at or between the Unicode 'a' to 'z' or 'A' to 'Z' code-points, inclusively. Similarly, you can write [0–9], [a–zA–Z0–9] , and so on. You certainly don't need to use full ranges; for example, consider this pattern, which (poorly) matches input representing the current time in 24-hour format:

 [0–2]?[0–9]:[0–6][0–9].[0–6][0–9]

Here, we only accept ranges of numbers that make sense based on the position in the time string; for example, only '0', '1', or '2' are valid in the first position.

Note

Unfortunately, this example isn't very robust. "29:62.61" is a valid time according to this pattern, which (last I checked) is certainly incorrect. There are much better ways to accomplish this, in particular by using groups and alternation. As with the case of money, I'd recommend just using the DateTime.Parse family of methods for date and time parsing. Again, the Framework developers have done the hard work for you.

You can also supply a negated character class. The default is to match on the characters present in the class definition. However, in some cases you might want to indicate a match should be successful only if the absence of characters in the class is detected. To indicate this, you simply use the caret, ^, as your first character in the character class definition. For example, [^0–9] matches anything but a number character, and [^<] matches anything but the '<' character. If you want '^' to be part of your character class itself, you can escape it, for example [\^0–9].

Out-of-The-Box Classes

There is a set of classes that have shorthand notation because they are used so frequently:

The dot class matches any non-newline character. It is essentially just a shortcut to the character class [^\n]. For example, .* will match a variable-length sequence of non-newline characters. To use a period as a literal, clearly you must escape it.
The \w class matches any word character. This basically [a–zA–Z0–9] for the lower English section of the Unicode character set. It does take into account word characters from other languages as well, which would be terribly complex to write out by hand. \W (notice the uppercase W) matches nonword characters, essentially the inverse of \w.
\d will match any digit character. This is similar to the class [0–9], although it is Unicode sensitive and will match numerical characters from other parts of the Unicode character set. \D (notice that this is uppercase) is the inverse of \d and matches any nondigit character. This, too, is Unicode aware.
Lastly, \s matches any whitespace character, including newlines. This can be thought of as being equivalent to [ \f\n\r\t\v], although there are some Unicode subtleties that we won't delve into here. This class matches a space, line feed, newline, carriage return, tab, or vertical tab character. As you might expect, there's also a \S class, which matches the inverse of \s, again the uppercase version of the normal class.

Unicode Properties and Blocks

The Unicode specification not only defines characters and code-points but also general qualities and ranges of logically grouped characters. The \p{name} character class matches any character found with either the named property or found in the named Unicode block grouping. This enables you to reference parts the Unicode character set either by using a quality about a character or by checking that it falls within a named range. Similarly, \P{name} matches if a character does not exhibit the named quality and is not found in the named group.

For example, \p{IsArabic} will match any character from the Arabic range of characters, and \p{L} will match any letter character form the Unicode character set. For a detailed reference to the kinds of properties and blocks available, please see the Unicode specification referred to in the "Further Reading" section at the end of this chapter.

There are many properties about characters that the Unicode specification defines; here is a quick reference to the most common of them.

Escape	Description
\p{L}	Letter
\p{Lu}	Uppercase letter
\p{Ll}	Lowercase letter
\p{Lm}	Modifier letter
\p{Lo}	Other letter
\p{M}	Mark character — characters whose purpose is to modify other characters that they appear with
\p{Mn}	Nonspacing mark
\p{Mc}	Spacing combining mark
\p{Me}	Enclosing mark
\p{N}	Number
\p{Nd}	Decimal digit number
\p{Nl}	Letter-based numbers, such as Roman numerals
\p{No}	Other number, for example
\p{Z}	Separator
\p{Zs}	Space separator
\p{Zl}	Line separator
\p{Zp}	Paragraph separator
\p{C}	Other or miscellaneous characters
\p{Cc}	Control characters
\p{Cf}	Formatting characters
\p{Cs}	Surrogate characters
\p{Co}	"Private use" characters, reserved for custom definitions
\p{Cn}	Unassigned code-point characters
\p{P}	Punctuation
\p{Pc}	Connector
\p{Pd}	Dash
\p{Ps}	Open punctuation
\p{Pe}	Close punctuation
\p{Pi}	Initial quote
\p{Pf}	Final quote
\p{Po}	Other punctuation
\p{S}	Symbol
\p{Sm}	Mathematic symbol
\p{Sc}	Currency symbol
\p{Sk}	Modifier symbol
\p{So}	Other symbol

Character Class Subtraction

You can subtract a set or range of characters from a custom character class by using a feature called subtraction. This has the effect of removing specific characters from a contiguous range of characters in the class. For example, the expression [a–z-[c–f]] matches any character in the range of 'a' through 'c' or in the range of 'f' to 'z'. The subtracted class can be any valid character class as defined above, so for example the following pattern will match any consonant lowercase character: [a–z-[aeiou]].

Subtracted character classes can be defined recursively. For example, [a–z-[c–f-[de]]] will match any character between 'a' and 'z' but not 'c' or 'f'. The expression [a–z-[c–f]] would normally prevent matches on 'd' or 'e', but we "add them back" by subtracting them from the subtraction.

Commenting Your Expressions

If you are working with lengthy expressions, readability can go downhill very quickly. For such situations, you can provide inline comments to clarify subparts of your expression to aid those attempting to interpret them. This is much like commenting your source code.

The simplest form of a comment is to use a group in the format of (?#<comment>). The parser will see the (?# meta-character beginning sequence and ignore all of the characters found within the <comment> part leading up to the next). Alternatively, if you use the IgnorePatternWhitespace option while constructing a Regex, the matcher enables you to format your expression in a more readable manner. With this option enabled, you can specify comments that act like the // construct in programming languages such as C#. The start of a comment is indicated with the # character and causes the matcher to ignore anything up to the end of the line. Note that having IgnorePatternWhitespace turned on necessitates that you escape any literal '#'s that are contained within your pattern (otherwise, characters following it would be interpreted as comments):

 string nl = Environment.NewLine; Regex r = new Regex(@"("+     @"(?<key>\w+):  # Match the key of a key-value pair" + nl +     @"\s*           # Optional whitespace" + nl +     @"(?<value>\w+) # Match the value" + nl +     @"(?# Optional comma and/or whitespace)[,\s]*" +     @")+            #Can have one or more pairs",     RegexOptions.IgnorePatternWhitespace);

Notice that when using the # comment mechanism, you need to ensure that your lines are terminated with newline characters. Otherwise, the regular expression parser will interpret the following text as a continuing part of the comment. In most cases, this will result in errors.

Alternation

An alternation indicates to the matcher that it should match on any one of a set of possible patterns. In other words, it provides an "or" construct very much like the one you might use in your favorite programming language. Alternation is indicated by separating expressions with the pipe character (i.e., |). For example, say that you wanted to match either a three-digit number or a single whitespace character. The following pattern uses alternation to do this: \d{3}|\s. You can similarly alternate between entire groups of expressions, as in (abc)|(xyz)|\d+. This matches either the character sequence 'abc', 'xyz', or a sequence of digits.

Note

Notice that alternation has lower precedence than concatenation in the regular expression grammar, eliminating the need to group characters. The following is equivalent and can improve readability, however: (abc)|(xyz)\d+.

You may alternate between any number of choices. The regular expression matcher will try to match each from left to right, so if you have some input that might match more than one, the leftmost pattern will be matched first. In situations where you have complex groupings of expressions, this may lead the matcher down unexpected paths — and incur a dramatic performance penalty as a result of excess backtracking — so be sure to watch out for this.

Conditionals

Simple "or" functionality is the most straightforward alternation construct available. There is also a more advanced conditional alternation construct available, much like an if-then-else programming block (e.g. condition ? true_stmt : false_stmt).The form (?(<test>)<true>|<false>) is used to indicate that the logic. <test> is either a pattern to be evaluated or a named capture. If it matches, the pattern matcher tries to match the consequent pattern <true>; otherwise, it tries to match the alternate pattern <false>. Named captures are discussed further in the upcoming section on groups.

(?(^\d)^\d+$|^\D+$) is an example of this feature. This instructs the matcher to perform the following logic: first evaluate ^\d, which simply checks to see if the text starts with a digit; if this matches, then match an entire line of digits with the pattern ^\d+$; otherwise, match a line of nondigit characters via ^\D+$.

Positional Matching

Often it's useful to match parts of a pattern found in specific locations of your input. By default, a match is considered "successful" if a pattern matches any part within the input text. In fact, the Regex.IsMatch API does exactly that: it returns true if a match is found anywhere within the input. Say that you want to only report success if an entire line of text matches, for instance, or if the matched text was found at the beginning of a line; positional meta-characters enable you to do this.

These characters do not consume input but rather assert that a condition is true at the time the matcher evaluates them. They match positions within the text instead of matching characters. If the asserted condition is false, the match will fail. These are often referred to as zero-width assertions or anchors because they don't consume input.

Beginning and End of Line

Specifying \A indicates to the matcher that it should only report success if a match is at the start of the input text. Similarly, the ^ character indicates to the matcher that it should only report success if it's at the start of the input text or just after a newline character. For example, the pattern ^\d+ will match a sequence of digits found at the beginning of any line, while \A\d+ will only match digits found at the very beginning of the input text. Imagine the following text:

 abc 123

The first pattern (using ^) will successfully match this — because "123" is at the beginning of a newline — but the second (using \A) will not — because the first line of text begins with a character.

Much as with \A, you can use \Z to indicate a match only if the matcher is positioned at the very end of the input. A lowercase variant on \Z is \z, which ignores trailing newlines at the end of the input text when asserting the condition is true. Much like +, the $ character indicates that either the end of the input or the end of a line should be matched. For instance, \d+$ will match lines that end with a sequence of numbers, and \d+\Z will match only the end of the text. Unlike the above example, both patterns match the above example input (because "123" is found at the very end of the text).

You can combine these positional meta-characters to conveniently create patterns that match entire lines of text. Say that you wanted to match an entire line (and only an entire line) of numbers. The pattern ^\d+$ would do exactly that, and will match the string "123" but not "abc123", "123abc", or "1a2b3", for example. \A\d+\Z will match only if the entire input consists of digits. Notice that you do not need to use these characters at exclusively at the beginning or end of your pattern, meaning that you can create expressions that span multiple lines of text. As an example, the pattern (^\d+[\r\n]*)+ will match at least one line consisting entirely of a sequence of digits but will try to match as many consecutive lines as it can find. Note that to match multiple lines using the .NET Framework's regular expression APIs, you must use the RegexOptions.Multiline flag when creating your Regex.

Word Boundaries

You can also assert that a pattern is matched at just before or after a word boundary, where a word is defined by a contiguous sequence of alphanumeric characters and a boundary is located in between such a sequence. \b will match that the position is at either the start or end of a word. You can similarly check that the position is precisely at the beginning or precisely at the end. You will have to take advantage of the lookahead and lookbehind meta-characters described later in this section. Specifically, (?<!\w)(?=\w) and (?<=\w)(?!\w) will match the beginning and end of a word boundary, respectively.

Continuing from a Previous Match

The \G meta-character specifies that the position in the input text of the match is identical to the ending position of the previous match (if continuing from a previous match). This can be used to ensure that only contiguous matches are found in the input text:

 Regex r = new Regex(@"^\G\d+$\n?", RegexOptions.Multiline); Match m = r.Match("15\n90\n22\n103"); do {   if (m.Success)     Console.WriteLine("Matched: " + m.Value); } while ((m = m.NextMatch()).Success);

The typical behavior of Match.NextMatch here would be to start where the previous match left off and to search for the next occurrence of a match anywhere in the remaining input text. Here, however, the presence of the \G meta-character ensures that the expression matcher does not skip over characters that fail to match. If the next character fails the match, the match fails. Similar functionality is available by using the Regex.Matches API, which returns an enumerable collection of matches.

Lookahead and Lookbehind

Similar to the positional assertion meta-characters, the lookahead and lookbehind features enable you to make an assertion about the state of the input surrounding the current position and proceed only if it is true. These meta-characters similarly do not consume characters. As you might guess, lookahead looks forward in the input and continues matching based on the specified assertion, while lookbehind examines recently consumed input.

Each has a positive and negative version. Positive assertions match if the specified pattern successfully matches, while negative assertions match if the pattern does not. The lookahead meta-characters are denoted with (?=<pattern>) and (?!<pattern>) for positive and negative assertion, respectively, while lookbehind is denoted with (?<=<pattern>) and (?<!<pattern>). <pattern> in each case must be any valid regular expression.

As an example, imagine that you would like to parse numbers out of input text but only those that are surrounded by whitespace (excluding numbers embedded within other words, for example):

 Regex r = new Regex(@"(?<=\s|^)\d+(?=\s|$)"); Console.WriteLine(r.IsMatch("abc123xyz")); Console.WriteLine(r.IsMatch("abc 123 xyz")); Console.WriteLine(r.IsMatch("123"));

You could have done this with constructs discussed already, for example in \s\d+\s, but this would result in matching the surrounding whitespace. If you were extracting the match, you would have to trim off the excess. The above pattern demonstrates how to use positive lookahead and lookbehind assertions to accomplish the same thing without matching the surrounding whitespace.

Groups

Grouping enables you treat individual components of a larger expression as separate addressable subexpressions. This is done by wrapping part of a larger expression in parenthesis. Modifiers and meta-characters can then be applied to a group rather than individual elements of the expression. Moreover, groups enable you to capture and extract precise bits from the matched input text. Groups can be nested to any depth.

As an example of grouping, imagine that you need to create a pattern that matches an open-ended sequence of the letters "abc". An example of matching text is "abcabcabc". You might initially come up with the pattern abc for matching this, but remember, you must match an open-ended sequence of them. You must group the expression abc, as in (abc). This creates a group and currently matches the same exact thing as the previous pattern. But we can now modify the entire group with a quantifier, resulting in our desired pattern (abc)*.

Similarly, say that we wanted to match a never-ending sequence of "abc" and "xyz" characters. One's first inclination might be to try abc|xyz*, but due to expression precedence this won't work. This matches either "abc", or "xy" followed by any number of 'z' characters. We must use grouping to make the modifier * apply to the entire group: (abc|xyz)*. You might have expected (abc)|(xyz)* to work. But this is incorrect because the quantifier * modifies the fragment immediately preceding it. In this case, this is the group (xyz), not the entire alternation. And it results in the same exact meaning as the first attempt, abc|xyz*.

Numbered Groups

The matcher auto-assigns a number to each group, enabling a feature called backreferencing. The number a given group receives is determined relative to its position in an entire expression. The matcher starts numbering at the number 1 and increases by one each time it finds a new opening parenthesis. (The 0th group is always the entire matched expression.) To illustrate this, consider how group numbers are assigned for the pattern ((abc)|(xyz))*, depicted in Figure 13-2.

image from book
Figure 13-2: Regular expression group numbers.

Not only is this feature great for composing larger patterns out of smaller individual patterns, but it enables you to capture a group's matched text and use that to drive later matching behavior within the same pattern. This is what is meant by the term backreferencing. There are some subtle differences that are introduced once you begin using named captures. This is described further below.

Backreferencing

Say you have a subexpression that must be found multiple times within a single piece of input text. You can easily accomplish this by repeating the expression more than once. For example, (abc)*x(abc)* repeats the pattern (abc)* to match "abc" more than once. However, consider if matches found later on needed to be identical to previous matches. So in this example, perhaps we want to ensure that we match the same number of "abc" sequences before and after the 'x'. If we have "abcabcabc" before the 'x', for a match to be successful we must also find "abcabcabc" after it.

Numbered groups and backreferences make this easy to accomplish: ((abc)*)x(\1). In this example, the backreference \1 says to match the exact text that group 1 matched. Not the same pattern — the same exact text. The number 1 following the backslash represents the group number to match, so for example group 2 would be \2, group 3 would be \3, and so on. Notice that in this example we have three groups (aside from the 0th group representing the whole match): ((abc)*) is 1, (abc) is 2, and (\1) is itself another group, 3.

A great illustration of where this might come in handy is matching XML elements. An XML element starts with an opening tag and must end with a matching closing tag. Take the XML text "<MyTag> Some content</MyTag>", for example. The following pattern would match this text successfully: <[^>]+>[^<]*</[^>]+>. But this isn't a very robust XML parser (if you can call it that). For example, it will match incorrectly formatted XML such as "<MyTag>Some content</MyOtherTag>". All we require with our pattern is that within the angle brackets we match a sequence of characters that are not '>' (signaling the end of a tag). But we say nothing about the ending tag having to match the beginning one.

We can use backreferences to solve this problem: <([^>]+)>([^<]*)</(\1)>. This preserves the property that the end tag must match the opening tag. It also captures the contents of the tag as its own group for convenience. Of course, it's still not perfect: it doesn't account for attributes very nicely, doesn't recursively parse inner tags, among other things. But it's a start.

Named Groups

You can also name groups rather than relying on auto-grouping. This can be particularly useful to enhance maintainability. If your pattern evolves over time — which most software does — it's very easy to accidentally reorder the numbering because of a simple change. Conversely, changing a named group is an explicit operation.

To define a named group use a question mark followed by the group name in either angle brackets or single quotes right after a group's opening parenthesis. The XML example above might be written as follows: <(?<tag>[^>]+)>(?<body>[^<]*)</(\k<tag>)>. Notice that you use \k<name> as the backreference. Because our pattern already has so many angle brackets (as a result of the XML), this is a good candidate for using single quotes instead, for example: <(?'tag'[^>]+)>(?'body'[^<]*</(\k'tag')>.

Accessing Captures

Groups capture the text that they match. This enables you to access the actual text that the parser matched for a given group in an expression programmatically:

 Regex r = new Regex(@"<([^>]+)>([^<]*)</(\1)>"); Match m = r.Match("<MyTag>Some content</MyTag>"); Console.WriteLine("Tag: {0}, InnerText: {1}",     m.Groups[1].Value, m.Groups[2].Value);

This uses the pattern developed above for parsing simple XML elements (without attributes or nested elements, that is). It successfully parses the input string and extracts the interesting bits from it. It will print "Tag: MyTag, InnerText: Some content" to the console. This sample illustrates that matched text is kept around after performing a match for further analysis if needed. The Match class and its members are described later in this chapter, including details on how to work with captured text.

Noncapturing

In all of these cases, each new group will capture the text it matches. If you don't intend to use the captured group's value or backreference the group — perhaps just using it for modifier purposes — you can use (?: instead of just an opening parenthesis.

Greediness

The default behavior of quantifiers is greedy. This means that the matcher will try to match as much of the text as possible using a quantifier. This can sometimes cause certain groups to be skipped that you might have expected to match (resulting in missed captures, for example) or perhaps groups to be matched that you didn't anticipate. Greediness will never prevent a successful match. That is, if consuming too much input for a given quantifier would cause a match to fail where it could have otherwise succeeded (given an alternative matching strategy), the matcher will backtrack and keep trying alternative paths until it finds a match. In general, backtracking works by "giving back" input so that it can attempt to match other alternate expressions.

You should try to reduce the amount of backtracking an expression must perform by reducing ambiguity. Too much backtracking can harm the performance of your expressions and can be avoided by being more thoughtful about the expressions that you write. Sometimes ambiguity cannot be avoided, as in (abc)*abc. If fed the text "abcabc", it will successfully match, usually first consuming the entire input with (abc)* and only after that realizing it still has the abc part to match. It responds by placing groups of "abc" back into the input buffer and retrying the match until it succeeds. This means that generally speaking (abc)*abcabc will actually have to backtrack twice in order to match the text "abcabc". Of course, regular expression compilers are apt to find and optimize such obvious cases that can lead to backtracking, but this isn't always the case.

You can cause an entire subexpression to be matched greedily no matter whether it will cause the overall match to succeed or fail. To do so, you enclose an expression in (?>...). For example, using a slight variant on the above pattern, (?>(abc)*)abc will never match any possible input because the greedy (abc)* subexpression will always starve the latter abc part. Clearly, this is a case where you would not want to use this construct. There are examples where it is actually useful and can permit an expression compiler to optimize because it knows there is no possible chance of backtracking.

Understanding What You Ask For

There are situations in which, while a match may be valid, there are alternative matching semantics that you desire. A few techniques will help you to deal with this. In general, you need to understand what you are asking the matcher for and be as precise as possible.

Consider a slightly more complex version of the XML parsing examples from above. In this case, input is a bit of text containing an XML element, which itself can either contain either plain text (as before) or other simple XML elements that themselves contain good old text. For example, both "<MyTag> Plaintext data</MyTag>" and "<MyTag><InnerTag>Stuff</InnerTag><SomeTag>More stuff</SomeTag></MyTag>" are valid inputs. Ideally, we would like to tease out the tag names after matching with captured groups.

An initial attempt at an expression that does this might look like this: <(.+)>(.*|(<(.+)>.*</\4>)*)</\1>. Indeed, this does match both of the above inputs successfully, but the .* part of the inner expression matches the tag soup between "<MyTag>" and "</MyTag>" even when subelements are present. Unfortunately, we would like the alternate expression (<(.+)>.*</\4>)* to match in this case. The reason is twofold: first, so that we can easily extract the captures afterward to see what inner tags were parsed, and second, so that the numbered backreference will validate that tags are being closed correctly.

A better expression in this case would be: <(.+)>(([^<].*)|(<(.+)>.*</\5>)*)</\1>. If this looks complex, well … it is! Regular expressions are not necessarily known for their readability. Notice that we solved the problem by being more precise than before in what plain text can appear between the outer tags. Instead of simply defining plain text as a sequence of any characters, we say that it is a sequence of any characters that does not begin with an opening angle bracket. This prevents the openness of the . character class and greediness of the * quantifier from stealing the inner tags from the other expression. This accomplishes both goals outlined above. See Listing 13-1 for an illustration of parsing, capturing, and printing elements using this pattern.

Listing 13-1: Capturing and printing simple XML elements

 String[] inputs = new String[] {     "<MyTag><InnerTag>Stuff</InnerTag><SomeTag>More stuff</SomeCrapTag></MyTag>",     "<MyTag>Plaintext data</MyTag>" }; Regex[] rr = new Regex[] {     new Regex(@"<(.+)>(.*|(<(.+)>.*</\4>)*)</\1>"),     new Regex(@"<(.+)>([^<]*|(<(.+)>.*</\4>)*)</\1>") }; foreach (String input in inputs) {     Console.WriteLine(input);     foreach (Regex r in rr)     {         Console.WriteLine("  {0}", r.ToString());         Match m = r.Match(input);         Console.WriteLine("    - Match: {0}", m.Success);         if (m.Success)         {             for (int i = 0; i < m.Groups.Count; i++)             {                 for (int j = 0; j < m.Groups[i].Captures.Count; j++)                 {                     Console.WriteLine("    - Group {0}.{1}: {2}", i, j,                         m.Groups[i].Captures[j].Value);                 }             }         }     } }

Lazy Quantifiers

There are additional problems that can arise when you need a quantified expression to match if it can, but as little as possible while not leading to a failure. A contrived example can be illustrated with the example of parsing (abc)*(abc)*. It's obvious that both groups are competing for the same characters to match but that the first occurrence will starve the second in each case. If you pass in "abcabc", the entire input text will be consumed by the first group. If you want the second group to match instead, however, you will have to mark the first as being lazy. You do this with a set of so-called lazy quantifiers. There are real scenarios where you desire an alternate group to perform the matching and, thus, need to starve an earlier expression.

There is a lazy version of each of the quantifiers. You indicate this by following the ordinary quantifier meta-character with a ?. In other words, *? will match as few occurrences as possible, from 0 to an unbounded number. In our case, to cause the first group in the above example to be lazy you can use this pattern: (abc)*?(abc)*. Likewise, there are +? and ?? meta-characters. There are also lazy versions of the explicit quantifiers {x,y}?, {x,}?, and {,y}?. There is also a {x}? quantifier.