5.7. Regular Expressions

< Day Day Up >

The use of strings and expressions to perform pattern matching dates from the earliest programming languages. In the mid-1960s SNOBOL was designed for the express purpose of text and string manipulation. It influenced the subsequent development of the grep tool in the Unix environment that makes extensive use of regular expressions. Those who have worked with grep or Perl or other scripting languages will recognize the similarity in the .NET implementation of regular expressions.

Pattern matching is based on the simple concept of applying a special pattern string to some text source in order to match an instance or instances of that pattern within the text. The pattern applied against the text is referred to as a regular expression, or regex, for short.

Entire books have been devoted to the topic of regular expressions. This section is intended to provide the essential knowledge required to get you started using regular expressions in the .NET world. The focus is on using the Regex class, and creating regular expressions from the set of characters and symbols available for that purpose.

The Regex Class

You can think of the Regex class as the engine that evaluates regular expressions and applies them to target strings. It provides both static and instance methods that use regexes for text searching, extraction, and replacement. The Regex class and all related classes are found in the System.Text.RegularExpressions namespace.

Syntax:

 Regex( string pattern ) Regex( string pattern, RegexOptions)

Parameters:

Pattern

Regular expression used for pattern matching.

RegexOptions

An enum whose values control how the regex is applied. Values include:

CultureInvariant Ignore culture.

IgnoreCase Ignore upper- or lowercase.

RightToLeft Process string right to left.

Example:

 Regex r1 = new Regex("  ");   // Regular expression is a blank String words[] = r1.Split("red blue orange yellow"); // Regular expression matches upper- or lowercase "at" Regex r2 = new Regex("at", RegexOptions.IgnoreCase);

As the example shows, creating a Regex object is quite simple. The first parameter to its constructor is a regular expression. The optional second parameter is one or more (separated by |) RegexOptions enum values that control how the regex is applied.

Regex Methods

The Regex class contains a number of methods for pattern matching and text manipulation. These include IsMatch, Replace, Split, Match, and Matches. All have instance and static overloads that are similar, but not identical.

Core Recommendation

If you plan to use a regular expression repeatedly, it is more efficient to create a Regex object. When the object is created, it compiles the expression into a form that can be used as long as the object exists. In contrast, static methods recompile the expression each time they are used.

Let's now examine some of the more important Regex methods. We'll keep the regular expressions simple for now because the emphasis at this stage is on understanding the methods not regular expressions.

IsMatch()

This method matches the regular expression against an input string and returns a boolean value indicating whether a match is found.

 string searchStr = "He went that a way"; Regex myRegex = new Regex("at"); // instance methods bool match = myRegex.IsMatch(searchStr);         // true // Begin search at position 12 in the string match = myRegex.IsMatch(searchStr,12);           // false // Static Methods   both return true match = Regex.IsMatch(searchStr,"at"); match = Regex.IsMatch(searchStr,"AT",RegexOptions.IgnoreCase);

Replace()

This method returns a string that replaces occurrences of a matched pattern with a specified replacement string. This method has several overloads that permit you to specify a start position for the search or control how many replacements are made.

Syntax:

 static Replace (string input, string pattern, string replacement                 [,RegexOptions]) Replace(string input, string replacement) Replace(string input, string replacement, int count) Replace(string input, string replacement, int count, int startat)

The count parameter denotes the maximum number of matches; startat indicates where in the string to begin the matching process. There are also versions of this method which you may want to explore further that accept a MatchEvaluator delegate parameter. This delegate is called each time a match is found and can be used to customize the replacement process.

Here is a code segment that illustrates the static and instance forms of the method:

 string newStr; newStr = Regex.Replace("soft rose","o","i");   // sift rise // instance method Regex myRegex = new Regex("o");                // regex = "o" // Now specify that only one replacement may occur newStr = myRegex.Replace("soft rose","i",1);   // sift rose

Split()

This method splits a string at each point a match occurs and places that matching occurrence in an array. It is similar to the String.Split method, except that the match is based on a regular expression rather than a character or character string.

Syntax:

 String[] Split(string input) String[] Split(string input, int count) String[] Split(string input, int count, int startat) Static String[] Split(string input, string pattern)

Parameters:

`input`	The string to split.
`count`	The maximum number of array elements to return. A count value of 0 results in as many matches as possible. If the number of matches is greater than count, the last match consists of the remainder of the string.
`startat`	The character position in input where the search begins.
`pattern`	The regex pattern to be matched against the input string.

This short example parses a string consisting of a list of artists' last names and places them in an array. A comma followed by zero or more blanks separates the names. The regular expression to match this delimiter string is: ",[ ]*". You will see how to construct this later in the section.

 string impressionists = "Manet,Monet, Degas, Pissarro,Sisley"; // Regex to match a comma followed by 0 or more spaces string patt = @",[ ]*"; // Static method string[] artists = Regex.Split(impressionists, patt); // Instance method is used to accept maximum of four matches Regex myRegex = new Regex(patt); string[] artists4 = myRegex.Split(impressionists, 4); foreach (string master in artists4)    Console.Write(master); // Output --> "Manet" "Monet" "Degas" "Pissarro,Sisley"

Match() and Matches()

These related methods search an input string for a match to the regular expression. Match() returns a single Match object and Matches() returns the object MatchCollection, a collection of all matches.

Syntax:

 Match Match(string input) Match Match(string input, int startat) Match Match(string input, int startat, int numchars) static Match(string input, string pattern, [RegexOptions])

The Matches method has similar overloads but returns a MatchCollection object.

Match and Matches are the most useful Regex methods. The Match object they return is rich in properties that expose the matched string, its length, and its location within the target string. It also includes a Groups property that allows the matched string to be further broken down into matching substrings. Table 5-7 shows selected members of the Match class.

Table 5-7. Selected Members of the `Match` Class
Member	Description
`Index`	Property returning the position in the string where the first character of the match is found.
`Groups`	A collection of groups within the class. Groups are created by placing sections of the regex with parentheses. The text that matches the pattern in parentheses is placed in the `Groups` collection.
`Length`	Length of the matched string.
`Success`	`TRue` or `False` depending on whether a match was found.
`Value`	Returns the matching substring.
`NextMatch()`	Returns a new `Match` with the results from the next match operation, beginning with the character after the previous match, if any.

The following code demonstrates the use of these class members. Note that the dot (.) in the regular expression functions as a wildcard character that matches any single character.

 string verse = "In Xanadu did Kubla Khan"; string patt = ".an...";       // "." matches any character Match verseMatch = Regex.Match(verse, patt); Console.WriteLine(verseMatch.Value);  // Xanadu Console.WriteLine(verseMatch.Index);  // 3 // string newPatt = "K(..)";             //contains group(..) Match kMatch = Regex.Match(verse, newPatt); while (kMatch.Success) {    Console.Write(kMatch.Value);       // -->Kub -->Kha    Console.Write(kMatch.Groups[1]);   // -->ub  -->ha    kMatch = kMatch.NextMatch(); }

This example uses NextMatch to iterate through the target string and assign each match to kMatch (if NextMatch is left out, an infinite loop results). The parentheses surrounding the two dots in newPatt break the pattern into groups without affecting the actual pattern matching. In this example, the two characters after K are assigned to group objects that are accessed in the Groups collection.

Sometimes, an application may need to collect all of the matches before processing them which is the purpose of the MatchCollection class. This class is just a container for holding Match objects and is created using the Regex.Matches method discussed earlier. Its most useful properties are Count, which returns the number of captures, and Item, which returns an individual member of the collection. Here is how the NextMatch loop in the previous example could be rewritten:

 string verse = "In Xanadu did Kubla Khan"; String newpatt = "K(..)"; foreach (Match kMatch in Regex.Matches(verse, newpatt))    Console.Write(kMatch.Value);  // -->Kub  -->Kha // Could also create explicit collection and work with it. MatchCollection mc = Regex.Matches(verse, newpatt); Console.WriteLine(mc.Count);     // 2

Creating Regular Expressions

The examples used to illustrate the Regex methods have employed only rudimentary regular expressions. Now, let's explore how to create regular expressions that are genuinely useful. If you are new to the subject, you will discover that designing Regex patterns tends to be a trial-and-error process; and the endeavor can yield a solution of simple elegance or maddening complexity. Fortunately, almost all of the commonly used patterns can be found on one of the Web sites that maintain a searchable library of Regex patterns (www.regexlib.com is one such site).

A regular expression can be broken down into four different types of metacharacters that have their own role in the matching process:

Matching characters. These match a specific type of character for example, \d matches any digit from 0 to 9.
Repetition characters. Used to prevent having to repeat a matching character or item for example, \d{3}can be used instead of \d\d\d to match three digits.
Positional characters. Designate the location in the target string where a match must occur for example, ^\d{3} requires that the match occur at the beginning of the string.
Escape sequences. Use the backslash (\) in front of characters that otherwise have special meaning for example, \} permits the right brace to be matched.

Table 5-8 summarizes the most frequently used patterns.

Table 5-8. Regular Expression Patterns
Pattern	Matching Criterion	Example
`+`	Match one or more occurrences of the previous item.	`to+` matches too and tooo. It does not match t.
`*`	Match zero or more occurrences of the previous item.	`to*` matches t or too or tooo.
`?`	Match zero or one occurrence of the previous item. Performs "non-greedy" matching.	`te?n` matches ten or tn. It does not match teen.
`{n}`	Match exactly `n` occurrences of the previous character.	`te{2}n` matches teen. It does not match ten or teeen.
`{n,}`	Match at least `n` occurrences of the previous character.	`te{1,}n` matches ten and teen. It does not match tn.
`{n,m}`	Match at least `n` and no more than `m` occurrences of the previous character.	`te{1,2}n` matches ten and teen.
`\`	Treat the next character literally. Used to match characters that have special meaning such as the patterns `+`, `*`, and `?`.	`A\+B` matches A+B. The slash (`\`) is required because `+` has special meaning.
`\d \D`	Match any digit (`\d`) or non-digit (`\D`). This is equivalent to [0-9] or [^0-9], respectively.	`\d\d` matches 55. `\D\D` matches xx.
`\w \W`	Match any word plus underscore (`\w`) or non-word (`\W`) character. `\w` is equivalent to [a-zA-Z0-9_]. `\W` is equivalent to [^a-zA-Z0-9_].	`\w\w\w\w` matches A_19. `\W\W\W` matches ($).
\n \r \t \v \f	Match newline, carriage return, tab, vertical tab, or form feed, respectively.	N/A
`\s \S`	Match any whitespace (`\s`) or non-whitespace (`\S`). A whitespace is usually a space or tab character.	`\w\s\w\s\w` matches A B C.
`.` (dot)	Matches any single character. Does not match a newline.	`a.c` matches abc. It does not match abcc.
`\|`	Logical OR.	`"in\|en"` matches enquiry.
`[. . . ]`	Match any single character between the brackets. Hyphens may be used to indicate a range.	`[aeiou]` matches u. `[\d\D]` matches a single digit or non-digit.
`[^. . .]`	All characters except those in the brackets.	`[^aeiou]` matches x.

A Pattern Matching Example

Let's apply these character patterns to create a regular expression that matches a Social Security Number (SSN):

 bool iMatch = Regex.IsMatch("245-09-8444",                             @"\d\d\d-\d\d-\d\d\d\d");

This is the most straightforward approach: Each character in the Social Security Number matches a corresponding pattern in the regular expression. It's easy to see, however, that simply repeating symbols can become unwieldy if a long string is to be matched. Repetition characters improve this:

 bool iMatch = Regex.IsMatch("245-09-8444",                             @"\d{3}-\d{2}-\d{4}");

Another consideration in matching the Social Security Number may be to restrict where it exists in the text. You may want to ensure it is on a line by itself, or at the beginning or end of a line. This requires using position characters at the beginning or end of the matching sequence.

Let's alter the pattern so that it matches only if the Social Security Number exists by itself on the line. To do this, we need two characters: one to ensure the match is at the beginning of the line, and one to ensure that it is also at the end. According to Table 5-9, ^ and $ can be placed around the expression to meet these criteria. The new string is

 @"^\d{3}-\d{2}-\d{4}$"

Table 5-9. Characters That Specify Where a Match Must Occur
Position Character	Description
`^`	Following pattern must be at the start of a string or line.
`$`	Preceding pattern must be at end of a string or line.
`\A`	Preceding pattern must be at the start of a string.
`\b \B`	Move to a word boundary `(\b)`, where a word character and non-word character meet, or a non-word boundary.
`\z \Z`	Pattern must be at the end of a string `(\z)` or at the end of a string before a newline.

These positional characters do not take up any space in the expression that is, they indicate where matching may occur but are not involved in the actual matching process.

As a final refinement to the SSN pattern, let's break it into groups so that the three sets of numbers separated by dashes can be easily examined. To create a group, place parentheses around the parts of the expression that you want to examine independently. Here is a simple code example that uses the revised pattern:

 string ssn = "245-09-8444"; string ssnPatt = @"^(\d{3})-(\d{2})-(\d{4})$"; Match ssnMatch = Regex.Match(ssn, ssnPatt); if (ssnMatch.Success){    Console.WriteLine(ssnMatch.Value);         // 245-09-8444    Console.WriteLine(ssnMatch.Groups.Count);  // 4    // Count is 4 since Groups[0] is set to entire SSN    Console.Write(ssnMatch.Groups[1]);         // 245    Console.Write(ssnMatch.Groups[2]);         // 09    Console.Write(ssnMatch.Groups[3]);         // 8444 }

We now have a useful pattern that incorporates position, repetition, and group characters. The approach that was used to create this pattern started with an obvious pattern and refined it through multiple stages is a useful way to create complex regular expressions (see Figure 5-4).

Figure 5-4. Regular expression

Working with Groups

As we saw in the preceding example, the text resulting from a match can be automatically partitioned into substrings or groups by enclosing sections of the regular expression in parentheses. The text that matches the enclosed pattern becomes a member of the Match.Groups[] collection. This collection can be indexed as a zero-based array: the 0 element is the entire match, element 1 is the first group, element 2 the second, and so on.

Groups can be named to make them easier to work with. The name designator is placed adjacent to the opening parenthesis using the syntax ?<name>. To demonstrate the use of groups, let's suppose we need to parse a string containing the forecasted temperatures for the week (for brevity, only two days are included):

 string txt ="Monday Hi:88 Lo:56 Tuesday Hi:91 Lo:61";

The regex to match this includes two groups: day and temps. The following code creates a collection of matches and then iterates through the collection, printing the content of each group:

 string rgPatt = @"(?<day>[a-zA-Z]+)\s*(?<temps>Hi:\d+\s*Lo:\d+)"; MatchCollection mc = Regex.Matches(txt, rgPatt); //Get matches foreach(Match m in mc) {    Console.WriteLine("{0} {1}",                      m.Groups["day"],m.Groups["temps"]); } //Output:   Monday Hi:88 Lo:56 //          Tuesday Hi:91 Lo:61

Core Note

There are times when you do not want the presence of parentheses to designate a group that captures a match. A common example is the use of parentheses to create an OR expression for example, (an|in|on). To make this a non-capturing group, place ?: inside the parentheses for example, (?:an|in|on).

Backreferencing a Group

It is often useful to create a regular expression that includes matching logic based on the results of previous matches within the expression. For example, during a grammatical check, word processors flag any word that is a repeat of the preceding word(s). We can create a regular expression to perform the same operation. The secret is to define a group that matches a word and then uses the matched value as part of the pattern. To illustrate, consider the following code:

 string speech = "Four score and and seven years"; patt = @"(\b[a-zA-Z]+\b)\s\1";          // Match repeated words MatchCollection mc = Regex.Matches(speech, patt); foreach(Match m in mc) {       Console.WriteLine(m.Groups[1]);   // --> and }

This code matches only the repeated words. Let's examine the regular expression:

Text/Pattern	Description
and and @"(\b[a-zA-Z]+\b)\s	Matches a word bounded on each side by a word boundary (`\b`) and followed by a whitespace.
and and \1	The backreference indicator. Any group can be referenced with a slash (`\`) followed by the group number. The effect is to insert the group's matched value into the expression.

A group can also be referenced by name rather than number. The syntax for this backreference is \k followed by the group name enclosed in <>:

 patt = @"(?<word>\b[a-zA-Z]+\b)\s\k<word>";

Examples of Using Regular Expressions

This section closes with a quick look at some patterns that can be used to handle common pattern matching challenges. Two things should be clear from these examples: There are virtually unlimited ways to create expressions to solve a single problem, and many pattern matching problems involve nuances that are not immediately obvious.

Using Replace to Reverse Words

 string userName = "Claudel, Camille"; userName = Regex.Replace( userName, @"(\w+),\s*(\w+)", "$2 $1" ); Console.WriteLine(userName);   // Camille Claudel

The regular expression assigns the last and first name to groups 1 and 2. The third parameter in the Replace method allows these groups to be referenced by placing $ in front of the group number. In this case, the effect is to replace the entire matched name with the match from group 2 (first name) followed by the match from group 1 (last name).

Parsing Numbers

 String myText = "98, 98.0, +98.0, +98"; string numPatt = @"\d+";                     // Integer numPatt = @"(\d+\.?\d*)|(\.\d+)";            // Allow decimal numPatt = @"([+-]?\d+\.?\d*)|([+-]?\.\d+)";  // Allow + or -

Note the use of the OR (|) symbol in the third line of code to offer alternate patterns. In this case, it permits an optional number before the decimal.

The following code uses the ^ character to anchor the pattern to the beginning of the line. The regular expression contains a group that matches four bytes at a time. The * character causes the group to be repeated until there is nothing to match. Each time the group is applied, it captures a 4-digit hex number that is placed in the CaptureCollection object.

 string hex = "00AA001CFF0C"; string hexPatt =  @"^(?<hex4>[a-fA-F\d]{4})*"; Match hexMatch = Regex.Match(hex,hexPatt); Console.WriteLine(hexMatch.Value); // --> 00AA001CFFOC CaptureCollection cc = hexMatch.Groups["hex4"].Captures; foreach (Capture c in cc)    Console.Write(c.Value); // --> 00AA 001C FF0C

Figure 5-5 shows the hierarchical relationship among the Match, GroupCollection, and CaptureCollection classes.

Figure 5-5. Hex numbers captured by regular expression

< Day Day Up >

The Regex Class

Regex Methods

IsMatch()

Replace()

Split()

Match() and Matches()

Table 5-7. Selected Members of the Match Class

Creating Regular Expressions

Table 5-8. Regular Expression Patterns

A Pattern Matching Example

Table 5-9. Characters That Specify Where a Match Must Occur

Figure 5-4. Regular expression

Working with Groups

Backreferencing a Group

Examples of Using Regular Expressions

Using Replace to Reverse Words

Parsing Numbers

Figure 5-5. Hex numbers captured by regular expression

Table 5-7. Selected Members of the `Match` Class