Regular Expressions | Professional C# 2005 with .NET 3.0

Regular expressions are part of those small technology areas that are incredibly useful in a wide range of programs, yet rarely used among developers. You can think of regular expressions as a mini-programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the Unix environment and is commonly used with the Perl programming language. Microsoft ported it onto Windows, where up until now it has been used mostly with scripting languages. Regular expressions are today, however, supported by a number of .NET classes in the namespace System.Text.RegularExpressions. You can also find the use of regular expressions in various parts of the .NET Framework. For instance, you will find that they are used within the ASP.NET Validation server controls.

If you are not familiar with the regular expressions language, this section gives a very basic introduction to both regular expressions and their related .NET classes. If you are already familiar with regular expressions, you’ll probably want to just skim through this section to pick out the references to the .NET base classes. You might like to know that the .NET regular expression engine is designed to be mostly compatible with Perl 5 regular expressions, although it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is designed specifically for string processing. It contains two features:

A set of escape codes for identifying specific types of characters. You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re* lists the files with names beginning with Re.) Regular expressions use many sequences like this to represent items such as any one character, a word break, one optional character, and so on.
A system for grouping parts of substrings and intermediate results during a search operation.

With regular expressions, you can perform quite sophisticated and high-level operations on strings. For example, you can:

Identify (and perhaps either flag or remove) all repeated words in a string (for example, “The computer books books” to “The computer books”)
Convert all words to title case (for example, “this is a Title” to “This Is aTitle”)
Convert all words longer than three characters to title case (for example, “this is a Title” to “This is a Title”)
Ensure that sentences are properly capitalized
Separate the various elements of a URI (for example, given http://www.wrox.com, extract the protocol, computer name, file name, and so on)

Of course, all of these tasks can be performed in C# using the various methods on System.String and System.Text.StringBuilder. However, in some cases, this would involve writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed to just a couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or, even simpler, invoke a static RegEx() method), pass it the string to be processed, and pass in a regular expression (a string containing the instructions in the regular expressions language), and you’re done.

A regular expression string looks at first sight rather like a regular string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if you wanted to indicate you were looking for the characters th at the beginning of a word, you would search for the regular expression, \bth. (that is, the sequence word boundary -t-h). If you wanted to search for all occurrences of th at the end of a word, you would write th\b (the sequence t-h-word boundary). However, regular expressions are much more sophisticated than that and include, for example, facilities to store portions of text that are found in a search operation. This section merely scratches the surface of the power of regular expressions.

Suppose your application needed to convert U.S. phone numbers to an international format. In the United States, the phone numbers have this format: 314-123-1234, which is often written as (314) 123-1234. When converting this national format to an international format you have to include +1 (the country code of the United States) and add brackets around the area code: +1 (314) 123-1234. As find-and-replace operations go, that’s not too complicated, would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available on System.String).The regular expressions language allows you to construct a short string that achieves the same result.

This section is intended only as a very simple example, so it concentrates on searching strings to identify certain substrings, not on modifying them.

The RegularExpressionsPlayaround Example

For the rest of this section, you develop a short example that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text you are going to use as your sample document is an introduction to a Wrox Press book on ASP.NET (Professional ASP.NET 2.0, ISBN 0-7645-7610-0):

  string Text = @"This comprehensive compendium provides a broad and thorough investigation of all aspects of programming with ASP.NET. Entirely revised and updated for the 2.0 Release of .NET, this book will give you the information you need to master ASP.NET and build a dynamic, successful, enterprise Web application.";

Tip

This code is valid C# code, despite all the line breaks. It nicely illustrates the utility of verbatim strings that are prefixed by the @ symbol.

This text is referred to as the input string. To get your bearings and get used to the regular expressions .NET classes, you start with a basic plain text search that doesn’t feature any escape sequences or regular expression commands. Suppose that you want to find all occurrences of the string ion. This search string is referred to as the pattern. Using regular expressions and the Text variable declared previously, you can write this:

  string Pattern = "ion"; MatchCollection Matches = Regex.Matches(Text, Pattern,                                         RegexOptions.IgnoreCase |                                         RegexOptions.ExplicitCapture); foreach (Match NextMatch in Matches) {    Console.WriteLine(NextMatch.Index); }

This code uses the static method Matches() of the Regex class in the System.Text.RegularExpressions namespace. This method takes as parameters some input text, a pattern, and a set of optional flags taken from the RegexOptions enumeration. In this case, you have specified that all searching should be case insensitive. The other flag, ExplicitCapture, modifies the way that the match is collected in a way that, for your purposes, makes the search a bit more efficient - you see why this is later (although it does have other uses that we won’t explore here). Matches() returns a reference to a MatchCollection object. A match is the technical term for the results of finding an instance of the pattern in the expression. It is represented by the class System.Text.RegularExpressions.Match. Therefore, you return a MatchCollection that contains all the matches, each represented by a Match object. In the preceding code, you simply iterate over the collection and use the Index property of the Match class, which returns the index in the input text of where the match was found. Running this code results in three matches. The following table details some of the RegexOptions enumerations.

Open table as spreadsheet

Member Name	Description
CultureInvariant	Specifies that the culture of the string is ignored
ExplicitCapture	Modifies the way the match is collected by making sure that valid captures are the ones that are explicitly named
IgnoreCase	Ignores the case of the string that is input
IgnorePatternWhitespace	Removes unescaped whitespace from the string and enables comments that are specified with the pound or hash sign
Multiline	Changes the characters ^ and $ so that they are applied to the beginning and end of each line and not to just to the beginning and end of the entire string
RightToLeft	Causes the inputted string to be read from right to left instead of the default left to right (ideal for some Asian and other languages that are read in this direction)
Singleline	Specifies a single-line mode were the meaning of the dot (.) is changed to match every character

So far, nothing is really new from the preceding example apart from some .NET base classes. However, the power of regular collections really comes from that pattern string. The reason is that the pattern string doesn’t have to only contain plain text. As hinted at earlier, it can also contain what are known as meta-characters, which are special characters that give commands, as well as escape sequences, which work in much the same way as C# escape sequences. They are characters preceded by a backslash (\) and have special meanings.

For example, suppose that you wanted to find words beginning with n. You could use the escape sequence \b, which indicates a word boundary (a word boundary is just a point where an alphanumeric character precedes or follows a whitespace character or punctuation symbol). You would write this:

 string Pattern = @"\bn"; MatchCollection Matches = Regex.Matches(Text, Pattern,                                         RegexOptions.IgnoreCase |                                         RegexOptions.ExplicitCapture);

Notice the @ character in front of the string. You want the \b to be passed to the .NET regular expressions engine at runtime - you don’t want the backslash intercepted by a well-meaning C# compiler that thinks it’s an escape sequence intended for itself! If you want to find words ending with the sequence ion, you write this:

  string Pattern = @"ion\b";

If you want to find all words beginning with the letter a and ending with the sequence ion (which has as its only match the word application in the example), you will have to put a bit more thought into your code. You clearly need a pattern that begins with \ba and ends with ion\b, but what goes in the middle? You need to somehow tell the application that between the n and the ion there can be any number of characters as long as none of them are whitespace. In fact, the correct pattern looks like this:

  string Pattern = @"\ba\S*ion\b";

Eventually you will get used to seeing weird sequences of characters like this when working with regular expressions. It actually works quite logically. The escape sequence \S indicates any character that is not a whitespace character. The * is called a quantifier. It means that the preceding character can be repeated any number of times, including zero times. The sequence \S* means any number of characters as long as they are not whitespace characters. The preceding pattern will, therefore, match any single word that begins with a and ends with ion.

The following table lists some of the main special characters or escape sequences that you can use. It is not comprehensive, but a fuller list is available in the MSDN documentation.

Open table as spreadsheet

Symbol	Meaning	Example	Matches
^	Beginning of input text	^B	B, but only if first character in text
$	End of input text	X$	X, but only if last character in text
.	Any single character except the new-line character (\n)	i.ation	isation, ization
*	Preceding character may be repeated 0 or more times	ra*t	rt, rat, raat, raaat, and so on
+	Preceding character may be repeated 1 or more times	ra+t	rat, raat, raaat and so on, (but not rt)
?	Preceding character may be repeated 0 or 1 times	ra?t	rt and rat only
\s	Any whitespace character	\sa	[space]a, \ta, \na (\t and \n have the same meanings as in C#)
\S	Any character that isn’t a whitespace	\SF	aF, rF, cF, but not \tf
\b	Word boundary	ion\b	Any word ending in ion
\B	Any position that isn’t a word boundary	\BX\B	Any X in the middle of a word

If you want to search for one of the meta-characters, you can do so by escaping the corresponding character with a backslash. For example, . (a single period) means any single character other than the newline character, whereas \. means a dot.

You can request a match that contains alternative characters by enclosing them in square brackets. For example, [1|c] means one character that can be either 1 or c. If you wanted to search for any occurrence of the words map or man, you would use the sequence ma[n|p]. Within the square brackets, you can also indicate a range, for example [a-z] to indicate any single lowercase letter, [A-E] to indicate any uppercase letter between A and E, or [0-9] to represent a single digit. If you want to search for an integer (that is, a sequence that contains only the characters 0 through 9), you could write [0-9]+ (note the use of the + character to indicate there must be at least one such digit, but there may be more than one - so this would match 9, 83, 854, and so on).

Displaying Results

In this section, you code the RegularExpressionsPlayaround example, so you can get a feel for how the regular expressions work.

The core of the example is a method called WriteMatches(), which writes out all the matches from a MatchCollection in a more detailed format. For each match, it displays the index of where the match was found in the input string, the string of the match, and a slightly longer string, which consists of the match plus up to ten surrounding characters from the input text - up to five characters before the match and up to five afterward (it is fewer than five characters if the match occurred within five characters of the beginning or end of the input text). In other words, a match on the word messaging that occurs near the end of the input text quoted earlier would display and messaging of d (five characters before and after the match), but a match on the final word data would display g of data. (only one character after the match), because after that you get to the end of the string. This longer string lets you see more clearly where the regular expression locates the match:

  static void WriteMatches(string text, MatchCollection matches) {    Console.WriteLine("Original text was: \n\n" + text + "\n");    Console.WriteLine("No. of matches: " + matches.Count);    foreach (Match nextMatch in matches)    {       int Index = nextMatch.Index;       string result = nextMatch.ToString();       int charsBefore = (Index < 5) ? Index : 5;       int fromEnd = text.Length - Index - result.Length;       int charsAfter = (fromEnd < 5) ? fromEnd : 5;       int charsToDisplay = charsBefore + charsAfter + result.Length;       Console.WriteLine("Index: {0}, \tString: {1}, \t{2}",           Index, result,           text.Substring(Index - charsBefore, charsToDisplay));    } }

The bulk of the processing in this method is devoted to the logic of figuring out how many characters in the longer substring it can display without overrunning the beginning or end of the input text. Note that you use another property on the Match object, Value, which contains the string identified for the match. Other than that, RegularExpressionsPlayaround simply contains a number of methods with names like Find1, Find2, and so on, which perform some of the searches based on the examples in this section. For example, Find2 looks for any string that contains a at the beginning of a word:

  static void Find2() {    string text = @"This comprehensive compendium provides a broad and thorough      investigation of all aspects of programming with ASP.NET. Entirely revised and      updated for the 2.0 Release of .NET, this book will give you the information      you need to master ASP.NET and build a dynamic, successful, enterprise Web      application.";    string pattern = @"\ba";    MatchCollection matches = Regex.Matches(text, pattern,      RegexOptions.IgnoreCase);    WriteMatches(text, matches); }

Along with this comes a simple Main() method that you can edit to select one of the Find<n>() methods:

  static void Main() {    Find1();    Console.ReadLine(); }

The code also needs to make use of the RegularExpressions namespace:

 using System; using System.Text.RegularExpressions;

Running the example with the Find1() method shown previously gives these results:

 RegularExpressionsPlayaround Original text was: This comprehensive compendium provides a broad and thorough investigation of all aspects of programming with ASP.NET. Entirely revised and updated for the 2.0 Release of .NET, this book will give you the information you need to master ASP.NET and build a dynamic, successful, enterprise Web application. No. of matches: 1 Index: 291,     String: application,     Web application.

Matches, Groups, and Captures

One nice feature of regular expressions is that you can group characters. It works the same way as compound statements in C#. In C# you can group any number of statements by putting them in braces, and the result is treated as one compound statement. In regular expression patterns, you can group any characters (including meta-characters and escape sequences), and the result is treated as a single character. The only difference is that you use parentheses instead of braces. The resultant sequence is known as a group.

For example, the pattern (an)+ locates any recurrences of the sequence an. The + quantifier applies only to the previous character, but because you have grouped the characters together, it now applies to repeats of an treated as a unit. This means that if you apply (an)+ to the input text, bananas came to Europe late in the annals of history , the anan from bananas is identified. On the other hand, if you write an+, the program selects the ann from annals, as well as two separate sequences of an from bananas. The expression (an)+ identifies occurrences of an, anan, ananan, and so on, whereas the expression an+ identifies occurrences of an, ann, annn, and so on.

Tip

You might wonder why with the preceding example (an)+ picks out anan from the word banana but doesn’t identify either of the two occurrences of an from the same word. The rule is that matches must not overlap. If there are a couple of possibilities that would overlap, then by default the longest possible sequence will be matched.

However, groups are actually more powerful than that. By default, when you form part of the pattern into a group, you are also asking the regular expression engine to remember any matches against just that group, as well as any matches against the entire pattern. In other words, you are treating that group as a pattern to be matched and returned in its own right. This can actually be extremely useful if you want to break up strings into component parts.

For example, URIs have the format: <protocol>://<address>:<port>, where the port is optional. An example of this is http://www.wrox.com:4355. Suppose that you want to extract the protocol, the address, and the port from a URI, where you know that there may or may not be whitespace (but no punctuation) immediately following the URI. You could do so using this expression:

  \b(\S+)://(\S+)(?::(\S+))?\b

Here is how this expression works: First, the leading and trailing \b sequences ensure that you only consider portions of text that are entire words. Within that, the first group, (\S+)://, identifies one or more characters that don’t count as whitespace, and that are followed by :// - the http:// at the start of an HTTP URI. The brackets cause the http to be stored as a group. The subsequent (\S+) identifies the string www.wrox.com in the URI. This group will end either when it encounters the end of the word (the closing \b) or a colon (:) as marked by the next group.

The next group identifies the port (:4355). The following ? indicates that this group is optional in the match - if there is no :xxxx, this won’t prevent a match from being marked. This very important, because the port number is not always specified in a URI - in fact, it is absent most of the time. However, things are a bit more complicated than that. You want to indicate that the colon might or might not appear too, but you don’t want to store this colon in the group. You’ve achieved this by having two nested groups. The inner (\S+) identifies anything that follows the colon (for example, 4355). The outer group contains the inner group preceded by the colon, and this group in turn is preceded by the sequence ?:. This sequence indicates that the group in question should not be saved (you only want to save 4355; you don’t need :4355 as well!). Don’t get confused by the two colons following each other - the first colon is part of the ?: sequence that says “don’t save this group,” and the second is text to be searched for.

If you run this pattern on the following string, you’ll get one match: http://www.wrox.com.

 Hey I've just found this amazing URI at http:// what was it -- oh yes http://www.wrox.com

Within this match, you will find the three groups just mentioned as well as a fourth group, which represents the match itself. Theoretically, it is possible that each group itself might return no, one, or more than one match. Each of these individual matches is known as a capture. So, the first group, (\S+), has one capture, http. The second group also has one capture (www.wrox.com). The third group, however, has no captures, because there is no port number on this URI.

Notice that the string contains a second http://. Although this does match up to the first group, it will not be captured by the search, because the entire search expression does not match this part of the text.

There isn’t space to show any examples of C# code that uses groups and captures, but you should know that the .NET RegularExpressions classes support groups and captures, through classes known as Group and Capture. Also, the GroupCollection and CaptureCollection classes represent collections of groups and captures. The Match class exposes the Groups() method, which returns the corresponding GroupCollection object. The Group class correspondingly implements the Captures() method, which returns a CaptureCollection. The relationship between the objects is shown in Figure 8-3.

image from book
Figure 8-3

You might not want to return a Group object every time you just want to group some characters. A fair amount of overhead is involved in instantiating the object, which is not necessary if all you want is to group some characters as part of your search pattern. You can disable this by starting the group with the character sequence ?: for an individual group, as was done for the URI example, or for all groups by specifying the RegExOptions.ExplicitCaptures flag on the RegEx.Matches() method, as was done in the earlier examples.