Regular Expressions | Pro Visual C++ 2005 for C# Developers

Regular expressions are part of those small technology areas that are incredibly useful in a wide range of programs, yet rarely used among developers. You can think of regular expressions as a mini-programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the UNIX environment and is commonly used with the Perl programming language. Microsoft ported it onto Windows, where up until now it has been used mostly with scripting languages. Regular expressions are today, however, supported by a number of .NET classes in the namespace System.Text.RegularExpressions. You can also find the use of regular expressions in various parts of the .NET Framework. For instance, you will find that they are used within the ASP.NET Validation server controls.

If you are not familiar with the regular expressions language, this section gives a very basic introduction to both regular expressions and their related .NET classes. If you are already familiar with regular expressions, you'll probably want to just skim through this section to pick out the references to the .NET base classes. You might like to know that the .NET regular expression engine is designed to be mostly compatible with Perl 5 regular expressions, although it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is designed specifically for string processing. It contains two features:

A set of escape codes for identifying specific types of characters. You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re* lists the files with names beginning with Re.) Regular expressions use many sequences like this to represent items such as any one character, a word break, one optional character, and so on.
A system for grouping parts of substrings and intermediate results during a search operation.

With regular expressions, you can perform quite sophisticated and high-level operations on strings. For example, you can

Identify (and perhaps either flag or remove) all repeated words in a string (for example, "The computer books books" to "The computer books")
Convert all words to title case (for example, "this is a Title" to "This Is A Title")
Convert all words longer than three characters to title case (for example, "this is a Title" to "This is a Title")
Ensure that sentences are properly capitalized
Separate the various elements of a URI (for example, given http://www.wrox.com, extract the protocol, computer name, file name, and so on)

Of course, all of these tasks can be performed in C# using the various methods on System.String and System.Text.StringBuilder. However, in some cases, this would involve writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed to just a couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or, even simpler, invoke a static RegEx() method), pass it the string to be processed, and pass in a regular expression(a string containing the instructions in the regular expressions language), and you're done.

A regular expression string looks at first sight rather like a regular string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if you wanted to indicate you were looking for the characters th at the beginning of a word, you would search for the regular expression, \bth. (that is, the sequence word boundary -t-h). If you wanted to search for all occurrences of th at the end of a word, you would write th\b (the sequence t-h-word boundary). However, regular expressions are much more sophisticated than that and include, for example, facilities to store portions of text that are found in a search operation. This section merely scratches the surface of the power of regular expressions.

Suppose your application needed to convert U.S. phone numbers to an international format. In the United States, the phone numbers have this format: 314-123-1234, which is often written as (314) 123-1234. When converting this national format to an international format you have to include +1 (the country code of the United States) and add brackets around the area code: +1 (314) 123-1234. As find- and-replace operations go, that's not too complicated, but would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available on System.String).The regular expressions language allows you to construct a short string that achieves the same result.

This section is intended only as a very simple example, so it concentrates on searching strings to identify certain substrings, not on modifying them.

The RegularExpressionsPlayaround Example

For the rest of this section, you develop a short example that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text you are going to use as your sample document is an introduction to a Wrox Press book on ASP.NET (Professional ASP.NET 2.0, ISBN 0-7645-7610-0):

 string Text =  @"This comprehensive compendium provides a broad and thorough investigation of all  aspects of programming with ASP.NET. Entirely revised and updated for the 2.0  Release of .NET, this book will give you the information you need to master ASP.NET  and build a dynamic, successful, enterprise Web application.";