Working with Regular Expressions | Microsoft Visual C#.NET 2003 Kick Start

Regular expressions give you the ability to manipulate and search strings using a special syntax, and there is a great deal of support for regular expressions in C#. A regular expression can be applied to text, and can search and modify that text. Regular expressions are a language all their own, and it's not the easiest language to work with. Despite that, regular expressions are gaining popularity, and we'll see why as we work with them here.

A full treatment on creating regular expressions is beyond the scope of this book (this topic alone would take a complete chapter), but you can find many useful regular expressions already built into the Regular Expression Editor in the C# IDE. Here are some of the pre-built regular expressions you'll find there:

Internet Email Address : \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
Internet URL : http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
US Phone number : ((\(\d{3}\) ?)(\d{3}-))?\d{3}-\d{4}
German Phone Number : ((\(0\d\d\) (\(0\d{3}\) )?\d )?\d\d \d\d \d\d\(0\d{4}\) \d \d\d-\d\d?)
US Social Security Number : \d{3}-\d{2}-\d{4}

CREATING REGULAR EXPRESSIONS

For more on regular expressions and how to create them, search for "Regular Expression Syntax" in the C# documentation in the IDE, or take a look at http://www.perldoc.com/perl5.8.0/pod/perlre.html. (C#'s regular expression handling is based on the Perl language's regular expression specification.)

As you can see, regular expressions are terse and pretty tightly packed. In the previous regular expressions, \w stands for a "word character" such as a letter or underscore , \d stands for a decimal digit, . matches any character (unless it's escaped as \. , in which case it simply stands for a dot), + stands for "one or more of," * stands for "zero or more of," and so on.

USING THE IDE'S PRE-BUILT REGULAR EXPRESSIONS

How do you open the IDE's Regular Expression Editor? You can do that by creating a Web application (see Chapter 8, "Creating C# Web Applications"). Then, add a Web regular expression validation control from the toolbox to a Web form, and click the control's Validation property in the Properties window, which opens the Regular Expression Editor. This editor displays many pre-built regular expressions, including the ones discussed here, ready and available for use.

Let's take a look at an example. Say, for example, that you wanted to pick all the lowercase words out of a string.

Using Regular Expression Matches

Say that we had some sample text, "Here is the text!" , and we wanted to pull all words that only had lowercase letters out of it. To match words with lowercase letters only, you start with the regular expression \b token, which stands for a word boundary . That is, the transition in a string from a word character (like a or d ) to a non-word character (like a space or punctuation mark), or vice versa. You can also create a character class , which lets you match a character from a set of characters using square brackets. For example, the character class [a-z] matches any lowercase letter, from a to z.

So to match lowercase words, we can use the regular expression \b[a-z]+\b . The + operator means "one or more of," so this is a word boundary, followed by one or more lowercase letters, followed by a word boundary.

Now we have the regular expressionhow do we use it? Using the System.Text.RegularExpressions.Regex class, we can create a new RegEx object, which holds your regular expression:

 using System.Text.RegularExpressions; class ch02_25 {   static void Main()   {     string text = "Here is the text!";  Regex regularExpression = new Regex(@"\b[a-z]+\b");  .     .     .

Next , we can pass the regular expression object's Matches method the text we want to search, and this method will return an object of the MatchCollection class, holding the text that matched our regular expression:

 static void Main()   {     string text = "Here is the text!";     Regex regularExpression = new Regex(@"\b[a-z]+\b");  MatchCollection matches = regularExpression.Matches(text);  .     .     .

This MatchCollection object contains the matches in text to our regular expression. All we have to do now is to loop over that object with a foreach loop to display those matches, as you see in ch02_25.cs, Listing 2.25.

Listing 2.25 Using Regular Expressions (ch02_25.cs)

 using System.Text.RegularExpressions; class ch02_25 {   static void Main()   {     string text = "Here is the text!";     Regex regularExpression = new Regex(@"\b[a-z]+\b");     MatchCollection matches = regularExpression.Matches(text);  foreach (Match match in matches)   {   if (match.Length != 0)   {   System.Console.WriteLine(match);   }   }  } }

Here's what you see when you run this code:

 C:>ch02_25 is the text

As you can see, the code found and displayed the lowercase words in the test string.

Using Regular Expression Groups

Regular expressions can also use groups to perform multiple matches in the same text string. For example, take a look at this string:

 "Order number: 1234 Customer number: 5678"

Say that you wanted to pick out the order number (1234) and customer number (5678) from this text. You can match a digit with \d , so to match a four-digit number, you use \d\d\d\d .

Here's a regular expression that will match the string (note that in regular expressions, characters match themselves , so "Order number" in the regular expression will match "Order number" in the text string, and so on):

 "Order number: (\d\d\d\d) Customer number: (\d\d\d\d)"

Note the parentheses around the four-digit numbers ; these create match groups. The first group match will hold the order number and the second group match will hold the customer number. In the Perl language, there are various ways of accessing the text that a group matches; in C#, you name the group. For example, including the directive ?<order> names a group order , so we can name our two groups order and customer like this:

 "Order number: (?<order>\d\d\d\d) "Customer number: (?<customer>\d\d\d\d)"

After we apply this regular expression to the sample text, we need to recover the text that matched each group to get the order and customer numbers. You do that with a Match object's Groups propertyfor example, match .Groups["order"] will return the match to the order group. You can see this at work in ch02_26.cs, Listing 2.26, where we're recovering matches to the order and customer groups.

Listing 2.26 Using Regular Expression Groups (ch02_26.cs)

 using System.Text.RegularExpressions; class ch02_26 {   static void Main()   {     string text = "Order number: 1234 Customer number: 5678";  Regex regularExpression =   new Regex(@"Order number: (?<order>\d\d\d\d) " +   @"Customer number: (?<customer>\d\d\d\d)");  MatchCollection matches = regularExpression.Matches(text);  foreach (Match match in matches)   {   if (match.Length != 0)   {   System.Console.WriteLine("Order number: {0}",   match.Groups["order"]);   System.Console.WriteLine("Customer number: {0}",   match.Groups["customer"]);   }  }   } }

Here's what you see when you run this code. As you can see, the code picked out the order and customer numbers:

 C:\>ch02_26 Order number: 1234 Customer number: 5678

Using Capture Collections

You can even use the same group name multiple times in the same regular expression. For example, take a look at this text:

 "Order, customer numbers: 1234, 5678";

Say that you wanted to get the two four-digit numbers here. In this case, you could use this regular expression with two named groupsboth of which are named number :

 "(?<number>\d\d\d\d), (?<number>\d\d\d\d)"

Now when you try to display the matches to the number group, you might use this code:

 Regex regularExpression =   new Regex(@"(?<number>\d\d\d\d), (?<number>\d\d\d\d)"); MatchCollection matches = regularExpression.Matches(text); foreach (Match match in matches) {   if (match.Length != 0)   {     System.Console.WriteLine(match.Groups["number"]);   } }

The problem here is that when you run this code, you'll only see this result, which is the second number we're looking for:

The second number group match overwrote the first group match. To find the matches to both groups even though they have the same name, you can use the Captures collection, which contains all the matches to groups with a particular name. In this case, match .Groups["number"].Captures will return a Captures collection of Capture objects holding all the matches to the number group. And all we have to do is to loop over that collection and display the matches we've found. You can see how this works in ch02_27.cs, Listing 2.27, which uses two groups named number in the same regular expression and displays the matches to both groups.

Listing 2.27 Using Regular Expression Capture Groups (ch02_27.cs)

 using System.Text.RegularExpressions; class ch02_27 {   static void Main()   {     string text = "Order, customer numbers: 1234, 5678";     Regex regularExpression =       new Regex(@"(?<number>\d\d\d\d), (?<number>\d\d\d\d)");     MatchCollection matches = regularExpression.Matches(text);     foreach (Match match in matches)     {       if (match.Length != 0)       {  foreach (Capture capture in   match.Groups["number"].Captures)   {   System.Console.WriteLine(capture);   }  }     }   } }

Here are the results of this code, which picked up both matches:

 C:\>ch02_27 1234 5678

And that's itnow you're working with regular expressions in C#. We have some basic C# programming under our belts at this point; the next step, starting in Chapter 3, is where C# really starts to come into its ownin object-oriented programming.