10.2 Regular Expressions | C# Programming: From Problem Analysis to Program Design

Regular expressions are a powerful language for describing and manipulating text. A regular expression is applied to a string that is, to a set of characters. Often that string is an entire text document.

The result of applying a regular expression to a string is either to return a substring, or to return a new string representing a modification of some part of the original string. Remember that strings are immutable and so cannot be changed by the regular expression.

By applying a properly constructed regular expression to the following string:

One,Two,Three Liberty Associates, Inc.

you can return any or all of its substrings (e.g., Liberty or One), or modified versions of its substrings (e.g., LIBeRtY or OnE). What the regular expression does is determined by the syntax of the regular expression itself.

A regular expression consists of two types of characters: literals and metacharacters. A literal is a character you wish to match in the target string. A metacharacter is a special symbol that acts as a command to the regular expression parser. The parser is the engine responsible for understanding the regular expression. For example, if you create a regular expression:

^(From|To|Subject|Date):

this will match any substring with the letters "From," "To," "Subject," or "Date," so long as those letters start a new line (^) and end with a colon (:).

The caret (^) in this case indicates to the regular expression parser that the string you're searching for must begin a new line. The letters "From" and "To" are literals, and the metacharacters left and right parentheses ( (, ) ) and vertical bar (|) are all used to group sets of literals and indicate that any of the choices should match. (Note that ^ is a metacharacter as well, used to indicate the start of the line.)

Thus you would read this line:

^(From|To|Subject|Date):

as follows: "Match any string that begins a new line followed by any of the four literal strings From, To, Subject, or Date followed by a colon."

A full explanation of regular expressions is beyond the scope of this book, but all the regular expressions used in the examples are explained. For a complete understanding of regular expressions, I highly recommend Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly).

10.2.1 Using Regular Expressions: Regex

The .NET Framework provides an object-oriented approach to regular expression matching and replacement.

C#'s regular expressions are based on Perl5 regexp, including lazy quantifiers (??, *?, +?, {n,m}?), positive and negative look ahead, and conditional evaluation.

The Base Class Library namespace System.Text.RegularExpressions is the home to all the .NET Framework objects associated with regular expressions. The central class for regular expression support is Regex, which represents an immutable, compiled regular expression. Although instances of Regex can be created, the class also provides a number of useful static methods. The use of Regex is illustrated in Example 10-5.

Example 10-5. Using the Regex class for regular expressions

namespace Programming_CSharp {    using System;    using System.Text;    using System.Text.RegularExpressions;       public class Tester    {       static void Main( )       {          string s1 =              "One,Two,Three Liberty Associates, Inc.";           Regex theRegex = new Regex(" |, |,");          StringBuilder sBuilder = new StringBuilder( );          int id = 1;                       foreach (string subString in theRegex.Split(s1))          {             sBuilder.AppendFormat(                "{0}: {1}\n", id++, subString);          }          Console.WriteLine("{0}", sBuilder);          }    } } Output: 1: One 2: Two 3: Three 4: Liberty 5: Associates 6: Inc.

Example 10-5 begins by creating a string, s1, that is identical to the string used in Example 10-4.

string s1 = "One,Two,Three Liberty Associates, Inc.";

It also creates a regular expression, which will be used to search that string:

Regex theRegex = new Regex(" |,|, ");

One of the overloaded constructors for Regex takes a regular expression string as its parameter. This is a bit confusing. In the context of a C# program, which is the regular expression? Is it the text passed in to the constructor, or the Regex object itself? It is true that the text string passed to the constructor is a regular expression in the traditional sense of the term. From an object-oriented C# point of view, however, the argument to the constructor is just a string of characters; it is theRegex that is the regular expression object.

The rest of the program proceeds like the earlier Example 10-4, except that rather than calling Split( ) on string s1, the Split( ) method of Regex is called. Regex.Split( ) acts in much the same way as String.Split( ), returning an array of strings as a result of matching the regular expression pattern within theRegex.

Regex.Split( ) is overloaded. The simplest version is called on an instance of Regex, as shown in Example 10-5. There is also a static version of this method, which takes a string to search and the pattern to search with, as illustrated in Example 10-6.

Example 10-6. Using static Regex.Split( )

namespace Programming_CSharp {    using System;    using System.Text;    using System.Text.RegularExpressions;       public class Tester    {       static void Main( )       {          string s1 =              "One,Two,Three Liberty Associates, Inc.";           StringBuilder sBuilder = new StringBuilder( );          int id = 1;          foreach (string subStr in Regex.Split(s1," |, |,"))          {             sBuilder.AppendFormat("{0}: {1}\n", id++, subStr);          }          Console.WriteLine("{0}", sBuilder);          }    } }

Example 10-6 is identical to Example 10-5, except that the latter example does not instantiate an object of type Regex. Instead, Example 10-6 uses the static version of Split( ), which takes two arguments: a string to search for and a regular expression string that represents the pattern to match.

The instance method of Split( ) is also overloaded with versions that limit the number of times the split will occur and also determine the position within the target string where the search will begin.

10.2.2 Using Regex Match Collections

Two additional classes in the .NET RegularExpressions namespace allow you to search a string repeatedly, and to return the results in a collection. The collection returned is of type MatchCollection, which consists of zero or more Match objects. Two important properties of a Match object are its length and its value, each of which can be read as illustrated in Example 10-7.

Example 10-7. Using MatchCollection and Match

namespace Programming_CSharp {    using System;    using System.Text.RegularExpressions;    class Test    {       public static void Main( )       {          string string1 = "This is a test string";                       // find any nonwhitespace followed by whitespace          Regex theReg = new Regex(@"(\S+)\s");                   // get the collection of matches          MatchCollection theMatches =              theReg.Matches(string1);          // iterate through the collection          foreach (Match theMatch in theMatches)          {             Console.WriteLine(                "theMatch.Length: {0}", theMatch.Length);             if (theMatch.Length != 0)                            {                Console.WriteLine("theMatch: {0}",                    theMatch.ToString( ));                           }                 }                  }                   }                    }              Output:      theMatch.Length: 5 theMatch: This theMatch.Length: 3 theMatch: is theMatch.Length: 2 theMatch: a theMatch.Length: 5 theMatch: test

Example 10-7 creates a simple string to search:

string string1 = "This is a test string";

and a trivial regular expression to search it:

Regex theReg = new Regex(@"(\S+)\s");

The string \S finds nonwhitespace, and the plus sign indicates one or more. The string \s (note lowercase) indicates whitespace. Thus, together, this string looks for any nonwhitespace characters followed by whitespace.

Remember the at (@) symbol before the string creates a verbatim string, which avoids the necessity of escaping the backslash (\) character.

The output shows that the first four words were found. The final word was not found because it is not followed by a space. If you insert a space after the word string and before the closing quotation marks, this program will find that word as well.

The length property is the length of the captured substring, and is discussed in Section 10.2.4, later in this chapter.

10.2.3 Using Regex Groups

It is often convenient to group subexpression matches together so that you can parse out pieces of the matching string. For example, you might want to match on IP addresses and group all IP addresses found anywhere within the string.

IP addresses are used to locate computers on a network, and typically have the form x.x.x.x, where x is generally any digit between 0 and 255 (such as 192.168.0.1).

The Group class allows you to create groups of matches based on regular expression syntax, and represents the results from a single grouping expression.

A grouping expression names a group and provides a regular expression; any substring matching the regular expression will be added to the group. For example, to create an ip group you might write:

@"(?<ip>(\d|\.)+)\s"

The Match class derives from Group, and has a collection called "Groups" that contains all the groups your Match finds.

Creation and use of the Groups collection and Group classes is illustrated in Example 10-8.

Example 10-8. Using the Group class

namespace Programming_CSharp {     using System;     using System.Text.RegularExpressions;     class Test     {         public static void Main( )         {             string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";                          // group time = one or more digits or colons followed by space             Regex theReg = new Regex(@"(?<time>(\d|\:)+)\s" +                         // ip address = one or more digits or dots followed by  space             @"(?<ip>(\d|\.)+)\s" +              // site = one or more characters             @"(?<site>\S+)");                           // get the collection of matches             MatchCollection theMatches = theReg.Matches(string1);             // iterate through the collection             foreach (Match theMatch in theMatches)             {                 if (theMatch.Length != 0)                                {                     Console.WriteLine("\ntheMatch: {0}",                          theMatch.ToString( ));                               Console.WriteLine("time: {0}",                          theMatch.Groups["time"]);                     Console.WriteLine("ip: {0}",                         theMatch.Groups["ip"]);                     Console.WriteLine("site: {0}",                         theMatch.Groups["site"]);                 }                    }                    }                    }                    }

Again, Example 10-8 begins by creating a string to search:

string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";

This string might be one of many recorded in a web server log file or produced as the result of a search of the database. In this simple example, there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces. Of course, in an example solving a real-life problem, you might need to do more complex searches and choose to use other delimiters and more complex searches.

In Example 10-8, we want to create a single Regex object to search strings of this type and break them into three groups: time, ip address, and site. The regular expression string is fairly simple, so the example is easy to understand. However, keep in mind that in a real search, you would probably only use a part of the source string rather than the entire source string, as shown here:

// group time = one or more digits or colons  // followed by space Regex theReg = new Regex(@"(?<time>(\d|\:)+)\s" +   // ip address = one or more digits or dots  // followed by  space @"(?<ip>(\d|\.)+)\s" +  // site = one or more characters @"(?<site>\S+)");

Let's focus on the characters that create the group:

(@"(?<time>(\d|\:)+)

The parentheses create a group. Everything between the opening parenthesis (just before the question mark) and the closing parenthesis (in this case, after the + sign) is a single unnamed group.

The string ?<time> names that group time, and the group is associated with the matching text, which is the regular expression (\d|\:)+)\s". This regular expression can be interpreted as "one or more digits or colons followed by a space."

Similarly, the string ?<ip> names the ip group, and ?<site> names the site group. As Example 10-7 does, Example 10-8 asks for a collection of all the matches:

MatchCollection theMatches = theReg.Matches(string1);

Example 10-8 iterates through the Matches collection, finding each Match object.

If the Length of the Match is greater than 0, a Match was found; it prints the entire match:

Console.WriteLine("\ntheMatch: {0}",     theMatch.ToString( ));

Here's the output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com

It then gets the "time" group from theMatch.Groups collection and prints that value:

Console.WriteLine("time: {0}",     theMatch.Groups["time"]);

This produces the output:

time: 04:03:27

The code then obtains ip and site groups:

Console.WriteLine("ip: {0}",     theMatch.Groups["ip"]); Console.WriteLine("site: {0}",     theMatch.Groups["site"]);

This produces the output:

ip: 127.0.0.0 site: LibertyAssociates.com

In Example 10-8, the Matches collection has only one Match. It is possible, however, to match more than one expression within a string. To see this, modify string1 in Example 10-8 to provide several logFile entries instead of one, as follows:

string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com " + "04:03:28 127.0.0.0 foo.com " + "04:03:29 127.0.0.0 bar.com " ;

This creates three matches in the MatchCollection, called theMatches. Here's the resulting output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com time: 04:03:27 ip: 127.0.0.0 site: LibertyAssociates.com theMatch: 04:03:28 127.0.0.0 foo.com time: 04:03:28 ip: 127.0.0.0 site: foo.com theMatch: 04:03:29 127.0.0.0 bar.com time: 04:03:29 ip: 127.0.0.0 site: bar.com

In this example, theMatches contains three Match objects. Each time through the outer foreach loop we find the next Match in the collection and display its contents:

foreach (Match theMatch in theMatches)

For each of the Match items found, you can print out the entire match, various groups, or both.

10.2.4 Using CaptureCollection

Each time a Regex object matches a subexpression, a Capture instance is created and added to a CaptureCollection collection. Each capture object represents a single capture. Each group has its own capture collection of the matches for the subexpression associated with the group.

A key property of the Capture object is its length, which is the length of the captured substring. When you ask Match for its length, it is Capture.Length that you retrieve, because Match derives from Group, which in turn derives from Capture.

The regular expression inheritance scheme in .NET allows Match to include in its interface the methods and properties of these parent classes. In a sense, a Group is-a capture: it is a capture that encapsulates the idea of grouping subexpressions. A Match, in turn, is-a Group: it is the encapsulation of all the groups of subexpressions making up the entire match for this regular expression. (See Chapter 5 for more about the is-a relationship and other relationships.)

Typically, you will find only a single Capture in a CaptureCollection, but that need not be so. Consider what would happen if you were parsing a string in which the company name might occur in either of two positions. To group these together in a single match, create the ?<company> group in two places in your regular expression pattern:

Regex theReg = new Regex(@"(?<time>(\d|\:)+)\s" + @"(?<company>\S+)\s" + @"(?<ip>(\d|\.)+)\s" +  @"(?<company>\S+)\s");

This regular expression group captures any matching string of characters that follows time, and also any matching string of characters that follows ip. Given this regular expression, you are ready to parse the following string:

string string1 = "04:03:27 Jesse 0.0.0.127 Liberty ";

The string includes names in both the positions specified. Here is the result:

theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty

What happened? Why is the Company group showing Liberty? Where is the first term, which also matched? The answer is that the second term overwrote the first. The group, however, has captured both. Its Captures collection can demonstrate, as illustrated in Example 10-9.

Example 10-9. Examining the capture collection

namespace Programming_CSharp {    using System;    using System.Text.RegularExpressions;    class Test    {       public static void Main( )       {          // the string to parse          // note that names appear in both           // searchable positions          string string1 =              "04:03:27 Jesse 0.0.0.127 Liberty ";                       // regular expression which groups company twice          Regex theReg = new Regex(@"(?<time>(\d|\:)+)\s" +             @"(?<company>\S+)\s" +             @"(?<ip>(\d|\.)+)\s" +              @"(?<company>\S+)\s");                   // get the collection of matches          MatchCollection theMatches =              theReg.Matches(string1);          // iterate through the collection          foreach (Match theMatch in theMatches)          {             if (theMatch.Length != 0)                            {                Console.WriteLine("theMatch: {0}",                    theMatch.ToString( ));                          Console.WriteLine("time: {0}",                    theMatch.Groups["time"]);                Console.WriteLine("ip: {0}",                    theMatch.Groups["ip"]);                Console.WriteLine("Company: {0}",                   theMatch.Groups["company"]);                // iterate over the captures collection                 // in the company group within the                 // groups collection in the match                                foreach (Capture cap in                    theMatch.Groups["company"].Captures)                {                   Console.WriteLine("cap: {0}",cap.ToString( ));                }             }                 }                  }                   }                    }   Output: theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty cap: Jesse cap: Liberty

The code in bold iterates through the Captures collection for the Company group:

foreach (Capture cap in     theMatch.Groups["company"].Captures)

Let's review how this line is parsed. The compiler begins by finding the collection that it will iterate over. theMatch is an object that has a collection named Groups. The Groups collection has an indexer that takes a string and returns a single Group object. Thus, the following line returns a single Group object:

theMatch.Groups["company"]

The Group object has a collection named Captures. Thus, the following line returns a Captures collection for the Group stored at Groups["company"] within the theMatch object:

theMatch.Groups["company"].Captures

The foreach loop iterates over the Captures collection, extracting each element in turn and assigning it to the local variable cap, which is of type Capture. You can see from the output that there are two capture elements: Jesse and Liberty. The second one overwrites the first in the group, and so the displayed value is just Liberty. However, by examining the Captures collection, you can find both values that were captured.