Regular Expressions | Performance Consulting: A Practical Guide for HR and Learning Professionals

Chapter 5 - C# and the Base Classes

bySimon Robinsonet al.
Wrox Press 2002

Regular expressions form one of those little technology areas that is incredibly useful in a wide range of programs, but despite that isn't really that widely known among developers. It could almost be thought of as a mini programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the UNIX environment, and is commonly used with Perl. Microsoft ported it onto Windows, where it has up until now mostly been used with scripting languages. Regular expressions are, however, supported by a number of .NET classes in the namespace System.Text.RegularExpressions .

Many readers will not be familiar with the regular expressions language, so we will use this section as a very basic introduction to both regular expressions and to the related .NET classes. If you are already familiar with regular expressions then you'll probably want to just skim through this section to pick out the references to the .NET base classes. You might like to note that the .NET regular expression engine is designed to be mostly compatible with Perl 5 regular expressions, though it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is a language designed specifically for string processing. It broadly contains two features:

A set of escape codes for identifying types of characters . You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re* lists the names of files with names beginning Re. ) Regular expressions use many sequences like this to represent items such as 'any one character', 'a word break', 'one optional character', and so on.
A system for grouping parts of substrings and intermediate results during a search operation.

Using regular expressions, it is possible to perform quite sophisticated and high level operations on strings. For example, you could:

Identify (and perhaps either flag or remove) all repeated words in a string, for example, converting "The computer books books" to "The computer books"
Convert all words to title case, such as convert "this is a Title" to "This Is A Title".
Convert all words longer than three characters long to title case, for example convert "this is a Title" to "This is a Title"
Ensure that sentences are properly capitalized
Separate out the various elements of a URI (for example, given http://www.wrox.com , extract the protocol, computer name, file name , and so on)

These are all of course, tasks that can be performed in C# using the various methods on System.String and System.Text.StringBuilder . However, in some cases, this would involve writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed down to just a couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or - even simpler - invoke a static RegEx() method), pass it the string to be processed , and a regular expression (a string containing the instructions in the regular expressions language), and you're done.

A regular expression string looks at first sight rather like a normal string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if we wanted to indicate we were looking for the characters th at the beginning of a word, we would search for the regular expression, \bth . (that is, the sequence word boundary - t - h ). If we wanted to search for all occurrences of th at the end of a word, we would write th\b (the sequence t - h - word boundary). However, as we have hinted, regular expressions are much more sophisticated than that, and include, for example, facilities to store portions of text that are found in a search operation. In this section, we will merely scratch the surface of the power of regular expressions.

As another example, suppose your application needed to convert UK phone numbers from national to international format. In the UK, national format would be something like 01233 345532, which would sometimes be written (01233) 345532. International format would mean this number should always be written +44 1233 345532, in other words the leading zero must be replaced by +44, and any brackets removed. As find and replace operations go, that's not too complicated, but would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available on System.String ); you would need to locate any zeros that occur at the beginning of a number or immediately following a left bracket . Again, the regular expressions language allows us to construct a short string that will be interpreted to have this meaning.

This section is intended only as a very simple example, so we will simply concentrate on searching strings to identify certain substrings, not on modifying them.

The RegularExpressionsPlayaround Example

For the rest of this section, we will develop a short sample that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text we are going to use as our sample 'document' to search through is the introduction to another Wrox Press book on XML (Professional XML 2nd Edition, ISBN 1-861005-05-9):

   string Text =     @"XML has made a major impact in almost every aspect of software development     Designed as an open, extensible, self-describing language,     it has become the standard for data and document delivery on the web.     The panoply of XML-related technologies continues to develop at breakneck     speed, to enable validation, navigation, transformation, linking, querying,     description, and messaging of data.";

The above is valid C# code, despite all the line breaks. It nicely illustrates the utility of verbatim strings that are prefixed by the @ symbol.

We will refer to this text as the input string . To get our bearings and get used to the regular expressions .NET classes, we will start with a basic plain text search that doesn't feature any escape sequences or regular expression commands. Suppose that we want to find all occurrences of the string ion . We will refer to this search string as the pattern . Using regular expressions and the Text variable declared above, you could do it like this:

   string Pattern = "ion";     MatchCollection Matches = Regex.Matches(Text, Pattern,     RegexOptions.IgnoreCase     RegexOptions.ExplicitCapture);     foreach (Match NextMatch in Matches)     {     Console.WriteLine(NextMatch.Index);     }

In this code, we have used the static method Matches() of the Regex class in the System.Text.RegularExpressions namespace. This method takes as parameters some input text, a pattern, and a set of optional flags taken from the RegexOptions enumeration. In this case, we have specified that all searching should be case-insensitive. The other flag, ExplicitCapture , modifies the way that the match is collected in a way that, for our purposes, makes the search a bit more efficient - we will see why this is later (although it does have other uses that we won't explore here). Matches() returns a reference to a MatchCollection object. A match is the technical term for the results of finding an instance of the pattern in the expression. It is represented by the class System.Text.RegularExpressions.Match . Therefore, we return a MatchCollection that contains all the matches, each represented by a Match object. In the above code, we simply iterate over the collection, and use the Index property of the Match class, which returns the index in the input text of where the match was found. When I ran this code, it found four matches.

So far, there is not really anything new here apart from some new .NET base classes. However, the power of regular collections really comes from that pattern string. The reason is that the pattern string doesn't only have to contain plain text. As hinted at earlier, it can also contain what are known as metacharacters , which are special characters that give commands, as well as escape sequences, which work in much the same way as C# escape sequences. They are characters preceded by a backslash, \ , and also have special meanings.

For example, suppose we wanted to find words beginning with n . We could use the escape sequence \b , which indicates a word boundary (a word boundary is just a point where a an alphanumeric character precedes or follows a whitespace character or punctuation symbol). We would simply write this:

   string Pattern = @"\bn";   MatchCollection Matches = Regex.Matches(Text, Pattern,                                         RegexOptions.IgnoreCase                                           RegexOptions.ExplicitCapture);

Notice the @ character in front of the string. We want the \b to be passed to the .NET regular expressions engine at runtime - we don't want the backslash intercepted by a well-meaning C# compiler that thinks it's an escape sequence intended for itself! If we want to find words ending with the sequence ion , then we could do this:

   string Pattern = @"ion\b";

What if we want to find all words beginning with the letter n and ending with the sequence ion ? This would pick out the one word 'navigation' from the above text. That's a little more complicated. We clearly need a pattern that begins with \bn and ends with ion\b , but what goes in the middle? We need to somehow tell the application that between that the n and the ion there can be any number of characters as long as none of them are whitespace. In fact, the correct pattern looks like this:

   string Pattern = @"\bn\S*ion\b";

One thing you will get used to with regular expressions is seeing weird sequences of characters like this, but it actually works quite logically. The escape sequence \S indicates any character that is not a whitespace character. The * is called a quantifier . It means that the preceding character can be repeated any number of time, including zero times. The sequence \S* means "any number of characters as long as they are not whitespace characters". The above pattern will, therefore, match any single word that begins with n and ends with ion .

The table shows some of the main special characters or escape sequences that you can use. It is not comprehensive, but a fuller list is available in the MSDN documentation:

	Meaning	Example	Examples that this will match
^	Beginning of input text	^B	B , but only if first character in text
$	End of the input text	X$	X , but only if last character in text
.	Any single character except the newline character (\n)	i.ation	isation , ization
*	Preceding character may be repeated 0 or more times	ra*t	rt , rat , raat , raaat , and so on
+	Preceding character may be repeated 1 or more times	ra+t	rat , raat , raaat and so on, (but not rt )
?	Preceding character may be repeated 0 or 1 times	ra?t	rt and rat only
\s	Any whitespace character	\sa	[space]a , \ta , \na ( \t and \n have the same meanings as in C#)
\S	Any character that isn't a whitespace	\SF	aF , rF , cF , but not \tf
\b	Word boundary	ion\b	any word ending in ion
\B	Any position that isn't a word boundary	\BX\B	any X in the middle of a word

If you want to actually search for one of the metacharacters, you can do so by escaping the corresponding character with a backslash. For example, . (a single period) means any single character other than the newline character, while \. means a dot.

You can request a match that contains alternative characters by enclosing them in square brackets. For example [1c] means one character that can be either 1 or c . If you wanted to search for any occurrence of the words map or man , you would use the sequence ma[np] . Within the square brackets, you can also indicate a range, for example [a-z] to indicate any single lower case letter, [A-E] to indicate any uppercase letter between A and E , or [0-9] to represent a single digit. If you want to search for an integer (that is, a sequence that contains only the characters 0 through 9), you could write [0-9]+ (note the use of the + character to indicate there must be at least one such digit, but there may be more than one - so this would match 9, 83, 854, and so on).

Displaying Results

Now we have the flavor what regular expressions are about, we will actually code up our RegularExpressionsPlayaround example. This is not really intended as a serious example of a real situation; it lets you set up a few regular expressions and displays the results so you can get a feel for how the regular expressions work.

The core of the example is a method called WriteMatches() , which writes out all the matches from a MatchCollection in a more detailed format. For each match, it displays the index of where the match was found in the input string, the string of the match, and a slightly longer string, which consists of the match plus up to ten surrounding characters from the input text - up to 5 characters before the match and up to 5 afterwards (it is less than 5 characters if the match occurred within 5 characters of the beginning or end of the input text). In other words, a match on the word messaging that occurs near the end of the input text quoted earlier would display '" and messaging of d" (five characters before and after the match), but a match on the final word data would display " g of data." (only one character after the match), because after that we hit the end of the string. This longer string lets you see more clearly where the regular expression located the match:

   static void WriteMatches(string text, MatchCollection matches)     {     Console.WriteLine("Original text was: \n\n" + text + "\n");     Console.WriteLine("No. of matches: " + matches.Count);     foreach (Match nextMatch in matches)     {     int Index = nextMatch.Index;     string result = nextMatch.ToString();     int charsBefore = (Index < 5) ? Index : 5;     int fromEnd = text.Length - Index - result.Length;     int charsAfter = (fromEnd < 5) ? fromEnd : 5;     int charsToDisplay = charsBefore + charsAfter + result.Length;     Console.WriteLine("Index: {0}, \tString: {1}, \t{2}",     Index, result,     text.Substring(Index - charsBefore, charsToDisplay));     }     }

The bulk of the processing in this method is devoted to the logic of figuring out how many characters in the longer substring it can display without overrunning the beginning or end of the input text. Note that we use another property on the Match object, Value , which contains the string identified for the match. Other than that, RegularExpressionsPlayaround simply contains a number of methods with names like Find1 , Find2 , and so on, which perform some of the searches based on the examples in this section. For example, Find2 looks for any string that contains n at the beginning of a word:

   static void Find2()     {     string text = @"XML has made a major impact in almost every aspect of     software development. Designed as an open, extensible, self-describing     language, it has become the standard for data and document delivery on     the web. The panoply of XML-related technologies continues to develop     at breakneck speed, to enable validation, navigation, transformation,     linking, querying, description, and messaging of data.";     string pattern = @"\bn";     MatchCollection matches = Regex.Matches(text, pattern,     RegexOptions.IgnoreCase);     WriteMatches(text, matches);     }

Along with this is a simple Main() method that you can edit to select one of the Find < n > () methods:

   static void Main()     {     Find1();     Console.ReadLine();     }

The code also makes use of the RegularExpressions namespace:

 using System;   using System.Text.RegularExpressions;

Running the example with the Find1() method as above gives these results:

  RegularExpressionsPlayaround  Original text was: XML has made a major impact in almost every aspect of software development. Designed as an open, extensible, self-describing language, it has become the standard for data and document delivery on the web. The panoply of XML-related technologies continues to develop at breakneck speed, to enable validation, navigation, transformation, linking, querying, description, and messaging of data. No. of matches: 1 Index: 364,     String: navigation,     ion, navigation, tra

Matches, Groups, and Captures

One nice feature of regular expressions is that you can group characters together. It works the same way as compound statements in C#. Recall that in C# you can group any number of statements together by putting them in braces, and the result is treated as one compound statement. In regular expression patterns, you can group any characters (including metacharacters and escape sequences) together, and the result is treated as a single character. The only difference is you use parentheses instead of braces. The resultant sequence is known as a group .

For example, the pattern (an)+ will locate any recurrences of the sequence an . The + quantifier applies only to the previous character, but because we have grouped the characters together, it now applies to repeats of an treated as a unit. This means that (an)+ applied to the input text, " bananas came to Europe late in the annals of history" will pick out the anan from bananas . On the other hand, if we'd written an+ , that would pick out the ann from annals , as well as two separate sequences of an from bananas . The expression (an)+ will pick out occurrences of an , anan , ananan , and so on, while the expression an+ will pick out occurrences of an , ann , annn , and so on.

You might wonder why with the above example, (an)+ picks out anan from the word banana , but doesn't identify either of the two occurrences of an from the same word. The rule is that matches must not overlap. If there are a couple of possibilities that would overlap, then by default the longest possible sequence will be matched.

However, groups are actually more powerful than that. By default, when you form part of the pattern into a group, you are also asking the regular expression engine to remember any matches against just that group, as well as any matches against the entire pattern. In other words you are treating that group as a pattern to be matched and returned in its own right. This can actually be extremely useful if you want to break up strings into component parts.

For example, URIs have the format: < protocol > :// < address > : < port >, where the port is optional. An example of this is http://www.wrox.com:4355 . Suppose you want to extract the protocol, the address, and the port from a URI, where you know that there may or may not be whitespace, (but no punctuation) immediately following the URI. You could do so using this expression:

   \b(\S+)://(\S+)(?::(\S+))?\b

This is the way the expression works. First, the leading and trailing \b sequences ensure that we only consider portions of text that are entire words. Within that, the first group, (\S+):// will pick out one or more characters that don't count as whitespace, and which are followed by :// . This will pick out the http:// at the start of an HTTP URI. The brackets will cause the http to be stored as a group. The subsequent (\S+) will pick out expressions such as www.wrox.com in the above URI. This group will end either when it hits the end of the word (the closing \b ) or when it hits a colon ( : ) as marked by the next group.

The next group is intended to pick out the port ( :4355 in our case). The following ? indicates that this group is optional in the match - if there is no :xxxx then this won't prevent a match from being marked.

That's very important as the port number isn't always specified in a URI - in fact it is absent most of the time. However, things are a bit more complicated than that. We want to indicate that the colon might or might not appear too, but we don't want to store this colon in the group. We've achieved this by having two nested groups. The inner (\S+) will pick out anything that follows the colon (for example the 4355 in our example). The outer group contains the inner group preceded by the colon, and that in turn is preceded by the sequence ?: . This sequence indicates that the group in question should not be saved (we only want to save the 4355; we don't need the :4355 as well!). Don't get confused by the two colons following each other - the first is part of the ?: sequence that says 'don't save this group', and the second is text to be searched for.

If you run this pattern on this string:

 Hey I've just found this amazing URI at http:// what was it - oh yes http://www.wrox.com

you'll get one match: http://www.wrox.com . Within this match, there are the three groups just mentioned as well as a fourth group which represents the match itself. Theoretically, it is possible that each group itself might pick nothing, one or more than one match. Each of these individual matches is known as a capture . So, the first group, ( \S+ ), has one capture, http . The second group has one capture too, www.wrox.com , but the third group has no captures, since there is no port number on this URI.

Notice that the string contained a second http:// on its own. Although this does match up to our first group, it will not be picked out by the search because the entire search expression will not match this part of the text.

We don't have space to show any examples of C# code that uses groups and captures, but we will mention that the .NET RegularExpressions classes support groups and captures, through classes known as Group and Capture . There are also the GroupCollection and CaptureCollection classes, which respectively represent collections of groups and captures. The Match class exposes a method, Groups() , which returns the corresponding GroupCollection object. The Group class correspondingly implements a method, Captures() , which returns a CaptureCollection . The relationship between the objects is as shown in the diagram:

Returning a Group object every time you just want to group some characters together may not be what you want to do. There's a fair amount of overhead involved in instantiating the object, which is wasted if all you wanted was to group some characters together as part of your search pattern. You can disable this by starting the group with the character sequence ?: for an individual group, as we did for our URI example, or for all groups by specifying the RegExOptions.ExplicitCaptures flag on the RegEx.Matches() method, as we did in the earlier examples.