BCL Support


The System.Text.RegularExpressions library contains a number of simple classes enabling you to work with regular expressions. Regex is the primary class with which you will work. In addition to that are the Match, Group, and Capture types, which are used to hold detailed information about an attempt to match some input text. This section will detail their functionality and some of the different options you can use when working with them.

Expressions

The Regex class is the core of regular expressions support in the .NET Framework. It hooks into all of the other functionality in the namespace. You've already seen it used in quite a few examples above, so this section only briefly details its mechanics and then swiftly moves on to some of its advanced capabilities.

Whenever you work with a regular expression, you must create an instance of the Regex type. This is usually done via its single-argument public constructor: Regex(String pattern). This constructor sets up a new object that is able to interpret the provided pattern and perform matching based on it against arbitrary input text. This type is immutable; once you have an instance of one, it cannot ever be altered (e.g., to contain a different pattern). Regex overrides the ToString() method to return the pattern used during construction of the target instance. See below for more information on compiling regular expressions and how to use the results of a matching operation.

There is also a Regex constructor that takes a RegexOptions parameter. This flags-style enumeration enables you to pass in options to modify the behavior of the matcher. For example, the statement new Regex("...", RegexOptions.Compiled | RegexOptions.IgnoreCase) results in a regular expression that has both the compiled and ignore-case options turned on. Notice that you can also specify these using inline syntax rather than passing them to the regular expression constructor. This is described further below.

The RegexOptions enumeration offers the following values:

  • Compiled: Instructs the regular expression parser to generate executable IL for the matching of input. This results in more efficient matching performance at the cost of less performant Regex object construction. For more details, refer to the later section on expression compilation.

  • CultureInvariant: As described throughout this chapter, many of the regular expression features are culture sensitive and make use of the upper part of the Unicode character set. This is done by examining the Thread.CurrentThread.CurrentCulture instance of the System .Globalization.CultureInfo class. When you are certain that you won't be utilizing these capabilities, you can remove the need for the matcher to worry about this by specifying the CultureInvariant option. This can, in fact, improve performance. To learn more about culture-specific features in the .NET Framework, please refer to Chapter 8.

  • ECMAScript: By default, so-called canonical matching is used. For compatibility with ECMAScript formatted expressions, however, the ECMAScript option changes the behavior of this matching. This alters the way in which octal versus backreferences are interpreted and support for Unicode characters within character classes, among other things. More information on this mode can be found in the Platform SDK.

  • ExplicitCapture: Tells the matcher not to capture groups by default unless explicitly called out in the group itself, for example by using the (?<name>) syntax. This is appropriate for large expressions in which you have many groups but most of which you do not care to capture. This prevents you from having to specify (?:) for all of your noncaptured groups when this will be the majority case.

  • IgnoreCase: Causes the matcher to ignore case, in both the input text and the expression provided.

  • IgnorePatternWhitespace: Tells the regular expression parser to discard nonsignificant whitespace. This is defined as anything except whitespace within a character class, for example the space in [ \n\t], or escaped whitespace, as in \ . This allows you to format an expression in a more friendly manner (with comments even!) without disrupting its functionality.

  • Multiline: Enables multiple lines of input to be matched. Specifically, this enables both the beginning- and end-of-input characters ^ and $ to also match on the beginning and end of lines.

  • None: The default behavior of Regex. Has no effect when combined with other options.

  • RightToLeft: Requests that the regular expression matcher move from the right to left instead of performing a typical left-to-right search. For some patterns, this might result in less backtracking and, thus, a more efficient matching algorithm. This has no effect on whether a match will succeed or fail but could change the way that groups are captured with ambiguous expressions.

  • Singleline: Modifies the dot character class to match on newline characters.

Instead of providing a set of options via the RegexOptions type, you can do it inside an expression itself. This also enables you to make certain pieces of an expression adhere to different parsing rules. In some circumstances, an expression might cease to function correctly without certain options, in which case placing them right in the expression itself can be useful.

Options are indicated by the expression (?imnsx-imnsx:), where each letter is optional. Characters following the - character turn the specified option off, while those found before it (or without it) turn the option on. Each letter corresponds to one of the abovementioned options, only a subset of which is available inline. They map to the above options as follows: i corresponds to IgnoreCase, m to Multiline, n to ExplicitCapture, s to Singleline, and x to IgnorePatternWhitespace.

As an example, (?i:<group>) causes the < group > regular expression to be evaluated with the ignore-case option turned on. You may specify as little as one, and as many as all, of the characters to turn on specific options. For example, (?i-mx) is a valid pattern: it turns on IgnoreCase and turns off the Multiline and IgnorePatternWhitespace options. Similarly, (?-i:< group >) disables case insensitivity within the pattern <group>. Inline options affect only the group within which they are found. For instance, the pattern (?i:abc(?-i:\d+xyz)+) matches "abc" using case insensitivity, while the inner group matches "xyz" with case sensitivity. This means that the input "AbC123xyz" would successfully match, while "AbC123XyZ" would not.

Simple Matching

Once you've constructed a Regex object, you're ready to start working with input. You can simply test for a match with the IsMatch API, retrieve a Match object containing details about the matching process with the Match method or collection of them with the Matches method, replace occurrences of the pattern via Replace, or split the input at match boundaries using Split. We'll take a brief look at each mechanism now.

Testing for a Match

The most straightforward operation is simply to check whether a given expression matches some input. The IsMatch function takes an input string and returns true or false to indicate whether the expression matched the provided text:

 Regex r = new Regex("\d{5}(-\d{4})?"); Console.WriteLine(r.IsMatch("01925-4322")); Console.WriteLine(r.IsMatch("01925")); Console.WriteLine(r.IsMatch("a9cd3")); 

This parses U.S. postal codes, in both their five-digit and nine-digit forms. The result of executing this program will be "True", "True", and "False" output to the console.

The IsMatch overload which takes an integer parameter startat is similar. But it only considers text starting from the startat position in the input string. Note that for RightToLeft expressions, the expression will consider everything to the left (or coming before) of the startat position.

Obtaining a Single Match

You will likely use the Match method the most when working with regular expressions. This returns a Match object that provides further details about what happened during the match process. Not only does it indicate success or failure (with the Boolean Match.Success instance property), but it also enables you to access groups and captured information. It additionally provides the ability to step through incremental matches with the NextMatch method. Notice that this method does not return null to indicate match failure but rather a Match object whose Success property evaluates to false.

As with the IsMatch method above, you can indicate the position in the input text at which to begin matching using the overload that accepts an integer startat.

Obtaining Multiple Matches

The Matches method is similar to the Match method detailed above, except that it continues to apply the pattern matching until it has consumed all of the input or fails to match once, whichever occurs first. This is much like the way you'd use the NextMatch method found on Match, except that it encapsulates this very common idiom into a single operation.

This method returns a collection of Match objects within a MatchCollection instance, which implements the System.Collections.ICollection interface. This collection will lazily match input as you request instances from its contents, meaning that you pay for the incremental matching as you walk through the results.

This sample parses out sequences of digits from the input text:

 Regex r = new Regex(@"\d+"); Matches mm = r.Matches("532, 9322, 183, 0... 55, 67."); foreach (Match m in mm)     Console.WriteLine(m.ToString()); 

As with the above methods, there is also an overload that takes a startat integer.

Replacing Matches

The Replace method searches for matches in the input text, replaces each with the supplied string argument (replacement), and returns the result as a string. This process is called substitution. The default behavior is to replace all occurrences of the specified pattern, although additional overrides are available with which to limit the number of occurrences to look for. For example, this code will replace occurrences of sequences of digits with the text "<num/>":

 string input = "53 cats went to town on the 13th of July, the year 2004."; Regex r = new Regex(@"\d+"); Console.WriteLine(r.Replace(input, "<num/>")); 

The result of running this program is that input is transformed into "<num/> cats went to town on the <num/>th of July, the year <num/>." and returned in the string. The are similar static methods available for Replace that obviate the need to construct throwaway Regex objects, similar to many of the other methods already mentioned.

You can also embed a number of special tokens within your replacement text. For example, you can use backreferences to insert captured text from the regular expression match. These references are specified using a dollar sign followed by the numbered or named capture. Enclosing the name in curly braces is required, while it is optional for numbered groups. For example, $3 and ${3} are both valid for referencing the third numbered capture, and ${tag} references a named capture "tag."

If you need to embed a literal $ in your replacement text, you must escape it by using $$. If you must follow a numbered capture by a literal number, you should enclose your numbered backreference in curly braces to avoid the replacement engine from interpreting it as a multi-digit backreference. For example, $24 would be interpreted as a reference to the 24th group, while ${2}4 is a reference to the 2nd group, followed by the literal character 4:

Here is an example of this feature:

 String input = "The date is October 31st, 2004."; Regex numberedCapture = new Regex(@"(\d+)"); Console.WriteLine(numberedCapture.Replace(input, "<num>${1}</num>")); 

The replacement here simply takes the numbers in the input and changes them such that XML tags (<num/>) surround them. Running this program results in the string "The date is October <num>31</num>st, <num>2004</num>."

This code illustrates the use of named captures in the replacement text:

 String input = "The date is October 31st, 2004."; Regex namedCapture =     new Regex(@"(?<month>\w+) (?<day>\d+)(?<daysuffix>\w{2})?, (?<year>\d+)"); Console.WriteLine(namedCapture.Replace(input,     "the ${day}${daysuffix} of ${month} in the year of ${year}")); 

The length of this makes it slightly more complex than the previous example, but the concepts are very similar. First, the regular expression breaks the same input up into month, day, daysuffix, and year components, using named captures for each. It then reorders the way in which these are found in the input, resulting in the string "The date is the 31st of October in the year of 2004."

Similarly, you can use any of the following special forms in the replacement text, causing the matcher to treat them specially. $& will insert a copy of the entire match (equivalent to match.Value), while $_ substitutes the entire input string. $` results in inserting all of the input leading up to, but not including, the matching text; $' similarly inserts the text following the match. Lastly, $+ will insert the very last group captured in the pattern.

Alternatively, if you need to perform more complex behavior, you can use the string Replace(string input, MatchEvaluator evaluator, ...) overloads, where evaluator is a delegate that takes a Match object and returns a string, which is used for replacement. The delegate signature is string MatchEvaluator(Match match). It gets invoked for each matched pattern in the input text and will be replaced with whatever evaluator responds with. This enables some very powerful constructs.

The code illustrated in Listing 13-2 demonstrates one possible use, replacing formatted text in some input with attributes on records retrieved from a database. This could be used in an enterprise content management system, for example, enabling content to reference data via patterns in the format of ${id:attribute}, where id is the primary key of a record and attribute is a field. Running such content through this regular expressions replacement routine causes these to be expanded to the real values retrieved from the database.

Listing 13-2: Regular expressions replacement

image from book
 public class CustomerEvaluator {     private Dictionary<int, Customer> customers = new Dictionary<int, Customer>();     private Customer LoadFromDb(int id)     {         if (customers.ContainsKey(id))             return customers[id];         Customer c = // look up customer from db         customers.Add(id, c);         return c;     }     public string Evaluate(Match match)     {         Customer c = LoadFromDb(int.Parse(match.Groups["custId"].Value));         switch (match.Groups["attrib"].Value)         {             case "name":                 return c.Name;             case "ssn":                 return c.Ssn;             case "company":                 return c.Company;             default:                 throw new Exception("Invalid customer attribute found");         }     }     public string Process(string input)     {         Regex r = new Regex(@"\$\{(?<custid>\d+):(?<attrib>\w+)}");         return r.Replace(input, Evaluate);     } } //... string input = "Customer ${1011:name} works at company ${1011:company}."; CustomerEvaluator ce = new CustomerEvaluator(); Console.WriteLine(ce.Process(input)); 
image from book

The result of running this code is that the input string is expanded to contain the name and company of the record located using the primary key 1011. If the record's name and company fields were, say, "Jerry Smith" and "EMC", respectively, the result of executing ce.Process(input) would be "Customer Jerry Smith works at company EMC."

Splitting Strings Based on Matches

The string[] Split(string input, ...) method is an alternative to the Split(...) method found on the String class. It uses a regular expression to find delimiters in the text and returns an array of strings containing the individual elements found between these delimiters. If you use captured groups, the delimiter text that was matched is also placed into the resulting array. Otherwise, all that the array contains is the text between matches. As an illustration, pretend that we'd like to split some text using any sequence of whitespace characters as a delimiter:

 String input = "This is the text to split."; Regex r = new Regex(@"\s+"); foreach (String s in r.Split(input))     Console.WriteLine(s); 

This simply prints each word in the input sentence to the console. Each word is represented by a single element of the array returned by the call to Split. Consider a similar situation where numbers are used to delimit the text. As illustrated in the following code, by default the numbers would not returned in the returned array (because they are the delimiters). But if you wish them to be in the array itself, you can capture the digits within a group:

 String input = "Some93text103to38276split."; foreach (String s in Regex.Split(input, @"\d+"))     Console.WriteLine(s); foreach (String s in Regex.Split(input, @"(\d+)"))     Console.WriteLine(s); 

The array returned by the former of the two Split calls will not contain the digits, while the latter will.

Static Helpers

The Regex class defines a set of static methods, IsMatch, Match, Matches, Replace, and Split, which can be used as shortcuts to instance methods demonstrated above. Each takes a string argument representing the regular expression pattern and constructs the Regex automatically for you.

The Escape and Unescape methods take a string, create a modified copy with the standard metacharacters escaped or unescaped, respectively, and then return it. Refer to the earlier section on escaping for a complete list of characters needing escaping and exactly why you might want to:

 String patternToEscape = "Metachars: .*"; String escaped = Regex.Escape(patternToEscape); Console.WriteLine(escaped); Console.WriteLine(Regex.Unescape(escaped)); 

This takes a string "Metachars: .*", which contains two meta-characters, . and *. It then uses the Escape method to escape this sequence, resulting in the text "Metachars:\ \.\*", which prevents the regular expression matcher from interpreting the characters as meta-characters. Lastly, it unescapes this text via Unescape, reverting it back to its original form "Metachars: .*".

Detailed Match Results

We saw above the mechanisms using which to generate these types, in addition to the simple form IsMatch, which only returns a Boolean to indicate success or failure. The results of a matching operation are captured in a set of Match, Group, and Capture objects. There is quite a bit you can do with these objects, including inspecting the individual components matched by an expression.

Match Objects

The Match type represents the results of an attempt to match a regular expression against some input text. You will obtain instances of this class through various methods on the Regex class, such as Match and Matches. Its primary purpose is to verify that an expression was successful, but it can also be used to inspect text captured while performing a match. For example, this code will get an instance of Match, regardless of whether a match was found or not:

 Regex r = new Regex(@"\d{3}-\d{2}-\d{4}"); Match m = r.Match("001-01-0011"); 

The instance property Success indicates whether the expression matcher was able to find a match in the given input. In the above example, m.Success will return true because the expression does, in fact, match the test input. The Value property returns a string containing the unmodified input text that was matched by the pattern. In the above example, this would return "001-01-0011" as would be the case if input were "garbage011-01-0011garbage"; in other words, it's the entire text, not just the matching components. Match.ToString is overridden to return Value.

You can also access any captured groups using the Groups property. This returns an instance of GroupCollection, an enumerable and indexable ICollection of all of the groups captured from the given input. This information is stored as instances of the Group class, described in further detail below. This collection is indexable by number or string, depending on whether you are working with autonumbered or named groups, respectively.

Numbered groups can be accessed by the same numbering scheme applied by the regular expression matcher. The 0th group is the implicit group that contains the matched text (only the matching part, unlike Value above). Likewise, to obtain a named group, you can pass a string containing its name to the indexer. The following code shows two variants on a pattern that matches semicolon-delimited name-value pairs; one uses named groups and the other uses numbered groups:

 Regex numbered = new Regex(@"(\w+):\s*(\w+)"); Match numberedMatch = numbered.Match("Name: Joe, Company: Microsoft"); Console.WriteLine("Field '{0}' = '{1}'",     numberedMatch.Groups[1], numberedMatch.Groups[2]); Regex named = new Regex(@"(?<name>\w+):\s*(?<value>\w+)"); Match namedMatch = named.Match("Name: Mark, Company: EMC"); Console.WriteLine("Field '{0}' = '{1}'",     namedMatch.Groups["name"], namedMatch.Groups["value"]); 

Lastly, the NextMatch method returns a new Match instance, the result of reapplying the pattern to the remaining input from where the previous match ended. Imagine if you wanted to use the pattern from the previous example to walk through all of the name-value pairs found in an entire body of text. You may have noticed that it only matched the "Name: ..." part of the input text but not the "Company: ..." part; to enumerate each pair in the text, you can use the NextMatch method:

 Match namedMatch = named.Match("Name: Bill, Company: EMC"); while (namedMatch.Success) {     Console.WriteLine("Field '{0}' = '{1}'",         namedMatch.Groups["name"], namedMatch.Groups["value"]);     namedMatch = namedMatch.NextMatch(); } 

This code continues parsing and printing numbers from the input string until it no longer finds a match, detected by examining the Success property on Match returned by NextMatch.

Group and Capture Objects

The Group and Capture classes shares many properties with the Match type. In fact, Match is a subclass of Group, which in turn is a subclass of Capture. Group exists so that you can test the success of matching individual groups of an expression and to access the associated captured input. This is done in the same fashion as you would with a Match object — that is, by using the Success and Value properties, respectively. Additionally, there is a Captures property — similar to Groups found on Match — that returns an indexable collection of Capture instances.

The reason for having a collection of captures rather than just one may not be immediately obvious. If you have a group which is modified by a quantifier, it might actually capture multiple occurrences of that group. For example, the expression (\w+)* captures a collection of words, where each word is a matched group. A more complex illustration can be created using a slight modification to the name-value pair expression from above:

 Regex named = new Regex(@"((?<name>\w+):\s*(?<value>\w+)[,\s]*)+",     RegexOptions.ExplicitCapture); Match namedMatch = named.Match("Name: Bill, Company: EMC"); Group nameGroup = namedMatch.Groups["name"]; Group valueGroup = namedMatch.Groups["value"]; for (int i = 0; i < nameGroup.Captures.Count; i++)     Console.WriteLine("Field '{0}' = '{1}'",         nameGroup.Captures[i].Value, valueGroup.Captures[i].Value); 

Given that we captured multiple occurrences of the group (?<name>\w+), we need a way to access the different "Name: ..." and "Company: ..." captured bits of text. Accessing nameGroup.Value returns the last capture matched by the expression, while Captures enables you to access individual captures. Thus, nameGroup.Captures[0] will be the "Name: ..." text, while nameGroup.Captures[1] will be "Company: ...".

Compiled Expressions

There are three mechanisms with which the regular expressions library executes matches: interpretation, lazy compilation, or precompilation. Each has its own benefits and disadvantages. At first glance, it might seem odd that a regular expression might require compilation at all. But matching a regular expression against input is much like what a compiler's front end has to do with program input and can require some rather complex state machine generation and manipulation to execute efficiently. Using the right mode of compilation can lead to significant performance gains. You can control this process through the use of additional BCL APIs.

Interpretation

The default mode of execution is interpretation. In this mode, the regular expression APIs will parse your pattern and lay it out in memory in a parse tree structure. This representation is not executable code (as is the case with compilation) but rather a walkable expression that the matcher will interpret and react to while matching some input. The lack of any compilation or code generation here means that this is a very efficient way to create expressions, avoiding any expensive compilation. However, it is noticeably less performant while matching input.

If you will be using an expression multiple times during your program's execution, the accumulation of performance hits taken each time you perform a match will likely outweigh the potential cost for compilation. If you're using an expression from within a loop or hot segment of code, for example, you should seriously consider compilation as an option.

Lazy Compilation

Lazy compilation simply means that compilation will occur when you instantiate a new regular expression (Regex object) at runtime. To enable this mode, you must pass the RegexOptions.Compiled option to the Regex constructor. Similar to the way in which interpretation works, this will parse the expression into an in-memory data structure. However, this data structure is then compiled into executable IL code. Lightweight Code Generation (LCG) is used to generate code that is then stored in memory and is ready to be run whenever the matcher is called. (LCG is discussed in Chapter 14.)

Because generating this code is less efficient than just the parse tree generated with interpretation, you will take a slight performance hit when constructing a new expression. This hit comes in two ways: both in time and space (associated with the dynamic code). Worse yet, prior to using LCG in 2.0 this code memory wouldn't go away unless you unloaded the AppDomain in which the expression was created. What is lost in up-front cost can be recovered quickly if you use an expression more than once, especially for overly complex patterns.

Precompilation

You can generate additional assemblies during your build process to contain the compiled code for your expressions. You typically don't need dynamic information to create an expression — they tend to be represented as literal strings — and thus there's no need to wait until runtime to compile expressions at all. With this approach, you shift the entire burden to compile time. This technique avoids the performance hit of compiling at runtime but does require that you to load a new assembly into memory.

This option works by using the Regex.CompileToAssembly static method. It takes an array of expression descriptors of type RegexCompilationInfo and an AssemblyName type, which describes the assembly to create. The RegexCompilationInfo class has just a few properties: Pattern is the regular expression to compile; Options is an enumeration value of type RegexOptions, allowing you to specify expression options as you would when creating a normal expression; Namespace, Name, and IsPublic all control the type that gets compiled.

You will have to create some code that, when run, compiles your expression and stores it in an assembly. This might seem odd at first but will typically just be a simple script that looks like this:

 public static void Main(String[] args) {   Regex.CompileToAssembly(...);   Regex.CompileToAssembly(...);   // Additional expressions... } 

The sole purpose of this program is just to generate a dedicated regular expression assembly containing all of your code's patterns, which you will then reference from your main project. It will create a custom class, the name of which you control using arguments to the CompileToAssembly method. The generated class can then be instantiated from your program and used like any old Regex.

Evolution from Interpretation to Compilation

This section will demonstrate an incremental improvement to an initially poor use of regular expressions. Imagine that we wanted to use the expression (?(^\d)^\d+$|^\D+$) several times. We might begin by using interpretation, perhaps because we were previously unaware of compilation:

 Regex r = new Regex(@"(?(^\d)^\d+$|^\D+$)"); foreach (String str in inputStrings) {     if (r.IsMatch(str))         // Do something...     else         // Do something else... } 

This is not an uncommon practice. But for large sets of input, performance will suffer. Think about what happens if we had, say, 1,000,000 lines of input to process. The matcher would have to walk the regular expression's in-memory representation 1,000,000 times.

You can easily remedy this with a quick, one-line fix:

 Regex r = new Regex(@"(?(^\d)^\d+$|^\D+$)", RegexOptions.Compiled); foreach (String str in inputStrings) {     if (r.IsMatch(str))         // Do something...     else         // Do something else... } 

In some basic testing I performed, this sped up execution by 2.25x (including the cost for compilation itself). That's more than a twofold increase, which isn't too shabby. But we're still paying the cost to compile the expression at runtime. We can do better.

To do so, we need to write a separate program to generate the assembly:

 using System; using System.Reflection; using System.Text.RegularExpressions; class RegexGenerator {     public static void Main(String[] args)     {         RegexCompilationInfo myRegex = new RegexCompilationInfo(             @"(?(^\d)^\d+$|^\D+$)", RegexOptions.None,             "MyRegex", "MyNamespace", true);         Console.Write("Compiling to assembly...");         Regex.CompileToAssembly(new RegexCompilationInfo[] { myRegex },             new AssemblyName("MyRegex"));         Console.WriteLine(" Done!");     } } 

After running this code, a new assembly, MyRegex.dll, will be generated in the current directory. Inside it, you will find one public class MyNamespace.MyRegex, which derives from the System.Text .RegularExpressions.Regex type. This contains your fully compiled expression. If you had passed in multiple RegexCompilationInfo instances to CompileToAssembly, there would be one class per compilation info instance. You can manage the generation of such assemblies by placing the execution of this program into your standard build process, for example an MSBuild task that simple executes the program.

To use this newly generated expression, simply add a reference to the regular expression DLL, ensure that you import the namespace (if different from the class in which you are using it), and change your code where you instantiate the expression to something like this:

 Regex r = new MyRegex(); foreach (String str in inputStrings) {     if (r.IsMatch(str))         // Do something...     else         // Do something else... } 

There's no need to pass in the expression string or any other information to construct an instance. In fact, the generated type only has a single, no-args constructor that you must worry about. Everything else works the same. And my testing showed another 1.5x speedup over the lazily compiled version as a result.




Professional. NET Framework 2.0
Professional .NET Framework 2.0 (Programmer to Programmer)
ISBN: 0764571354
EAN: 2147483647
Year: N/A
Pages: 116
Authors: Joe Duffy

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net