Section 9.6. Advanced .NET


9.6. Advanced .NET

The following pages cover a few features that haven't fit into the discussion so far: building a regex library with regex assemblies, using an interesting .NET-only regex feature for matching nested constructs, and a discussion of the Capture object.

9.6.1. Regex Assemblies

.NET allows you to encapsulate Regex objects into an assembly, which is useful in creating a regex library. The example in the sidebar on the facing page shows how to build one.

When the sidebar example executes, it creates the file JfriedlsRegexLibrary.DLL in the project's bin directory.

I can then use that assembly in another project, after first adding it as a reference via Visual Studio .NET's Project > Add Reference dialog.

To make the classes in the assembly available, I first import them:

 Imports   jfriedl 

I can then use them just like any other class, as in this example::

 Dim  FieldRegex  as  CSV.GetField  = New  CSV.GetField  '  This makes a new Regex object   Dim  FieldMatch  as Match =  FieldRegex  .Match(  Line  ) '  Apply the regex to a string  ...     While  FieldMatch  .Success        Dim  Field  as String        If  FieldMatch  .Groups(1).Success  Field  =  FieldMatch  .Groups("QuotedField").Value  Field  = Regex.Replace(  Field  , """""", """") '  replace two double quotes with one  Else  Field  =  FieldMatch  .Groups("UnquotedField").Value        End If        Console.WriteLine("[" &  Field  & "]")        ' Can now work with '  Field  '...  FieldMatch  =  FieldMatch  .NextMatch     End While 

In this example, I chose to import only from the jfriedl namespace, but could have just as easily imported from the jfriedl.CSV namespace, which then would allow the Regex object to be created with:

 Dim  FieldRegex  as  GetField  = New  GetField  '  This makes a new Regex object  

The difference is mostly a matter of style.

Creating Your Own Regex Library with an Assembly

This example builds a small regex library. This complete program builds an assembly (DLL) that holds three prebuilt Regex constructors I've named jfriedl.Mail.Subject, jfriedl.Mail.From , and jfriedl.CSV.GetField .

The first two are simple examples just to show how it's done, but the complexity of the final one really shows the promise of building your own library. Note that you don't have to give the RegexOptions.Compiled flag, as that's implied by the process of building an assembly.

See the text (˜434) for how to use the assembly after it's built.

 Option Explicit On     Option Strict On     Imports System.Text.RegularExpressions     Imports System.Reflection        Module  BuildMyLibrary  Sub Main()      '  The calls to RegexCompilationInfo   below provide the  pattern, regex options, name within the class  ,      '  class name, and a Boolean indicating whether the new class is public. The first class, for example  ,      '  will be available to programs that use this assembly as "jfriedl.Mail.Subject", a Regex constructor  .      Dim  RCInfo  () as RegexCompilationInfo = {              _        New RegexCompilationInfo(                                                _          "^Subject:\s*(.*)", RegexOptions.IgnoreCase,                           _          "Subject", "jfriedl.Mail", true),                                      _        New RegexCompilationInfo(                                                _          "^From:\s*(.*)", RegexOptions.IgnoreCase,                              _          "From", "jfriedl.Mail", true),                                         _        New RegexCompilationInfo(                                                _          "\G(?:^,)                                            " &          _          "(?:                                                  " &          _          " (?# Either a double-quoted field... )          " &          _          " "" (?# field's opening quote )                      " &          _          " (?<QuotedField> (?> [^""]+  """" )* )     " &          _          " "" (?# field's closing quote )                      " &          _          " (?# ...or... )                            " &          _          "                                                    " &          _          " (?# ...some non-quote/non-comma text... ) " &          _          " (?<UnquotedField> [^"",]*)                    " &          _          " )",                                                                  _          RegexOptions.IgnorePatternWhitespace,                                  _          "GetField", "jfriedl.CSV", true)                                       _     }      '  Now do the heavy lifting to build and write out the whole thing  ...      Dim  AN  as AssemblyName   = new AssemblyName()  AN  .Name = "JfriedlsRegexLibrary" '  This will be the DLL's filename   AN  .Version = New Version("1.0.0.0")      Regex.CompileToAssembly(  RCInfo  ,  AN  ) '  Build everything  End Sub     End Module 


You can also choose to not import anything, but rather use them directly:

 Dim  FieldRegex  as  jfriedl.CSV.GetField  = New  jfriedl.CSV.GetField  

This is a bit more cumbersome, but documents clearly where exactly the object is coming from. Again, it's a matter of style.

9.6.2. Matching Nested Constructs

Microsoft has included an interesting innovation for matching balanced constructs (historically, something not possible with a regular expression). It's not particularly easy to understandthis section is short, but be warned , it is very dense.

It's easiest to understand with an example, so I'll start with one:

 Dim  R  As Regex = New Regex(" \(                         " & _                                "   (?>                   " & _                                "       [^()]+               " & _                                "                           " & _                                "       \( (?<DEPTH>)  " & _                                "                           " & _                                "       \) (?<-DEPTH>) " & _                                "   )*                       " & _                                "   (?(DEPTH)(?!))           " & _                                " \)                         ",_            RegexOptions.IgnorePatternWhitespace) 

This matches the first properly-paired nested set of parentheses, such as the underlined portion of 'before ( nope after' . The first parenthesis isn't matched because it has no associated closing parenthesis.

Here's the super-short overview of how it works:

  • 1 . With each '(' matched, (?<DEPTH>) adds one to the regexs idea of how deep the parentheses are currently nested (at least, nested beyond the initial \( at the start of the regex).

  • 2 . With each ')' matched, (?<-DEPTH>) subtracts one from that depth.

  • 3 . (?(DEPTH)(?!)) ensures that the depth is zero before allowing the final literal \) to match.

This works because the engine's backtracking stack keeps track of successfully matched groupings. (?<DEPTH>) is just a named-capture version of () , which is always successful. Since it has been placed immediately after \( , its success (which remains on the stack until removed) is used as a marker for counting opening parentheses.

Thus, the number of successful 'DEPTH' groupings matched so far is maintained on the backtracking stack. We want to subtract from that whenever a closing parentheses is found. That's accomplished by .NET's special (?<-DEPTH>) construct, which removes the most recent "successful DEPTH " notation from the stack. If it turns out that there arent any, the (?<-DEPTH>) itself fails, thereby disallowing the regex from over-matching an extra closing parenthesis.

Finally, (?(DEPTH)(?!)) is a normal conditional that applies (?!) if the 'DEPTH grouping is currently successful. If it's still successful by the time we get here, there was an unpaired opening parenthesis whose success had never been subtracted by a balancing (?<-DEPTH>) . If thats the case, we want to exit the match (we don't want to match an unbalanced sequence), so we apply (?!) , which is normal negative lookahead of an empty subexpression, and guaranteed to fail.

Phew! That's how to match nested constructs with .NET regular expressions.

9.6.3. Capture Objects

There's an additional component to .NET's object model, the Capture object, which I haven't discussed yet. Depending on your point of view, it either adds an interesting new dimension to the match results, or adds confusion and bloat.

A Capture object is almost identical to a Group object in that it represents the text matched within a set of capturing parentheses. Like the Group object, it has methods for Value (the text matched), Length (the length of the text matched), and Index (the zero-based number of characters into the target string that the match was found).

The main difference between a Group object and a Capture object is that each Group object contains a collection of Captures representing all the intermediary matches by the group during the match, as well as the final text matched by the group.

Here's an example with ^(..)+ applied to 'abcdefghijk :

 Dim  M  as Match = Regex.Match("abcdefghijk", "^(..)+") 

The regex matches four sets of (..) , which is most of the string: . Since the plus is outside of the parentheses, they recapture with each iteration of the plus, and are left with only 'ij' (that is, M.Groups(1).Value is 'ij' ). However, that M.Groups(1) also contains a collection of Captures representing the complete 'ab', 'cd', 'ef', 'gh' , and 'ij' that (..) walked through during the match:

  M  .Groups(1).Captures(0).Value is 'ab'  M  .Groups(1).Captures(1).Value is 'cd'  M  .Groups(1).Captures(2).Value is 'ef'  M  .Groups(1).Captures(3).Value is 'gh'  M  .Groups(1).Captures(4).Value is 'ij'  M  .Groups(1).Captures.Count is 5. 

You'll notice that the last capture has the same 'ij' value as the overall match, M.Groups(1).Value . It turns out that the Value of a Group is really just a shorthand notation for the group's final capture. M.Groups(1).Value is really:

  M.Groups (1).Captures(  M.Groups(1).Captures.Count - 1 ).  Value  

Here are some additional points about captures:

  • M.Groups(1).Captures is a CaptureCollection , which, like any collection, has Items and Count properties. However, it's common to forego the Items property and index directly through the collection to its individual items, as with M.Groups(1).Captures(3) (M.Groups[1].Captures[3] in C # ).

  • A Capture object does not have a Success method; check the Group 's Success instead.

  • So far, we've seen that Capture objects are available from a Group object. Although it's not particularly useful, a Match object also has a Captures property. M.Captures gives direct access to the Captures property of the zeroth group (that is, M.Captures is the same as M.Groups(0).Captures ). Since the zeroth group represents the entire match, there are no iterations of it "walking through" a match, so the zeroth captured collection always has only one Capture . Since they contain exactly the same information as the zeroth Group , both M.Captures and M.Groups(0).Captures are not particularly useful.

.NET's Capture object is an interesting innovation that appears somewhat more complex and confusing than it really is by the way it's been "overly integrated" into the object model. After getting past the .NET documentation and actually understanding what these objects add, I've got mixed feelings about them. On one hand, it's an interesting innovation that I'd like to get to know. Uses for it don't immediately jump to mind, but that's likely because I've not had the same years of experience with it as I have with traditional regex features.

On the other hand, the construction of all these extra capture groups during a match, and then their encapsulation into objects after the match, seems an efficiency burden that I wouldn't want to pay unless I'd requested the extra information. The extra Capture groups won't be used in the vast majority of matches, but as it is, all Group and Capture objects (and their associated GroupCollection and CaptureCollection objects) are built when the Match object is built. So, you've got them whether you need them or not; if you can find a use for the Capture objects, by all means, use them.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net