9.6. Advanced .NET
The following pages cover a few features that haven't fit into the discussion so far: building a regex library with regex assemblies, using an interesting .NET-only regex feature for matching nested constructs, and a discussion of the Capture object.
9.6.1. Regex Assemblies
.NET allows you to encapsulate Regex objects into an assembly, which is useful in creating a regex library. The example in the sidebar on the facing page shows how to build one.
When the sidebar example executes, it creates the file JfriedlsRegexLibrary.DLL in the project's bin directory.
I can then use that assembly in another project, after first adding it as a reference via Visual Studio .NET's Project > Add Reference dialog.
To make the classes in the assembly available, I first import them:
I can then use them just like any other class, as in this example::
Dim FieldRegex as CSV.GetField = New CSV.GetField ' This makes a new Regex object Dim FieldMatch as Match = FieldRegex .Match( Line ) ' Apply the regex to a string ... While FieldMatch .Success Dim Field as String If FieldMatch .Groups(1).Success Field = FieldMatch .Groups("QuotedField").Value Field = Regex.Replace( Field , """""", """") ' replace two double quotes with one Else Field = FieldMatch .Groups("UnquotedField").Value End If Console.WriteLine("[" & Field & "]") ' Can now work with ' Field '... FieldMatch = FieldMatch .NextMatch End While
In this example, I chose to import only from the jfriedl namespace, but could have just as easily imported from the jfriedl.CSV namespace, which then would allow the Regex object to be created with:
Dim FieldRegex as GetField = New GetField ' This makes a new Regex object
The difference is mostly a matter of style.
You can also choose to not import anything, but rather use them directly:
Dim FieldRegex as jfriedl.CSV.GetField = New jfriedl.CSV.GetField
This is a bit more cumbersome, but documents clearly where exactly the object is coming from. Again, it's a matter of style.
9.6.2. Matching Nested Constructs
Microsoft has included an interesting innovation for matching balanced constructs (historically, something not possible with a regular expression). It's not particularly easy to understandthis section is short, but be warned , it is very dense.
It's easiest to understand with an example, so I'll start with one:
Dim R As Regex = New Regex(" \( " & _ " (?> " & _ " [^()]+ " & _ " " & _ " \( (?<DEPTH>) " & _ " " & _ " \) (?<-DEPTH>) " & _ " )* " & _ " (?(DEPTH)(?!)) " & _ " \) ",_ RegexOptions.IgnorePatternWhitespace)
This matches the first properly-paired nested set of parentheses, such as the underlined portion of 'before ( nope after' . The first parenthesis isn't matched because it has no associated closing parenthesis.
Here's the super-short overview of how it works:
This works because the engine's backtracking stack keeps track of successfully matched groupings. (?<DEPTH>) is just a named-capture version of () , which is always successful. Since it has been placed immediately after \( , its success (which remains on the stack until removed) is used as a marker for counting opening parentheses.
Thus, the number of successful 'DEPTH' groupings matched so far is maintained on the backtracking stack. We want to subtract from that whenever a closing parentheses is found. That's accomplished by .NET's special (?<-DEPTH>) construct, which removes the most recent "successful DEPTH " notation from the stack. If it turns out that there arent any, the (?<-DEPTH>) itself fails, thereby disallowing the regex from over-matching an extra closing parenthesis.
Finally, (?(DEPTH)(?!)) is a normal conditional that applies (?!) if the 'DEPTH grouping is currently successful. If it's still successful by the time we get here, there was an unpaired opening parenthesis whose success had never been subtracted by a balancing (?<-DEPTH>) . If thats the case, we want to exit the match (we don't want to match an unbalanced sequence), so we apply (?!) , which is normal negative lookahead of an empty subexpression, and guaranteed to fail.
Phew! That's how to match nested constructs with .NET regular expressions.
9.6.3. Capture Objects
There's an additional component to .NET's object model, the Capture object, which I haven't discussed yet. Depending on your point of view, it either adds an interesting new dimension to the match results, or adds confusion and bloat.
A Capture object is almost identical to a Group object in that it represents the text matched within a set of capturing parentheses. Like the Group object, it has methods for Value (the text matched), Length (the length of the text matched), and Index (the zero-based number of characters into the target string that the match was found).
The main difference between a Group object and a Capture object is that each Group object contains a collection of Captures representing all the intermediary matches by the group during the match, as well as the final text matched by the group.
Here's an example with ^(..)+ applied to 'abcdefghijk :
Dim M as Match = Regex.Match("abcdefghijk", "^(..)+")
The regex matches four sets of (..) , which is most of the string: . Since the plus is outside of the parentheses, they recapture with each iteration of the plus, and are left with only 'ij' (that is, M.Groups(1).Value is 'ij' ). However, that M.Groups(1) also contains a collection of Captures representing the complete 'ab', 'cd', 'ef', 'gh' , and 'ij' that (..) walked through during the match:
M .Groups(1).Captures(0).Value is 'ab' M .Groups(1).Captures(1).Value is 'cd' M .Groups(1).Captures(2).Value is 'ef' M .Groups(1).Captures(3).Value is 'gh' M .Groups(1).Captures(4).Value is 'ij' M .Groups(1).Captures.Count is 5.
You'll notice that the last capture has the same 'ij' value as the overall match, M.Groups(1).Value . It turns out that the Value of a Group is really just a shorthand notation for the group's final capture. M.Groups(1).Value is really:
M.Groups (1).Captures( M.Groups(1).Captures.Count - 1 ). Value
Here are some additional points about captures:
.NET's Capture object is an interesting innovation that appears somewhat more complex and confusing than it really is by the way it's been "overly integrated" into the object model. After getting past the .NET documentation and actually understanding what these objects add, I've got mixed feelings about them. On one hand, it's an interesting innovation that I'd like to get to know. Uses for it don't immediately jump to mind, but that's likely because I've not had the same years of experience with it as I have with traditional regex features.
On the other hand, the construction of all these extra capture groups during a match, and then their encapsulation into objects after the match, seems an efficiency burden that I wouldn't want to pay unless I'd requested the extra information. The extra Capture groups won't be used in the vast majority of matches, but as it is, all Group and Capture objects (and their associated GroupCollection and CaptureCollection objects) are built when the Match object is built. So, you've got them whether you need them or not; if you can find a use for the Capture objects, by all means, use them.