9.2. Using .NET Regular Expressions
.NET regular expressions are powerful, clean, and provided through a complete and easy-to-use class interface. But as wonderful a job that Microsoft did building the package, the documentation is just the oppositeit's horrifically bad. It's woefully incomplete, poorly written, disorganized, and sometimes even wrong. It took me quite a while to figure the package out, so it's my hope that the presentation in this chapter makes the use of .NET regular expressions clear for you.
9.2.1. Regex Quickstart
You can get quite a bit of use out of the .NET regex package without even knowing the details of its regex class model. Knowing the details lets you get more information more efficiently , but the following are examples of how to do simple operations without explicitly creating any classes. These are just examples; all the details follow shortly.
Any program that uses the regex library must have the line
at the beginning of the file (˜415), so these examples assume that's there.
The following examples all work with the text in the String variable TestStr . As with all examples in this chapter, names I've chosen are in italic.
22.214.171.124. Quickstart: Checking a string for match
This example simply checks to see whether a regex matches a string:
If Regex.IsMatch( TestStr , "^\s*$") Console.WriteLine("line is empty") Else Console.WriteLine("line is not empty") End If
This example uses a match option:
If Regex.IsMatch( TestStr , "^subject:", RegexOptions.IgnoreCase) Console.WriteLine("line is a subject line") Else Console.WriteLine("line is not a subject line") End If
126.96.36.199. Quickstart: Matching and getting the text matched
This example identifies the text actually matched by the regex. If there's no match, TheNum is set to an empty string.
Dim TheNum as String = Regex.Match( TestStr , "\d+") .Value If TheNum <> "" Console.WriteLine("Number is: " & TheNum ) End If
This example uses a match option:
Dim ImgTag as String = Regex.Match( TestStr , "<img\b[^>]*>", _ RegexOptions.IgnoreCase).Value If ImgTag <> "" Console.WriteLine("Image tag: " & ImgTag ) End If
188.8.131.52. Quickstart: Matching and getting captured text
This example gets the first captured group (e.g., $1 ) as a string:
Dim Subject as String = _ Regex.Match( TestStr , "^Subject: (.*)").Groups(1).Value If Subject <> "" Console.WriteLine("Subject is: " & Subject ) End If
Note that C # uses Groups instead of Groups(1) .
Here's the same thing, using a match option:
Dim Subject as String = _ Regex.Match( TestStr , "^subject: (.*)", _ RegexOptions.IgnoreCase).Groups(1).Value If Subject <> "" Console.WriteLine("Subject is: " & Subject ) End If
This example is the same as the previous, but using named capture:
Dim Subject as String = _ Regex.Match( TestStr , "^subject: (?<Subj>.*)", _ RegexOptions.IgnoreCase).Groups("Subj").Value If Subject <> "" Console.WriteLine("Subject is: " & Subject ) End If
184.108.40.206. Quickstart: Search and replace
This example makes our test string "safe" to include within HTML, converting characters special to HTML into HTML entities:
TestStr = Regex.Replace( TestStr , "&", "&") TestStr = Regex.Replace( TestStr , "<", "<") TestStr = Regex.Replace( TestStr , ">", ">") Console.WriteLine("Now safe in HTML: " & TestStr )
The replacement string (the third argument) is interpreted specially, as described in the sidebar on page 424. For example, within the replacement string, '$&' is replaced by the text actually matched by the regex. Here's an example that wraps <B>‹</B> around capitalized words:
TestStr = Regex.Replace( TestStr , "\b[A-Z]\w*", "<B>$&</B>") Console.WriteLine("Modified string: " & TestStr )
This example replaces <B>‹</B> (in a case-insensitive manner) with <I>‹</I>:
TestStr = Regex.Replace( TestStr , "<b>(.*?)/</b>", "<I></I>", _ RegexOptions.IgnoreCase) Console.WriteLine("Modified string: " & TestStr )
9.2.2. Package Overview
You can get the most out .NET regular expressions by working with its rich and convenient class structure. To give us an overview, here's a complete console application that shows a simple match using explicit objects:
Option Explicit On ' These are not specifically required to use regexes , Option Strict On ' but their use is good general practice . ' Make regex-related classes easily available . Imports System.Text.RegularExpressions Module SimpleTest Sub Main() Dim SampleText as String = "this is the 1st test string" Dim R as Regex = New Regex("\d+\w+") ' Compile the pattern . Dim M as Match = R .match( SampleText ) ' Check against a string . If not M .Success Console.WriteLine("no match") Else Dim MatchedText as String = M .Value ' Query the results ... Dim MatchedFrom as Integer = M .Index Dim MatchedLen as Integer = M .Length Console.WriteLine("matched [" & MatchedText & "]" & _ " from char#" & MatchedFrom .ToString() & _ " for " & MatchedLen .ToString() & " chars.") End If End Sub End Module
When executed from a command prompt, it applies \d+\w+ to the sample text and displays:
matched [1st] from char#12 for 3 chars.
220.127.116.11. Importing the regex namespace
Notice the Imports System.Text.RegularExpressions line near the top of the program? That's required in any VB program that wishes to access the .NET regex objects, to make them available to the compiler.
The analogous statement in C # is:
using System.Text.RegularExpressions; // This is for C#
The example shows the use of the underlying raw regex objects. The two main action lines:
Dim R as Regex = New Regex("\d+\w+") ' Compile the pattern . Dim M as Match = R.Match(SampleText) ' Check against a string .
can also be combined, as:
Dim M as Match = Regex.Match(SampleText, "\d+\w+") ' Check pattern against string .
The combined version is easier to work with, as there's less for the programmer to type, and less objects to keep track of. It does, however, come with at a slight efficiency penalty (˜432). Over the coming pages, we'll first look at the raw objects, and then at the "convenience" functions such as the Regex.Match static function, and when it makes sense to use them.
For brevity's sake, I'll generally not repeat the following lines in examples that are not complete programs:
Option Explicit On Option Strict On Imports System.Text.RegularExpressions
It may also be helpful to look back at some of VB examples earlier in the book, on pages 96, 99, 204, 219, and 237.
9.2.3. Core Object Overview
Before getting into the details, let's first take a step back and look the .NET regex object model. An object model is the set of class structures through which regex functionality is provided. .NET regex functionality is provided through seven highly-interwoven classes, but in practice, you'll generally need to understand only the three shown visually in Figure 9-1 on the facing page, which depicts the repeated application of \s+(\d+) to the string 'May 16, 1998 .
Figure 9-1. .NET's Regex-related object model
18.104.22.168. Regex objects
The first step is to create a Regex object, as with:
Dim R as Regex = New Regex("\s+(\d+)")
Here, we've made a regex object representing \s+(\d+) and stored it in the R variable. Once youve got a Regex object, you can apply it to text with its Match( text ) method, which returns information on the first match found:
Dim M as Match = R.Match("May 16, 1998")
22.214.171.124. Match objects
A Regex object's Match(‹) method provides information about a match result by creating and returning a Match object. A Match object has a number of properties, including Success (a Boolean value indicating whether the match was successful) and Value (a copy of the text actually matched, if the match was successful). We'll look at the full list of Match properties later.
Among the details you can get about a match from a Match object is information about the text matched within capturing parentheses. The Perl examples in earlier chapters used Perl's $1 variable to get the text matched within the first set of capturing parentheses. .NET offers two methods to retrieve this data: to get the raw text, you can index into a Match object's Groups property, such as with Groups(1).Value to get the equivalent of Perl's $1 . (Note: C # requires a different syntax, Groups.Value , instead.) Another approach is to use the Result method, which is discussed starting on page 429.
126.96.36.199. Group objects
The Groups(1) part in the previous paragraph actually references a Group object, and the subsequent .Value references its Value property (the text associated with the group). There is a Group object for each set of capturing parentheses, and a "virtual group," numbered zero, which holds the information about the overall match.
Thus, MatchObj . Value and MatchObj . Groups(0).Value are the samea copy of the entire text matched. It's more concise and convenient to use the first, shorter approach, but it's important to know about the zeroth group because MatchObj . Groups.Count (the number of groups known to the Match object) includes it. The MatchObj . Groups.Count resulting from a successful match with \s+(\d+) is two (the whole-match "zeroth group, and the $1 group).
188.8.131.52. Capture objects
There is also a Capture object. It's not used often, but it's discussed starting on page 437.
184.108.40.206.1. All results are computed at match time
When a regex is applied to a string, resulting in a Match object, all the results (where it matched, what each capturing group matched, etc.) are calculated and encapsulated into the Match object. Accessing properties and methods of the Match object, including its Group objects (and their properties and methods) merely fetches the results that have already been computed.