Section 9.2. Using .NET Regular Expressions


9.2. Using .NET Regular Expressions

.NET regular expressions are powerful, clean, and provided through a complete and easy-to-use class interface. But as wonderful a job that Microsoft did building the package, the documentation is just the oppositeit's horrifically bad. It's woefully incomplete, poorly written, disorganized, and sometimes even wrong. It took me quite a while to figure the package out, so it's my hope that the presentation in this chapter makes the use of .NET regular expressions clear for you.

9.2.1. Regex Quickstart

You can get quite a bit of use out of the .NET regex package without even knowing the details of its regex class model. Knowing the details lets you get more information more efficiently , but the following are examples of how to do simple operations without explicitly creating any classes. These are just examples; all the details follow shortly.

Any program that uses the regex library must have the line

 Imports   System.Text.RegularExpressions 

at the beginning of the file (˜415), so these examples assume that's there.

The following examples all work with the text in the String variable TestStr . As with all examples in this chapter, names I've chosen are in italic.

9.2.1.1. Quickstart: Checking a string for match

This example simply checks to see whether a regex matches a string:

 If Regex.IsMatch(  TestStr  , "^\s*$")        Console.WriteLine("line is empty")     Else        Console.WriteLine("line is not empty")     End If 

This example uses a match option:

 If Regex.IsMatch(  TestStr  , "^subject:", RegexOptions.IgnoreCase)        Console.WriteLine("line is a subject line")    Else        Console.WriteLine("line is not a subject line")    End If 

9.2.1.2. Quickstart: Matching and getting the text matched

This example identifies the text actually matched by the regex. If there's no match, TheNum is set to an empty string.

 Dim  TheNum  as String = Regex.Match(  TestStr  , "\d+") .Value     If  TheNum  <> ""            Console.WriteLine("Number is: " &  TheNum  )     End If 

This example uses a match option:

 Dim  ImgTag  as String = Regex.Match(  TestStr  , "<img\b[^>]*>", _                                         RegexOptions.IgnoreCase).Value     If  ImgTag  <> ""        Console.WriteLine("Image tag: " &  ImgTag  )     End If 

9.2.1.3. Quickstart: Matching and getting captured text

This example gets the first captured group (e.g., $1 ) as a string:

 Dim  Subject  as String = _         Regex.Match(  TestStr  , "^Subject: (.*)").Groups(1).Value     If  Subject  <> ""        Console.WriteLine("Subject is: " &  Subject  )     End If 

Note that C # uses Groups[1] instead of Groups(1) .

Here's the same thing, using a match option:

 Dim  Subject  as String = _         Regex.Match(  TestStr  , "^subject: (.*)", _                       RegexOptions.IgnoreCase).Groups(1).Value     If  Subject  <>  ""         Console.WriteLine("Subject is: " &  Subject  )     End If 

This example is the same as the previous, but using named capture:

 Dim  Subject  as String = _         Regex.Match(  TestStr  , "^subject: (?<Subj>.*)", _                     RegexOptions.IgnoreCase).Groups("Subj").Value     If  Subject  <> ""        Console.WriteLine("Subject is: " &  Subject  )     End If 

9.2.1.4. Quickstart: Search and replace

This example makes our test string "safe" to include within HTML, converting characters special to HTML into HTML entities:

  TestStr  = Regex.Replace(  TestStr  , "&", "&amp;")  TestStr  = Regex.Replace(  TestStr  , "<", "&lt;")  TestStr  = Regex.Replace(  TestStr  , ">", "&gt;")     Console.WriteLine("Now safe in HTML: " &  TestStr  ) 

The replacement string (the third argument) is interpreted specially, as described in the sidebar on page 424. For example, within the replacement string, '$&' is replaced by the text actually matched by the regex. Here's an example that wraps <B>‹</B> around capitalized words:

  TestStr  = Regex.Replace(  TestStr  , "\b[A-Z]\w*", "<B>$&</B>")     Console.WriteLine("Modified string: " &  TestStr  ) 

This example replaces <B>‹</B> (in a case-insensitive manner) with <I>‹</I>:

  TestStr  = Regex.Replace(  TestStr  , "<b>(.*?)/</b>", "<I></I>", _                             RegexOptions.IgnoreCase)                                     Console.WriteLine("Modified string: " &  TestStr  ) 

9.2.2. Package Overview

You can get the most out .NET regular expressions by working with its rich and convenient class structure. To give us an overview, here's a complete console application that shows a simple match using explicit objects:

 Option Explicit On '  These are not specifically required to use regexes  ,     Option Strict On   '  but their use is good general practice  .     '  Make regex-related classes easily available  .  Imports   System.Text.RegularExpressions  Module  SimpleTest  Sub Main()          Dim  SampleText  as String = "this is the 1st test string"          Dim  R  as Regex = New Regex("\d+\w+") '  Compile the pattern  .          Dim  M  as Match =  R  .match(  SampleText  ) '  Check against a string  .          If not  M  .Success              Console.WriteLine("no match")          Else              Dim  MatchedText  as String  =  M  .Value '  Query the results  ...              Dim  MatchedFrom  as Integer =  M  .Index              Dim  MatchedLen  as Integer =  M  .Length              Console.WriteLine("matched [" &  MatchedText  & "]" & _                                          " from char#" &  MatchedFrom  .ToString() & _                                          " for " &  MatchedLen  .ToString() & " chars.")          End If     End Sub     End Module 

When executed from a command prompt, it applies \d+\w+ to the sample text and displays:

 matched [1st] from char#12 for 3 chars. 

9.2.2.1. Importing the regex namespace

Notice the Imports System.Text.RegularExpressions line near the top of the program? That's required in any VB program that wishes to access the .NET regex objects, to make them available to the compiler.

The analogous statement in C # is:

 using System.Text.RegularExpressions;   //  This is for C#  

The example shows the use of the underlying raw regex objects. The two main action lines:

 Dim R as Regex = New Regex("\d+\w+")   '  Compile the pattern  .     Dim M as Match   = R.Match(SampleText) '  Check against a  string  . 

can also be combined, as:

 Dim M as Match = Regex.Match(SampleText, "\d+\w+") '  Check pattern against string  . 

The combined version is easier to work with, as there's less for the programmer to type, and less objects to keep track of. It does, however, come with at a slight efficiency penalty (˜432). Over the coming pages, we'll first look at the raw objects, and then at the "convenience" functions such as the Regex.Match static function, and when it makes sense to use them.

For brevity's sake, I'll generally not repeat the following lines in examples that are not complete programs:

 Option Explicit On     Option Strict On     Imports System.Text.RegularExpressions 

It may also be helpful to look back at some of VB examples earlier in the book, on pages 96, 99, 204, 219, and 237.

9.2.3. Core Object Overview

Before getting into the details, let's first take a step back and look the .NET regex object model. An object model is the set of class structures through which regex functionality is provided. .NET regex functionality is provided through seven highly-interwoven classes, but in practice, you'll generally need to understand only the three shown visually in Figure 9-1 on the facing page, which depicts the repeated application of \s+(\d+) to the string 'May 16, 1998 .

Figure 9-1. .NET's Regex-related object model

9.2.3.1. Regex objects

The first step is to create a Regex object, as with:

 Dim R as Regex = New Regex("\s+(\d+)") 

Here, we've made a regex object representing \s+(\d+) and stored it in the R variable. Once youve got a Regex object, you can apply it to text with its Match( text ) method, which returns information on the first match found:

 Dim M as Match = R.Match("May 16, 1998") 

9.2.3.2. Match objects

A Regex object's Match(‹) method provides information about a match result by creating and returning a Match object. A Match object has a number of properties, including Success (a Boolean value indicating whether the match was successful) and Value (a copy of the text actually matched, if the match was successful). We'll look at the full list of Match properties later.

Among the details you can get about a match from a Match object is information about the text matched within capturing parentheses. The Perl examples in earlier chapters used Perl's $1 variable to get the text matched within the first set of capturing parentheses. .NET offers two methods to retrieve this data: to get the raw text, you can index into a Match object's Groups property, such as with Groups(1).Value to get the equivalent of Perl's $1 . (Note: C # requires a different syntax, Groups[1].Value , instead.) Another approach is to use the Result method, which is discussed starting on page 429.

9.2.3.3. Group objects

The Groups(1) part in the previous paragraph actually references a Group object, and the subsequent .Value references its Value property (the text associated with the group). There is a Group object for each set of capturing parentheses, and a "virtual group," numbered zero, which holds the information about the overall match.

Thus, MatchObj . Value and MatchObj . Groups(0).Value are the samea copy of the entire text matched. It's more concise and convenient to use the first, shorter approach, but it's important to know about the zeroth group because MatchObj . Groups.Count (the number of groups known to the Match object) includes it. The MatchObj . Groups.Count resulting from a successful match with \s+(\d+) is two (the whole-match "zeroth group, and the $1 group).

9.2.3.4. Capture objects

There is also a Capture object. It's not used often, but it's discussed starting on page 437.

9.2.3.4.1. All results are computed at match time

When a regex is applied to a string, resulting in a Match object, all the results (where it matched, what each capturing group matched, etc.) are calculated and encapsulated into the Match object. Accessing properties and methods of the Match object, including its Group objects (and their properties and methods) merely fetches the results that have already been computed.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net