Section 9.3. Core Object Details


9.3. Core Object Details

Now that we've seen an overview, let's look at the details. First, we'll look at how to create a Regex object, followed by how to apply it to a string to yield a Match object, and how to work with that object and its Group objects.

In practice, you can often avoid having to explicitly create a Regex object, but it's good to be comfortable with them, so during this look at the core objects, I'll always explicitly create them. We'll see later what shortcuts .NET provides to make things more convenient .

In the lists that follow, I don't mention little-used methods that are merely inherited from the Object class.

9.3.1. Creating Regex Objects

The constructor for creating a Regex object is uncomplicated. It accepts either one argument (the regex, as a string), or two arguments (the regex and a set of options). Here's a one-argument example:

 Dim StripTrailWS = new Regex("\s+$") '  for removing trailing whitespace  

This just creates the Regex object, preparing it for use; no matching has been done to this point.

Here's a two-argument example:

 Dim GetSubject = new Regex("^subject: (.*)", RegexOptions.IgnoreCase) 

That passes one of the RegexOptions flags, but you can pass multiple flags if they're OR 'd together, as with:

 Dim GetSubject = new Regex("^subject: (.*)", _                        RegexOptions.IgnoreCase OR RegexOptions.Multiline) 

9.3.1.1. Catching exceptions

An ArgumentException error is thrown if a regex with an invalid combination of metacharacters is given. You don't normally need to catch this exception when using regular expressions you know to work, but it's important to catch it if using regular expressions from "outside" the program (e.g., entered by the user , or read from a configuration file). Here's an example:

 Dim  R  As Regex     Try  R  = New Regex(SearchRegex)     Catch  e  As ArgumentException         Console.WriteLine("*ERROR* bad regex: " &  e  .ToString)         Exit Sub     End Try 

Of course, depending on the application, you may want to do something other than writing to the console upon detection of the exception.

9.3.1.2. Regex options

The following option flags are allowed when creating a Regex object:



RegexOptions. IgnoreCase

This option indicates that when the regex is applied, it should be done in a case-insensitive manner (˜110).



RegexOptions. IgnorePatternWhitespace

This option indicates that the regex should be parsed in a free-spacing and comments mode (˜111). If you use raw #‹ comments, be sure to include a newline at the end of each logical line, or the first raw comment "comments out the entire rest of the regex.

In VB.NET, this can be achieved with chr(10) , as in this example:

 Dim  R  as Regex = New Regex(_        "# Match a floating-point number ...        " & chr(10) & _        " \d+(?:\.\d*)? # with a leading digit ...  " & chr(10) & _        "              # or ...                    " & chr(10) & _        " \.\d+         # with a leading decimal point", _        RegexOptions.IgnorePatternWhitespace) 

That's cumbersome; in VB.NET, (?#‹) comments can be more convenient:

 Dim  R  as Regex = New Regex(_        "(?# Match a floating-point number ...)" & _        " \d+(?:\.\d*)? (?# with a leading digit ...)" & _        "              (?# or ...)" & _        " \.\d+         (?# with a leading decimal point)", _        RegexOptions.IgnorePatternWhitespace) 



RegexOptions. Multiline

This option indicates that the regex should be applied in an enhanced line-anchor mode (˜112). This allows ^ and $ to match at embedded newlines in addition to the normal beginning and end of string, respectively.



RegexOptions. Singleline

This option indicates that the regex should be applied in a dot-matches-all mode (˜111). This allows dot to match any character, rather than any character except a newline.



RegexOptions. ExplicitCapture

This option indicates that even raw (‹) parentheses, which are normally capturing parentheses, should not capture, but rather behave like (?:‹) grouping-only non-capturing parentheses. This leaves named-capture (?< name >‹) parentheses as the only type of capturing parentheses.

If you're using named capture and also want non-capturing parentheses for grouping, it makes sense to use normal (‹) parentheses and this option, as it keeps the regex more visually clear.



RegexOptions. RightToLeft

This option sets the regex to a right-to-left match mode (˜411).



RegexOptions. Compiled

This option indicates that the regex should be compiled, on the fly, to a highly-optimized format, which generally leads to much faster matching. This comes at the expense of increased compile time the first time it's used, and increased memory use for the duration of the program's execution.

If a regex is going to be used just once, or sparingly, it makes little sense to use RegexOptions.Compiled , since its extra memory remains used even when a Regex object created with it has been disposed of. But if a regex is used in a time-critical area, it's probably advantageous to use this flag.

You can see an example on page 237, where this option cuts the time for one benchmark about in half. Also, see the discussion about compiling to an assembly (˜434).



RegexOptions. ECMAScript

This option indicates that the regex should be parsed in a way that's compatible with ECMAScript (˜412). If you don't know what ECMAScript is, or don't need compatibility with it, you can safely ignore this option.



RegexOptions. None

This is a "no extra options" value that's useful for initializing a RegexOptions variable, should you need to. As you decide options are required, they can be OR 'd in to it.

9.3.2. Using Regex Objects

Just having a regex object is not useful unless you apply it, so the following methods swing it into action.

  RegexObj  .  IsMatch  (  target   )  Return type:  Boolean   RegexObj  .  IsMatch(   target, offset   )  

The IsMatch method applies the object's regex to the target string, returning a simple Boolean indicating whether the attempt is successful. Here's an example:

 Dim R as RegexObj = New Regex("^\s*$")  If R.IsMatch(Line) Then        ' Line is blank ...  Endif 

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted.

  RegexObj  .  Match  (  target   )  Return type:  Match  object  RegexObj  .  Match  (  target  ,  offset   )   RegexObj  .  Match  (  target  ,  offset  ,  maxlength   )  

The Match method applies the object's regex to the target string, returning a Match object. With this Match object, you can query information about the results of the match (whether it was successful, the text matched, etc.), and initiate the " next " match of the same regex in the string. Details of the Match object follow, starting on page 427.

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted.

If you provide a maxlength argument, it puts matching into a special mode where the maxlength characters starting offset characters into the target string are taken as the entire target string, as far as the regex engine is concerned . It pretends that characters outside the range don't even exist, so, for example, ^ can match at offset characters into the original target string, and $ can match at maxlength characters after that. It also means that lookaround can't "see" the characters outside of that range. This is all very different from when only offset is provided, as that merely influences where the transmission begins applying the regexthe engine still "sees" the entire target string.

This table shows examples that illustrate the meaning of offset and maxlength :

 

Results when RegexObj is built with ...

Method call

\d\d

^\d\d

^\d\d $

RegexObj . Match("May 16, 1998")

match '16'

fail

fail

RegexObj . Match("May 16, 1998", 9)

match '99'

fail

fail

RegexObj . Match("May 16, 1998", 9, 2)

match '99'

match '99'

match '99'


  RegexObj  .  Matches  (  target   )  Return type:  MatchCollection   RegexObj  .  Matches  (  target  ,  offset   )  

The Matches method is similar to the Match method, except Matches returns a collection of Match objects representing all the matches in the target , rather than just one Match object representing the first match. The returned object is a MatchCollection .

For example, after this initialization:

 Dim  R  as New Regex("\w+")     Dim  Target  as String = "a few words" 

this code snippet

 Dim  BunchOfMatches  as MatchCollection =  R  .Matches(Target)     Dim  I  as Integer     For  I  = 0 to  BunchOfMatches  .Count - 1         Dim  MatchObj  as Match =  BunchOfMatches  .Item(  I  )         Console.WriteLine("Match: " &  MatchObj  .Value)     Next 

produces this output:

 Match: a     Match: few     Match: words 

The following example, which produces the same output, shows that you can dispense with the MatchCollection object altogether:

 Dim  MatchObj  as Match     For Each  MatchObj  in  R  .Matches(Target)         Console.WriteLine("Match: " &  MatchObj  .Value)     Next 

Finally, as a comparison, here's how you can accomplish the same thing another way, with the Match (rather than Matches ) method:

 Dim  MatchObj  as Match =  R  .Match(Target)     While  MatchObj  .Success         Console.WriteLine("Match: " &  MatchObj  .Value)  MatchObj  =  MatchObj  .NextMatch()     End While  RegexObj  .  RegexObj  (  target, replacement  )        Return type:  String   RegexObj  .  RegexObj  (  target  ,  replacement  ,  count   )   RegexObj  .  RegexObj  (  target  ,  replacement  ,  count  ,  offset   )  

The Replace method does a search and replace on the target string, returning a (possibly changed) copy of it. It applies the Regex object's regular expression, but instead of returning a Match object, it replaces the matched text. What the matched text is replaced with depends on the replacement argument. The replacement argument is overloaded; it can be either a string or a MatchEvaluator delegate. If replacement is a string, it is interpreted according to the sidebar on the next page. For example,

 Dim  R_CapWord  as New Regex("\b[A-Z]\w*")  Text =  R_CapWord  .Replace(  Text  , "<B> 
 Dim  R_CapWord  as New Regex("\b[A-Z]\w*")  Text =  R_CapWord  .Replace(  Text  , "<B>$0</B>") 
</B>")

wraps each capitalized word with <B>‹</B>.

If count is given, only that number of replacements is done. (The default is to do all replacements ). To replace just the first match found, for example, use a count of one. If you know that there will be only one match, using an explicit count of one is more efficient than letting the Replace mechanics go through the work of trying to find additional matches. A count of -1 means "replace all" (which, again, is the default when no count is given).

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is applied. Bypassed characters are copied through to the result unchanged.

For example, this canonicalizes all whitespace (that is, reduces sequences of whitespace down to a single space):

 Dim  AnyWS  as New Regex("\s+")  Target =  AnyWS  .Replace(  Target  , " ") 

This converts 'some random spacing' to 'some random spacing' . The following does the same, except it leaves any leading whitespace alone:

 Dim  AnyWS  as New Regex("\s+")     Dim LeadingWS as New Regex("^\s+")   Target  =  AnyWS  .Replace(  Target  , " ", -1,  LeadingWS  .Match(  Target  ).Length) 

This converts ' some random spacing ' to ' some random spacing' . It uses the length of what's matched by LeadingWS as the offset (as the count of characters to skip) when doing the search and replace. It uses a convenient feature of the Match object, returned here by LeadingWS.Match(Target) , that its Length property may be used even if the match fails. (Upon failure, the Length property has a value of zero, which is exactly what we need to apply AnyWS to the entire target.)

Special Per-Match Replacement Sequences

Both the Regex.Replace method and the Match.Result method accept a "replacement" string that's interpreted specially. Within it, the following sequences are replaced by appropriate text from the match:

Sequence

Replaced by

$&

text matched by the regex (also available as $0 )

$1, $2, ...

text matched by the corresponding set of capturing parentheses

${ name }

text matched by the corresponding named capture

$'

text of the target string before the match location

$'

text of the target string after the match location

$$

a single '$' character

$_

a copy of the entire original target string

$+

(see text below)


The $+ sequence is fairly useless as currently implemented. Its origins lie with Perl's useful $+ variable, which references the highest-numbered set of capturing parentheses that actually participated in the match. (There's an example of it in use on page 202.) This .NET replacement-string $+ , though, merely references the highest-numbered set of capturing parentheses in the regex. It's particularly useless in light of the capturing-parentheses renumbering that's automatically done when named captures are used (˜409).

Any uses of '$' in the replacement string in situations other than those described in the table are left unmolested.


9.3.2.1. Using a replacement delegate

The replacement argument isn't limited to a simple string. It can be a delegate (basically, a pointer to a function). The delegate function is called after each match to generate the text to use as the replacement. Since the function can do any processing you want, it's an extremely powerful replacement mechanism.

The delegate is of the type MatchEvaluator , and is called once per match. The function it refers to should accept the Match object for the match, do whatever processing you like, and return the text to be used as the replacement.

As examples for comparison, the following two code snippets produce identical results:

  Target  =  R  .Replace(  Target  ,  "<<$&>>"  ))     -----------------------------------     Function  MatchFunc  (ByVal  M  as Match) as String       return  M  .Result(  "<<$&>>"  )     End Function     Dim  Evaluator  as MatchEvaluator = New MatchEvaluator(AddressOf  MatchFunc  )   Target  =  R  .Replace(  Target, Evaluator  ) 

Both snippets highlight each match by wrapping the matched text in <<‹>>. The advantage of using a delegate is that you can include code as complex as you like in computing the replacement. Here's an example that converts Celsius temperatures to Fahrenheit:

 Function  MatchFunc  (ByVal  M  as Match) as String        '  Get numeric temperature from , then convert to Fahrenheit  Dim  Celsius  as Double = Double.Parse(  M  .Groups(1).Value)        Dim  Fahrenheit  as Double =  Celsius  * 9/5 + 32        Return  Fahrenheit  & "F" '  Append an "F", and return  End Function        Dim  Evaluator  as MatchEvaluator = New MatchEvaluator(AddressOf  MatchFunc  )  Dim  R_Temp  as Regex = New Regex("(\d+)C\b", RegexOptions.IgnoreCase)  Target  =  R_Temp  .Replace(  Target  , Evaluator) 

Given 'Temp is 37C.' in Target , it replaces it with 'Temp is 98.6F.' .

  RegexObj  .  Split  (  target   )  Return type: array of  String   RegexObj  .  Split  (  target  ,  count   )   RegexObj  .  Split  (  target  ,  count  ,  offset   )  

The Split method applies the object's regex to the target string, returning an array of the strings separated by the matches. Here's a trivial example:

 Dim  R  as New Regex("\.")     Dim  Parts  as String() =  R  .Split("209.204.146.22") 

The R.Split returns the array of four strings ('209', '204', '146', and '22') that are separated by the three matches of \. in the text.

If a count is provided, no more than count strings will be returned (unless capturing parentheses are usedmore on that in a bit). If count is not provided, Split returns as many strings as are separated by matches. Providing a count may mean that the regex stops being applied before the final match, and if so, the last string has the unsplit remainder of the line:

 Dim  R  as New Regex("\.")     Dim  Parts  as String() =  R  .Split("209.204.146.22", 2) 

This time, Parts receives two strings, '209' and '204.146.22' .

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is attempted. The bypassed text becomes part of the first string returned (unless RegexOptions.RightToLeft has been specified, in which case the bypassed text becomes part of the last string returned).

9.3.2.2. Using Split with capturing parentheses

If capturing parentheses of any type are used, additional entries for captured text are usually inserted into the array. (We'll see in what cases they might not be inserted in a bit.) As a simple example, to separate a string like '2006-12-31' or '04/12/2007' into its component parts, you might split on [-/] , as with:

 Dim  R  as New Regex("[-/]")     Dim  Parts  as String() =  R  .Split(MyDate) 

This returns a list of the three numbers (as strings). However, adding capturing parentheses and using ( [-/,] ) as the regex causes Split to return five strings: if MyDate contains '2006-12-31 , the strings are '2006', '-', '12', '-' , and '31' . The extra '-' elements are from the per-capture $1 .

If there are multiple sets of capturing parentheses, they are inserted in their numerical ordering (which means that all named captures come after all unnamed captures ˜409).

Split works consistently with capturing parentheses so long as all sets of capturing parentheses actually participate in the match. However, there's a bug with the current version of .NET such that if there is a set of capturing parentheses that doesn't participate in the match, it and all higher-numbered sets don't add an element to the returned list.

As a somewhat contrived example, consider wanting to split on a comma with optional whitespace around it, yet have the whitespace added to the list of elements returned. You might use (\s+)?,(\s+)? for this. When applied with Split to 'this , that , four strings are returned, 'this' ',' ', and 'that' . However, when applied to 'this, that' , the inability of the first set of capturing parentheses to match inhibits the element for it (and for all sets that follow) from being added to the list, so only two strings are returned, 'this' and 'that' . The inability to know beforehand exactly how many strings will be returned per match is a major shortcoming of the current implementation.

In this particular example, you could get around this problem simply by using (\s*),(\s*) (in which both groups are guaranteed to participate in any overall match). However, more complex expressions are not easily rewritten.

  RegexObj  .  GetGroupNames()   RegexObj  .  GetGroupNumbers()   RegexObj  .  GroupNameFromNumber(   number   )   RegexObj  .  GroupNumberFromName(  name  )  

These methods allow you to query information about the names (both numeric and, if named capture is used, by name) of capturing groups in the regex. They don't refer to any particular match, but merely to the names and numbers of groups that exist in the regex. The sidebar below shows an example of their use.

  RegexObj  .  ToString()   RegexObj  .  RightToLeft   RegexObj  .  Options  

These allow you to query information about the Regex object itself (as opposed to applying the regex object to a string). The ToString() method returns the pattern string originally passed to the regex constructor. The RightToLeft property returns a Boolean indicating whether RegexOptions.RightToLeft was specified with the regex. The Options property returns the RegexOptions that are associated with the regex. The following table shows the values of the individual options, which are added together when reported :

 0  None                  16  Singleline      1  IgnoreCase            32  IgnorePatternWhitespace      2  Multiline             64  RightToLeft     4  ExplicitCapture      256  ECMAScript      8  Compiled 

The missing 128 value is for a Microsoft debugging option not available in the final product.

The sidebar shows an example these methods in use.

9.3.3. Using Match Objects

Match objects are created by a Regex 's Match method, the Regex.Match static function (discussed in a bit), and a Match object's own NextMatch method. It encapsulates all information relating to a single application of a regex. It has the following properties and methods:



MatchObj . Success

This returns a Boolean indicating whether the match was successful. If not, the object is a copy of the static Match.Empty object (˜433).



MatchObj . Value



MatchObj . ToString()

These return copies of the text actually matched.

Displaying Information about a Regex Object

The following code displays what's known about the Regex object in the variable R :

 '  Display information known about the Regex object in the variable R  Console.WriteLine("Regex is: " &  R  .ToString())     Console.WriteLine("Options are: " &  R  .Options)     If  R  .RightToLeft        Console.WriteLine("Is Right-To-Left: True")     Else        Console.WriteLine("Is Right-To-Left: False")     End If     Dim  S  as String     For Each  S  in  R  .GetGroupNames()         Console.WriteLine("Name """ &  S  & """ is Num #" & _  R  .GroupNumberFromName(  S  ))     Next     Console.WriteLine("---")     Dim  I  as Integer     For Each  I  in  R  .GetGroupNumbers()         Console.WriteLine("Num #" &  I  & " is Name """ & _  R  .GroupNameFromNumber(  I  ) & """")     Next 

Run twice, once with each of the two Regex objects created with

 New Regex("^(\w+)://([^/]+)(/\S*)")     New Regex("^(?<proto>\w+)://(?<host>[^/]+)(?<page>/\S*)",               RegexOptions.Compiled) 

the following output is produced (with one regex cut off to fit the page):

Regex is: ^(\w+)://([^/]+)(/\S*)

Regex is: ^(?<proto>\w+)://(?<host> ‹

Option are: 0

Option are: 8

Is Right-To-Left: False

Is Right-To-Left: False

Name "0" is Num #0

Name "0" is Num #0

Name "1" is Num #1

Name "proto" is Num #1

Name "2" is Num #2

Name "host" is Num #2

Name "3" is Num #3

Name "page" is Num #3

---

---

Num #0 is Name "0"

Num #0 is Name "0"

Num #1 is Name "1"

Num #1 is Name "proto"

Num #2 is Name "2"

Num #2 is Name "host"

Num #3 is Name "3"

Num #3 is Name "page"





MatchObj . Length

This returns the length of the text actually matched.



MatchObj . Index

This returns an integer indicating the position in the target text where the match was found. It's a zero-based index, so it's the number of characters from the start (left) of the string to the start (left) of the matched text. This is true even if RegexOptions.RightToLeft had been used to create the regex that generated this Match object.



MatchObj . Groups

This property is a GroupCollection object, in which a number of Group objects are encapsulated. It is a normal collection object, with a Count and Item properties, but it's most commonly accessed by indexing into it, fetching an individual Group object. For example, M.Groups(3) is the Group object related to the third set of capturing parentheses, and M.Groups("HostName") is the group object for the "Hostname" named capture (e.g., after the use of (?<HostName>‹) in a regex).

Note that C # requires M.Groups[3] and M.Groups["HostName"] instead.

The zeroth group represents the entire match itself. MatchObj . Groups(0).Value , for example, is the same as MatchObj . Value .



MatchObj . NextMatch()

The NextMatch() method re-invokes the original regex to find the next match in the original string, returning a new Match object.



MatchObj . Result ( string )

Special sequences in the given string are processed as shown in the sidebar on page 424, returning the resulting text. Here's a simple example:

 Dim  M  as Match = Regex.Match(SomeString, "\w+")     Console.WriteLine(  M  .Result("The first word is '$&'")) 

You can use this to get a copy of the text to the left and right of the match, with

  M  .Result("$'") '  This is the text to the left of the match   M  .Result("$'") '  This is the text to the right of the match  

During debugging, it may be helpful to display something along the lines of:

  M  .Result("[$'<$&>$']")) 

Given a Match object created by applying \d+ to the string 'May 16, 1998 , it returns 'May <16>, 1998' , clearly showing the exact match.



MatchObj . Synchronized()

This returns a new Match object that's identical to the current one, except that it's safe for multi-threaded use.



MatchObj . Captures

The Captures property is not used often, but is discussed starting on page 437.

9.3.4. Using Group Objects

A Group object contains the match information for one set of capturing parentheses (or, if a zeroth group, for an entire match). It has the following properties and methods:



GroupObj . Success

This returns a Boolean indicating whether the group participated in the match. Not all groups necessarily "participate" in a successful overall match. For example, if (this)(that) matches successfully, one of the sets of parentheses is guaranteed to have participated, while the other is guaranteed to have not. See the footnote on page 139 for another example.



GroupObj . Value



GroupObj . ToString()

These both return a copy of the text captured by this group. If the match hadn't been successful, these return an empty string.



GroupObj . Length

This returns the length of the text captured by this group. If the match hadn't been successful, it returns zero.



GroupObj . Index

This returns an integer indicating where in the target text the group match was found. The return value is a zero-based index, so it's the number of characters from the start (left) of the string to the start (left) of the captured text. (This is true even if RegexOptions.RightToLeft had been used to create the regex that generated this Match object.)



GroupObj . Captures

The Group object also has a Captures property discussed starting on page 437.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net