Regular Expressions


Regular expressions often generate a lot of confusion, but are extremely useful if you have to deal with any form of text input or need to process some text. Chapter 5 discussed validation controls, including the regular expression validator that uses a regular expression to check the value of an email entry field. This works very well to validate entry fields, but there are times when you need to process text outside of the validators, probably when writing custom text or screen scraping applications.

For example, without regular expressions, how easy would it be to extract all the links from the HTML of a web page? You could search for the "href" string, but then you would have to be flexible about the contents of the attribute string. Regular expressions allow this flexibility, by way of pattern matching .

Pattern Matching

Regular expressions allow you to search, extract, or replace substrings based on an expression, or a pattern. These expressions are where the power of regular expressions lies. The patterns available in regular expressions use special characters and sequences to identify what is being searched for. The following table lists some of the main pattern elements:

Element

Description

*

A quantifier construct, when used it indicates that zero or more matches for a specific expression

+

A quantifier construct, when used it indicates that one or more matches for a specific expression

()

Captures the matched substring into the next available capture group (a capture group is zero, one, or more strings)

(?< name >)

Captures matched substring into capture group identified by name

\n

Return the nth captured group

Either of the expressions separated by the character

.

Any character (except newline)

[]

Any single character within the brackets

[^]

Any single character not within the brackets

\s

Any whitespace character

\S

Any non-whitespace character

\d

Any digit character

\D

Any non-digit character

The following table shows some examples of regular expressions, and content those expressions match:

Example

Matches

abc*

abc followed by none or more 'c' characters

abc+

abc followed by one or more 'c' characters

abc(def)ghi

abcdefghi , and places def in the first capture group

Ab(cd)ef(gh)i

abcdefghi , places cd into capture group 1, and gh into capture group 2

hello goodbye

Either hello or goodbye

[abcdef]

Any of the characters abcdef

[a-f]

Any of the characters abcdef

[^a-f]

Any character other than abcdef

Pattern Ordering and Length

There are two important points to note about searching for patterns. The searched pattern will be the largest available, which may not be what is expected. For example, consider the following string:

  Alex Homer is an author. Despite his years, he's not the Homer that wrote   Greek epics.  

Let's say we use the following expression:

  Homer(.*)  

This expression looks for the word Homer , and places any characters found after it in a capture group. The thing to watch for is that the first expression found in the search string is used. So, what's captured is the following:

 is an author. Despite his years, he's not the Homer that wrote Greek epics. 

There are two instances of Homer , and it's the first one that is matched. This rule changes when the search expression is widened to include any characters at the start of the search string. If you use the following expression:

  .*Homer(.*)  

This looks for any characters, followed by Homer , and places any characters found after it in a capture group. However, since the entire expression is widened, it now matches a larger number of characters. The largest match is returned, but the group now contains less characters. In this case, the following is captured:

  that wrote the Greek epics.  

The rules for these matches are entirely consistent, and they mean that you have to be careful in selecting match strings.

Text Replacement

If you are using patterns to search and replace within a string, remember that the replacement text may invalidate the expression that was used to perform the search. You should therefore be careful of search patterns that pick the widest match. It's nearly always best to be as explicit as possible, by using narrow patterns.

Pattern Example

You've seen how to use the network classes to retrieve a web page from Amazon.com and extract the sales ranking for a book. Let's take a look at part of the HTML that the Amazon.com web page uses:

 <b>Amazon.com Sales Rank: </b> 52,504 </font><br> 

Notice that this is all on one line, so you need to extract the rank from the middle of text, rather than from a line on its own. Here's the search expression, this time only using one group, since you really only require the sales rank:

  <b>Amazon.com Sales Rank: </b>(?<rank>.*)</font></b>  

There are several parts to this, some of which aren't directly relevant to the ranking. However, let's take the whole expression so you can see exactly what it's built from. Firstly you'll notice that you have two groups (these are the parts contained within parentheses), each of which is given a name. The name is defined by use of the ? character followed by a name contained within angle brackets. So you have x and rank . The groupings don't affect how the expression is parsed “ they are just used to allow easy access to parts of the expression once parsing has taken place.

It's clear which characters you need to match “those after : </b> and before the closing font tag. These are extracted by the group labeled rank .

The Regular Expression Classes

The System.Text.RegularExpressions namespace contains eight classes for the manipulation of regular expressions. These are:

Class

Represents

Regex

A regular expression

Match

The results from a single expression match

MatchCollection

A collection of results from iteratively applied matches

Group

The results from a single captured group

GroupCollection

A collection of captured groups

Capture

The results from a single sub-expression capture

CaptureCollection

A collection of captured sub-expressions

RegexCompilationInfo

Information about the compilation of expressions

Like the pattern matching, we're not going to cover an exhaustive list of all the classes, properties, and methods . Instead we'll concentrate on the most useful scenarios.

The Regex Class

Regex is the root class for regular expressions, and represents an individual regular expression. It contains a number of methods to allow the creation and matching of expressions. For example:

  Dim expr As String = "hello"   Dim re As New Regex(expr)   re.Match("Hello everyone, hello one and all.")  

This creates an expression and then uses the Match method to match the expression with the supplied string. In this case, there would only be one match “ the second hello “ since the matching is, by default, case-sensitive.

The Regex class constructor can be overloaded, to allow options to be specified. For example:

  Dim expr As String = "hello"   Dim re As New Regex(expr, RegexOptions.IgnoreCase)   re.Match("Hello everyone, hello one and all.")  

Now there are two matches, since case is being ignored!

The options to specify can be from the RegexOptions shown in the following table (or set the Options property of the class):

RegexOption

Description

Compiled

Specifies that the expression should be compiled to MSIL

ECMAScript

Enables ECMAScript -compliant behavior for the expression

ExplicitCapture

Only captures explicitly named or numbered groups, allowing parentheses to be matched without escaping

IgnoreCase

Case-insensitive match

IgnorePattern Whitespace

Ignores un-escaped whitespace in the pattern

Multiline

Make ^ and $ match the beginning and end of any line, rather than the entire string

None

No options are set

RightToLeft

Searches from right to left. This sets the RightToLeft property of the class

SingleLine

Treat the search string as a single line (where all characters are matched, including new line)

The Match Class

The Match class contains the details of a single expression match, as returned by the Match method of the Regex class. For example:

  Dim mt As Match   Dim expr As String = "hello"   Dim re As New Regex(expr, RegexOptions.IgnoreCase)   mt = re.Match("Hello everyone, hello one and all.")  

You can then use the Success property to determine if any matches were made, and examine the Groups and Captures collections to identify what were matched.

The Group Class

The Group class identifies a single captured group. Since an expression can contain multiple groups, the Match class has a Groups collection that contains a Group object for each group matched. For example, consider the match expression:

  (he(ll)o)  

This contains two explicit groups. One is for the entire word hello , and the other for the two l characters. There is also a third group, which is the entire expression. So, as far as matching is concerned , this expression is equivalent to:

  he(ll)o  

The only difference is the number of groups created.

Unlike the sales-ranking examples, these groups aren't explicitly named, so they are given names equivalent to their position in the collection ( 1 , 2 , and so on). You can access the groups directly, or through an enumeration. For simple expressions, it's marginally quicker to allow the class name the groups, but for more complex expressions, explicit names make it clear exactly which groups correspond to which match expression.

For example, consider the following expression:

  (l)+  

This expression matches one or more occurrences of the l character.

The following example demonstrates simple grouping in use:

  <%@ Page Language="VB" %>   <%   Dim mt As Match   Dim gp As Group   Dim expr As String = "h(e(ll)o) "   Dim re As New Regex(expr, RegexOptions.IgnoreCase)   mt = re.Match("Hello everyone, hello one and all.")   For Each gp In mt.Groups   Response.Write("<br />")   Response.Write(gp.Value)   Next   %>  

This returns the following:

 hello ello ll 

There are three matches. The first is the entire match expression, the second corresponds to the group within the first set of parentheses, and the third is the group within the second set of parentheses.

The Group class also includes Index and Length properties, which indicate the position of the match within the search string, and the length of string that is matched.

The Capture Class

The Capture class represents a single sub-expression capture. Each Group can have multiple captures. The Capture class really comes into its own when quantifiers are used within expressions. Quantifiers add an optional quantity to finding patterns. Examples of quantifiers are * for zero or more occurrences and + for one or more occurrences. For example, consider the following expression, which searches for the first occurrence of one or more l characters:

  (l)+  

Putting this into a full example, you have:

  <%@ Page Language="VB" %>   <%   Dim mt As Match   Dim gp As Group   Dim cp As Capture   Dim expr As String = "(l)+"   Dim re As New Regex(expr, RegexOptions.IgnoreCase)   mt = re.Match("Hello everyone, hello one and all.")   For Each gp In mt.Groups   Response.Write("Group: " & gp.Value)   Response.Write("<br />");   For Each cp In gp.Captures   Response.Write(" Capture: " & cp.Value)   Response.Write("<br />");   Next   Next   %>  

This gives the following result:

 Group: ll    Capture: ll Group: l    Capture: l    Capture: l 

Both a single l and multiple l characters are matched, because the + quantifier specifies one or more. So, the first group matches the ll in the first Hello . For the second group, there are two occurrences of single l characters. This becomes clearer with another example. Let's consider the following:

 (abc)+ 

This matches one or more occurrences of the string abc . When matched against QQQabcabcabcWWWEEEabcab you get the following output:

 Group: abcabcabc Capture: abcabcabc Group: abc Capture: abc Capture: abc Capture: abc 

The first group matches the widest expression, and there is only one occurrence of this. The second group matches the explicit group, and there are three occurrences of this.

Substitutions

When using groups in expressions, you can reuse the group without having to retype it. This is known as substitution . For example, consider the expression:

  (abc)def  

This matches abcdef but places abc into the first group. Then, to match abcdefabc , you'd use:

  (abc)def  



Professional ASP. NET 1.1
Professional ASP.NET MVC 1.0 (Wrox Programmer to Programmer)
ISBN: 0470384611
EAN: 2147483647
Year: 2006
Pages: 243

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net