How to Use Matching


The most basic use of regular expressions is that of matching substrings within a string. Matching can be for words, characters, or any other conceivable sequence required. For example, given the string "Angie called Albert" we could begin by locating the capital letter A only if it begins a word. The following regular expression would do just that:

 \bA 

Translated, it means "begin on a word boundary and find the letter A." The matches produced would be two matches for the letter A: the first letter of Angie and Albert. No other characters would be included in the match result. We could extend this pattern to locate all words that begin with the letter A. The regular expression to accomplish this would be as follows:

 \bA\w+ 

This expression translates to "begin at word boundaries, and locate the letter A followed by one or more characters or digits." This expression again produces two matches. This time the matches are Angie and Albert.

If we want to find words that do not begin with the letter A, but rather contain the letter A or a, the following expression could be used:

 \b[^aA\s]\w*[aA]\w+ 

This expression reads, "find words that do not begin with a or A or a space, that begin with letters or digits and contain either a or A, and are followed by one or more letters or digits." Table 4.6 lists the basic escape sequences for regular expressions.

Table 4.6. Regular Expression Single-Character Escape Sequences

Escape Sequence

Description

\a

Matches the ASCII character 7 (system bell)

\b

Matches the ASCII character 8 if inside []; otherwise, represents a word boundary

\d

Matches a decimal digit 09

\D

Matches any nondecimal digit

\e

Matches the escape character 0x1B

\f

Matches the form feed

\n

Matches the newline

\r

Matches the carriage return

\t

Matches the tab character

\040

Matches ASCII characters as octal representations; 040 matches the space

\x20

Matches ASCII characters expressed in hexadecimal; only two digits

\cC

Matches control sequences such as Control+F

\u0020

Matches Unicode characters using hexadecimal representation; four digits

\

When not followed by a control character, the following character is the match character


The items in Table 4.6 are included for completeness and are probably not going to be part of normal usage. Table 4.7 lists the standard character classes for regular expressions.

Table 4.7. Regular Expression Character Classes

Character Class

Description

.

Matches all characters except for the newline character, \n

[abc]

Matches any character contained in the set

[^abc]

Matches any character not in the set

[0-9a-zA-Z]

Matches the range of characters denoted with the hyphen

\p{name}

Matches any character in the named character class specified by name

\P{name}

Matches any character not in the named character class specified by name

\w

Matches any word character; ECMAScript-compliant behavior is equivalent to [a-zA-Z0-9]

\W

Matches any nonword character; [^a-zA-Z0-9]

\s

Matches any whitespace

\S

Matches any nonwhitespace character

\d

Matches any decimal digit

\D

Matches any nondecimal digit : [^0-9]


At this point, you have seen the basic character classes and escape sequences as well as some regular expressions. Next, you'll make use of the .NET RegEx class to match input and display the results. Listing 4.5 matches various phone number input types and shows how to test and match for multiple formats.

Listing 4.5. Using the Regex Class for String Matching
 using System; using System.Text;                       //StringBuilder using System.Text.RegularExpressions;    //Regular Expression classes namespace Listing_4_5 {     class Class1     {         static void Main(string[] args)    {             string phoneNumber1                = "555-1212";             string phoneNumber2                = "919-555-1212";             string phoneNumber3                = "(919) 555-1212";             string invalidPhoneNumberFormat    = "919.555.1212";             //Match the phone number for the following formats             //1: 555-1212             //2: 919-555-1212             //3: (919) 555-1212             StringBuilder  expressionBuilder = new StringBuilder( );             //The @ symbol is used so the \ is ignored as a C# escape sequence             //start at beginning of line (^), 3 digits hyphen 4 digits             expressionBuilder.Append( @"^\d{3}-\d{4}" );             //or, another expression to follow             expressionBuilder.Append( "|" );             //start at beginning of line (^), 3 digits hyphen 3 digits hyphen 4             // digits             expressionBuilder.Append( @"^\d{3}-\d{3}-\d{4}" );             //or last expression to meet the criteria             expressionBuilder.Append( "|" );             //start at beginning of line (^), open paren ( //3 digits close paren )             //space 3 digits hyphen 4 digits             //Note: the open and close parens must be //escaped with the \              //character.             expressionBuilder.Append( @"^\(\d{3}\)\s\d{3}-\d{4}" );             //Now we have the regular expression, create the RegEx object             Regex phoneMatchExpression = new Regex( expressionBuilder.ToString( ) );             //Match the phone numbers             if( phoneMatchExpression.Match( phoneNumber1 ).Success )                 Console.WriteLine( "phoneNumber1 matches" );             else                 Console.WriteLine( "phoneNumber1 has invalid format" );             if( phoneMatchExpression.Match( phoneNumber2 ).Success )                 Console.WriteLine( "phoneNumber2 matches" );             else                 Console.WriteLine( "phoneNumber2 has invalid format" );             if( phoneMatchExpression.Match( phoneNumber3 ).Success )                 Console.WriteLine( "phoneNumber3 matches" );             else                 Console.WriteLine( "phoneNumber3 has invalid format" );             if( phoneMatchExpression.Match( invalidPhoneNumberFormat ).Success )                 Console.WriteLine( "invalidPhoneNumberFormat matches" );             else                 Console.WriteLine( "invalidPhoneNumberFormat has invalid format" );         }     } } 

Validating Data with Regular Expressions

Whenever you hear the phrase data validation, think of applying regular expressions. By using regular expressions, you can validate character ranges, length, and format. A useful example of this is validating passwords for standards compliance; for example, if the password must be alphanumeric with at least one uppercase letter and one digit. To handle such validation, it is necessary to understand and apply regular expression assertions. Table 4.8 lists the assertions and their descriptions.

Table 4.8. Regular Expression Assertions

Assertion

Description

(?=pattern)

Specifies that the pattern follows this location

(?!pattern)

Specifies that the pattern does not follow this location

(?<=pattern)

Specifies that the pattern precedes this location

(?<!pattern)

Specifies that the pattern does not precede this location


Armed with assertions, it is now possible to validate a password whose length is 8 to 12 characters and must include at least 1 uppercase character and 1 digit. The following expression uses assertions to implement this validation:

 ^(?=.*\d+)(?=.*[a-z]+)(?=.*[A-Z]+).{4,8}$ 

Grouping Matches

When parsing strings, the ability to locate substrings and quickly access them can be difficult with the System.String and System.StringBuilder. However, the regular expression grouping support allows for quick access to matched substrings. By creating a named group, you can quickly access the captured data for the expressed pattern. Such grouping makes a task such as parsing and accessing web query string parameters very simple to do.

Grouping is accomplished by creating a named or unnamed group using the following syntax:

 (?<group-name>pattern) 

With the group name specified, a returned Match object contains a Groups collection that can be indexed by the name of the captured group. Listing 4.6 shows how to use grouping to parse and capture data from the web query string param1=data1&param2=data2.

Listing 4.6. Grouping with Regular Expressions
 using System; using System.Text.RegularExpressions; namespace Listing_4_6 {   class Class1    {    static void Main(string[] args)    {      string queryString = "param1=data1&param2=data2";      Regex queryStringExpression = new Regex( @"param1=(?<param1>\w+[^&]) &param2=(?<param2>\w+[^&])" );      Match match = queryStringExpression.Match( queryString );      if( match.Success ) {       //display the group data       Console.WriteLine( "param1 := {0}", match.Groups[ "param1" ].Value );       Console.WriteLine( "param2 := {0}", match.Groups[ "param2" ].Value );      }    }   } } 

Replacing Matched Strings

One of the more useful features of regular expression is the ability to implement search-and-replace-style functionality with a very powerful languagethat language of course being regular expressions. The replacement works for both named and unnamed groups and allows for a new string to be created based on the matched pattern and supplied replacement expression.

TIP

Unnamed groups are created whenever a pattern is enclosed in parentheses. These unnamed groups have a one-based ordinal number assigned to them and can be referenced as $1, $2, $3, ….


Looking back at the phone number validation example, if the phone number is 9195551212 and you want to display it as (919) 555-1212, you could use a matching expression in combination with a replacement expression. The code necessary to create the desired result is as follows:

 string match = @"(\d{3})(\d{3})(\d{4})"; string replace = @"($1) $2-$3"; string result = Regex.Replace( match, replace ); 

After executing the Replace method, the result string would contain the newly formatted phone number.



    Visual C#. NET 2003 Unleashed
    Visual C#. NET 2003 Unleashed
    ISBN: 672326760
    EAN: N/A
    Year: 2003
    Pages: 316

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net