Regular Expressions

The Perl programming language has long been revered by many CGI programmers for its ability to quickly and effortlessly extract or replace sub-strings within a body of text. Not too long ago, you could probably ask any Perl guru to name the top three advantages of his language, and you would unwillingly receive a detailed lecture on the power of regular expressions. Well, those days are surely over, and there wouldn't be any mention of regular expressions in this chapter if they were still considered a Perl advantage. You can now think of your Perl friends as excellent resources when you need to perform a string parsing operation with .NET technology that is, if they will even talk to you about .NET.

For the newcomer, the term regular expressions refers to an extremely rich set of informally standardized pattern-matching tools that removes every last bit of drudgery from the text-parsing process. Do you need to extract phone numbers that may be written with or without parentheses or hyphens from a 359K document? These days, you can grab them all at a cost of about five lines of code. Let's face it nobody has time to iterate through characters and track positions in a string just to find a lousy Zip code that might be written any of several different ways. Learn to use regular expressions, and you will not only save yourself countless hours and headaches when you are retrieving and parsing data while networking, but you will forever change your approach to the parsing of text.

This section covers some of the tools included in the System.Text.RegularExpressions namespace and how those tools can be used to greatly simplify your networking applications. We'll start with a simple example using the Regex and Match objects, followed by a crash course on regular expressions tools and syntax. Then, we'll use the very same tools to perform search-and-replace operations. Finally, you'll practice writing regular expressions with a Web application that allows you to extract information from the source code of any Web page you have a URL for.

Using Regular Expressions to Match Sub-Strings

Suppose you're a programmer for a Bob's Used Cars, a large used-car dealership. An agreement has recently been set up in which Bob's Used Cars has agreed to purchase at a discount all vehicles accepted for trade-in by Tom's Auto Mall, a new-car dealership. At the end of each month, Bob's Used Cars will be sent a URL pointing to a report of all trade-ins received under the agreement, with the report formatted like the sample shown below.

 Used Cars Received Week of 3/8/2002 Lot ID:  1742-208 Make:    Saturn Model:   SL1 Year:    1999 Price:   $10,294 Lot ID:  2283-517 Make:    Honda Model:   Accord Year:    1993 Price:   $3,309 Lot ID:  2413-502 Make:    Nissan Model:   Maxima Year:    1990 Price:   $4700 

Your job is to develop an application that will retrieve the data from the Tom's Auto Mall Web server, convert it to HTML, and present the HTML on the Bob's Used Cars Web site. To keep all focus on the functionality provided by the RegularExpressions namespace, let's assume that you have already made your application retrieve the report and are ready to pass it as string parameter to the HTML conversion function we'll create.

To get the ball rolling, we'll create a Regex object that can be used to extract the value of one field. We'll get this working properly, and then we'll modify the regular expression to extract and format the remaining fields as well. (If this is confusing, skip down to read the section entitled "A Taste of Regular Expressions.")

C#
 private string ConvertToHtml( string strData ) {     String strExpr = "Lot ID:\\s+(?<lot_id>.*)";     Regex regEx = new Regex( strExpr ); } 
VB
 Private Function ConvertToHtml( ByVal strData as string ) as string     Dim strExpr as string = "Lot ID:\\s+(?<lot_id>.*)"     Dim regEx as new Regex( strExpr ) End Sub 

Now you have a Regex object that can eventually be used to extract the Lot ID from one or more records, but for now, it doesn't do anything you can actually benefit from. So to make your effort worthwhile, you must tell the Regex object to process a string using your regular expression. This can be done simply by passing the string you want processed to the Match() method of the Regex object, as follows:

C#
 Match regExMatch = regEx.Match( strData ); 
VB
 Dim regExMatch as Match = regEx.Match( strData ) 

That's all there is to it, really. You probably never thought parsing a string could be so simple! Now, all you have to do is check the Match object to see whether a match was found by evaluating the Boolean Success property. If a match was found, you will extract the value of group "lot_id" and append that value to strHtml. Then, just tell the Match object to look for another match by calling its NextMatch() method.

C#
 while( myMatch.Success ) {     strHtml += myMatch.Groups["lot_id"].Value + "<BR>\r\n";     myMatch = myMatch.NextMatch(); } 
VB
 While myMatch.Success     strHtml += myMatch.Groups["lot_id"].Value + "<BR>\r\n"     myMatch = myMatch.NextMatch() End While 

The NextMatch() method of the Match object is a particularly helpful method because it allows you to iterate through every occurrence of a match in a string without having to keep track of your position within the string. If you ran the program in the example, you would get the following output:

 1742-408 2283-517 2413-502 

Here's the entire method:

C#
 private string ConvertToHtml( string strData ) {     string strHtml = ""; // value to be returned     // store the regular expression     string strExpr = "Lot ID:\\s*(?<lot_id>.*)";    // create and initialize a Regex object    Regex regEx = new Regex( strExpr );     // perform the initial match     Match myMatch = regEx.Match( strData );     while( myMatch.Success ) // while match found     {         // add match group "lot_id" to return value         strHtml += myMatch.Groups["lot_id"].Value + "<BR>\r\n;         // attempt to match again         myMatch = myMatch.NextMatch();     }     return strHtml; } 
VB
 Private Function ConvertToHtml(ByVal strData As String) As String     Dim strHtml As String = "" ' value to be returned     ' store the regular expression     Dim strExpr As String = "Lot ID:\\s*(?<lot_id>.*)"     ' create and initialize a Regex object     Dim regEx As New Regex(strExpr)     ' perform the initial match     Dim myMatch As Match = regEx.Match(strData)     While myMatch.Success         ' add match group "lot_id" to return value         strHtml += myMatch.Groups("lot_id").Value + "<BR>\r\n"         ' attempt to match again         myMatch = myMatch.NextMatch()     End While     Return strHtml End Function 

If you have never used regular expressions before, you may be wondering exactly what just happened, so let's look at the regular expression itself in just a little more detail. To be completely honest, you will probably have a much clearer understanding of this material on your second time through, and especially after you have had some time to play around with some different regular expressions formulas. But just to get you started thinking about regular expressions, Table 14.1 shows a breakdown of the formula we just used.

To avoid some frustration, keep in mind that any escape sequences that are not known to the C# compiler and that you include in your regular expression will cause problems. Remember to escape the preceding backslash for metacharacters, such as the \s, and use \\s in your regular expression string.

Table 14.1. The Regular Expressions Used in the Example

Component

Explanation

Full Expression:

Lot ID:\s*(?<lot_id>.*)

Lot ID:

Simply tells the parser to match "Lot ID:".

\s*

Tells the parser to match zero or more whitespace characters

.*

Tells the parser to match zero or more characters. In this case, the dot (.) will match anything but a carriage return or line feed, so .* effectively gives you the rest of the line after "Lot ID:" and zero or more spaces. There are options you can pass to the Regex constructor that control the behavior of the dot in a regular expression that we'll discuss later.

(?<lot_id>.*)

Tells the parser to match .* and store the result of .* so that it may be referenced by the name "lot_id." A match group is always enclosed in parentheses. Notice that ?<lot_id> names the group, while any results of the match criteria that follow are placed in the group. If it were possible to write this as a C# expression, it might be written as lot_id = .*;

We'll add some additional ammo to our regular expressions arsenal a little later, but for now, let's modify the above example to extract all fields from the automobile report and return an HTML table. Here's an expression we could pass to the Regex object's constructor:

C#
 string strExpr =     "Lot ID:\\s*(?<lot_id>.*)\r\n" +     "Make:\\s*(?<make>.*)\r\n" +     "Model:\\s*(?<model>.*)\r\n" +     "Year:\\s*(?<year>.*)\r\n" +     "Price:\\s*(?<price>.*)"; 
VB
 Dim strError as string = _     "Lot ID:\\s*(?<lot_id>.*)\r\n" + _     "Make:\\s*(?<make>.*)\r\n" +  _     "Model:\\s*(?<model>.*)\r\n" +  _     "Year:\\s*(?<year>.*)\r\n" + _     "Price:\\s*(?<price>.*)" 

As you can see, the regular expression ends up masking a record contained in the actual data. This particular example requires a mask that is fairly readable, but with more general regular expressions, you may have to do some intense studying to figure out exactly what the expression is attempting to match. For example, a regular expression to match an e-mail address would look like this:

 \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* 

Unless you're at least mildly experienced with regular expressions, the above line probably looks like something a comic book character might say in a fit of rage. If this is the case, don't worry; it will all make sense once you have had a little more exposure.

Here is the finished method you could use to present the data from the Tom's Auto Mall report in an HTML table:

C#
 private string ConvertToHtml( string strData ) {     string strHtml;     // The string for the expression.     string strExpr =         "Lot ID:\\s*(?<lot_id>.*)\r\n" +         "Make:\\s*(?<make>.*)\r\n" +         "Model:\\s*(?<model>.*)\r\n" +         "Year:\\s*(?<year>.*)\r\n" +         "Price:\\s*(?<price>.*)";     // Create the Regex object and do the match.     Regex regEx = new Regex( strExpr );     Match myMatch = regEx.Match( strData );     // Start the HTML string.     strHtml = "<table border=\"1\">\r\n";     while( myMatch.Success )     {         // Add another row of data.         strHtml +=            "\t<tr>\r\n" +            "\t\t<td>" + myMatch.Groups["lot_id"].Value + "</td>\r\n" +            "\t\t<td>" + myMatch.Groups["make"].Value + "</td>\r\n" +            "\t\t<td>" + myMatch.Groups["model"].Value + "</td>\r\n" +            "\t\t<td>" + myMatch.Groups["year"].Value + "</td>\r\n" +            "\t\t<td>" + myMatch.Groups["price"].Value + "</td>\r\n" +            "\t</tr>\r\n";     // Find the next match.     myMatch = myMatch.NextMatch();     }     // End the HTML table.     strHtml += "</table>\r\n";     return strHtml; } 
VB
 Private Function ConvertToHtml(ByVal strData As String) As String     Dim strHtml As String = ""     Dim strExpr As String = _         "Lot ID:\\s*(?<lot_id>.*)\r\n" + _         "Make:\\s*(?<make>.*)\r\n" + _         "Model:\\s*(?<model>.*)\r\n" + _         "Year:\\s*(?<year>.*)\r\n" + _         "Price:\\s*(?<price>.*)"     Dim regEx As New Regex(strExpr)     Dim myMatch As Match = regEx.Match(strData)     strHtml = "<table border=1>\r\n"     While myMatch.Success         strHtml += _            "\t<tr>\r\n" + _            "\t\t<td>" + myMatch.Groups("lot_id").Value + "</td>\r\n" + _            "\t\t<td>" + myMatch.Groups("make").Value + "</td>\r\n" + _            "\t\t<td>" + myMatch.Groups("model").Value + "</td>\r\n" + _            "\t\t<td>" + myMatch.Groups("year").Value + "</td>\r\n" + _            "\t\t<td>" + myMatch.Groups("price").Value + "</td>\r\n" + _            "\t</tr>\r\n"         myMatch = myMatch.NextMatch()     End While     strHtml += "</table>\r\n"     Return strHtml End Function 

That's it. The result of this function is now a string containing an HTML table that neatly displays all the cars that have been purchased from the new-car dealership in the past month. What a large benefit for such a minute amount of code.

A Taste of Regular Expressions

Now that you've learned to process regular expressions in C#, it will help to have some basic knowledge of regular expressions themselves. Table 14.2 is a limited dictionary containing some of the more common regular expressions metacharacters and their meanings. Table 14.3 offers a few examples of regular expressions syntax.

Table 14.2. Common Regular Expressions

Metacharacters

Meanings

\w

Matches an alphanumeric character, including the underscore.

\W

Matches a non-alphanumeric character.

\d

Matches a digit character.

\D

Matches a non-digit character.

\s

Matches a whitespace character.

\S

Matches a non-whitespace character.

.

Using default options, matches any single character except a carriage return or line-feed.

*

Modifies preceding character or grouping of characters to match zero or more times. For example, \w* would match zero or more alphanumeric characters. (Possible Matches: "", "A", "Apple" )

+

Modifies preceding character or grouping of characters to match one or more times. For example, \d+ would match one or more digits. (Possible Matches: "0", "093", "0479" )

[ ]

Brackets specify a collection of characters to be matched. For example, the regular expression [aeiou] would match any vowel.

Min-Max

Specifies a range of characters to be matched. Ranges are specified with a hyphen between the minimum and maximum values ( "A-Z", "a-z", "0-9" ) and must be contained in a bracketed grouping. A range is also treated as a single character and can be used with single characters in bracketed groups. For example, [0-9A-Fa-f] or [0-9A-Fabcdef] would match any single hexadecimal digit.

^

Represents the beginning of the string.

$

Represents the end of a string.

?

Matches preceding character zero or more times. When used after a + or *, this character causes the match to be less "greedy." For example, .* would match an entire line, but .*?</font> would match only text preceding the literal "</font>".

The ? character is also used to define a group name in C#. For example, (?<my_group>\w+) would match a sequence of one or more alphanumeric characters and store the match in a group called "my_group."

{n}

Matches preceding character or grouping exactly n times.

{n1, n2}

Matches preceding character or grouping a minimum of n1 times and a maximum of n2 times.

{n, }

Matches preceding character or grouping a minimum of n times.

\metacharacter

Escapes a metacharacter. Because characters such as the plus sign, dollar sign, and period have special meaning to the Regex parser, use \+, \$, \. when you want to match those characters in a string.

Table 14.3. Regular Expressions that are Built into Visual Studio .NET

Regular Expression

Syntax

US Zip+4

(?<zip_five>\d{5})\D+(?<zip_four>\d{4})

US Phone No.

(?<area_code>\d{3})\D+(?<prefix>\d{3})\D+(?<suffix>\d{4})

Currency

\$(?<dollars>\d*)\.(?<cents>\d{2})

As mentioned earlier in the chapter, Perl's unrivaled support of regular expression pattern-matching has long been one of the major reasons for its popularity as a language. Fortunately, most regular expressions parsers, such as the one provided by the System.Text.RegularExpressions namespace, behave very similarly to Perl's, so you can easily dig up a wealth of regular expressions dictionaries and examples on the Internet, such as at www.regxlib.com. Just be sure to test your formulas in C# and not assume that they will all translate perfectly.

The Search-And-Replace Operation

Now that you've seen how regular expressions easily extract information from a string, you'll be delighted to find out that replacing substrings is even simpler.

In this cheesy, but appropriate example, you'll see how regular expressions can literally be used to take the work out of networking. This example is so simple that we won't even create a new method.

Create a new ASP/C# Web application and place a label named lblReplace and a button named btnReplace on a WebForm. For the btnReplace control's event handler, place the following code:

C#
 private void btnReplace_Click(object sender, System.EventArgs) {     // create new regex object to match 'work'     Regex regEx = new Regex("work");     // replace any matches of 'work' with ''     lblReplace.Text = regEx.Replace("Networking", ""); } 
VB
 Private Function btmReplace_Click( ByVal sender as object, ByVal e as System.EventArgs )     ' create new regex object to match 'work'     Dim regEx as new Regex( "work" )     ' replace any matches of 'work' with ''     lblReplace.Text = regEx.Replace("Networking", "") End Function 

Run the application. When you press the button, the label should display "Neting" to show that you've literally taken the work out of networking.

You'd probably rarely want to take a word out of another word as shown in the example, but suppose your company is a member of a business legal association that requires each member company to dedicate a portion of its Web site to providing uniform and specific information. The association distributes an HTML template that must be used if the member companies are to remain in compliance. You are frustrated because the association keeps changing its layout and has been e-mailing you nearly every other day to notify you that an updated template is available.

Fortunately for you, the association posts the template on its Web site and maintains the same information, but it just can't decide on the look and feel of the documents. Rather than manually downloading the page, opening it in Notepad, and inserting the same information you've been inserting for the past two weeks, you decide to dole the task out to your new friend, the Regex object.

Let's assume the HTML page looks like the sample below, and that you have already downloaded it and stored the data in a string called strLegalData. Again, we'll keep it simple so we can focus on the important points. Notice that the creators of the template have conveniently placed any text to be replaced inside brackets.

 <HTML> <HEAD><TITLE>[Company Name] - Legal Web Watchers Association</TITLE> <BODY>   <H1>[Company Name]   Company Profile</H1>   <P>Number of Employees: [Employee Count]</P>   <P>Rate of Turnover in 2001: [Turnover Rate]</P> </BODY> </HTML> 

All you have to do is create a method to accept a field name, a value to replace the field with, and the target string containing the template data.

C#
 private void ReplaceLegalField( string strFieldName,   string strFieldValue, ref string strData ) {     Regex strExpr = new Regex("\\[" + strFieldName + "\\]");     strData = strExpr.Replace( strData, strFieldValue ); } 
VB
 Private Function ReplaceLegalField(ByVal strFieldName as string, _  ByVal strFieldValue as string,     ByRef strDate as String)     Dim strExpr as new Regex("\\[" + strFieldName + "\\]" )     StrData = strExpr.Replace( strData, strFieldValue ) End Function 

Now, simply call the method once for each field. Remember that strLegalData holds the template you downloaded from the association's Web site.

C#
 ReplaceLegalField("Company Name", "Parsing, Inc.", strLegalData); ReplaceLegalField("Employee Count", "64", strLegalData ); ReplaceLegalField("Turnover Rate", "14.2%", strLegalData ); 
VB
 ReplaceLegalField("Company Name", "Parsing, Inc.", strLegalData) ReplaceLegalField("Employee Count", "64", strLegalData ) ReplaceLegalField("Turnover Rate", "14.2%", strLegalData ) 

That's it! Now, the value of strLegalData is as follows:

 <HTML> <HEAD><TITLE>Parsing, Inc. - Legal Web Watchers Association</TITLE> <BODY>   <H1>Parsing, Inc.   Company Profile</H1>   <P>Number of Employees: 64</P>   <P>Rate of Turnover in 2001: 14.2%</P> </BODY> </HTML> 

Keep in mind that you are downloading and processing this document in realtime, so you won't have to bother with it ever again, unless the association adds, deletes, or renames a field. Of course, you'd probably want to grab the data you're plugging into the fields from a location outside the program.

The addition of regular expressions capabilities to the world of object-oriented programming is sure to be a big relief for those who have long been frustrated when useful scripting languages such as Perl have been so annoyingly convenient for one or two applications.



ASP. NET Solutions - 24 Case Studies. Best Practices for Developers
ASP. NET Solutions - 24 Case Studies. Best Practices for Developers
ISBN: 321159659
EAN: N/A
Year: 2003
Pages: 175

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net