Example Programs That Use Regular Expressions

We've created two Web-based example programs that show how to use the Regex classes on retrieved HTML data. The two programs are shown and explained in the next two sections.

Regular Expressions Html Parsing Demo

This program allows users to enter a URL and a regular expression. The data from the URL is retrieved when the Pull The Data button is clicked. The pulled data will show up in the TextField object.

Once the HTML data has been retrieved from the URL, regular expression parsing can be performed on the data. Users can enter any regular expression, click the Parse button, and see all matches that were found in the TextField below the regular expression field.

You can see this application running in Figure 14.1.

Figure 14.1. This Application Performs a Regular Expression on HTML Data That Has Been Retrieved.

graphics/14fig01.jpg

Several predefined URLs and regular expressions can be selected. For instance, you can see www.Yahoo.com and parse all e-mail addresses by making these selections in the DropDownList objects. This makes it easier to get an idea of how the program works without having to think too much.

Pulling HTML Data

HTML data is retrieved in response to clicking the Pull The Data button. When users click this button, an event is fired that calls the Button1_Click() method, as shown in Listing 14.1.

NOTE: The complete code for this application can be found via links from www.ASPNET-Solutions.com/chapter_14.htm. There are two versions, one for C# and one for VB. You can also view the code for both via links at the bottom of each page of the application.

This program can be run from a link on the same page.


Pulling the HTML data is very simple and relies on the WebClient class. For a detailed understanding of this class, refer to Chapter 13 and the section entitled "Using the WebClient Class."

After a WebClient object is instantiated, the URL is obtained from the URLText TextField object. The user may or may not have preceded the URL with the "http://" protocol (or scheme) specifier. We check for it because the DownloadData() method requires it. If we don't find it, we'll add it.

Once we're sure we have a URL preceded with "http://", the DownloadData() method is called. This method returns an array of bytes, which are then used to populate the HtmlData TextField object. This is the user interface object that resides on the ASP.NET page with which the user interacts.

If an exception is thrown, the HtmlData object will contain the exception message.

Listing 14.1 This Code Retrieves Data from a URL.
 private void Button1_Click(object sender, System.EventArgs {     try     {         WebClient wc = new WebClient();         byte[] data;         string strURL;         if( URLText.Text.ToUpper ().Substring( 0, 7 ) != "HTTP://" )         {             strURL = "http://" + URLText.Text;         }         else         {             strURL = URLText.Text;         }         data = wc.DownloadData( strURL );         HtmlData.Text = Encoding.ASCII.GetString( data );     }     catch( Exception ex )     {         HtmlData.Text = ex.Message.ToString();     } } 
Parsing HTML Data

Listing 14.2 shows how the application parses the data using the regular expression that the user entered. The first part of the code considers the options that the user has selected by testing the state of the four CheckBox objects that are on the page. For each one that's selected, the enumeration value is ORed to form the combined value for all selected options. The object named options contains this value.

A Regex object is created using the user-supplied regular expression and the RegexOptions object that resulted from the user option choices.

With the Regex object created, the parsing is done by calling the Matches() method with the HTML data string as the only parameter. This method returns a collection of Match objects, and these are contained in a MatchCollection object.

Once the MatchCollection object has been obtained, it's a simple matter to walk through the collection and emit the results for each match. The ParsedData TextField object is appended for each match with the text for each match. The text is in the Value property.

Listing 14.2 This Code Parses the Data with the User-Provided Regular Expression.
 private void Button2_Click(object sender, System.EventArgs {     try     {         RegexOptions options = RegexOptions.None;         if( IgnoreCase.Checked )         {             options |= RegexOptions.IgnoreCase;         }         if( Singleline.Checked )         {             options |= RegexOptions.Singleline;         }         if( Multiline.Checked )         {             options |= RegexOptions.Multiline;         }         if( IgnorePatternWhitespace.Checked )         {             options |= RegexOptions.IgnorePatternWhitespace;         }         Regex re = new Regex( RegexText.Text, options );         MatchCollection mc = re.Matches( HtmlData.Text );         int nCount = 0;         ParsedData.Text = "";         foreach( Match m in mc )         {             nCount++;             ParsedData.Text += ( "Match " + nCount + ":\r\n" + m.Value +                "\r\n" );         }     }     catch( Exception ex )     {         ParsedData.Text = ex.Message.ToString();     } } 
Housekeeping Tasks

I've found that most people want to use a demonstration program without having to think too hard. This is especially true for this demo and people new to regular expressions. For this reason, the demo program has some predefined choices that make it easy to use. The URLs that are predefined are http://www.yahoo.com, http://www.microsoft.com, and http://www.ASPNET-Solutions.com. The regular expressions that are predefined are Email, URL, U.S. Phone Number, U.S. Social Security, and U.S. Zip Code.

The code in Listing 14.3 shows the event handlers that fire when users select something from the DropDownList objects that contain the predefined choices.

Listing 14.3 These Two Methods Allow Users to Take Advantage of Some Predefined Choices.
 private void ExampleExpressions_SelectedIndexChanged(object sender,    System.EventArgs {     RegexText.Text = ExampleExpressions.SelectedItem.Value;     if( RegexText.Text == "(Custom)" )     {         RegexText.Text = "";     } } private void URLExamples_SelectedIndexChanged(object sender,   System.EventArgs {     URLText.Text = ExampleURLs.SelectedItem.Value;     if( URLText.Text == "(Custom)" )     {         URLText.Text = "";     } } 

Regular Expressions Html Scraping Demo

This program allows users to enter a URL and two text fragments. The text fragments represent the start and end text of the data you want to extract. The data from the URL is retrieved when the Pull The Data button is clicked. The pulled data will show up in the TextField object.

Once the HTML data has been retrieved from the URL, the desired data can be extracted. Users can enter any starting and ending text, click the Scrape button, and see the scraped text in the TextField below the Starting and Ending TextField objects.

Several predefined URLs and Starting and Ending text fragments can be selected. For instance, you can select Get Word, Get Weather, and Get Quote. This makes it easier to get an idea of how the program works without having to think too much.

You can see this application running in Figure 14.2.

Figure 14.2. This Application Scrapes Data from HTML Data That Has Been Retrieved.

graphics/14fig02.jpg

Pulling HTML Data

This code works identically to the code in Listing 14.1. HTML data is retrieved in response to clicking the Pull The Data button. When users click this button, an event is fired that calls the Button1_Click() method, as shown in Listing 14.4.

WEB CONTENT: The complete code for this application can be found via links from www.ASPNET-Solutions.com/chapter_14.htm. There are two versions, one for C# and one for VB. You can also view the code for both via links at the bottom of each page of the application.

This program can be run from a link on the same page.


Listing 14.4 This Code Retrieves Data from a URL.
 private void Button1_Click(object sender, System.EventArgs { try     {         WebClient wc = new WebClient();         byte[] data;         string strURL;         if( URLText.Text.ToUpper().Substring( 0, 7 ) !=            "HTTP://" )         {                 strURL = "http://" + URLText.Text;         }         else         {                 strURL = URLText.Text;         }         data = wc.DownloadData( strURL );         HtmlData.Text = Encoding.ASCII.GetString( data );     }     catch( Exception ex )     {         HtmlData.Text = ex.Message.ToString();     } } 

Extracting Text From HTML Data

Listing 14.5 shows how the application extracts text based on the Starting and Ending text that the user enters. Unlike the previous example wherein users could provide their own options, this demo is hard-coded to use the IgnoreCase and Singleline options. These are necessary for the application to work correctly.

A Regex object is created using a regular expression that is created by combining the starting text, some regular expression code, and the ending text.

With the Regex object created, the extraction is done by calling the Match() method with the HTML data string as the only parameter. This method returns a single Match object. The extracted text will be contained in the Groups[].Value property.

Listing 14.5 This Code Extracts Data Based on Starting and Ending Text Fragments.
 private void Button2_Click(object sender, System.EventArgs { try         {                 Regex re = new Regex( StartText.Text +                     "(?<MYDATA>.*?(?=" + EndText.Text + "))",                     RegexOptions.IgnoreCase | RegexOptions.Singleline );                 Match m = re.Match( HtmlData.Text );                 ScrapedData.Text = m.Groups["MYDATA"].Value;         }         catch( Exception ex )         {                 ScrapedData.Text = ex.Message.ToString();         } } 
More Housekeeping Tasks

The code in Listing 14.6 shows the event handler that fires when users select something from the DropDownList objects that contain the predefined choices.

Listing 14.6 Some Predefined Choices Can Make the Demonstration of the Program Simple.
 private void DropDownList1_SelectedIndexChanged(object sender,    System.EventArgs {         switch( Convert.ToInt32( Examples.SelectedItem.Value ) )         {                 case 0:                         URLText.Text = "";                         StartText.Text = "";                         EndText.Text = "";                         break;                 case 1:                         URLText.Text =                           "http://www.dictionary.com/search?q=expert";                         StartText.Text = "<hr.*?>";                         EndText.Text = "<hr";                         break;                 case 2:                         URLText.Text =                           "http://www.weather.com/weather/local/27320";                         StartText.Text = "<B> Reidsville, NC ";                         EndText.Text = "</table>.*?</table>";                         break;                 case 3:                         URLText.Text =                            "http://www.quotations.com/american.htm";                         StartText.Text = "<font ";                         EndText.Text = "<p>";                         break;         } } 


ASP. NET Solutions - 24 Case Studies. Best Practices for Developers
ASP. NET Solutions - 24 Case Studies. Best Practices for Developers
ISBN: 321159659
EAN: N/A
Year: 2003
Pages: 175

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net