Data Scraping Web Services | ASP. NET Solutions - 24 Case Studies. Best Practices for Developers

Data scraping is the technique of extracting useful data from Web resources. It is one of the coolest things that you can do with the System.Net namespace go out on the Web, get some data, and then use it.

Before we get too far, I need to point out that you need permission to do this. You can't just go to a Web site, get its content (whether or not it explicitly says it has a copyright), and then use it. But there are a great number of legitimate uses for this process.

Imagine that your company has some legacy Web sites where you can't get direct access to the underlying data. But it may be important for you to use the data in your application. You can make requests to the legacy Web application, scrape the data, and use it as needed. In this case, you're leveraging your company's data, and you don't have to worry about or bear the expense of interfacing with the data source.

Another thing you can do is scrape information from .gov Web sites. The data contained on these sites is owned by the government, and therefore it is public domain and can be used in your application.

A typical data-scraping session goes like this: Data is downloaded from a Web resource such as a Web site, it is stored (in memory or on disk) in a convenient form such as in a string object, the desired information is extracted, and the extracted information is used.

You might be asking, "What makes a Web Service a desirable mechanism for scraping data?" Good question, considering your application could easily do the information scraping within its own code. The two main reasons a Web Service is desirable for creating data-scraping processes is that they can be used by multiple applications in disparate locations, and a single Web Service can be updated if the format of the Web resource changes and thus breaks the data-scraping process.

The entire Web Service can be seen in Listing 14.13. I'll describe how it works here.

The externally callable method is called GetWeather(). It returns a string collection to the client application that contains three strings. The first string is the temperature; the second, the barometric pressure; and the third, the humidity. It takes no arguments from the client application, although a more complete weather scraper would need information such as a Zip code.

The GetWeather() method does three things. It pulls the Web data by calling the PullHtmlData() method. It extracts the desired data by calling the ParseHtmlData() method. And, finally, it adds the temperature, pressure, and humidity strings in the StringCollection object that will be returned to the client application.

The PullHtmlData() method instantiates and uses a WebClient object with which it will download the data. A call to the WebClient.DownloadData() method returns an array of bytes, which are then converted to a string object. If an exception is thrown, the PullHtmlData() method will return false, indicating an error.

The ParseHtmlData() method relies heavily on the Regex class to scrape the data from the HTML. (For more information about regular expressions and parsing HTML data, see Chapter 10.) This method first creates a Regex object from the downloaded HTML data, and it sets it to ignore case and treat the data as a single line. The regular expression that is used pulls all data between the <b> and </b> tags. For the page that was downloaded (which is located at www.DotNet-Networking.com/Weather/default.aspx), this works perfectly. For other situations, a different extraction expression is needed. By the way, you have explicit permission to use this Web page to get weather data.

Once the Regex.Matches() method is called, it's a simple matter to walk through the list of Match objects and get the data. The first string will be the temperature; the second, the barometric pressure; and the third, the humidity.

Listing 14.13 This Web Service Scrapes Weather Data and Returns It to Be Used by the Client Application.

 bool PullHtmlData(ref string strHtmlData) {     try     {         WebClient wc = new WebClient();         byte[] data;         data =             wc.DownloadData( Convert.ToString( Application["WeatherPage"] ) );         strHtmlData = Encoding.ASCII.GetString( data );         return( true );     }     catch( Exception ex )     {         strHtmlData = ex.Message.ToString();         return( false );     } } bool ParseHtmlData( string strHtmlData, ref string strTemp,     ref string strPressure, ref string strHumidity ) {     try     {         Regex re = new Regex( strHtmlData,            RegexOptions.Singleline | RegexOptions.IgnoreCase );         MatchCollection mc = re.Matches( "<b>(?<MYDATA>.*?(?=</b>))" );         int nCount = 0;         foreach( Match m in mc )         {             switch( nCount )             {                 case 0:                     strTemp = m.Groups["MYDATA"].Value;                     break;                 case 1:                     strPressure = m.Groups["MYDATA"].Value;                     break;                 case 2:                     strHumidity = m.Groups["MYDATA"].Value;                     break;             }             nCount++;         }     }     catch( Exception ex )     {         return( false );     }     return( true ); } [WebMethod] public StringCollection GetWeather() {     string strTemp = "", strPressure = "", strHumidity = "";     string strHtmlData = "";     StringCollection WeatherInfo = new StringCollection();     if( !PullHtmlData( ref strHtmlData ) )     {         return( WeatherInfo );     }     if( !ParseHtmlData( strHtmlData, ref strTemp,         ref strPressure, ref strHumidity ) )     {         return( WeatherInfo );     }     WeatherInfo.Add( strTemp );     WeatherInfo.Add( strPressure );     WeatherInfo.Add( strHumidity );     return( WeatherInfo ); }