Hack 96 Making Your Resources Scrapable with Regular Expressions

A few tricks can make your web page data easier to parse, without needing complicated HTML libraries or convoluted logic. The benefits extend to more than just visitors ; your own HTML will be more understandable too .

Scraping is an attempt to address a common problem in development: an application to which someone needs automated access was built around interfaces meant only for human use. For whatever reason, the application cannot be changed or replaced . The only apparent way to talk to the application is to be a human operator.

Well, someone surmised that something simulating a human operator might work as well. Back when text forms on green screens first fell out of favor, the functions of applications that produced them were still in demand. So, developers built programmable terminals as components in their new applications. These automated terminals were capable of extracting the characters displayed at known column and row locations, in order to harvest (or scrape ) data from the screen displayed in form fields. This terminal software could also fill in form fields and send control commands, just like a human user , in order to run a formerly manual application through its paces.

Although it was often a convoluted and error-prone process, this simulation of a human operator made the impossible task of automation possible. For older applications that previously required human interaction, the process extended their usefulness just a bit longer. After creating automated terminals, access to legacy applications could be wrapped in more modern facades. Despite the scripted contortions of a simulated terminal going on in the background, an end developer would see, ideally , only a standard database API in her editor.

The Challenge of Web Scraping

Today, accessing information on the Web presents another challenge to automation and scraping. Content and applications on the Web are usually intended for human users via web browsers, so they have little or no support for automated access or control.

Depending on the site or service provider, this lack can be intentional, but for the most part it's merely something most webmasters and developers haven't considered . On the contrary, most projects on the Web have visual appearance and user experience as their top concerns, often resulting in convoluted and inconsistent HTML code to produce the desired effects in web browsers.

Unfortunately, trying to automate the use of web resources presents us with a somewhat more complicated situation than when terminals were first automated. Whereas developers scraping terminal screens mostly had to worry about the location of fields on a two-dimensional screen, developers scraping web sites have to worry about the many dimensions of HTML tag soup accepted by modern browsers, not to mention the various browser tricks used for navigation and session management.

Despite the difficulty, though, the solution in both situations remains similar: build a programmable browser that can simulate human access to web resources. Generally , this involves two intertwined tasks : navigating between web resources and extracting specific information from these resources.

Navigating between web resources

Consider a site that provides current weather conditions. Say that this particular site is a paid service, so it requires a valid user account and login before giving access to the information. In order to pull weather conditions from this site, a developer will need to have his application authenticate with the site first, then hunt for the information through the page that displays weather conditions.

With Perl, the first task can be accomplished using the WWW::Mechanize module [Hack #21] from CPAN. This module can be used to easily automate many of the tasks involved in grabbing web resources, filling out and submitting forms, managing session cookies, and following links. For example:

 #!/usr/bin/perl -w use strict;  use WWW::Mechanize; my $agent = WWW::Mechanize->new(  ); # Get the site login page. $agent->get('http://my.weatherexample.com/login.html'); # Step to the third form on the page, which we've identified # as the login form. Fill out account details and submit. $agent->form_number('3'); $agent->field('user','my_name'); $agent->field('password','my_password'); $agent->submit(  ); # Having logged in, we should have valid session cookies or # whatever the site requires. On the page following login, # follow the link labeled "Current Conditions". $agent->follow('Current Conditions'); # Grab the contents of the current weather conditions page. my $content = $agent->content(  );

This code simulates the process of a human user logging in and navigating to the "Current Conditions" page on the site. The WWW::Mechanize module is capable of much more in the way of automating web browser activity, without requiring the use of a browser or human intervention. See [Hack #22] for more details on using WWW::Mechanize .

Extracting specific information

So, now that we have the HTML source of the page that displays the current weather conditions, we can work out an approach to extract the relevant data. Let's say that a peek at this HTML source looks like this:

 <html> <body> <table><tr><td> <table>         <tr>         <td> <table CELLPADDING=2 BORDER=0 CELLSPACING=1 width="100%">   <tr valign=top align=center bgcolor=#000088>    <td colspan=2><font color=#ffffff><b>Conditions</b></font></td>   </tr> <tr BGCOLOR="#eeeeee"> <td COLSPAN=2> Conditions at <b>Ann Arbor, Michigan</b><br> Updated: <b>4:53 PM EDT on June 22, 2003</b>  </td></tr> <tr BGCOLOR="#FFFFFF"><td width="35%">Temperature</td>      <td>     <b> 83 </b>&nbsp;&#176;F     /         <b> 28 </b>&nbsp;&#176;C  </td>      </tr> <tr BGCOLOR="#ddeeff"><td>Humidity</td> <td><b> 32% </b></td></tr> </table>    </td> </tr></table> </td> </TR></table> </body> </html>

This is some ugly tag soup, but it was generated by an application to look good in the browser. It was never meant to be seen by human eyes. Too bad for us, but don't worry; there are plenty of tricks left.

There are a few Perl packages that give us access to highly tolerant HTML tag soup processing. One of them is HTML::Tree (http://search.cpan.org/author/sburke/HTML-Tree/), which has many modules and convenience methods for walking and searching through the structure of an HTML document for content to extract. After some head scratching and staring at the HTML of the weather conditions page, some patterns can be worked out and used to extract basic information, which we can then extract using HTML::TreeBuilder [Hack #19]:

 #!/usr/bin/perl -w use strict;  use HTML::TreeBuilder; # Build a tree from the HTML. my $tree = HTML::TreeBuilder->new(  ); $tree->parse(  $the_weather_content  ); # Drill down and find the first # instance of 'Conditions' in bold. my ($curr) = $tree->look_down   (    _tag => 'b',    sub { $_[0]->as_text(  ) eq 'Conditions' }   ); # Step back up to the first # containing table from 'Conditions'. ($curr) = $curr->look_up(_tag => 'table'); # Grab the containing table's rows. my @rows = $curr->look_down(_tag => 'tr'); # Each table row after the first contains some info we want, and each # piece of info is set in bold. So, extract our info as text from the # bold tags in each row. my %data = (  ); ($data{location}, $data{time}) =     map { $_->as_text(  ) } $rows[1]->look_down(_tag => 'b'); ($data{$temp_f}, $data{temp_c}) =     map { $_->as_text(  ) } $rows[2]->look_down(_tag => 'b'); ($data{humid}) =     map { $_->as_text(  ) } $rows[3]->look_down(_tag => 'b');

How to Be Nicer to Scrapers

It should be clear that, even with very clever tools, developers trying to automate interaction with web resources need to jump through a few (occasionally flaming ) hoops to do the simplest things a human user can do by hand. And, as mentioned earlier, some web site owners prefer things that way.

But what if, as a site owner, you'd actually like to encourage automated use of your site and make things easier? Just as the scraping process has two tasks, there are also two main things to be done to make the tasks easier: making resources easier to locate and acquire and making data within these resources easier to extract.

Make resources easier to locate and acquire

How can you make resources easier to find and obtain? Try to reduce the number of steps to reach a desired resource, ideally down to one step via a single URL.

Instead of having a login process that spans several pages and uses custom cookies for session management, why not try using HTTP authentication? Or, you could allow the username and password to be passed in as query parameters in the request for any resource. While this does have security implications, you should consider requiring access via secure HTTP to your site if this is a major worry. On the other hand, rethink your data: must it be secured? If our weather site followed this first suggestion and simplified access to the current weather conditions page, we could replace the WWW::Mechanize -based code with the following:

 #!/usr/bin/perl -w use strict;  use LWP::Simple; # Define the username, password, # and URL to access current conditions. my $user = 'my_name'; my $pass = 'my_pass'; my $url  = 'http://my.weatherexample.com/current'; # Grab the desired page. my $content = get("$url?username=$user&password=$pass");

Notice how few hoops this example has. In fact, most of the code is there for the sake of clarity, rather than as a part of the process to find the desired resource. This is almost always a good thing, especially for the novice scraper.

Removing the authentication entirely would reduce the code to simply this:

 #!/usr/bin/perl -w use strict;  use LWP::Simple; my $content = get('http://my.weatherexample.com/current');

Making data easier to extract

Now that we've made site resources more accessible to programs, how do we make the data within resources easier to extract? Ideally, we shouldn't need to do anything radical or drastic to the existing site or application.

Let's consider another extraction technique: regular expressions. Perl comes with built-in support for regular expressions, as do many other scripting languages. Where support is not built in, some external module is generally available. Although they have one of the most opaque of syntaxes, using regular expressions is a much more direct and lightweight way of plucking data from text such as HTML source.

As the name suggests, regular expressions are best at finding and extracting bits from regular patterns matched in data streams. Although the rich syntax of regular expressions can capture and describe complex patterns, simpler patterns call for simpler expressions. And, since we're trying to make it easier to get at data from our site, we should consider how to offer simpler patterns.

Here are some suggestions for simpler patterns that are easy to implement and have little or no impact on how HTML appears in the browser:

Add context to your HTML data, such as ID attributes or CSS classes.
Try to keep individual data items and surrounding tags on one line.
Where data takes up multiple lines, try bracketing the data with HTML commentsfor example:  DATA GOES HERE  .

For the most part, making data within HTML documents more easily extractable is a matter of reducing noise and adding easily described context around the data items. For example, with just a few tweaks, we can make the weather service's current conditions page incredibly easy to handle:

 <html> <body> <table><tr><td> <table>         <tr>         <td> <table CELLPADDING=2 BORDER=0 CELLSPACING=1 width="100%">   <tr valign=top align=center bgcolor=#000088>    <td colspan=2><font color=#ffffff><b>Conditions</b></font></td>   </tr> <tr BGCOLOR="#eeeeee"> <td COLSPAN=2> Conditions at <b  ID="location"  >Ann Arbor, Michigan</b><br> Updated: <b  ID="time"  >4:53 PM EDT on June 22, 2003</b>  </td></tr> <tr BGCOLOR="#FFFFFF"><td width="35%">Temperature</td>      <td>     <b  ID="temp_f"  >83</b>&nbsp;&#176;F     /         <b  ID="temp_c"  >28</b>&nbsp;&#176;C  </td>      </tr> <tr BGCOLOR="#ddeeff"><td>Humidity</td> <td><b  ID="humid"  >32%</b></td></tr> </table>    </td> </tr></table> </td> </TR></table> </body> </html>

The only changes we've made in this example to the original HTML are to remove all the line breaks around data and to add ID attributes to all the bold tags surrounding the data items. Now, the code used to extract this data can be reduced to this:

 my %data = (  ); foreach my $id qw(location time temp_f temp_c humid) {   ( $data{$id} ) = ($content =~ m!<b ID="$id">(.+?)</b>!i); }

This code loops through the name of each piece of desired data, and extracts the text of a bold tag, labeled with the corresponding ID attribute.

Hacking the Hack

Since we've saved so much time extracting data while working with regular expressions, we could go further and break up the date with another:

 my ($h, $m, $ampm, $tz, $mm, $dd, $yyyy) =   ( $data{time} =~ m!(\d+):(\d+) (..) (...) on (\w+) (\d+), (\d+)! );

This breaks up the date format into its individual parts so that we can do whatever further processing or reformatting we might want. Alternatively, if the date is in a readily understandable format, the Date::Manip module (http://search.cpan.org/author/SBECK/DateManip/) would be a smarter and infinitely more flexible choice.

With just a few simple changes, extracting data from the weather service page no longer requires any kind of HTML parser or document tree searching. Everything can be done with a single regular expression template. And, since the pattern is so simple, the regular expression needed to extract the data is also very simple.

l.m.orchard