Hack 22 Scraping with WWW::Mechanize

Never miss another Buffy the Vampire Slayer episode again with this easy-to-learn introduction to WWW::Mechanize and HTML::TokeParser .

Screen scraping is the process of emulating an interaction with a web sitenot just downloading pages, but also filling out forms, navigating around, and dealing with the HTML received as a result. As well as for traditional information lookupslike the example we'll be exploring in this hackyou can use screen scraping to enhance a web service into doing something the designers didn't give us the power to do in the first place. Here's a quick example.

I do my banking online, but I quickly get bored with having to go to my bank's site, log in, navigate around to my accounts, and check the balance on each of them. One quick Perl module ( Finance::Bank::HSBC ) later, I can loop through each of my accounts and print their balances , all from a shell prompt. With some more code, I can do something the bank's site doesn't ordinarily let me do: I can treat my accounts as a whole instead of as individual accounts, and find out how much money I have, could possibly spend , and owe, all in total. Another step forward would be to schedule a cron entry [Hack #90] every day to use the HSBC option to download a copy of my transactions in Quicken's QIF format, and use Simon Cozens' Finance::QIF module to interpret the file and run those transactions against a budget, letting me know whether I'm spending too much lately. This takes a simple web-based system from being merely useful to being automated and bespoke; if you can think of how to write the code, you can do it.

It's probably wise for me to add the caveat that you should be extremely careful when working with banking information programmatically, and you should be even more careful if you're storing your login details in a Perl script somewhere.

While that's very exciting, there are also more mundane tasks you can take care of with some Perl code and a couple of modules. Andy Lester's WWW::Mechanize [Hack #22] allows you to go to a URL and explore the site, following links by name , taking cookies, filling in forms, and clicking Submit buttons . We're also going to use HTML::TokeParser to process the HTML we're given back, which is a process I've written about previously; see http://www.perl.com/pub/a/2001/11/15/creatingrss.html.

The site I've chosen to use for this demonstration is the BBC's Radio Times (http://www.radiotimes.beeb.com), which allows users to create a "Diary" for their favorite TV programs and tells them whenever any of the programs are showing on any channel. Being a London Perl M[ou]nger, I have an obsession with Buffy the Vampire Slayer . If I tell this to the BBC's site, they'll tell me the time and name of the next episode, so I can check if it's one I've seen previously. I'd have to remember to log into their site every few days to check if there was a new episode coming along, though. Perl to the rescue! Our script will check to see when the next episode is and let us know, along with the name of the episode being shown.

If you're going to run the script yourself, you should register with the Radio Times site (http://www.radiotimes.beeb.com/jsp/register.jsp) and create a Diary; the script requires the email you registered with. Figure 2-2 shows an example of the data we'll be scraping, which contains the Buffy episodes we'd like to be informed about.

Figure 2-2. Our Diary, configured with Buffy showings

The Code

Save the following code to a file called radiotimes.pl :

 #!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; # the address you registered # with Radio Times site here.  my $email = '   your email address   '; die "Must provide an email address" unless $email ne ''; # We create a WWW::Mechanize object and tell it the address of the site # we'll be working from. The Radio Times' front page has an image link # with an ALT text of "My Diary", so we can use that to get to the right # section of the site: my $agent = WWW::Mechanize->new(  ); $agent->get("http://www.radiotimes.beeb.com/"); $agent->follow("My Diary"); # The returned page contains two forms - one to allow you to choose from a # list box of program types, and then a login form for the diary # function. We tell WWW::Mechanize to use the second form for input. # (Something to remember here is that WWW::Mechanize's list of forms, # unlike an array in Perl, is indexed starting at 1 rather than 0.  # Therefore, our index is '2'.) $agent->form(2); # Now we can fill in our email address for the '<INPUT name="email" # type="text">' field and click the submit button. Nothing too # complicated here. $agent->field("email", $email); $agent->click(  ); # WWW::Mechanize moves us on to our Diary page. This is the page  # we need to process to find the date details. On looking at the  # HTML source for this page, we can see the HTML we need to work  # through is something like: # #  <input> #  <tr><td></td></tr> #  <tr><td></td><td></td><td class="bluetext">Date of episode</td></tr> #  <td></td><td></td> #  <td class="bluetext"><b>Time of episode</b></td></tr> #  <a href="page_with_episode_info"></a> # # This can be modelled with HTML::TokeParser as below. The important # methods to note are get_tag, which will move the stream on to the # next start of the tag given, and get_trimmed_text, which will take # the text between the current tag and a given tag. For example, for the # HTML code "<b>Bold text here</b>", my $tag = get_trimmed_text("/b") # would return "Bold text here" to $tag. # Also note that we're initializing HTML::TokeParser on # '$agent->{content}' - this is an internal variable for WWW::Mechanize, # exposing the HTML content of the current page. my $stream = HTML::TokeParser->new($agent->{content}); my $date; # will hold the current show's datestamp. # <input> $stream->get_tag("input"); # <tr><td></td></tr><tr> $stream->get_tag("tr"); $stream->get_tag("tr"); # <td></td><td></td> $stream->get_tag("td"); $stream->get_tag("td"); # <td class="bluetext">Date of episode</td></tr> my $tag = $stream->get_tag("td"); if ($tag->[1]{class} and $tag->[1]{class} eq "bluetext") {     $date = $stream->get_trimmed_text("/td");     # The date contains '&nbsp;', which we'll translate to a space.     $date =~ s/\xa0/ /g; } # <td></td><td></td> $stream->get_tag("td");  # <td class="bluetext"><b>Time of episode</b>   $tag = $stream->get_tag("td"); if ($tag->[1]{class} eq "bluetext") {     $stream->get_tag("b");     # This concatenates the time of the showing to the date.     $date .= ", from " . $stream->get_trimmed_text("/b"); } # </td></tr><a href="page_with_episode_info"></a> $tag = $stream->get_tag("a"); # Match the URL to find the page giving episode information. $tag->[1]{href} =~ m!src=(http://.*?)'!; my $show = $stream->get_trimmed_text("a"); # We have a scalar, $date, containing a string that looks something like # "Thursday 23 January, from 6:45pm to 7:30pm.", and we have a URL, in # , that will tell us more about that episode. We tell WWW::Mechanize # to go to the URL: $agent->get(); # The navigation we want to perform on this page is far less complex than # on the last page, so we can avoid using a TokeParser for it - a regular # expression should suffice. The HTML we want to parse looks something # like this: # #  <br><b>Episode</b><br>  The Episode Title<br> # # We use a regex delimited with '!' in order to avoid having to escape the # slashes present in the HTML, and store any number of alphanumeric # characters after some whitespace, all in between <br> tags after the # Episode header: $agent->{content} =~ m!<br><b>Episode</b><br>\s+?(\w+?)<br>!; #  now contains our episode, and all that's # left to do is print out what we've found: my $episode = ; print "The next Buffy episode ($episode) is on $date.\n";

Running the Hack

Invoke the script on the command line and, in our solely configured Buffy example, it'll tell us what the next episode of Buffy is and when it's on, based on the information we've configured in our Diary:

 %  perl radiotimes.pl  The next episode of Buffy(Gone) is on    Thursday 23 January, from 6:45pm to 7:30pm.

Note that even though my favorite show is Buffy , yours might be Farscape , and that's just fine; you can configure as many shows as you'd like in your Radio Times Diary, and the script will always show you what's coming next:

 %  perl radiotimes.pl  The next episode of Farscape (Crackers Don't Matter) is   on Thursday 23 January, from 10:00am to 11:00am.

I hope this gives a light-hearted introduction to the usefulness of these modules; happy screen scraping, and may you never miss your favorite episodes again.

Chris Ball