Hack 21 WWW::Mechanize 101 | Spidering Hacks

While LWP::UserAgent and the rest of the LWP suite provide powerful tools for accessing and downloading web content, WWW::Mechanize can automate many of the tasks you'd normally have to code .

Perl has great tools for handling web protocols, and LWP::UserAgent makes it easy, encapsulating the nitty-gritty details of creating HTTP::Request s, sending the requests , parsing the HTTP::Response s, and providing the results.

Simple fetching of web pages is, as it should be, simple. For example:

 #/usr/bin/perl -w  use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new(  ); my $response = $ua->get( "http://search.cpan.org" ); die $response->status_line unless $response->is_success; print $response->title; my $html = $response->content;

Behind the scenes of the get method, all the details of the HTTP protocol are hidden from view, leaving me free to think about the code itself. POST ing requests is almost as simple. To search CPAN by author for my last name , I use this:

 my %fields = (     query => 'lester',     mode => 'author', ); my $response = $ua->post( "http://search.cpan.org", \%fields );

Although LWP::UserAgent makes things pretty simple when it comes to grabbing individual pages, it doesn't do much with the page itself. Once I have the results, I need to parse the page myself to handle the content.

For example, let's say I want to go through the search interface to find the CPAN home page for Andy Lester. The POST example does the searching and returns the results page, but that's not where I want to wind up. We still need to find out the address pointed to by the "Andy Lester" link. Once I have the search results, how do I know which Lester author I want? I need to extract the links from the HTML, find the text that matches "Andy Lester" and then find the next page. Maybe I don't know what fields will be on the page and I want to fill them in dynamically. All of this drudgery is taken care of by WWW::Mechanize .

Introducing WWW::Mechanize

WWW::Mechanize , or Mech for short, is a module that builds on the base of LWP::UserAgent and provides an easy interface for your most common web automation tasks (in fact, the first version of Mech was called WWW::Automate ). While LWP::UserAgent is a pure component that makes no assumptions about how you're going to use it, and Mech's intent is to have a miniature web browser in a single object, Mech takes some liberties in the name of simplicity. For example, a Mech object keeps in its memory a history of the pages it's visited and automatically supplies an HTTP Referer header.

My previous example of fetching is even simpler with Mech:

 #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new(  ); $mech->get( "http://search.cpan.org" ); die $mech->response->status_line unless $mech->success; print $mech->title; my $html = $mech->content; # Big text string of HTML

Now that Mech is working for me, I don't even have to deal with any HTTP::Response objects unless I specifically want to. The success method checks that the response, carried around by the $mech object, indicates a successful action. The content method returns whatever the content from the page is, and the title method returns the title for the page, if the page is HTML (which we can check with the is_html method).

Using Mech's Navigation Tools

So far, Mech is just a couple of convenience methods . Mech really shines when it's pressed into action as a web client, extracting and following links and filling out and posting forms. Once you've successfully loaded a page, through either a GET or POST , Mech goes to work on the HTML content. It finds all the links on the page, whether they're in an A tag as a link, or in any FRAME or IFRAME tags as page source. Mech also finds and parses the forms on the page.

I'll put together all of Mech's talents into one little program that downloads all of my modules from CPAN. It will have to search for me by name, find my module listings, and then download the file to my current directory. (I could have had it go directly to my module listing, since I know my own CPAN ID, but that wouldn't show off form submission!)

The Code

Save the following code to a file called mechmod.pl :

 #!/usr/bin/perl -w use strict; $++; use File::Basename; use WWW::Mechanize 0.48; my $mech = WWW::Mechanize->new(  ); # Get the starting search page $mech->get( "http://search.cpan.org" ); $mech->success or die $mech->response->status_line; # Select the form, fill the fields, and submit $mech->form_number( 1 ); $mech->field( query => "Lester" ); $mech->field( mode => "author" ); $mech->submit(  ); $mech->success or die "post failed: ",    $mech->response->status_line; # Find the link for "Andy" $mech->follow_link( text_regex => qr/Andy/ ); $mech->success or die "post failed: ", $mech->response->status_line; # Get all the tarbulls my @links = $mech->find_all_links( url_regex => qr/\.tar\.gz$/ ); my @urls = map { $_->[0] } @links; print "Found ", scalar @urls, " tarballs to download\n"; for my $url ( @urls ) {     my $filename = basename( $url );     print "$filename --> ";     $mech->get( $url, ':content_file'=>$filename );     print -s $filename, " bytes\n";

}

Running the Hack

Invoke mechmod.pl on the command line, like so:

 %  perl mechmod.pl  Found 14 tarballs to download Acme-Device-Plot-0.01.tar.gz --> 2025 bytes Apache-Lint-0.02.tar.gz --> 2131 bytes Apache-Pod-0.02.tar.gz --> 3148 bytes Carp-Assert-More-0.04.tar.gz --> 4126 bytes ConfigReader-Simple-1.16.tar.gz --> 7313 bytes HTML-Lint-1.22.tar.gz --> 58005 bytes ...

This short introduction to the world of WWW::Mechanize should give you an idea of how simple it is to write spiders and other mechanized robots that extract content from the Web.

Andy Lester