Hack 9 Simply Fetching with LWP::Simple

figs/beginner.gif figs/hack09.gif

Suck web content easily using the aptly named LWP::Simple .

LWP (short for "Library for WWW in Perl") is a popular group of Perl modules for accessing data on the Web. Like most Perl module distributions, each of LWP 's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in LWP that it's hard to know where to look for information on doing even the simplest things.

Introducing you to all aspects of using LWP would require a whole booka book that just so happens to exist, mind you (see Sean Burke's Perl & LWP at http://oreilly.com/catalog/perllwp/).


If you just want to access a particular URL, the simplest way to do so is to use LWP::Simple 's functions. In a Perl program, you can simply call its get($url) routine, where $url is the location of the content you're interested in. LWP::Simple will try to fetch the content at the end of the URL. If it's successful, you'll be handed the content; if there's an error of some sort , the get function will return undef , the undefined value. The get represents an aptly named HTTP GET request, which reads as "get me the content at the end of this URL":

 #!/usr/bin/perl -w use strict; use LWP::Simple; # Just an example: the URL for the most recent /Fresh Air/ show  my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';  my $content = get($url);   die "Couldn't get $url" unless defined $content;  # Do things with $content: if ($content =~ m/jazz/i) {     print "They're talking about jazz today on Fresh Air!\n"; } else { print "Fresh Air is apparently jazzless today.\n"; } 

A handy variant of get is getprint , useful in Perl one-liners. If it can get the page whose URL you provide, it sends it straight to STDOUT ; otherwise , it complains to STDERR both usually are your screen:

 %  perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"  MIRRORED.BY MIRRORING.FROM RECENT RECENT.html SITES SITES.html authors/00whois.html authors/01mailrc.txt.gz authors/id/A/AB/ABW/CHECKSUMS authors/id/A/AB/ABW/Pod-POM-0.17.tar.gz ... 

The previous command grabs and prints the URL of a plain text file that lists new files added to CPAN in the past two weeks. You can easily make it part of a tidy little shell command, like this one that mails you the list of new Acme:: modules:

 %  perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"  grep "/by- module/Acme"  mail -s "New Acme modules! Joy!" $USER  

There are a few other useful functions in LWP::Simple , such as head , which issues an HTTP HEAD r equest (surprising, eh?). A HEAD returns just the introductory bits of a response, rather than the full content returned by GET . HEAD returns a list of HTTP headerskey/value pairs that tell you more about the content in question (its size, content type, last-modified date, etc.)like so:

 %  perl -MLWP::Simple -e 'print join "\n", head "http://cpan.org/RECENT"'  text/html 49675 1059640198 Apache/1.3.26 (Unix) PHP/4.2.1 mod_gzip/1.3.19.1a mod_perl/1.27 

If successful, a HEAD request should return the content type (plain text, in this case), the document length ( 49675 ), modification time ( 1059640198 seconds since the Epoch, or July 31, 2003 at 01:29:58), content expiration date, if any, and a bit about the server itself (Apache under Unix with PHP , gzip , and mod_perl modules onboard).

If you're going use the CGI module with your LWP::Simple hacks, be sure to tell LWP::Simple not to import the head routine; otherwise, it'll conflict with a similarly named routine from CGI . To do this, use:

 use LWP::Simple qw(!head); use CGI qw{:standard}; 

For anything beyond the basics, you'll want to use LWP::UserAgent [Hack #10], the Swiss Army knife of LWP and Perl's network libraries.

Sean Burke



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net