Hack 10 More Involved Requests with LWP::UserAgent

Hack 10 More Involved Requests with LWP::UserAgent

figs/moderate.gif figs/hack10.gif

Knowing how to download web pages is great, but it doesn't help us when we want to submit forms, fake browser settings, or get more information about our request. Here, we'll jump into the more useful LWP::UserAgent .

LWP::Simple 's functions [Hack #9] are handy for simple cases, but they don't support cookies or authorization; they don't support setting header lines in the HTTP request; and, generally , they don't support reading header lines in the HTTP response (most notably, the full HTTP error message, in case of problems). To get at all those features, you'll have to use the full LWP class model.

While LWP consists of dozens of classes, the two that you have to understand are LWP::UserAgent and HTTP::Response . LWP::UserAgent is a class for virtual browsers , which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests.

The basic idiom is $response = $browser->get($url) , like so:

 #!/usr/bin/perl -w use strict; use LWP 5.64; # Loads all important LWP classes, and makes               # sure your version is reasonably recent. my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; my $browser = LWP::UserAgent->new; my $response = $browser->get( $url ); die "Can't get $url -- ", $response->status_line    unless $response->is_success; die "Hey, I was expecting HTML, not ", $response->content_type    unless $response->content_type eq 'text/html';    # or whatever content-type you're dealing with. # Otherwise, process the content somehow: if ($response->content =~ m/jazz/i) {     print "They're talking about jazz today on Fresh Air!\n"; } else {print "Fresh Air is apparently jazzless today.\n"; } 

There are two objects involved: $browser , which holds an object of the class LWP::UserAgent , and the $response object, which is of the class HTTP::Response . You really need only one browser object per program; but every time you make a request, you get back a new HTTP::Response object, which holds some interesting attributes:

  • A status code indicating success/failure (test with $response->is_success ).

  • An HTTP status line, which should be informative if there is a failure. A document not found should return a $response->status_line with something like "404 Not Found."

  • A MIME content type, such as text/html , image/gif , image/jpeg , application/xml , and so on, held in $response->content_type .

  • The actual content of the response, $response->content . If the response is HTML, that's where the HTML source will be; if it's a GIF or image of some other flavor, then $response->content will be the binary image data.

  • Dozens of other convenient and more specific methods , explained in the documentation for HTTP::Response and its superclasses, HTTP::Message and HTTP::Headers .

Sean Burke



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net