Flylib.com

Books Software

 
 
 

Hack 10 More Involved Requests with LWP::UserAgent

Hack 10 More Involved Requests with LWP::UserAgent

figs/moderate.gif figs/hack10.gif

Knowing how to download web pages is great, but it doesn't help us when we want to submit forms, fake browser settings, or get more information about our request. Here, we'll jump into the more useful LWP::UserAgent .

LWP::Simple 's functions [Hack #9] are handy for simple cases, but they don't support cookies or authorization; they don't support setting header lines in the HTTP request; and, generally , they don't support reading header lines in the HTTP response (most notably, the full HTTP error message, in case of problems). To get at all those features, you'll have to use the full LWP class model.

While LWP consists of dozens of classes, the two that you have to understand are LWP::UserAgent and HTTP::Response . LWP::UserAgent is a class for virtual browsers , which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests.

The basic idiom is $response = $browser->get($url) , like so:

#!/usr/bin/perl -w
use strict;
use LWP 5.64; # Loads all important LWP classes, and makes
              # sure your version is reasonably recent.

my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';

my $browser = LWP::UserAgent->new;
my $response = $browser->get( $url );
die "Can't get $url -- ", $response->status_line
   unless $response->is_success;

die "Hey, I was expecting HTML, not ", $response->content_type
   unless $response->content_type eq 'text/html';
   # or whatever content-type you're dealing with.

# Otherwise, process the content somehow:
if ($response->content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
} else {print "Fresh Air is apparently jazzless today.\n"; }

There are two objects involved: $browser , which holds an object of the class LWP::UserAgent , and the $response object, which is of the class HTTP::Response . You really need only one browser object per program; but every time you make a request, you get back a new HTTP::Response object, which holds some interesting attributes:

  • A status code indicating success/failure (test with $response->is_success ).

  • An HTTP status line, which should be informative if there is a failure. A document not found should return a $response->status_line with something like "404 Not Found."

  • A MIME content type, such as text/html , image/gif , image/jpeg , application/xml , and so on, held in $response->content_type .

  • The actual content of the response, $response->content . If the response is HTML, that's where the HTML source will be; if it's a GIF or image of some other flavor, then $response->content will be the binary image data.

  • Dozens of other convenient and more specific methods , explained in the documentation for HTTP::Response and its superclasses, HTTP::Message and HTTP::Headers .

Sean Burke

Hack 11 Adding HTTP Headers to Your Request

figs/moderate.gif figs/hack11.gif

Add more functionality to your programs, or mimic common browsers, to circumvent server-side filtering of unknown user agents .

The most commonly used syntax for LWP::UserAgent requests is $response = $browser->get($url) , but in truth you can add extra HTTP header lines to the request by adding a list of key/value pairs after the URL, like so:

$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );

Why is adding HTTP headers sometimes necessary? It really depends on the site that you're pulling data from; some will respond only to actions that appear to come from common end-user browsers, such as Internet Explorer, Netscape, Mozilla, or Safari. Others, in a desperate attempt to minimize bandwidth costs, will send only compressed data [Hack #16], requiring decoding on the client end. All these client necessities can be enabled through the use of HTTP headers. For example, here's how to send more Netscape-like headers:

my @ns_headers = (
    'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
    'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
                 image/pjpeg, image/png,  */*',
    'Accept-Charset' => 'iso-8859-1,*',
    'Accept-Language' => 'en-US',
);

$response = $browser->get($url, @ns_headers);

Or, alternatively, without the interim array:

$response = $browser->get($url,
    'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
    'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
                 image/pjpeg, image/png, */*',
    'Accept-Charset' => 'iso-8859-1,*',
    'Accept-Language' => 'en-US',
);

In these headers, you're telling the remote server which types of data you're willing to Accept and in what order: GIFs, bitmaps, JPEGs, PNGs, and then anything else (you'd rather have a GIF first, but an HTML file is fine if the server can't provide the data in your preferred formats). For servers that cater to international users by offering translated documents, the Accept-Language and Accept-Charset headers give you the ability to choose what sort of native content you get back. For example, if the server offers native French translations of its resources, you can request them with ' Accept-Language'=> 'fr ' and ' Accept-Charset'=> 'iso-8859-1 '.

If you were only going to change the User-Agent , you could just modify the $browser object's default line from libwww-perl/5.65 (or the like) to whatever you wish, using LWP::UserAgent 's agent method (in this case, for Netscape 4.76):

$browser->agent('Mozilla/4.76 [en] (Win98; U)');

Here's a short list of common User-Agent s you might wish to mimic; all perform quite nicely as a replication of a common browser the site in question may be expecting. The first is IE 5.22/Mac, the second is IE 6/Windows, and the third is an example of Mozilla 1.x:

Mozilla/4.0 (compatible; MSIE 5.22; Mac_PowerPC)
Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3) Gecko/20030312

Some sites prefer you arrive from only particular pages on their site or other sites. They do so by requiring a Referer ( sic ) header, the URL of the page you just came from. Faking a Referer is easy; simply set the header, passing it to get as a key/value pair, like so:

$response = $browser->get($url, 'Referer' => 'http://site.com/url.html');

Just goes to show you that relying upon a certain Referer or specific User-Agent is no security worth considering for your own site and resources.

Sean Burke and Kevin Hemenway