Hack 11 Adding HTTP Headers to Your Request

figs/moderate.gif figs/hack11.gif

Add more functionality to your programs, or mimic common browsers, to circumvent server-side filtering of unknown user agents .

The most commonly used syntax for LWP::UserAgent requests is $response = $browser->get($url) , but in truth you can add extra HTTP header lines to the request by adding a list of key/value pairs after the URL, like so:

 $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... ); 

Why is adding HTTP headers sometimes necessary? It really depends on the site that you're pulling data from; some will respond only to actions that appear to come from common end-user browsers, such as Internet Explorer, Netscape, Mozilla, or Safari. Others, in a desperate attempt to minimize bandwidth costs, will send only compressed data [Hack #16], requiring decoding on the client end. All these client necessities can be enabled through the use of HTTP headers. For example, here's how to send more Netscape-like headers:

 my @ns_headers = (     'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',     'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,                   image/pjpeg, image/png,  */*',     'Accept-Charset' => 'iso-8859-1,*',     'Accept-Language' => 'en-US', ); $response = $browser->get($url, @ns_headers); 

Or, alternatively, without the interim array:

 $response = $browser->get($url,     'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',     'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,                   image/pjpeg, image/png, */*',     'Accept-Charset' => 'iso-8859-1,*',     'Accept-Language' => 'en-US', ); 

In these headers, you're telling the remote server which types of data you're willing to Accept and in what order: GIFs, bitmaps, JPEGs, PNGs, and then anything else (you'd rather have a GIF first, but an HTML file is fine if the server can't provide the data in your preferred formats). For servers that cater to international users by offering translated documents, the Accept-Language and Accept-Charset headers give you the ability to choose what sort of native content you get back. For example, if the server offers native French translations of its resources, you can request them with ' Accept-Language'=> 'fr ' and ' Accept-Charset'=> 'iso-8859-1 '.

If you were only going to change the User-Agent , you could just modify the $browser object's default line from libwww-perl/5.65 (or the like) to whatever you wish, using LWP::UserAgent 's agent method (in this case, for Netscape 4.76):

 $browser->agent('Mozilla/4.76 [en] (Win98; U)'); 

Here's a short list of common User-Agent s you might wish to mimic; all perform quite nicely as a replication of a common browser the site in question may be expecting. The first is IE 5.22/Mac, the second is IE 6/Windows, and the third is an example of Mozilla 1.x:

 Mozilla/4.0 (compatible; MSIE 5.22; Mac_PowerPC) Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3) Gecko/20030312 

Some sites prefer you arrive from only particular pages on their site or other sites. They do so by requiring a Referer ( sic ) header, the URL of the page you just came from. Faking a Referer is easy; simply set the header, passing it to get as a key/value pair, like so:

 $response = $browser->get($url, 'Referer' => 'http://site.com/url.html'); 

Just goes to show you that relying upon a certain Referer or specific User-Agent is no security worth considering for your own site and resources.

Sean Burke and Kevin Hemenway



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net