Hack 16 Respecting Your Scrapee s Bandwidth

Hack 16 Respecting Your Scrapee's Bandwidth

Be a better Net citizen by reducing load on remote sites, either by ensuring you're downloading only changed content, or by supporting compression .

Everybody has bills, and the more services you partake in, the higher those bills become. It's a blatantly obvious concept, but one that is easily forgotten when you're writing a scraper. See, when you're physically sitting at your computer, clicking through a site's navigation with your browser, you're an active user : sites love you and they want your traffic but, more importantly, your eyeballs.

With a spider, there are no eyeballs; you run a command line, then go watch the latest anime fansub. Behind the scenes, your spider could be making hundreds or thousands of requests . Of course, it depends on what your spider actually purports to solve, but the fact remains: it's an automated process, and one which could be causing the remote site additional bandwidth costs.

It doesn't have to be this way. In this hack, we'll demonstrate three different ways you can save some bandwidth (both for the site, and for your own rehandling of data you've already seen). The first two methods compare metadata you've saved previously with server data; the last covers compression.

If-Modified-Since

In "Adding HTTP Headers to Your Request" [Hack #11], we learned how to fake our User-Agent or add a Referer to get past certain server-side filters. HTTP headers aren't always used for subversion, though, and If-Modified-Since is a perfect example of one that isn't. The following script downloads a web page and returns the Last-Modified HTTP header, as reported by the server:

 #!/usr/bin/perl -w use strict; use LWP 5.64; use HTTP::Date; my $url = 'http://disobey.com/amphetadesk/'; my $browser = LWP::UserAgent->new; my $response = $browser->get( $url ); print "Got: ", $response->status_line; print "\n". "Epoch: " . $response->last_modified . "\n"; print "English: " . time2str($response->last_modified) . "\n";

When run from the command line, it returns the last time the content at that URL was modified, both in seconds since the Epoch and in English:

 %  perl last_modified.pl  Got: 200 OK Epoch: 1036026316 English: Thu, 31 Oct 2002 01:05:16 GMT

Not all sites will report back a Last-Modified header, however; sites whose pages are dynamically generated (by PHP, SSIs, Perl, etc.) simply won't have one. For an example, change the $url to http://disobey.com/dnn/ , which uses server-side includes to load in sidebars.

But for those that do report Last-Modified and provide a date, what now? The first step to saving bandwidth is to remember that date . Save it to disk, database, memory, wherever; just keep track of it. The next time you request the same web page, you now have a way to determine if the page has been modified since that date. If the page hasn't changed (in essence, if the page has the same Last-Modified header), then you'll be downloading duplicate content. Notice the modification date gleaned from the previous script run and fed back to the server:

 #!/usr/bin/perl -w use strict; use LWP 5.64; use HTTP::Date; my $url = 'http://disobey.com/amphetadesk/'; my $date = "  Thu, 31 Oct 2002 01:05:16 GMT  "; my %headers = (  'If-Modified-Since' => $date  ); my $browser = LWP::UserAgent->new; my $response = $browser->get( $url, %headers ); print "Got: ", $response->status_line;

Invoked again, the server returns HTTP code 304, indicating that the content has not been modified since the If-Modified-Since date it was provided:

 %  perl last_modified.pl  Got: 304 Not Modified

Note that even though we're still using get to request the data from the server, the content was "Not Modified" (represented by the HTTP response code 304), so nothing was actually downloaded. You've saved yourself some processing time, and you've saved the remote site some bandwidth. You're able to check whether you have new data, or whether it's unchanged, like so:

 if ($response->is_success) { print "process new data"; } elsif ($response->code == 304) { print "data not modified"; }

ETags

An ETag is another HTTP header with a function similar to Last-Modified and If-Modified-Since . Instead of a date, it returns a unique string based on the content you're downloading. If the string has changed, then you can assume the content is different. The chief benefit of supporting ETag s is that they are often returned even if the content is dynamically generatedwhere the modification date is not particularly clear. Our code, assuming we've already saved an ETag from the last download, is similar to what we saw earlier. Here, we combine the getting and sending of the ETag into one script:

 #!/usr/bin/perl -w use strict; use LWP 5.64; my $url = 'http://www.w3.org/'; my $etag = '"  3ef89bc8;3e2eee38  "'; my %headers = ( '  If-None-Match' => $etag  ); my $browser = LWP::UserAgent->new; my $response = $browser->get( $url, %headers ); print "ETag from server: " . $response->header("ETag") . "\n"; print "Got: " . $response->status_line . "\n";

Compressed Data

What if we could save bandwidth by reducing the size of the new data we're receiving? As with the previous HTTP headers, this is entirely dependent on what's supported by the remote server, but it also requires a little more coding to live up to our end of the bargain.

Most web servers have the ability (either natively or with a module) to take textual data (such as an HTML web page) and reduce its size with the popular gzip compression format. Often, this creates a 50-80% smaller file to be sent across the wires. Think of it as analogous to receiving a ZIP archive by mail instead of receiving the full- sized files. However, the User-Agent (i.e., you) receiving this encoded data needs to know how to decompress it and treat it as the HTML it actually is.

The first thing we need to do is tell the remote server that we can accept gzipped documents. Since these documents are encoded, we add an HTTP header ( Accept-Encoding ) that states we can accept that encoding . If the server, in turn , also supports the gzip-encoding scheme for the document we've requested , it'll say as much, as shown by the following script:

 #!/usr/bin/perl -w use strict; use LWP 5.64; my $url = 'http://www.disobey.com/'; my %headers = ( 'Accept-Encoding' => 'gzip; deflate' ); my $browser = LWP::UserAgent->new; my $response = $browser->get( $url, %headers ); my $data = $response->content; my $enc = $response->content_encoding; if ($enc eq "gzip" or $enc eq "deflate") {     print "Server supports $enc, woo!\n"; }

This may look helpful, but it's really not. Simply knowing the server supports gzip doesn't get us very far, as now we have all this compressed junk in $data with no way to actually decode it. Compress::Zlib to the rescue!

 #!/usr/bin/perl -w use strict;  use Compress::Zlib;  use LWP 5.64; my $url = 'http://www.disobey.com/'; my %headers = ( 'Accept-Encoding' => 'gzip; deflate' ); my $browser = LWP::UserAgent->new; my $response = $browser->get( $url, %headers ); my $data = $response->content;  if (my $encoding = $response->content_encoding)  ) {   $data = Compress::Zlib::memGunzip($data) if $encoding =~ /gzip/i;   $data = Compress::Zlib::uncompress($data) if $encoding =~ /deflate/i;   }

Any production-quality spider should consider implementing all the suggestions within this hack; they'll not only make remote sites happier with your scraping, but they'll also ensure that your spider operates faster, by ignoring data you've already processed .