Hack 17 Respecting robots.txt

The robots.txt file is a bastion of fair play, allowing a site to restrict what visiting scrapers are allowed to see and do or, indeed, keep them out entirely. Play fair by respecting their requests .

If you've ever built your own web site, you may have come across something called a robots.txt file (http://www.robotstxt.org)a magical bit of text that you, as web developer and site owner, can create to control the capabilities of third-party robots, agents , scrapers, spiders, or what have you. Here is an example of a robots.txt file that blocks any robot's access to three specific directories:

 User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/

Applications that understood your robots.txt file will resolutely abstain from indexing those parts of your site, or they'll leave dejectedly if you deny them outright , as per this example:

 User-agent: * Disallow: /

If you're planning on releasing your scraper or spider into the wild, it's important that you make every possible attempt to support robots.txt . Its power comes solely from the number of clients that choose to respect it. Thankfully, with LWP , we can rise to the occasion quite simply.

If you want to make sure that your LWP -based program respects robots.txt , you can use the LWP::RobotUA class (http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/RobotUA.pm) instead of LWP::UserAgent . Doing so also ensures that your script doesn't make requests too many times a second, saturating the site's bandwidth unnecessarily. LWP::RobotUA is just like LWP::UserAgent , and you can use it like so:

 use LWP::RobotUA; # Your bot's name and your email address my $browser = LWP::RobotUA->new('SuperBot/1.34', 'you@site.com'); my $response = $browser->get($url);

If the robots.txt file on $url 's server forbids you from accessing $url , then the $browser object ( assuming it's of the class LWP::RobotUA ) won't actually request it, but instead will give you back (in $response ) a 403 error with a message "Forbidden by robots.txt." Trap such an eventuality like so:

 die "$url -- ", $response->status_line, "\nAborted"     unless $response->is_success;

Upon encountering such a resource, your script would die with:

 http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt

If this $browser object sees that the last time it talked to $url 's server was too recently, it will pause (via sleep ) to avoid making too many requests too often. By default, it will pause for one minute, but you can control the length of the pause with the $browser->delay( minutes ) attribute.

For example, $browser->delay(7/60) means that this browser will pause when it needs to avoid talking to any given server more than once every seven seconds.

Sean Burke