Hack 32 Being Adaptive to Site Redesigns

figs/beginner.gif figs/hack32.gif

It's a typical story: you work all night long to create the perfect script to solve all your woes, and when you wake in the morning ready to run it "for real," you find the site you're scraping has changed its URLs or HTML .

It's a common fact of programming: the minute you get something perfect, someone comes along and messes with the underlying assumptions. This gets even worse when you're downloading data from the Web, because you can't verify that the information is going to be in the same format from day to day, minute to minute, or even second to second. Expecting a web site to break your script eventually is a good way to code proactively: by being prepared for the inevitable downtime, you'll be better equipped to fix it.

One of the easiest things you can do is break the important bits of your code into variables . Take the following example, which, for all intents and purposes, works just fine:

 #/usr/bin/perl -w use strict; use LWP::Simple; my $data = get("http://example.com") while ($data =~ /weather: (.*?) horoscope: (.*?)/) {    my $weather = ; my $horoscope = ;    if ($weather < 56) { &increase_sweat;  }    if ($horoscope eq "Taurus") { &grow_horns; } } 

Obviously, this code exists solely as an illustration for rewriting the important bits into variables. Take a look at the following example, which does the same thing, only in twice as many lines:

 #/usr/bin/perl use warnings; use strict; use LWP::Simple; my $url            = "http://example.com"; my $weather_reg    = qr/weather: (.*?)/; my $horoscope_reg  = qr/horoscope: (.*?)/; my $weather_limit  = 56; my $horoscope_sign = "Taurus"; my $data = get($url); while ($data =~ /$weather_reg $horoscope_reg/) {    my $weather = ; my $horoscope = ;    if ($weather < $weather_limit) { &increase_sweat; }    if ($horoscope eq $horoscope_sign) { &grow_horns; } } 

Why is code twice as long arguably "better"? Think of it this way: say you have 600 or more lines of code, and it looks similar to our first exampleall the important bits are spread out amongst comments, subroutines, loops , and so on. If you want to change your horoscope or your heat tolerance, you have to search for just the right lines in a haystack of supplementary Perl. If it's been a couple of months since the last time you opened your script, there's a good chance it'll take you substantially longer as you try to figure out what to change and what to leave the same.

By placing all your eggs in the variable basket , there's only one place you need to go when you wish to modify the script. When a site changes its design, just modify the regular expression statements at the top, and the rest of the code will work as you intend. When you've been blessed with a new baby daughter , tweak the horoscope value at the beginning of the file, and you're ready to move on. No searching in the script and no spelunking through code you've forgotten the meaning of. Same with the URL: when the remote site makes it obsolete, just change the information in the first dozen lines of code.

Another benefit of segregating your variables from your logic is the ability to easily change where the values come from. Take a look at the following:

 my $url            = "http://example.com"; my $weather_reg    = qr/weather: (.*?)/; my $horoscope_reg  = qr/horoscope: (.*?)/;  my $weather_limit  = shift @ARGV  56;  my $horoscope_sign = "Taurus"; 

With a simple addition, we now have the ability to get our weather threshold from the command line. Running perl example.pl would keep the value as 56, but perl example.pl 83 would increase it to 83 for that particular run. Likewise, you can add command-line options for all of your variables, and you'll have only one central location to worry about editing: the top of the script.

There are a few scraping utilities that have taken this sort of dynamic configuration to a fault; dailystrips [Hack #37], for instance, is an application for downloading hundreds of comic strips from various sites. Adding new comic strips is a simple matter of defining a new configuration block; there's no code to modify, and, if a comic strip changes URL or format, five minutes can correct it.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net