Hack 31 Being Warned When Things Go Wrong

Hack 31 Being Warned When Things Go Wrong

figs/beginner.gif figs/hack31.gif

When you're writing any script that operates on data you don't control, from either a database, a text file, or a resource on the Internet, it's always a good idea to add a healthy dose of error checking .

The minute you decide to operate on somebody else's data, you've opened up a can of unreliable and constantly changing worms. One day, you may get one bit of content, while the next day you'll get the same data, only in a different format, or with a new line in a place you weren't expecting. While this premise is one of the solutions XML purports to prevent, 90% of this book is based on HTML: a scourge of "clean" and semantic markup.

Because HTML can change from day to day and is often arbitrary in its format (compare <strong>Moby Dick</strong> to <book_title>Moby Dick</book_title> , for instance), and because web sites can be down one second and up the next, adding error handling to your scripts is an important stopgap to ensure that you'll get the results you're expecting. Of course, seasoned programmers of any language will yell "whooptidoo" and move on to the next hack, but for those of you who need a quick brushing up, here are some ways you can make sure your scripts are doing what you expect. Most of them are one- or two-line additions that can save a lot of sanity in the long run.

When you're downloading content from the Web:

 # using LWP::Simple my $data = get("http://example.com/"); # using LWP::UserAgent my $ua = LWP::UserAgent->new(  ); my $data = $ua->get("http://example.com/")->content; 

check to make sure that your content was downloaded successfully:

 # using LWP::Simple my $data = get("http://example.com/")  or die "No content was downloaded!\n";  # using LWP::UserAgent my $ua = LWP::UserAgent->new(  );  my $res = $ua->get("http://example.com/");   die "No content was downloaded\n" unless $res->is_success;   my $data = $res->content;  

If you're using E-Tag s or If-Modified-Since headers, you can check if the content you're requesting has anything newer than what you last saw. You can see some examples of checking the response from the server in [Hack #16].

Likewise, check to make sure you have the content you expected to get, instead of just blindly processing your data in hopes that what you want is there. Make some effort to check for data you don't want, to rule out extraneous processing:

 # instead of blindly assuming: $data =~ /data we want: (.*)/; my $victory = ; # rule out the negatives first: die if $data =~ /no matches found/; die if $data =~ /not authorized to view/; die if $data =~ /this page has moved/;  # now, if we're this far, we can hope # that we've ruled out most of the bad # results that would waste our time. $data =~ /data we want: (.*)/; my $victory = ; 

Similar checks can be made against the type of content you're receiving:

 # if you're expecting a URL to be passed # on the command line, make sure # a) you got one, and b) it's a URL. my $url = shift @ARGV; die unless defined($url); die unless $url =~ /^http/; # if you're expecting a number on the command # line, make sure you have a number and, alternatively, # check to make sure it's within a certain limit. my $number = shift @ARGV; die unless defined($number); die unless $number =~ /^\d+$/ die if $number <= 0; die if $number >= 19; # if you're using matches in a regular expression, # make sure that you got what you expected: $data =~ /temp: (\d+) humidity: (\d+) description: (.*)/; my ($temp, $humidity, $description) = (, , ); unless ($temp && $humidity && $description) {    die "We didn't get the data we expected!\n"; } 

Another way of checking that your script has met your expectations is to verify that it has matched all the data you want. Say you're scraping a "top 25 records" list from a popular music site; it's safe to assume that you want 25 results. By adding a simple counter, you can get warned when things have gone awry:

 # instead of this: while (/Ranking: (\d+) Title: (.*?) Artist: (.*?)/gism) {    my ($ranking, $title, $artist) = (, , );     next unless ($ranking && $title && $artist);    print "$ranking, $title, $artist\n"; } # add a counter and check for a total:  my $counter = 1;  while (/Ranking: (\d+) Title: (.*?) Artist: (.*?)/gism) {    my ($ranking, $title, $artist) = (, , );     next unless ($ranking && $title && $artist);    print "$ranking, $title, $artist\n";  $counter++;  }  if ($counter < 25) { print "Odd, we didn't get 25 records!\n"; }  

Another change you may want to implement is to decrease the timeout of your network accesses . By default, anytime you request a resource with LWP::UserAgent , it'll wait for some sort of response for 180 seconds (three minutes). For busy sites, three minutes can seem like an eternity, especially when nothing else is going on. You can reduce the number of seconds in one of two ways:

 # during the creation of the UserAgent object: my $ua = LWP::UserAgent->new( timeout => 30 ); # or, after the object has been created: my $ua = LWP::UserAgent->new(  ); $ua->timeout(30); 

The last, and most important, early-warning system we'll talk about uses the warnings and strict programs. These are the easiest lines to add to your code; simply add the following to start off your scripts:

 #!/usr/bin/perl -w use strict; 

With this simple addition, you'll be forced to write cleaner code, because Perl will resolutely balk and whine at hundreds of additional places where your code could be misunderstood. If you've never written under strict before, it may seem like an unworkable and anal beast , but after a while you'll get used to writing code that passes its quality check. Likewise, you'll scratch your head at all the "errors" and " uninitialized values" from warnings, but again, the more you work with them, the better quality of code you'll produce. All the scripts within this book use both pragmas.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net