Recipe 20.7 Finding Stale Links

20.7.1 Problem

You want to check a document for invalid links.

20.7.2 Solution

Use the technique outlined in Recipe 20.3 to extract each link, and then use LWP::Simple's head function to make sure that link exists.

20.7.3 Discussion

Example 20-5 is an applied example of the link-extraction technique. Instead of just printing the name of the link, we call LWP::Simple's head function on it. The HEAD method fetches the remote document's metainformation without downloading the whole document. If it fails, the link is bad, so we print an appropriate message.

Because this program uses the get function from LWP::Simple, it is expecting a URL, not a filename. If you want to supply either, use the URI::Heuristic module described in Recipe 20.1.

Example 20-5. churl
  #!/usr/bin/perl -w   # churl - check urls   use HTML::LinkExtor;   use LWP::Simple;   $base_url = shift       or die "usage: $0 <start_url>\n";   $parser = HTML::LinkExtor->new(undef, $base_url);   $html = get($base_url);   die "Can't fetch $base_url" unless defined($html);   $parser->parse($html);   @links = $parser->links;   print "$base_url: \n";   foreach $linkarray (@links) {       my @element  = @$linkarray;       my $elt_type = shift @element;       while (@element) {           my ($attr_name , $attr_value) = splice(@element, 0, 2);           if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {               print "  $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";           }       }   }

Here's an example of a program run:

% churl http://www.wizards.com http://www.wizards.com:   FrontPage/FP_Color.gif:  OK   FrontPage/FP_BW.gif:  BAD   #FP_Map:  OK   Games_Library/Welcome.html:  OK

This program has the same limitation as the HTML::LinkExtor program in Recipe 20.3.

20.7.4 See Also

The documentation for the CPAN modules HTML::LinkExtor, LWP::Simple, LWP::UserAgent, and HTTP::Response; Recipe 20.8



Perl Cookbook
Perl Cookbook, Second Edition
ISBN: 0596003137
EAN: 2147483647
Year: 2003
Pages: 501

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net