Hack 14 Handling Relative and Absolute URLs

figs/beginner.gif figs/hack14.gif

Glean the full URL of any relative reference, such as "sample/index.html" or "../../images/flowers.gif", by using the helper functions of URI .

Occasionally, when you're parsing HTML or accepting command-line input, you'll receive a relative URL, something that looks like images/bob.jpg instead of the more specific http://www.example.com/images/bob.jpg . The longer version, called the absolute URL , is more desirable for parsing and display, as it ensures that no confusion can arise over where a resource is located.

The URI class provides all sorts of methods for accessing and modifying parts of URLs (such as asking what sort of URL it is with $url->scheme , asking which host it refers to with $url->host , and so on, as described in the docs for the URI class). However, the methods of most immediate interest are the query_form method [Hack #12] and the new_abs method for taking a URL string that is most likely relative and getting back an absolute URL, as shown here:

 use URI; my $abs = URI->new_abs($maybe_relative, $base); 

For example, consider the following simple program, which scrapes for URLs in the HTML list of new modules available at your local CPAN mirror:

 #!/usr/bin/perl -w use strict; use LWP 5.64; my $browser = LWP::UserAgent->new; my $url = 'http://www.cpan.org/RECENT.html'; my $response = $browser->get($url); die "Can't get $url -- ", $response->status_line   unless $response->is_success; my $html = $response->content; while( $html =~ m/<A HREF=\"(.*?)\"/g ) {      print "\n";  } 

It returns a list of relative URLs for Perl modules and other assorted files:

 %  perl get_relative.pl  MIRRORING.FROM RECENT RECENT.html authors/00whois.html authors/01mailrc.txt.gz authors/id/A/AA/AASSAD/CHECKSUMS ... 

However, if you actually want to retrieve those URLs, you'll need to convert them from relative (e.g., authors/00whois.html ) to absolute (e.g., http://www.cpan.org/authors/00whois.html ). The URI module's new_abs method is just the ticket and requires only that you change that while loop at the end of the script, like so:

 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {  print URI->new_abs( , $response->base ) ,"\n";  } 

The $response->base method from the HTTP::Message module returns the base URL, which, prepended to a relative URL, provides the missing piece of an absolute URL. The base URL is usually the first part (e.g., http://www.cpan.org ) of the URL you requested .

That minor adjustment in place, the code now returns absolute URLs:

 http://www.cpan.org/MIRRORING.FROM http://www.cpan.org/RECENT http://www.cpan.org/RECENT.html http://www.cpan.org/authors/00whois.html http://www.cpan.org/authors/01mailrc.txt.gz http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS ... 

Of course, using a regular expression to match link references is a bit simplistic, and for more robust programs you'll probably want to use an HTML-parsing module like HTML::LinkExtor, HTML::TokeParser [Hack #20], or HTML::TreeBuilder .

Sean Burke



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net