Hack 98 Making Your Resources Scrapable with XML-RPC

figs/moderate.gif figs/hack98.gif

If you want to make your site's information accessible to lots of aspiring spider builders, don't worry about regular expressions. Just add a little XML-RPC .

No matter how easy we make scraping, it's still scraping: a poorly supported hack meant to enable automation by working around the limitations of a resource meant for human consumption (whew!).

But what if, as a site owner, we have both the ability and desire to overcome the weaknesses of scraping? Then we can replace scraping as the method to access our web resources with an explicit programming interface. This is where web services come in, and they're easier than you might think. Well, in most cases, they're easier on developers than forcing them to dig through HTML tag soup with regular expressions, anyway.

Enter Web Services

Let's consider XML-RPC. XML-RPC (Extensible Markup Language with Remote Procedure Calls) is a simple way to exploit XML and HTTP to support remote procedure calls across the Web. The process uses the basic HTTP POST method to send an XML document to a web application. This XML document contains the details of a procedure to be called in a web application, along with the arguments with which to call the procedure. The web application then returns an XML document that contains the results of the procedure call.

Since XML-RPC uses nothing more exotic for communication than a form submission via the POST method, most web environments can support XML-RPC. Likewise, many web applications can easily be adapted to support a programmable interface. The most complicated part of the process is handling the XML content of the messages, but most programming environments have XML-RPC support packages available. In fact, in practical usage, you rarely even notice the XML. Perl modules are available that provide support for building clients transparently using XML-RPC calls and building servers responding to XML-RPC calls using a variety of convenience methods .

Let's put it all together. XML-RPC sends an XML document that contains procedure arguments to a web application. The web application processes the arguments and returns an XML document, which a script can parse with one of several available XML-RPC packages. Still with me? Let's look at an example.

Building the service

Let's consider providing the current weather conditions from a weather information site. For this, we'll use the XMLRPC::Lite Perl module. Not only is it easy to use, but the module is also a part of the larger SOAP::Lite package (http://search.cpan.org/src/KULCHENKO/SOAP-Lite/), so we can play with the more advanced SOAP protocol later. Let's step through the service one item at a time:

 #!/usr/bin/perl -w # Weather via XML-RPC # Providing current weather conditions # from a weather information site. use strict; use XMLRPC::Lite; use XMLRPC::Transport::HTTP; # Set up CGI-based XMLRPC server handler, with all # calls dispatched to the 'weather' package. my $server = XMLRPC::Transport::HTTP::CGI->new(  )   ->dispatch_to('weather')   ->handle(  ); 

Thanks to the handling XMLRPC::Lite does in the background, implementing an XML-RPC method is fairly simple, idiomatic Perl. In fact, most of the XML involved in the process is translated to and from Perl data structures automatically, so no worries there. In fact, our method to support the retrieval of current weather conditions can look something like this:

 package weather; sub getCurrentConditions {   my $pkg = shift;   my ($user, $pass) = @_;   main::auth_user($user, $pass)  die("Invalid login");   my ($location, $time, $temp_f, $temp_c, $humid) =     main::get_current_conditions($user, $pass);   return     {      location => $location,      time     => $time,      temp_f   => $temp_f,      temp_c   => $temp_c,      humid    => $humid     }; } 

This implements an XML-RPC method that expects two parameters: username and password. Upon successful authentication, a data structure is returned with the current weather condition data items.

Making the service useful

You'll need a few more subroutines to make this example work, namely auth_user() and get_current_conditions( ) :

 package main; # Authenticate users. sub auth_user {   my ($user, $pass) = @_;   return ( ($user eq 'my_user') && ($pass eq 'my_pass') ); } # Look up current weather conditions. # Use fake values, just for example. sub get_current_conditions {   my ($user, $pass) = @_;   my ($location, $time) = ("Ann Arbor", "4:53 PM EDT on June 22, 2003");   my ($temp_f, $temp_c) = ("83", "28");   my ($humid)           = ("32%");   return ($location, $time, $temp_f, $temp_c, $humid); } 

As implemented, these subroutines don't do much: a hardcoded user account named my_user with password my_pass is accepted, and canned weather data is returned. But, after testing, these two subroutines can be replaced with code that does real authentication and looks up real data.

Using the service from the client side

Here's how to use XMLRPC::Lite as a client:

 use XMLRPC::Lite; my $data = XMLRPC::Lite   -> proxy('http://my.weatherexample.com/service.cgi')   -> call('weather.getCurrentConditions', 'my_user', 'my_pass')   -> result; for my $name qw(location time temp_f temp_c humid) {     print $name . ": " . $data->{$name} . "\n"; } 

There are other ways to call this service, with one of the other Perl XML-RPC packages or with an XML-RPC package from another programming environment altogether. That's the beauty of providing this as a web service: easy interoperability.

Hacking a scrape together with a service

If you've already written code to successfully scrape the current conditions from the weather site via automated browsing or regular expressions, you can drop your existing code into get_current_conditions :

 use LWP::Simple; sub get_current_conditions {   my ($user, $pass) = @_;   # Grab the desired page.   my $content = get("http://my.weatherexample.com/current?".                     "username=$user&password=$pass");   # Extract weather   # conditions, bub.   my %data = (  );   foreach my $id qw(location time temp_f temp_c humid) {     ( $data{$id} ) = ($content =~ m!<b ID="$id">(.+?)</b>!i);   }   # Return the conditions data in a list.   return map { $data{$_} } qw(location time temp_f temp_c humid); } 

This takes us back to scraping web resources, but using the results of scraping to feed a web service creates a sort of adapter to the weather site. In this case, you could offer your web service to others so that they don't have to scrape the site themselves .

However, this can be a tricky path to tread. Before building a web service as a gateway on top of someone else's web site, you should have permission to do so. At best, the site's owner might find you rude; at worst, you might find yourself looking for a lawyer.

Hacking the Hack

If you want to build a web service with SOAP instead of XML-RPC, the SOAP::Lite package makes it easy. Simply replace XMLRPC with SOAP :

 use SOAP::Lite; use SOAP::Transport::HTTP; # Set up CGI-based SOAP server handler, with # all calls dispatched to the 'weather' package. my $server = SOAP::Transport::HTTP::CGI->new(  )   ->dispatch_to('weather')   ->handle(  ); 

There's much more to be said about SOAP as a more complicated yet more flexible cousin to XML-RPC, but an explanation of that is best left to another book.

l.m.orchard



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net