Hack 64 Super Author Searching

figs/expert.gif figs/hack64.gif

By combining multiple sites into one powerful script, you can get aggregated data results that are more complete than just one site could give .

Have you ever obsessively tried to find everything written by a favorite author? Have you ever wanted to, but never found the time? Or have you never really wanted to, but think it would be neat to search across several web sites at once? Well, here's your chance.

To search for authors, let's pick a few book- related sites, such as the Library of Congress (http://www.loc.gov), Project Gutenberg (http://promo.net/pg/), and Amazon.com (http://www.amazon.com). Between these three web sites, we should be able to get a wide range of works by an author. Some will be for sale, some will be available for free download, and others will be available at a library (or the Library of Congress, at least).

Gathering Tools

Before we do anything else, let's get some tools together. We're going to use Perl for this hack, with the following modules: LWP::Simple [Hack #9], WWW::RobotRules , WWW::Mechanize [Hack #21], and HTML::Tree . These modules give us the means to navigate sites and grab content to find and extract data from, all while trying to be a good little robot that follows the rules ([Hack #17] offers guidance on using LWP::RobotsUA to accomplish the same thing). It might seem like unnecessary effort, but taking a few extra steps to obey the Robots Exclusion Protocol (http://www.robotstxt.org) can go a long way in keeping us from trouble or losing access to the resources we want to gather.

Our script starts like so:

 #!/usr/bin/perl-w use strict; use Data::Dumper qw(Dumper); use LWP::Simple; use WWW::RobotRules; use WWW::Mechanize; use HTML::Tree; our $rules = WWW::RobotRules->new('AuthorSearchSpider/1.0'); our $amazon_affilate_id = "   your affiliate ID here   "; our $amazon_api_key     = "   your key here   "; my $author = $ARGV[0]  'dumas, alexandre'; my @book_records = sort {$a->{title} cmp $b->{title}}   (amazon_search($author), loc_gov_search($author), pg_search($author)); our %item_formats =   (    default => \&default_format,    amazon  => \&amazon_format,    loc     => \&loc_format,    pg      => \&pg_format   ); print html_wrapper($author,                    join("\n", map { format_item($_) } @book_records)); 

So, here's the basic structure of our script. We set up a few global resources, such as a way to mind rules of robot spiders and a way to access Amazon.com Web Services. Next , we attempt to get aggregate results of searches on several web sites and sort the records by title. Once we have those, we set up formatting for each type of result and produce an HTML page of the results.

Whew! Now, let's implement all the subroutines that enable all these steps. First, in order to make a few things easier later on, we're going to set up our robot rules handler and write a few convenience functions to use the handler and clean up bits of data we'll be extracting:

 # Get web content, # obeying robots.txt sub get_content {   my $url = shift;   return ($rules->allowed($url)) ? get($url) : undef; } # Get web content via WWW:: # Mechanize, obeying robots.txt sub get_mech {   my $url = shift;   if ($rules->allowed($url)) {     my $a = WWW::Mechanize->new(  );     $a->get($url);     return $a;   } else { return undef } } # Remove whitespace from # both ends of a string sub trim_space {   my $val = shift;   $val=~s/^\s+//;   $val=~s/\s+$//g;   return $val; } # Clean up a string to be used # as a field name of alphanumeric # characters and underscores. sub clean_name {   my $name = shift;   $name=lc($name);   $name=trim_space($name);   $name=~s/[^a-z0-9 ]//g;   $name=~s/ /_/g;   return $name; } 

Now that we have a start on a toolbox, let's work on searching. The idea is to build a list of results from each of our sources that can be mixed together and presented as a unified whole.

Hacking the Library of Congress

Now, let's visit the library. Right on the front page, we see a link inviting visitors to Search Our Catalogs, which leads us to a choice between a Basic Search and a Guided Search. For simplicity's sake, we'll follow the basic route.

This brings us to a simple-looking form (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First), with options for the search text, the type of search we want, and the number of records per page. Using WWW::Mechanize , we can start our subroutine to use this form like this:

 sub loc_gov_search {   my $author = shift;   # Submit search for author's name   my $url = 'http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First';   my $a = get_mech($url);   $a->submit_form     (      form_number => 1,      fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}     ); 

The first result of this search is a list of links with which to further refine our author search. So, let's try looking for links that contain the closest match to our author name:

 # Data structure for book data records   my @hit_links = grep { $_->text() =~ /$author/i } $a->links(  );   my @book_records = (  );   for my $hit_link (@hit_links) {     my $a = get_mech       ('http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First');     $a->submit_form       (        form_number => 1,        fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}       );     $a->follow_link(text=>$hit_link->text(  )); 

This particular bit of code uses the link-extraction feature of WWW::Mechanize to grab link tags from the initial search results page to which we just navigated. Due to some quirk in session management in the Library of Congress search, we need to start over from the search results page, rather than simply use the back function.

Once we have each secondary author page of the search, we can extract links to publications from these pages:

 # Build a tree from the HTML     my $tree = HTML::TreeBuilder->new(  );     $tree->parse($a->content(  ));     # Find the search results table: first, look for a header     # cell containing "#", then look for the parent table tag.     my $curr;     ($curr) = $tree->look_down       (_tag => 'th', sub { $_[0]->as_text(  ) eq '#' } );     next if !$curr;     ($curr) = $curr->look_up(_tag => 'table');     my ($head, @rows) = $curr->look_down       (_tag => 'tr', sub { $_[0]->parent(  ) == $curr } ); 

This code uses the HTML::Tree package to navigate the structure of the HTML content that makes up the search hits page. Looking at this page, we see that the actual table listing the search hits starts with a table header containing the text " # ". If we look for this text, then walk back up to the containing parent, we can then extract the table's rows to get the search hits.

Once we have the rows that contain links to details pages, let's process them:

 # Extract and process the search     # results from the results table.     my @book_records = (  );     while (@rows) {       # Take the results in row pairs; extract        # the title and year cells from the first row.       my ($r1, $r2) = (shift @rows, shift @rows);       my (undef, undef, undef, undef, $td_title, $td_year, undef) =         $r1->look_down(_tag => 'td', sub { $_[0]->parent(  ) == $r1 });            # Get title link from the results; extract the detail URL.       my ($a_title) = $td_title->look_down(_tag=>'a');       my $title_url = "http://catalog.loc.gov".$a_title->attr("href");       # Get the book detail page; follow the link to the Full record.       $a->follow_link(url => $title_url);       $a->follow_link(text => "Full"); 

Looking at this page, we see that each publication is listed as a pair of rows. The first row in each pair lists a few details of the publication, and the second row tells where to find the publication in the library. For our purposes, we're interested only in the title link in the first row, so we extract the cells of the first row of each pair and then extract the URL to the publication detail page from that.

From there, we follow the details link, which brings us to a brief description of the publication. But we're interested in more details than that, so on that details page we follow a link named "Full" to a more detailed list of information on a publication.

Finally, then, we've reached the full details page for a publication by our author. So, let's figure out how to extract the fields that describe this publication. Looking at the table, we see that the table starts with a header containing the string " LC Control Number ". So, we look for that header, then backtrack to the table that contains it:

 # Find table containing book detail data by looking       # for table containing a header with text "LC Control Number".       my $t2 = HTML::TreeBuilder->new(  );       $t2->parse($a->content(  ));       my ($c1) = $t2->look_down         (_tag=>'th', sub { $_[0]->as_text(  ) =~ /LC Control Number/ })            next;       $c1 = $c1->look_up(_tag=>"table"); 

After finding the table that contains the details of our publication, we can walk through the rows of the table and extract name/value pairs. First, we start building a record for this book by noting the type of the search, as well as the URL of the publication details page:

 # Now that we have the table, look       # for the rows and extract book data.       my %book_record = (_type => 'loc', url=>$title_url);       my @trs = $c1->look_down(_tag=>"tr");       for my $tr (@trs[1..$#trs]) {         # Grab the item name and value table         # cells; skip to next if empty.         my ($th_name)  = $tr->look_down(_tag=>"th");         my ($td_value) = $tr->look_down(_tag=>"td");         next if (!$th_name)  (!$td_value);         # Get and clean up the item name and value         # table data; skip to next if the name is empty.         my $name  = clean_name($th_name->as_text(  ));         my $value = trim_space($td_value->as_text(  ));         next if ($name eq '');            $book_record{$name} = $value;       } 

Luckily, the table that contains information about our publication is fairly clean, with every name contained in a header cell and every value contained in a corresponding data cell in the same row. So, we walk through the rows of the details table, collecting data fields by using the convenience methods we wrote earlier.

Now, we can finish up our subroutine, doing a little cleanup on the publication title and adding the finished record to a list that we return when all our wandering through the library is done:

 ($book_record{title}, undef)           = split(/ \//, $book_record{main_title});       push @book_records, \%book_record;       # Back up to the search results page.       $a->back(); $a->back(  );     }   }   return @book_records; } 

To summarize, this subroutine does the following:

  1. Performs an author search on the Library of Congress web site

  2. Follows links to author search results pages

  3. Follows publication details links on author search results pages

  4. Digs further down to full-detail records on publications

  5. Harvests data fields that describe a publication

In the end, by drilling down through several layers of search hits and details pages, we have collected a slew of records that describe publications by our author. These records are stored as a list of Perl hashes, each containing name/value pairs.

Each record also contains a value that indicates which source it was harvested from (i.e., _type=>'loc '). This will become important shortly, when we mix the results of other searches together.

Perusing Project Gutenberg

Next, let's take a look at Project Gutenberg (http://promo.net/pg/). In case you've never heard of it, this is an effort to make public-domain books and publications available to the public in formats usable by practically all personal computers available. In the Project Gutenberg library, you can find an amazing array of materials, so our author search could benefit from a stroll through their stacks.

Wandering around the project's site, we uncover a search form (http://www. ibiblio .org/gutenberg/cgi-bin/sdb/t9.cgi/). One of the fields in this form is Author, just what we need. Our search subroutine for this site begins like this:

 # Search Project Gutenberg # for books by an author sub pg_search {   my $author = shift;   my $pg_base = 'http://www.ibiblio.org/gutenberg/cgi-bin/sdb';   my @book_records = (  );   # Submit an author search at Project Gutenberg   my $a1 = get_mech("$pg_base/t9.cgi/");   $a1->submit_form     (      form_number => 1,      fields => { author => $author }     ); 

As it turns out, this search results page is quite simple, with a link to every result contained in a list bullet tag. So, we can write a quick set of map expressions to find the bullets and the links within them, and extract the link URLs into a list:

 # Extract all the book details   # pages from the search results   my $t1 = HTML::TreeBuilder->new(  );   $t1->parse($a1->content(  ));   my (@hit_urls) =     map { "$pg_base/".$_->attr('href') }       map { $_->look_down(_tag=>'a') }         $t1->look_down(_tag=>'li'); 

Now that we have a list of links to publication details pages, let's chase each one down and collect the information for each book:

 # Process each book detail   # page to extract book info   for my $url (@hit_urls) {     my $t2 = HTML::TreeBuilder->new(  );     $t2->parse(get_content($url)); 

Luckily, these details pages also have a fairly simple and regular structure. So, we can quickly locate the table that contains the details by finding a table cell with the word download and backtrack to its parent table.

 # Find the table of book data: look for a table      # cell containing 'download' and find its parent table.      my ($curr) = $t2->look_down        (_tag=>"td",         sub { $_[0]->as_text(  ) =~ /download/i });      ($curr) = $curr->look_up(_tag=>"table"); 

Most rows of this table contain name/value pairs in data cells, with the name of the pair surrounded by <tt> tags. The names also end in a colon , so we can add that for good measure:

 # Find the names of book data items: look for      # all the <tt> tags in the table that contain ':'      my (@hdrs) = $curr->look_down        (_tag=>'tt',         sub { $_[0]->as_text(  ) =~ /\:/}); 

After finding all the book details field names, we can visit each of them to dig out the values. For each tag that contains a name, we find its parent table row and grab the row's second column, which contains the value of the pair. So, we can start constructing a record for this book. Again, notice that we start out by identifying which source this search result was harvested from (i.e., _type=>'pg '):

 # Extract name/value data from book details page.      my %book_record = (_type=>'pg', url=>$url);      for my $hdr (@hdrs) {           # Name is text of <tt> tag.        my $name = clean_name($hdr->as_text(  ));        next if ($name eq '');           # Find the field value by finding the parent        # table row, then the child table data cell.        my ($c2) = $hdr->look_up(_tag=>'tr');        (undef, $c2) = $c2->look_down(_tag=>'td'); 

Most values are simple strings, with the exception of the publication's download links. When we encounter this value, we go a step further and extract the URLs from those links. Otherwise , we just extract the text of the table data cell. Using what we've extracted, we build up the book record:

 # Extract the value. For most fields, simply use the text of the        # table cell. For the download field, find the URLs of all links.        my $value;        if ($name eq 'download') {          my (@links) = $c2->look_down            (_tag=>"a",            sub { $_[0]->as_text(  ) =~ /(txtzip)/} );         $value = [ map { $_->attr('href') } @links ];       } else {         $value = $c2->as_text(  );       }       # Store the field name and value in the record.       $book_record{$name} = $value;     } 

Finally, we store each book record in a list and return it from our subroutine:

 push @book_records, \%book_record;   }   return @book_records; } 

Although simpler, this search is similar to searching the Library of Congress:

  1. Perform an author search on the Project Gutenberg web site.

  2. Follow links in the search results to find publication details pages.

  3. Harvest data fields that describe a publication.

And, like the Library of Congress search, we collect a list of Perl hashes that contain book details. Also, each record is tagged with the source of the search.

Navigating the Amazon

Our final search involves the online catalog at Amazon.com, via its Web Services API (http://www.amazon.com/ webservices ). This API allows developers and webmasters to integrate a wide range of the features of their sites into their own applications and content. But before we can do anything with Amazon.com's Web Services API, we need to sign up for a developer token. This allows Amazon.com to identify one consumer of its services from another. Once we have a token, we can get started using the API. First, we download the software development kit (SDK). In the documentation, we find that, among other services, the API offers simple XML-based author searches. So, we can use this service to build a search subroutine. Based on the SDK's instructions, we can start like this:

 # Search for authors via # the Amazon search API. sub amazon_search {   my $author = shift;   # Construct the base URL for Amazon author searches.   my $base_url = "http://xml.amazon.com/onca/xml3?t=$amazon_affilate_id&".     "dev-t=$amazon_api_key&AuthorSearch=$author&".       "mode=books&type=lite&f=xml"; 

The first step is to use the XML service to submit a search query for our author. One quirk in the otherwise simple service is that results are served up only a few at a time, across a number of pages. So, we'll grab the first page and extract the total number of pages that make up our search results:

 # Get the first page of search results.   my $content = get_content($base_url."&page=1");   # Find the total number of search results pages to be processed.   $content =~ m{<totalpages>(.*?)</totalpages>}mgis;   my ($totalpages) = ('1'); 

Note that, in this hack, we're going for a quick-and-dirty regular expression method for extracting information from XML. Normally, we'd want to use a proper XML parser, but this approach will work well enough to get this job done for now.

The next step, after getting the first page of search results and extracting the total number of pages, is to grab the rest of the pages for our search query. We can do this with another quick map expression in Perl to step through all the pages and store the content in a list.

One thing to note, however, is that we wait at least one second between grabbing results pages. The company may or may not enforce this restriction, but the license for using the Amazon.com Web Services API specifies that an application should make only one request per second. So, just as we make an effort to obey the Robots Exclusion Protocol, we should try to honor this as well.

Here's how we do it:

 # Grab all pages of search results.   my @search_pages = ($content);   if ($totalpages > 1) {     push @search_pages,       map { sleep(1); get_content($base_url."&page=$_") } (2..$totalpages);   } 

Now that we have the content of all the search pages, we can extract records on the publications, just as we have in the previous two search subroutines. The biggest difference in this case, however, is that XML content is so much easier to handle than HTML tag soup. In fact, we can use some relatively simple regular expressions to process this data:

 # Extract data for all the books   # found in the search results.   my @book_records;   for my $content (@search_pages) {     # Grab the content of all <details> tags.     while ($content=~ m{<details(?!s) url="(.*?)".*?>(.*?)</details>}mgis) {       # Extract the URL attribute and tag body content.       my($url, $details_content) = ('', '');       # Extract all the tags from the detail record, using       # tag name as hash key and tag contents as value.       my %book_record = (_type=>'amazon', url=>$url);       while ($details_content =~ m{<(.*?)>(.*?)</>}mgis) {         my ($name, $val) = ('', '');         $book_record{clean_name($name)} = $val;       } 

This code uses regular expressions to extract the contents of XML tags, starting with the details tag. The search results pages contain sets of these tags, and each set contains tags that describe a publication. We use a regular expression that matches on opening and closing tags, extracting the tag name and tag data as the name and value for each field. The names of these tags are described in the SDK, but we'll just stuff them away in a book record for now.

Notice that this process is much simpler than walking through a tree built up from parsed HTML, looking for tag patterns. Things like this are usually simpler when an explicit service is provided for our use. So, we can apply a little last-minute processingextracting lists of author subtagsfinish up our book record, and wrap up our Amazon.com search subroutine:

 # Further process the authors list to extract author       # names, and standardize on product name as title.       my $authors = $book_record{authors}  '';       $book_record{authors} =         [ map { $_ } ( $authors =~ m{<author>(.*?)</author>}mgis ) ];       $book_record{title} = $book_record{productname};       push @book_records, \%book_record;     }   }   return @book_records; } 

Compared to the previous two searches, this is the simplest of all. Since the XML provided by the Amazon.com search API is a well-defined and easily processed document, we don't have to do any of the searching and navigation that is needed to extract records from HTML.

And, like the Library of Congress search, we collect a list of Perl hashes that contain book details. Also, each record is tagged with the source of the search.

Presenting the Results

We now have three subroutines with which to search for an author's works. Each of them produces a similar set of results, as a list of Perl hashes that contain book details in name/value pairs. Although each site's result records contain different sets of data, there are a few fields common to all three subroutines: _type , title , and url .

We can use these common fields to sort by title and format the results differently for each type of record. Now, we can build the parts to make the aggregate search and result formatting that we put together toward the beginning of the script. Let's start with the wrapper HTML template:

 sub html_wrapper {   my ($author, $content) = @_;   return qq^     <html>       <head><title>Search results for $author</title></head>       <body>         <h1>Search results for $author</h1>         <ul>$content</ul>       </body>     </html>     ^; } 

This is a simple subroutine that wraps a given bit of content with the makings of an HTML page. Next, let's check out the basics of item formatting:

 sub format_item {   my $item = shift;   return "<li>".((defined $item_formats{$item->{_type}})     ? $item_formats{$item->{_type}}->($item)     : $item_formats{default}->($item))."</li>"; } sub default_format {   my $rec = shift;   return qq^<a href="$rec->{url}">$rec->{title}</a>^; } 

The first subroutine, format_item , uses the hash table of routines built earlier to apply formatting to items. The second subroutine, default_format , provides a simple implementation of an item format. Before we fill out implementations for the other record types, let's build a quick convenience function:

 sub field_layout {   my ($rec, $fields) = @_;   my $out = '';   for (my $i=0; $i<scalar(@$fields); $i+=2) {     my ($name, $val) = ($fields->[$i+1], $rec->{$fields->[$i]});     next if !defined $val;     $out .= qq^<tr><th align="right">$name:</th><td>$val</td></tr>^;   }   return $out; } 

This function takes a record and a list of fields and descriptions in order. It returns a string that contains a set of table rows, with descriptions paired with values. We'll use this in the rest of the formatters to build tables quickly.

First, we build a formatter for the Library of Congress search records. Basically, this is an incremental improvement over the default formatter. It identifies the source of this result and uses the field-layout function we just built to display a small set of common fields found in Library of Congress publication records:

 sub loc_format {   my $rec = shift;   my $out = qq^[LoC] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;   $out .= qq^<table border="1" cellpadding="4" cellspacing="0"  [RETURN]  width="50%">^;   $out .= field_layout     ($rec,       [         'publishedcreated'  => 'Published',         'type_of_material'  => 'Type of material',         'description'       => 'Description',         'dewey_class_no'    => 'Dewey class no.',         'call_number'       => 'Call number',         'lc_classification' => 'LoC classification',         'lc_control_number' => 'LoC control number',         'isbn'              => 'ISBN',       ]     );   $out .= "</table><br />";   return $out; } 

Next, we build a formatter for the Project Gutenberg records. This implementation doesn't display as many fields, but it has a special treatment of the download field in order to present the URLs as links:

 sub pg_format {   my $rec = shift;   my $out = qq^[PG] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;   $out .= qq^<table border="1" cellpadding="4" cellspacing="0"  [RETURN]  width="50%">^;   $out .= field_layout($rec, ['language' => 'Language']);   $out .= qq^     <tr><th align="right">Download:</th>       <td>   ^;   for my $link (@{$rec->{download}}) {     $out .= qq^<a href="$link">$link</a><br />^;   }   $out .= qq^</td></tr></table><br />^;   return $out; } 

Finally, we build a formatter for the Amazon.com records, which has much in common with the Library of Congress record formatter. The biggest difference is that we've added the display of the publication's cover image that is available at Amazon.com:

 sub amazon_format {   my $rec = shift;   my $out = qq^[Amazon] <a href="$rec->{url}">$rec->{title}</a>  [RETURN]  <br /><br />^;   $out .= qq^     <table border="1" cellpadding="4" cellspacing="0" width="50%">       <tr><th align="center" colspan="2">         <img src="$rec->{imageurlmedium}" />       </th></tr>   ^;   $out .= field_layout     ($rec,       [         'releasedate'  => 'Date',         'manufacturer' => 'Manufacturer',         'availability' => 'Availability',         'listprice'    => 'List price',         'ourprice'     => "Amazon's price",         'usedprice'    => 'Used price',         'asin'         => 'ASIN'       ]     );   $out .= "</table><br />";   return $out; } 

Running the Hack

Now our script is complete. We have code to search for an author across several sites, we have a means of driving these searches and aggregating the results, and we have a flexible means of presenting the results of our search. The design of this script should easily lend itself to adding further sites to be searched, as well as formatters for those results. Figure 4-6 shows the default format.

Figure 4-6. Search results for "dumas, alexandre"
figs/sphk_0406.gif

This script is best used from the command line, with the results saved to a file for viewing when the process is complete. Since this is a robot that spiders across quite a few pages from several sites, it won't be unusual for this to take quite a bit of time. Also, since it generates quite a bit of traffic on the sites it visits , you'll likely want to refrain from running it very often. In particular, this script is not really a good idea to adapt as a CGI script for a web search form.

Hacking the Hack

Exercises left for the reader include breaking up the search results into pages to make the results friendlier to browse. Also, without too much effort, this script could be modularized and turned into a fairly flexible search robot. In any case, enjoy your new powers of author searching, and good luck in building new search robots.

l.m.orchard



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net