Hack 65 Mapping O Reilly Best Sellers to Library Popularity

Hack 65 Mapping O'Reilly Best Sellers to Library Popularity

figs/moderate.gif figs/hack65.gif

If you're using Google to look for books in university libraries, you'll get better results using a Library of Congress Number than a plain old ISBN .

Earlier in the book, we looked at the variety of unique identifiers that can be used on a web site [Hack #7]. A number of these unique identifiers deal with books and other media.

You may one day find yourself with one identifier for a set of data but needing another set of data that uses a different identifier. That's where I found myself when I was wondering exactly how many O'Reilly books were in university libraries, compared to their best-selling status (O'Reilly publishes a weekly list of best sellers at http://www.oreilly.com/catalog/top25.html).

Now, I could just use the ISBN, which O'Reilly supplies , and try to find library holdings that way. The problem, though, is that searching for ISBNs on Google will lead you to lots of false positivesbookstores or just mentions of books, instead of actual library holdings. But we do have an alternative: searching for a book's Library of Congress (LOC) call number will eliminate most of those false positives.

But how do we get the LOC call number for each book? It's not available from O'Reilly. I found a good search interface at the Rochester Institute of Technology's library. I used the ISBNs from O'Reilly's site to look up the LOC call number at RIT's library. After I had the call number, I used Google's API to count how many times the call number appeared in Google's database.

Since the vast majority of LOC call numbers appear in Google search results from university web sites (and specifically library pages), this is a good way to gauge how popular an O'Reilly book is in university libraries versus how it ranks on O'Reilly's overall best-selling list. Are the results perfect? No; most of the search results find acquisitions lists, not catalog search results. But you can get some idea of which books are popular in libraries and which ones apparently have very little appearance in libraries at all!

There's another issue with this script. LOC call numbers end with the date a book was issued; for example, the call number for Mac OS X Hacks is QA76.76.O63 D67 2003. The "2003" is the year the book was published. In the case of Mac OS X Hacks , this is not a problem, since there's only one edition of the book. But in cases of books like Learning Perl , where there are several editions available, searching for just the call number with the year of publication could miss libraries that simply have older versions of the book on their acquisitions lists.

To that end, this program actually takes two counts in Google using the LOC call number. In the first case, it searches for the entire number. In the second case, it searches for the number without the year at the end, giving two different results.

The Code

Save the following code to a file called isbn2loc.pl :

 #!/usr/bin/perl-w use strict; use LWP::Simple; use SOAP::Lite; # All the Google information. my $google_key  = "   your Google API key   "; my $google_wdsl = "GoogleSearch.wsdl"; my $gsrch       = SOAP::Lite->service("file:$google_wdsl"); my $bestsellers = get("http://www.oreilly.com/catalog/top25.html"); # Since we're getting a list of best sellers, # we don't have to scrape the rank. Instead # we'll just start a counter and increment # it every time we move to the next book.  my $rank = 1;  while ($bestsellers =~ m!\[<a href="(.*?)">Read it on Safari!mgis) {    my $bookurl = ; $bookurl =~ m!http://safari.oreilly.com/(\w+)!;    my $oraisbn = ; next if $oraisbn =~ /^http/;    # Here we'll search the RIT library for the book's ISBN. Notice    # the lovely URL that allows us to get the book information.    my $ritdata = get("http://albert.rit.edu/search/i?SEARCH=$oraisbn");     $ritdata =~ m!field C -->&nbsp;<A HREF=.*?>(.*?)</a>!mgs;     my $ritloc = ; # now we've got the LOC number.    # Might as well get the title too, eh?    $ritdata =~ m!<STRONG>\n(.*?)</STRONG>!ms; my $booktitle = ;     # Check and see if the LOC code was found for the book.    # In a few cases it won't be. If it was, keep on going.    if ($ritloc =~ /^Q/ or $ritloc =~ /^Z/) {       # The first search we're doing is for the entire LOC call number.        my $results = $gsrch ->doGoogleSearch($google_key, "\"$ritloc\"",                              0, 1, "false", "",  "false", "", "", "");       my $firstcount = $results->{estimatedTotalResultsCount};       # Now, remove the date and check for all editions.       $ritloc =~ m!(.*?) 200\d{1}!ms; my $ritlocall = ;        $results = $gsrch ->doGoogleSearch($google_key, "\"$ritlocall\"",                           0, 1, "false", "",  "false", "", "", "");       my $secondcount = $results->{estimatedTotalResultsCount};       # Now we print everything out.       print "The book's title is $booktitle. \n";        print "The book's O'Reilly bestseller rank is $rank.\n";        print "The book's LOC number is $ritloc. \n";       print "Searching for $ritloc on Google gives $firstcount results. \n";        print "Searching for all editions on Google ($ritlocall) gives ".             "$secondcount results.\n \n";      }     $rank++; } 

Running the Hack

Unlike many of the hacks in this book, this hack has no command-line switches or options. You just run it from the command line. It visits the top 25 best-seller list, gets the ISBNs, uses the ISBNs to get the LOC call numbers from the library at RIT, and then searches Google for the LOC call numbers with and without the year of publication. Output looks like this:

 %  perl isbn2loc.pl  The book's title is Learning Perl. The book's O'Reilly bestseller rank is 8. The book's LOC number is QA76.73.P33 S34 2001. Searching for QA76.73.P33 S34 2001 on Google gives 0 results. Searching for all editions on Google (QA76.73.P33 S34) gives 9 results. The book's title is Running Linux. The book's O'Reilly bestseller rank is 13. The book's LOC number is QA76.76.O63 W465 2002. Searching for QA76.76.O63 W465 2002 on Google gives 1 results. Searching for all editions on Google (QA76.76.O63 W465) gives 20 results. The book's title is Programming Perl. The book's O'Reilly bestseller rank is 14. The book's LOC number is QA76.73.P22 W348 2000. Searching for QA76.73.P22 W348 2000 on Google gives 1 results. Searching for all editions on Google (QA76.73.P22 W348) gives 10 results. 

Hacking the Hack

This is a very closed hack; it has certain sources it uses and that's that. So, the first thing I think of when I think about modifications is using different sources. O'Reilly doesn't have the only best-seller list out there, you know. You could use Amazon.com, Barnes & Noble, or some other online bookstore or book list. You could also reference your own text file full of ISBN numbers.

You could also use Google's daterange : syntax to check by month and see when the new acquisitions pages are being indexed. (There are too few search results to try to search on a day-by-day basis.) Another idea is to output the results into comma-delimited format, allowing you to put the information into a spreadsheet and lay it out that way.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net