Hack 51 Spidering, Google, and Multiple Domains

When you want to search a site, you tend to go straight to the site itself and use its native capabilities. But what if you could use Google to search across many similar sites, scraping the pages of most relevance?

If you're searching for the same thing on multiple sites, it's handy to use Google's site : syntax, which allows you to restrict your search to just a particular domain (e.g., perl.org ) or set of domains (e.g., org ). For example, if you want to search several domains for the word perl , you might have a query that looks like this:

 perl ( site:oreilly.com  site:perl.com  site:mit.edu  site:yahoo.com)

You can combine this search with a Perl script to do some specific searching that you can't do with just Google and can't do easily with just Perl.

You might wonder why you'd want to involve Google at all in this search. Why not just go ahead and search each domain separately via their search forms and LWP::Simple [Hack #9] or LWP::UserAgent [Hack #10]? There are a few reasons, the first being that each place you want to search might not have its own search engine. Second, Google might have syntaxessuch as title search, URL search, and full-word wildcard searchthat the individual sites aren't providing. Google returns its search results in an array that's easy to manipulate. You don't have to use regular expressions or parsing modules to get what you want. And, of course, you'll also have all your results in one nice, regular format, independent of site-specific idiosyncrasies.

Example: Top 20 Searching on Google

Say you're a publisher, like O'Reilly, that is interested in finding out which universities are using your books as textbooks . You could do the search at Google itself, experimenting with keywords and limiting your search to the top-level domain edu (like syllabus o'reilly site:edu , or syllabus perl " required reading" site:edu ), and you'd have some success. But you'd get far more than the maximum number of results (Google returns only 1,000 matches for a given query) and you'd also get a lot of false positivespages that include mentions about a book but don't provide specific course information, or maybe weblogs discussing a class, or even old news stories! It's difficult to get a list of just class results with keyword searching alone.

So, there are two overall problems to be solved : narrowing your search to edu leaves your pool of potential results too broad, and it's extremely difficult to find just the right keywords for restricting to university course pages.

This hack tries to solve those problems. First, it uses the top 20 computer science grad schools (as ranked by U.S. News & World Report ) as its site searches and puts those sites into an array. Then, it goes through the array and searches for pages from those schools five at a time using the site : syntax. Each query also searches for O'Reilly * Associates (to match both O'Reilly & Associates and O'Reilly and Associates ) and the word syllabus .

The last tweak goes beyond keyword searching and makes use of Perl's regular expressions. As each search result is returned, both the title and the URL are checked for the presence of a three-digit string. A three-digit string? Yup, a course number! This quick regular expression eliminates a lot of the false positives you'd get from a regular Google search. It is not something you can do through Google's interface.

Search results that make it over all these hurdles are saved to a file.

The Code

This hack makes use of the SOAP-based Google Web Services API. You'll need your own Google search key (http://api.google.com) and a copy of the SOAP::Lite (http://www.soaplite.com) Perl module installed.

Save the following code to a file called textbook .pl :

 #!/usr/bin/perl -w # textbooks.pl # Generates a list of O'Reilly books used # as textbooks in the top 20 universities. # Usage: perl textbooks.pl use strict; use SOAP::Lite; # all the Google information my $google_key  = "   your google key here   "; my $google_wdsl = "GoogleSearch.wsdl"; my $gsrch       = SOAP::Lite->service("file:$google_wdsl"); my @toptwenty = ("site:cmu.edu", "site:mit.edu", "site:stanford.edu",        "site:berkeley.edu", "site:uiuc.edu","site:cornell.edu",        "site:utexas.edu", "site:washington.edu", "site:caltech.edu",        "site:princeton.edu", "site:wisc.edu", "site:gatech.edu",        "site:umd.edu", "site:brown.edu", "site:ucla.edu",        "site:umich.edu", "site:rice.edu", "site:upenn.edu",        "site:unc.edu", "site:columbia.edu"); my $twentycount = 0; open (OUT,'>top20.txt')  or die "Couldn't open: $!"; while ($twentycount < 20) {    # our five universities    my $arrayquery =       "( $toptwenty[$twentycount]  $toptwenty[$twentycount+1] ".       " $toptwenty[$twentycount+2]  $toptwenty[$twentycount+3] ".       " $toptwenty[$twentycount+4] )";    # our search term.    my $googlequery = "\"o'reilly * associates\" syllabus $arrayquery";     print "Searching for $googlequery\n";     # and do it, up to a maximum of 50 results.    my $counter = 0; while ($counter < 50) {        my $result = $gsrch->doGoogleSearch($google_key, $googlequery,                             $counter, 10, "false", "",  "false",                             "lang_en", "", "");        # foreach result.        foreach my $hit (@{$result->{'resultElements'}}){            my $urlcheck = $hit->{'URL'};            my $titlecheck = $hit->{'title'};             my $snip = $hit->{'snippet'};            # if the URL or title has a three-digit            # number in it, we clean up the snippet            # and print it out to our file.            if ($urlcheck =~/http:.*?\/.*?\d{3}.*?/                  or $titlecheck =~/\d{3}/) {               $snip =~ s/<b>/ /g;               $snip =~ s/<\/b>/ /g;               $snip =~ s/&#39;/'/g;               $snip =~ s/&quot;/"/g;               $snip =~ s/&amp;/&/g;               $snip =~ s/<br>/ /g;               print OUT "$hit->{title}\n";               print OUT "$hit->{URL}\n";               print OUT "$snip\n\n";            }         }         # go get 10 more         # search results.         $counter += 10;    }    # our next schools.    $twentycount += 5;  }

Running the Hack

Running the hack requires no switches or variables :

 %  perl textbooks.pl

The output file, top20.txt , looks something like this:

 Programming Languages and Compilers CS 164 - Spring 2002  http://www-inst.eecs.berkeley.edu/~cs164/home.html  ... Tentative  Syllabus  & Schedule of Assignments.  ... you might find  useful is "Unix in  a Nutshell (System V Edition)" by Gilly, published by  O  ' Reilly   & ... CS378 (Spring 03): Linux Kernel Programming  http://www.cs.utexas.edu/users/ygz/378-03S/course.html   ...  Guide, 2nd Edition By Olaf Kirch & Terry Dawson  O ' Reilly &    Associates, ISBN 1-56592  ...  Please  visit Spring 02 homepage for  information on  syllabus, projects, and  ...       LIS 530: Organizing Information Using the Internet  http://courses.washington.edu/lis541/syllabus-intro.html  Efthimis N. Efthimiadis' Site LIS-541  Syllabus  Main Page Syllabus  - Aims   & Objectives.  ...  Jennifer Niederst.  O'Reilly   and   Associates , 1999.   LIS415B * Spring98 * Class Schedule  http://alexia.lis.uiuc.edu/course/spring1998/415B/lis415.spring98.schedule. html  LIS415 (section B): Class Schedule. Spring 98.  Syllabus ...  In Connecting  to the Internet:  A buyer's guide. Sebastapol, California:  O ' Reilly &    Associates .   Implementation of Information Storage and Retrieval  http://alexia.lis.uiuc.edu/~dubin/429/429.pdf  ...  In addition to this  syllabus , this course is governed by the rules  and  ... Advanced  Perl Programming , first edition ( O'Reilly   and    Associates , Inc., INET 200: HTML, Dynamic HTML, and Scripting  http://www.outreach.washington.edu/dl/courses/inet200/  ...  such as HTML & XHTML: the Definitive Guide, 4 th edition, O'Reilly    and  Associates   (which I  ... are assigned, and there is one on the course syllabus  as Appendix B  ...

Hacking the Hack

There are plenty of things to change in this hack. Since it uses a very specific array (that is, the top 20 computer science grad schools), tweaking the array to your needs should be the first place you start. You can make that array anything you want: different kinds of schools, your favorite or local schools, and so on. You can even break out schools by athletic conference and check them that way. In addition, you can change the keywords to something more befitting your tastes. Maybe you don't want to search for textbooks, but you'd rather find everything from chemistry labs to vegetarian groups. Change your keywords appropriately (which will probably require a little experimenting in Google before you get them just right) and go to town.

And don't forget, you're also running a regular expression check on each keyword before you save it to a file. Maybe you don't want to do a three-digit check on the title and URL. Maybe you want to check for the string lib , either by itself or as part of the word library:

 ($urlcheck =~/http:.*?\/.*?lib.*?/) or ($titlecheck =~/.*?lib.*?/)

This particular search will find materials in a school library's web pages, for the most part, or in web pages that mention the word "library" in the title.

If you've read Google Hacks (http://www.oreilly.com/catalog/googlehks/), you might remember that Google offers wildcards for full-word searches, but not for stemming. In other words, you can search for three * mice and get three blind mice , three blue mice , three green mice , and so on. But you can't plug the query moon* into Google and get moons , moonlight , moonglow , and so on. When you use Perl to perform these checks, you are expanding the kind of searching possible with Google.