Hack 85 Aggregating Multiple Search Engine Results

figs/expert.gif figs/hack85.gif

Even though Google may solve all your searching needs on a daily basis, there may come a time when you need a "super search"something that queries multiple search engines or databases at once .

Google is still the gold standard for search engines and still arguably the most popular search spot on the Web. But after years of stagnation, the search engine wars are firing up again. AlltheWeb.com (http://www.alltheweb.com) in particular is working hard to offer new search syntax, a larger web index (over 3.2 billion URLs at the time of this writing), and additional interface options. If you want to keep up with searching on the Web, it behooves you to try search engines other than Google, if only to get an idea of how the other engines are evolving.

This hack builds a meta-search engine, querying several search engines in turn and displaying the aggregated results. Actually, it can query more than just search engines; it can request data from anything to which you can submit a search request. It does so by using a set of plug-inseach of which knows the details of a particular search engine or site's search request syntax and the format of its resultsthat perform the search and return the results. The main script, then, does nothing more than farm out the request to these plug-ins and let them perform their magic. This is an exercise in hacking together a client/server protocol. The protocol I use is simple: each plug-in needs to return URL and text pairs. How do we delimit one from the other? By finding a character that's illegal in URLs, such as the common tab, and using that to separate our data.

The protocol runs as follows :

  1. The server starts up a plug-in as an executable program, with the search terms as command-line parameters.

  2. The client responds by printing one result per new line, in the format of URL, tab, then text.

  3. The server receives the data, formats it a little before printing, and then moves on to the next available plug-in.

Note that because we have a simple call and response pattern, the plug-ins can query anything, including your own local databases with Perl's DBI, Python scripts that grok FTP servers, or PHP concoctions that do reverse lookups on phone numbers . As long as the plug-in returns the data in URL-tab-text format, what it does and how it's programmed don't matter.

The Code

The following short piece of code demonstrates the server portion, which searches for a ./plugins directory and executes all the code within:

 #!/usr/bin/perl -w # aggsearch - aggregate searching engine # # This file is distributed under the same licence as Perl itself. # # by rik - ora@rikrose.net ###################### # support stage      # ###################### use strict; # change this, if neccessary. my $pluginDir = "plugins"; # if the user didn't enter any search terms, yell at 'em. unless (@ARGV) { print 'usage: aggsearch "search terms"', "\n"; exit; } # this routine actually executes the current # plug-in, receives the tabbed data, and sticks # it into a result array for future printing. sub query {     my ($plugin, $args, @results) = (shift, shift);     my $command = $pluginDir . "/" . $plugin . " " . (join " ", @$args);     open RESULTS, "$command " or die "Plugin $plugin failed!\n";     while (<RESULTS>) {         chomp; # remove new line.         my ($url, $name) = split /\t/;         push @results, [$name, $url];     } close RESULTS;     return @results; } ###################### # find plug-ins stage # ###################### opendir PLUGINS, $pluginDir    or die "Plugin directory \"$pluginDir\"".      "not found! Please create, and populate\n"; my @plugins = grep {     stat $pluginDir . "/$_"; -x _ && ! -d _ && ! /\~$/; } readdir PLUGINS; closedir PLUGINS; ###################### # query stage        # ###################### for my $plugin (@plugins){     print "$plugin results:\n";     my @results = query $plugin, \@ARGV;     for my $listref (@results){         print " $listref->[0] : $listref->[1] \n"     } print "\n"; } exit 0; 

The plug-ins themselves are even smaller than the server code, since their only purpose is to return a tab-delimited set of results. Our first sample looks through the freshmeat.net (http:// freshmeat .net) software site:

 #!/usr/bin/perl -w # Example freshmeat searching plug-in # # This file is distributed under the same licence as Perl itself. # # by rik - ora@rikrose.net use strict; use LWP::UserAgent; use HTML::TokeParser; # create the URL from our incoming query. my $url = "http://freshmeat.net/search-xml?q=" . join "+", @ARGV; # download the data. my $ua = LWP::UserAgent->new(  ); $ua->agent('Mozilla/5.0'); my $response = $ua->get($url); die $response->status_line . "\n"   unless $response->is_success; my $stream = HTML::TokeParser->new ($response->content) or die "\n"; while (my $tag = $stream->get_tag("match")){     $tag = $stream->get_tag("projectname_full");     my $name = $stream->get_trimmed_text("/projectname_full");     $tag = $stream->get_tag("url_homepage");     my $url = $stream->get_trimmed_text("/url_homepage");     print "$url\t$name\n"; } 

Our second sample uses the Google API:

 #!/usr/bin/perl -w # Example Google searching plug-in use strict; use warnings; use SOAP::Lite; # all the Google information my $google_key  = "   your API key here   "; my $google_wdsl = "GoogleSearch.wsdl"; my $gsrch       = SOAP::Lite->service("file:$google_wdsl"); my $query       = join "+", @ARGV; # do the search... my $result = $gsrch->doGoogleSearch($google_key, $query,                           1, 10, "false", "",  "false",                           "lang_en", "", ""); # and print the results. foreach my $hit (@{$result->{'resultElements'}}){    print "$hit->{URL}\t$hit->{title}\n"; } 

Our last example covers AlltheWeb.com:

 #!/usr/bin/perl -w # Example alltheweb searching plug-in # # This file is distributed under the same licence as Perl itself. # # by rik - ora@rikrose.net use strict; use LWP::UserAgent; use HTML::TokeParser; # create the URL from our incoming query. my $url = "http://www.alltheweb.com/search?cat=web&cs=iso-8859-1" .           "&q=" . (join "+", @ARGV) . "&_sb_lang=en"; print $url; # download the data. my $ua = LWP::UserAgent->new(  ); $ua->agent('Mozilla/5.0'); my $response = $ua->get($url); die $response->status_line . "\n"   unless $response->is_success; my $stream = HTML::TokeParser->new ($response->content) or die "\n"; while (my $tag = $stream->get_tag("p")){     $tag = $stream->get_tag("a");     my $name = $stream->get_trimmed_text("/a");     last if $name eq "last 10 queries";     my $url = $tag->[1]{href};     print "$url\t$name\n"; } 

Running the Hack

Invoke the script from the command line, like so:

 %  perl aggsearch.pl spidering  alltheweb results:  Google is now better at spidering dynamic sites. : [long url here]   Submitting sites to search engines : [long url here]  WebcamCrawler.com  : [long url here]  ...etc... freshmeat results:  HouseSpider : http://freshmeat.net/redir/housespider/28546/url_homepage/   PhpDig : http://freshmeat.net/redir/phpdig/15340/url_homepage/  ...etc... google results:  What is Spidering? : http://www.1afm.com/optimization/spidering.html  SWISH-Enhanced Manual: Spidering : http://swish-e.org/Manual/spidering.html  ...etc... 

The power of combining data from many sources gives you more scope for working out trends in the information, a technique commonly known as data mining .

Richard Rose



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net