Hack 50 Weblog-Free Google Results

figs/moderate.gif figs/hack50.gif

With so many weblogs being indexed by Google, you might worry about too much emphasis on the hot topic of the moment. In this hack, we'll show you how to remove the weblog factor from your Google results .

Weblogs those frequently updated, link-heavy personal pagesare quite the fashionable thing these days. There are at least 400,000 active weblogs across the Internet, covering almost every possible subject and interest. For humans , they're good reading, but for search engines they are heavenly bundles of fresh content and links galore.

But some people think the search engine's delight in weblogs is slanting their search results and giving too much emphasis to too small a group of recent rather than evergreen content. As I write, for example, I am the third most important Ben on the Internet, according to Google. This rank comes solely from my weblog's popularity.

This hack searches Google, discarding any results coming from weblogs. It uses the Google Web Services API (http://api.google.com) and the API of Technorati (http://www. technorati .com/ members ), an excellent interface to David Sifry's weblog data-tracking tool [Hack #70]. Both APIs require keys, available from the URLs mentioned.

Finally, you'll need a simple HTML page with a form that passes a text query to the parameter q (the query that will run on Google), something like this:

 <form action="googletech.cgi" method="POST"> Your query: <input type="text" name="q"> <input type="submit" name="Search!" value="Search!"> </form> 

The Code

You'll need the XML::Simple and SOAP::Lite Perl modules. Save the following code to a file called googletech.cgi :

 #!/usr/bin/perl -w # googletech.cgi # Getting Google results # without getting weblog results. use strict; use SOAP::Lite; use XML::Simple; use CGI qw(:standard); use HTML::Entities (  ); use LWP::Simple qw(!head); my $technoratikey = "   your technorati key here   "; my $googlekey = "   your google key here   "; # Set up the query term # from the CGI input. my $query = param("q"); # Initialize the SOAP interface and run the Google search. my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl"; my $service = SOAP::Lite->service->($google_wdsl); # Start returning the results page - # do this now to prevent timeouts my $cgi = new CGI; print $cgi->header(  ); print $cgi->start_html(-title=>'Blog Free Google Results'); print $cgi->h1('Blog Free Results for '. "$query"); print $cgi->start_ul(  ); # Go through each of the results foreach my $element (@{$result->{'resultElements'}}) {     my $url = HTML::Entities::encode($element->{'URL'});     # Request the Technorati information for each result.     my $technorati_result = get("http://api.technorati.com/bloginfo?".                                 "url=$url&key=$technoratikey");     # Parse this information.     my $parser = new XML::Simple;     my $parsed_feed = $parser->XMLin($technorati_result);     # If Technorati considers this site to be a weblog,     # go onto the next result. If not, display it, and then go on.     if ($parsed_feed->{document}{result}{weblog}{name}) { next; }     else {         print $cgi-> i('<a href="'.$url.'">'.$element->{title}.'</a>');         print $cgi-> l("$element->{snippet}");     } } print $cgi -> end_ul(  ); print $cgi->end_html; 

Let's step through the meaningful bits of this code. First comes pulling in the query from Google. Notice the 10 in the doGoogleSearch ; this is the number of search results requested from Google. You should try to set this as high as Google will allow whenever you run the script, or else you might find that searching for terms that are extremely popular in the weblogging world do not return any results at all, having been rejected as originating from a blog.

Since we're about to make a web services call for every one of the returned results, which might take a while, we want to start returning the results page now; this helps prevent connection timeouts. As such, we spit out a header using the CGI module, then jump into our loop.

We then get to the final part of our code: actually looping through the search results returned by Google and passing the HTML-encoded URL to the Technorati API as a get request. Technorati will then return its results as an XML document.

Be careful you do not run out of Technorati requests. As I write this, Technorati is offering 500 free requests a day, which, with this script, is around 50 searches. If you make this script available to your web site's audience, you will soon run out of Technorati requests . One possible workaround is forcing the user to enter her own Technorati key. You can get the user 's key from the same form that accepts the query. See the "Hacking the Hack" section for a means of doing this.


Parsing this result is a matter of passing it through XML::Simple . Since Technorati returns only an XML construct containing name when the site is thought to be a weblog, we can use the presence of this construct as a marker. If the program sees the construct, it skips to the next result. If it doesn't, the site is not thought to be a weblog by Technorati and we display a link to it, along with the title and snippet (when available) returned by Google.

Hacking the Hack

As mentioned previously, this script can burn through your Technorati allowances rather quickly under heavy use. The simplest way of solving this is to force the end user to supply his own Technorati key. First, add a new input to your HTML form for the user's key:

 Your query: <input type="text" name="key"> 

Then, suck in the user's key as a replacement to your own:

 # Set up the query term # from the CGI input. my $query = param("q"); $technoratikey = param("key"); 

Ben Hammersley



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net