Hack 45 Gleaning Buzz from Yahoo

Hack 45 Gleaning Buzz from Yahoo!

figs/moderate.gif figs/hack45.gif

Stay hip with the latest Yahoo! Buzz search results .

Google has a Zeitgeist page (http://www.google.com/press/zeitgeist.html) that gives you an idea of what people are searching for, but unfortunately it's not updated very often; some parts are updated once a week, while other parts are updated only once a month. Meanwhile, Yahoo! has a Yahoo! Buzz site (http://buzz.yahoo.com/) that contains much more annotated information about what people are searching for.

We thought it would be fun to take a Buzz item from the Yahoo! Buzz site ( specifically , http://buzz.yahoo.com/overall/) and then use it to initiate a search on Google. This hack is part scrapingthe Yahoo! Buzz sideand part use of a web APIthe Google side. As you'll see, the two work very well together.

The Code

You'll need a Google API developer's key (http://api.google.com/) and a lesser-known Perl module ( Time::JulianDay ) to get this hack to work. Save the following code to a file called ybgoogled.pl :

 #!/usr/bin/perl -w # ybgoogled.pl # Pull the top item from the Yahoo Buzz Index and query # the last three day's worth of Google's index for it. # Usage: perl ybgoogled.pl use strict; use SOAP::Lite; use LWP::Simple; use Time::JulianDay; # Your Google API developer's key. my $google_key='   insert key here   '; # Location of the GoogleSearch WSDL file. my $google_wdsl = "./GoogleSearch.wsdl"; # Number of days back to # go in the Google index. my $days_back = 3; # Grab a copy of http://buzz.yahoo.com. my $buzz_content = get("http://buzz.yahoo.com/overall/")    or die "Couldn't grab the Yahoo Buzz: $!"; # Find the first item on the Buzz Index list. $buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!; my $buzziest = ; # assign our match as our search term. die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest; # Figure out today's Julian date. my $today = int local_julian_day(time); # Build the Google query and say hi. my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today";  print "The buzziest item on Yahoo Buzz today is: $buzziest\n",       "Querying Google for: $query\n", "Results:\n\n"; # Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl. my $google_search = SOAP::Lite->service("file:$google_wdsl"); # Query Google. my $results = $google_search->doGoogleSearch(                    $google_key, $query, 0, 10, "false",                   "",  "false", "", "", ""               ); # No results? die "No results" unless @{$results->{resultElements}}; # Loop through the results. foreach my $result (@{$results->{'resultElements'}}) {     my $output = join "\n", $result->{title}  "no title",                  $result->{URL}, $result->{snippet}  'none',"\n";     $output =~ s!<.+?>!!g; # drop all HTML tags sloppily.     print $output; # woo, we're done! } 

This code works only as long as Yahoo! formats its Buzz page in the same way; we've had to change it multiple times. If you try this program and it doesn't work, pull out this line:

 $buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!; 

Take a look at the code pulled out by the variable $buzziest and see if it matches any code in the source code at http://buzz.yahoo.com/overall/. If it doesn't, the code's changed. Go to the HTML source view and find the first item on the Buzz list. Look at the source, find that first Buzz listing, and pull the code from around it. You want to pull enough code to get a unique line, but not so much that you can't read it.

Running the Hack

Run this script from the command line, like so:

 %  perl ybgoogled.pl  The buzziest item on Yahoo Buzz today is: Gregory Hines Querying Google for: "Gregory Hines" daterange:2452861-2452864 Results:  Celebrities @ Hollywood.com-Featuring Gregory Hines. Celebrities ...   http://www.hollywood.com/celebs/detail/celeb/191902  Gregory Hines Vital Stats: Born: February 14, 1946 Birth Place: New York,  New York   Gregory Hines  http://www.rottentomatoes.com/p/GregoryHines-1007016/   ... Gregory Hines. CELEB QUIK BROWSER &gt; Select A Celebrity. ... ... 

Hacking the Hack

As it stands, this hack returns 10 results. If you want to, you can change the code to return only one result and immediately open it instead of returning a list. This version of the program searches the last three days of indexed pages. Because there's a slight lag in indexing news stories, I would index at least the last two days' worth of pages, but you could extend it to seven days or even a month.

If you want to abandon Google entirely, you can. Instead, you might want to go to Daypop (http://www.daypop.com), which also has a news search. Here's a version of the script using the top item on Daypop:

 #!/usr/bin/perl -w # ybdaypopped # Pull the top item from the Yahoo! Buzz Index and query  # Daypop's News search engine for relevant stories use strict; use LWP::Simple; # Grab a copy of http://buzz.yahoo.com. my $buzz_content = get("http://buzz.yahoo.com/")    or die "Couldn't grab the Yahoo Buzz: $!"; # Find the first item on the Buzz Index list. $buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!; my $buzziest = ; # assign our match as our search term. die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest; # Build a Daypop Query. my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n";  print "Location: $dpquery\n\n"; 

This version of the program takes the first Buzz item from Yahoo! and opens a Daypop news search for that item ( assuming you run this as a CGI script). But hey, maybe we should use that RSS format [Hack #94] all the kids are talking about. In that case, just put &o=rss at the end of $dpquery :

 my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n&o=rss"; 

Now you're using Yahoo! Buzz to generate an RSS file with Daypop. From there, you can scrape the RSS file, pass this URL to a routine that puts an RSS file up on a web page [Hack #95], and so on.

Tara Calishain and Rael Dornfest



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net