Hack 93 Accumulating Search Results Over Time

figs/moderate.gif figs/hack93.gif

Graphing search results over time can lead to interesting discoveries .

If you're doing regular research over time, the quality of results might become just as interesting as the quantity. In other words, you might find it useful to track how popular certain words are getting on the Internet as events occur and time passes .

Many search engines offer varying levels of date-search capacity, including Google. With other engines, however, we'd have to use some scraping techniques to do result counts by date. With Google, we just need to use the Google API and some code. In order to use this code, you'll need the Julian::Date module and a Google API key (which can be obtained for free by registering at http://api.google.com/).

Before we continue, there are two things of note:

  • Result counts will tend to rise over time anyway, as Google adds more pages to its index. If you run this search often enough, you'll soon be able to easily recognize a regular growth curve from a spike of interest.

  • Even though Google makes a date range syntax available, Google does not guarantee its results. So, don't use this application to make important decisions or draw definitive conclusions about keyword popularity.

The Code

Save the following code as goocount.pl :

 #!/usr/bin/perl -w # goocount.pl # Runs the specified query for every day between the specified # start and end dates, returning date and count as CSV. From # Tara Calishain, Rael Dornfest, and Google Hacks. # # usage: goocount.pl query="{query}" start={date} end={date} # where dates are of the format: yyyy-mm-dd, e.g. 2002-12-31 # use strict; use SOAP::Lite; use Time::JulianDay; use CGI qw/:standard/; # Your Google API developer's key. my $google_key = 'insert key here'; # Location of the GoogleSearch WSDL file. my $google_wdsl = "./GoogleSearch.wsdl"; # For checking date validity. my $date_regex = '(\d{4})-(\d{1,2})-(\d{1,2})'; # Make sure all arguments are passed correctly. ( param('query')   and param('start') =~ /^(?:$date_regex)?$/   and param('end') =~ /^(?:$date_regex)?$/  ) or die qq{usage: goocount.pl query="{query}" start={date} end={date}\n}; # Julian date manipulation. my $query = param('query'); my $yesterday_julian = int local_julian_day(time) - 1; my $start_julian = (param('start') =~ /$date_regex/)    ? julian_day(,,) : $yesterday_julian; my $end_julian   = (param('end') =~ /$date_regex/)    ? julian_day(,,) : $yesterday_julian; # Create a new Google SOAP request. my $google_search  = SOAP::Lite->service("file:$google_wdsl"); # Start our CSV file. print qq{"date","count"\n}; # Iterate over each of the Julian dates for your query. foreach my $julian ($start_julian..$end_julian) {     $full_query = "$query daterange:$julian-$julian";     my $results = $google_search->doGoogleSearch(                        $google_key, $full_query, 0, 10, "false",                       "",  "false", "", "latin1", "latin1"                   );     # Output our CSV record.     print '"', sprintf("%04d-%02d-%02d", inverse_julian_day($julian)),                qq{","$result->{estimatedTotalResultsCount}"\n}; } 

Running the Hack

Run the code from the command line, like so:

  % perl goocount.pl query="PalmOS" start=2002-01-01 end=2002-12-31  

This query searches for the keyword " PalmOS " over the entire year of 2002. (Since each day takes one query key, running the script with these parameters would take 365 keys.)

As output, you'll get a list of dates and numbers on the screen in this format:

 "date", "count" "2001-01-01", "200" "2001-01-02", "210" 

And so on and so on. If you want to save the results to a comma-delimited format (for easy import into Excel) append your query with a filename, like this:

  % perl goocount.pl query="PalmOS" start=2002-01-01 end=2002-12-31 > data.csv  

Perhaps you want to run this script under cron to gather information every day. Just run it without a date in the query (it'll default to today's date) and a >> to write additional information to the comma-delimited file:

  % perl goocount.pl query="PalmOS" >>data.csv  

Hacking the Hack

As written in this hack, the Google count script is a client-side application, but you can turn it into a web-based application with a little tweaking. Just change the program as noted in the following code (changes are shown in bold). And remember, this application can use a lot of API keys. Don't make this application publicly available unless you give users the option of using their own keys. Otherwise, you'll probably burn out your key!

 ... print  header(  )  ,  start_html("GooCount: $query")  ,  start_table({-border=>undef}, caption("GooCount:$query"))  ,  Tr([ th(['Date', 'Count']) ]);  foreach my $julian ($start_julian..$end_julian) {     $full_query = "$query daterange:$julian-$julian";     my $results = $google_search->doGoogleSearch(                        $google_key, $full_query, 0, 10, "false",                       "",  "false", "", "latin1", "latin1"                   );  print   Tr([ td([   sprintf("%04d-%02d-%02d", inverse_julian_day($julian))  ,  $result->{estimatedTotalResultsCount}   ]) ]);   }   print   end_table(  )  ,  end_html;  

See Also

  • [Hack #62] for graphing Amazon Sales Ranks over a period of time.

  • [Hack #25] for an example of pulling information from Junglescan.com.

  • [Hack #47] to count how many new items Yahoo! has been adding to its index on a daily basis.

Tara Calishain and Rael Dornfest



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net