Hack 47 Tracking Additions to Yahoo!
Keep track of the number of sites added to your favorite Yahoo! categories . Every day, a squad of surfers at Yahoo! adds new sites to the Yahoo! index. These changes are reflected in the Yahoo! What's New page (http://dir.yahoo.com/new/), along with the Picks of the Day. If you're a casual surfer, you might not care about the number of new sites added to Yahoo!. But there are several scenarios when you might have an interest:
This hack scrapes the recent counts of additions to Yahoo! categories and prints them out, providing an at-a-glance glimpse of additions to various categories. You'll also get a tab-delimited table of how many sites have been added to each category for each day. A tab-delimited file is excellent for importing into Excel, where you can turn the count numbers into a chart. The CodeSave the following code to a file called hoocount.pl : #!/usr/bin/perl-w use strict; use Date::Manip; use LWP::Simple; use Getopt::Long; $ENV{TZ} = "GMT" if $^O eq "MSWin32"; # the homepage for Yahoo!'s "What's New". my $new_url = "http://dir.yahoo.com/new/"; # the major categories at Yahoo!. hashed because # we'll use them to hold our counts string. my @categories = ("Arts & Humanities", "Business & Economy", "Computers & Internet", "Education", "Entertainment", "Government", "Health", "News & Media", "Recreation & Sports", "Reference", "Regional", "Science", "Social Science", "Society & Culture"); my %final_counts; # where we save our final readouts. # load in our options from the command line. my %opts; GetOptions(\%opts, "ccount=i"); die unless $opts{c}; # count sites from past $i days. # if we've been told to count the number of new sites, # then we'll go through each of our main categories # for the last $i days and collate a result. # begin the header # for our import file. my $header = "Category"; # from today, going backwards, get $i days. for (my $i=1; $i <= $opts{c}; $i++) { # create a Data::Manip time that will # be used to construct the last $i days. my $day; # query for Yahoo! retrieval. if ($i == 1) { $day = "yesterday"; } else { $day = "$i days ago"; } my $date = UnixDate($day, "%Y%m%d"); # add this date to # our import file. $header .= "\t$date"; # and download the day. my $url = "$new_url$date.html"; my $data = get($url) or die $!; # and loop through each of our categories. my $day_count; foreach my $category (sort @categories) { $data =~ /$category.*?(\d+)/; my $count = 0; $final_counts{$category} .= "\t$count"; # building our string. } } # with all our counts finished, # print out our final file. print $header . "\n"; foreach my $category (@categories) { print $category, $final_counts{$category}, "\n"; } Running the HackThe only argument you need to provide the script is the number of days back you'd like it to travel in search of new additions. Since Yahoo! doesn't archive their "new pages added" indefinitely, a safe upper limit is around two weeks. Here, we're looking at the past two days: % perl hoocount.pl --count 2 Category 20030807 20030806 Arts & Humanities 23 23 Business & Economy 88 141 Computers & Internet 2 9 Education 0 4 Entertainment 43 29 Government 3 4 Health 2 7 News & Media 1 1 Recreation & Sports 8 27 Reference 0 0 Regional 142 114 Science 1 2 Social Science 3 0 Society & Culture 7 8 Hacking the HackIf you're not only a researcher but also a Yahoo! observer, you might be interested in how the number of sites added changes over time. To that end, you could run this script under cron [Hack #90], and output the results to a file. After three months or so, you'd have a pretty interesting set of counts to manipulate with a spreadsheet program like Excel. Alternatively, you could modify the script to run RRDTOOL [Hack #62] and have real-time graphs. |