Hack 47 Tracking Additions to Yahoo

Hack 47 Tracking Additions to Yahoo!

figs/moderate.gif figs/hack47.gif

Keep track of the number of sites added to your favorite Yahoo! categories .

Every day, a squad of surfers at Yahoo! adds new sites to the Yahoo! index. These changes are reflected in the Yahoo! What's New page (http://dir.yahoo.com/new/), along with the Picks of the Day.

If you're a casual surfer, you might not care about the number of new sites added to Yahoo!. But there are several scenarios when you might have an interest:

  • You regularly glean information about new sites from Yahoo! Knowing which categories are growing and which categories are stagnant will tell you where to direct your attention.

  • You want to submit sites to Yahoo! Are you going to spend your hard-earned money adding a site to a category where new sites are added constantly (meaning your submitted site might get quickly buried)? Or will you be paying to add to a category that sees few additions (meaning your site might have a better chance of standing out)?

  • You're interested in trend tracking. Which categories are consistently busy? Which are all but dead? By watching how Yahoo! adds sites to categories, over time you'll get a sense of their rhythms and trends and detect when unusual activity occurs in a category.

This hack scrapes the recent counts of additions to Yahoo! categories and prints them out, providing an at-a-glance glimpse of additions to various categories. You'll also get a tab-delimited table of how many sites have been added to each category for each day. A tab-delimited file is excellent for importing into Excel, where you can turn the count numbers into a chart.

The Code

Save the following code to a file called hoocount.pl :

 #!/usr/bin/perl-w use strict; use Date::Manip; use LWP::Simple; use Getopt::Long; $ENV{TZ} = "GMT" if $^O eq "MSWin32"; # the homepage for Yahoo!'s "What's New". my $new_url = "http://dir.yahoo.com/new/"; # the major categories at Yahoo!. hashed because # we'll use them to hold our counts string. my @categories = ("Arts & Humanities",    "Business & Economy",                   "Computers & Internet", "Education",                   "Entertainment",        "Government",                   "Health",               "News & Media",                   "Recreation & Sports",  "Reference",                   "Regional",             "Science",                    "Social Science",       "Society & Culture"); my %final_counts; # where we save our final readouts. # load in our options from the command line. my %opts; GetOptions(\%opts, "ccount=i"); die unless $opts{c}; # count sites from past $i days. # if we've been told to count the number of new sites, # then we'll go through each of our main categories # for the last $i days and collate a result. # begin the header # for our import file. my $header = "Category"; # from today, going backwards, get $i days. for (my $i=1; $i <= $opts{c}; $i++) {    # create a Data::Manip time that will    # be used to construct the last $i days.    my $day; # query for Yahoo! retrieval.    if ($i == 1) { $day = "yesterday"; }    else { $day = "$i days ago"; }    my $date = UnixDate($day, "%Y%m%d");    # add this date to    # our import file.    $header .= "\t$date";    # and download the day.    my $url = "$new_url$date.html";    my $data = get($url) or die $!;    # and loop through each of our categories.    my $day_count; foreach my $category (sort @categories) {        $data =~ /$category.*?(\d+)/; my $count =   0;        $final_counts{$category} .= "\t$count"; # building our string.    } } # with all our counts finished, # print out our final file. print $header . "\n"; foreach my $category (@categories) {    print $category, $final_counts{$category}, "\n"; } 

Running the Hack

The only argument you need to provide the script is the number of days back you'd like it to travel in search of new additions. Since Yahoo! doesn't archive their "new pages added" indefinitely, a safe upper limit is around two weeks. Here, we're looking at the past two days:

 %  perl hoocount.pl --count 2  Category        20030807        20030806 Arts & Humanities       23      23 Business & Economy      88      141 Computers & Internet    2       9 Education       0       4 Entertainment   43      29 Government      3       4 Health  2       7 News & Media    1       1 Recreation & Sports     8       27 Reference       0       0 Regional        142     114 Science 1       2 Social Science  3       0 Society & Culture       7       8 

Hacking the Hack

If you're not only a researcher but also a Yahoo! observer, you might be interested in how the number of sites added changes over time. To that end, you could run this script under cron [Hack #90], and output the results to a file. After three months or so, you'd have a pretty interesting set of counts to manipulate with a spreadsheet program like Excel. Alternatively, you could modify the script to run RRDTOOL [Hack #62] and have real-time graphs.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net