Hack 44 Archiving Yahoo Groups Messages with WWW::Yahoo::Groups

Hack 44 Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

figs/moderate.gif figs/hack44.gif

Yahoo! Groups makes it easy to run an email discussion group at no cost. Sadly, there's no simple way to download all the messagesuntil now .

If you've ever wanted to run an email discussion group, but you didn't want to mess around with getting your own server and administering your own software, you should consider looking into Yahoo! Groups (http://groups.yahoo.com/). The free (ad-supported) service makes it easy to run a mailing list, and if you or any other group moderator has set a list to support archiving of messages, a handy web interface to browse them is provided. Sadly, the service provides no simple way to download all the messages in one fell swoop, and nobody wants to click and Save As . . . on hundreds or thousands of links.

Iain Truskett of Canberra, Australia, wanted to keep an offline archive of his Yahoo! Groups mailing lists, so he created the WWW::Yahoo::Groups module, available on CPAN (http://search.cpan.org/dist/WWW-Yahoo-Groups/). It uses WWW::Mechanize to log into Yahoo! Groups, get a count of the messages, and download any given message by number. It even bypasses the pop-up ads and interstitial interruptions!

The Code

You'll need the WWW::Yahoo::Groups Perl module installed to use this script. The module requires a number of other modules, but installing from the CPAN shell [Hack #8] should take care of the installation of these prerequisites for you.

Save the following code to a file called yahoogroups.pl :

 #!/usr/bin/perl -w use constant USERNAME => '   your username   '; use constant PASSWORD => '   your password   '; use strict; use File::Path; use Getopt::Long; use WWW::Yahoo::Groups; $SIG{PIPE} = 'IGNORE'; # define the command-line options, and  # ensure that a group has been passed. my ($debug, $group, $last, $first, $stats); GetOptions(     "debug"     => $debug,     "group=s"   => $group,     "stats"     => $stats,     "first=i"   => $first,     "last=i"    => $last, ); (defined $group) or die "Must specify a group!\n"; # sign into Yahoo! Groups. my $w = WWW::Yahoo::Groups->new(  ); $w->debug( $debug ); $w->login( USERNAME, PASSWORD ); $w->list( $group ); $w->agent->requests_redirectable( [] ); # no redirects now # first and last IDs of group. my $first_id = $w->first_msg_id(  ); my $last_id = $w->last_msg_id(  ); print "Messages in $group: $first_id to $last_id\n"; exit 0 if $stats; # they just wanted numbers. # default our IDs to the first and last # of the $group in question, else use the # passed command-line options. $first = $first_id unless $first; $last  = $last_id  unless $last; warn "Fetching $first to $last\n"; # get our specified messages. for my $msgnum ($first..$last) {     fetch_message( $w, $msgnum ); } sub fetch_message {     my $w = shift;     my $msgnum = shift;     # Put messages in directories by 100.     my $dirname = int($msgnum/100)*100;     # Create the dir if necessary.     my $dir = "$group/$dirname";     mkpath( $dir ) unless -d $dir;     # Don't pull down the message     # if we already have it...     my $filename = "$dir/$msgnum";     return if -f $filename;     # pull down the content and check for errors.     my $content = eval { $w->fetch_message($msgnum) };     if ( $@ ) {         if ( $@->isa('X::WWW::Yahoo::Groups') ) {             warn "Could not handle message $msgnum: ",$@->error,"\n";         } else { warn "Could not get content for message $msgnum\n"; }     } else {         open(FH, ">$filename")            or return warn "Can't create $filename: $!\n";         print FH $content; close FH; # data has been saved.         $w->autosleep( 5 ); # so now sleep to prevent saturation.     } } 

Running the Hack

Before you can use the script, you'll need to have a Yahoo! Groups account (http://edit.yahoo.com/config/eval_register) and be subscribed to at least one list that has web archives. Remember that we're merely automating the web transactions, not getting at some secret backdoor into Yahoo! Groups. Also, modify the lines at the top of the script that set the USERNAME and PASSWORD constants. If these aren't set, the script can't log in as you and, consequently, you might not have access to the group's messages.

First, find out how many messages there are. In this case, let's check out milwpm , the discussion list for the Milwaukee Perl Mongers:

 %  perl yahoogroups.pl --group=milwpm --stats  Messages in milwpm: 1 to 721 

Now, take a look at the last five messages in the archive:

 %  perl yahoogroups.pl --group=milwpm --first=717  Messages in milwpm: 1 to 721 Fetching 717 to 721 

Behind the scenes, the script has created a directory called milwpm and, within that, a directory called 700 for holding all messages between 700 and 799. Each message gets its own file.

 %  ls -al milwpm/700  -rw-r--r--    1 andy     staff        2814 Jul 16 23:04 700 -rw-r--r--    1 andy     staff        4005 Jul 16 23:05 717 -rw-r--r--    1 andy     staff        1511 Jul 16 23:05 718 -rw-r--r--    1 andy     staff        5576 Jul 16 23:05 719 -rw-r--r--    1 andy     staff        5862 Jul 16 23:05 720 -rw-r--r--    1 andy     staff        6632 Jul 16 23:05 721 

If you want to look at the starting few messages, use the --last parameter. You can also use the --debug parameter to get running notes of what the script is doing:

 %  perl yahoogroups.pl --group=milwpm --last=5 --debug  Fetching http://groups.yahoo.com/ Fetching http://login.yahoo.com/config/login?.intl=us&.src=ygrp&.... Fetching http://groups.yahoo.com/group/milwpm/messages/1 Messages in milwpm: 1 to 721 Fetching 1 to 5 Fetching http://groups.yahoo.com/group/milwpm/message/1?source=1&unwrap=1 Fetching http://groups.yahoo.com/group/milwpm/message/2?source=1&unwrap=1 Fetching http://groups.yahoo.com/group/milwpm/message/3?source=1&unwrap=1 Fetching http://groups.yahoo.com/group/milwpm/message/4?source=1&unwrap=1 Fetching http://groups.yahoo.com/group/milwpm/interrupt?st=2&m=1&done=%2... Fetching /group/milwpm/message/4?source=1&unwrap=1 Fetching http://groups.yahoo.com/group/milwpm/message/5?source=1&unwrap=1 

Hacking the Hack

You can easily extend this hack to manipulate the data before it gets saved to the file. The messages that are returned are in standard Internet mail format, so you can extract just the headers you want, such as To :, From :, and Subject :. The MailTools (http://search.cpan.org/dist/MailTools/MailTools) distribution has a number of modules that will help.

As a quick example, sans MailTools , let's say you want to see the most active threads from the messages you're downloading. This is a rather simple modification to make. Add a hash for our new information before the fetch_message subroutine (changes are in bold):

  # Keep track of popular subjects   my %subjects;  sub fetch_message {     my $w = shift; 

Then, add the tracking code for each subject line:

 } else { warn "Could not get content for message $msgnum\n"; }     } else {  # and add one to our subject line counter  .  $content =~ /Subject: (.*)/ig; $subjects{}++ if ;  open(FH, ">$filename")            or return warn "Can't create $filename: $!\n"; 

Finally, at the end of the script, display the stats:

  # now, print our totals  .  my @sorted = sort { $subjects{$b} <=> $subjects{$a} } keys %subjects;   foreach (@sorted) { print "$subjects{$_}: $_\n"; }  

This code can easily be tweaked to save only messages from certain authorslocal copies of your own postings, for instanceor subject lines associated with especially thoughtful or useful threads.

Yahoo! Groups also has search capabilities that you can take advantage of with WWW::Mechanize . See Downloading Images from Webshots" [Hack #36] for an example of searching web sites with WWW::Mechanize .

Andy Lester



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net