Hack 40 Saving Only POP3 Email Attachments

figs/expert.gif figs/hack40.gif

Get oodles of attachments from mailing lists and friends ? Learn how to save them to your hard drive automatically with a little Perl voodoo .

Remember those carefree days of yore, when all you got in your email were actual ASCII messages? Nah, me neither . Those days are long gone. Nowadays, your friends are sending you MP3 files of their bands, your aunt Ethel is sending you digital camera shots of her bulldog Zack, and occasionally you get the odd PDF file for work.

This hack pulls down the contents of a POP3 mailbox (which you probably have; most email accounts on the Internet are POP3 accounts) and saves the attachments to a directory. This hack is very customizable, but note that you'll need the Perl modules Net::POP3 and MIME::Parser to use it.

The Code

Save the following code as leechpop.pl :

 #!/usr/bin/perl -w  # # LeechPOP - save ONLY attachments from a POP3 mailbox, with filtering. # Part of the Leecharoo suite - for all those hard to leech places. # http://disobey.com/d/code/ or contact morbus@disobey.com. # # This code is free software; you can redistribute it and/or # modify it under the same terms as Perl itself. use strict; $++; my $VERSION = "1.0"; use Getopt::Long; my %opts; # make sure we have the modules we need, else die peacefully. eval("use Net::POP3;"); die "[err] Net::POP3 not installed.\n" if $@; eval("use MIME::Parser;"); die "[err] MIME::Parser not installed.\n" if $@; # define our command line flags (long and short versions). GetOptions(\%opts, 'servers=s',      # the POP3 server to use.                    'usernameu=s',    # the POP3 username to use.                    'passwordp=s',    # the POP3 password to use.                    'beginb=i',       # what msg number to start at. ); # at the very least, we need our login information. die "[err] POP3 server missing, use --server or -s.\n" unless $opts{server}; die "[err] Username missing, use --username or -u.\n"  unless  [RETURN]  $opts{username}; die "[err] Password missing, use --password or -p.\n"  unless  [RETURN]  $opts{password}; # try an initial connection to the server. print "-" x 76, "\n"; # merely a visual seperator. my $conn = Net::POP3->new( $opts{server} )   or die "[err] There was a problem connecting to the server.\n"; print "Connecting to POP3 server at $opts{server}.\n"; # and now the login information. $conn->login( $opts{username}, $opts{password} )   or die "[err] There was a problem logging in (.poplock? credentials?).\n"; print "Connected successfully as $opts{username}.\n"; # purdy stats about our mailbox. my ($msg_total, $mbox_size) = $conn->popstat(  ); if ($msg_total eq 0)  { print "No new emails are available.\n"; exit; } if ($msg_total eq '0E0')  { print "No new emails are available.\n"; exit; } print "You have $msg_total messages totalling ", commify($mbox_size), "k.\n"; # the list of valid file extensions. we do extensions, not # mime-types, because they're easier to understand from # an end-user perspective (no research is required). my $valid_exts = "jpg jpeg png"; my %msg_ids; # used to keep track of seen emails. my $msg_num = $opts{begin}  1; # user specified or 1. # create a subdirectory based on today's date. my ($d,$m,$y) = (localtime)[3,4,5]; $y += 1900; $m++; $d = sprintf "%02.0d", $d; $m = sprintf "%02.0d", $m; print "Using directory '$y-$m-$d' for newly downloaded files.\n"; my $savedir = "$y-$m-$d"; mkdir($savedir, 0777); # begin looping through each msg. print "-" x 76, "\n"; # merely a visual seperator. while ($msg_num <= $msg_total) {     # the size of the individual email.     my $msg_size = $conn->list($msg_num);     # get the header of the message     # so we can check for duplicates.     my $headers = $conn->top($msg_num);     # print/store the good bits.     my ($msg_subj, $msg_id);     foreach my $header (@$headers) {         # print subject line and size.         if ($header =~ /^Subject: (.*)/) {             $msg_subj = substr(, 0, 50); # trim subject down a bit.             print "Msg $msg_num / ",commify($msg_size),"k / $msg_subj...\n";         }         # save Message-ID for duplicate comparison.         elsif ($header =~ /^Message-ID: <(.*)>/i) {             $msg_id = ; $msg_ids{$msg_id}++;         }         # move on to the filtering.         elsif ($msg_subj and $msg_id) { last; }     }     # if the message size is too small, then it     # could be a reply or something of low quality.     if (defined($msg_size) and $msg_size < 40000) {         print "  Skipping - message size is smaller than our threshold.\n";         $msg_num++; next;     }     # check for matching Message-ID. If found,     # skip this message. This will help eliminate     # crossposting and duplicate downloads.     if (defined($msg_id) and $msg_ids{$msg_id} >= 2) {         print "  Skipping - we've already seen this Message-ID.\n";         $msg_num++; next;     }     # get the message to feed to MIME::Parser.     my $msg = $conn->get($msg_num);     # create a MIME::Parser object to     # extract any attachments found within.     my $parser = new MIME::Parser;     $parser->output_dir( $savedir );     my $entity = $parser->parse_data($msg);     # extract our mime parts and go through each one.     my @parts = $entity->parts;     foreach my $part (@parts) {         # determine the path to the file in question.         my $path = ($part->bodyhandle) ? $part->bodyhandle->path : undef;         # move on if it's not defined,         # else figure out the extension.         next unless $path; $path =~ /\w+\.([^.]+)$/;         my $ext = ; next unless $ext;         # we continue only if our extension is correct.         my $continue; $continue++ if $valid_exts =~ /$ext/i;         # delete the blasted thing.         unless ($valid_exts =~ /$ext/) {            print "  Removing unwanted filetype ($ext): $path\n";            unlink $path or print " > Error removing file at $path: $!.";            next; # move on to the next attachment or message.         }         # a valid file type. yummy!         print "  Keeping valid file: $path.\n";     }     # increase our counter.      $msg_num++; } # clean up and close the connection. $conn->quit; # now, jump into our savedir and remove all msg-* # files, which are message bodies saved by MIME::Parser. chdir ($savedir); opendir(SAVE, "./") or die $!; my @dir_files = grep !/^\.\.?$/, readdir(SAVE); closedir(SAVE); foreach (@dir_files) { unlink if $_ =~ /^msg-/; } # cookbook 2.17. sub commify {     my $text = reverse $_[0];     $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/,/g;     return scalar reverse $text; } 

Running the Hack

To use this hack, you'll need a minimum of three pieces of information: your email account's username, your password, and the name of the POP3 server. If, say, your username is fred , your password is fredpw , and the server is pop.example.com , you'll invoke the script on the command line like this:

 %  perl leechpop.pl -u=fred -p=fredpw -s=pop.example.com  

Or, if you prefer the wordier version:

 %  perl leechpop.pl --username=fred --password=fredpw   [RETURN]    --server=pop.example.com  

Assuming you have the proper modules installed and you've provided a correct username, password, and server, your results will look something like this:

 -------------------------------------------------------------------- Connecting to POP3 server at pop.example.com. Connected successfully as fred. You have 1040 messages totalling 229,818,136k. Using directory '2003-06-09' for newly downloaded files. Beginning downloads, starting at message 1. -------------------------------------------------------------------- ...etc... Msg 20 / 432,827k / [Budding_Rose-Art] The Flame Within.jpg...    Removing unwanted filetype (txt): 2003-06-09/msg-13522-1.txt    Keeping valid file: 2003-06-09/The Flame Within-1.jpg.  Msg 21 / 433,111k / [Graphics_TheAttic] The Flame Within.jpg...    Skipping - we've already seen this Message-ID.  Msg 22 / 168,423k / [Budding_Rose-Art] The Sad Clown.jpg...    Removing unwanted filetype (txt): 2003-06-09/msg-13522-2.txt    Keeping valid file: 2003-06-09/The Sad Clown.jpg.  Msg 23 / 558,731k / [Graphics_TheAttic] Sulamith_Wulfing_cal2002_janua...    Removing unwanted filetype (txt): 2003-06-09/msg-13522-3.txt    Keeping valid file: 2003-06-09/ma_Sulamith_02_january_white_lilies.jpg.  Msg 24 / 783,490k / [Graphics_TheAttic] Sulamith_Wulfing_cal2002_febru...    Removing unwanted filetype (txt): 2003-06-09/msg-13522-4.txt ...etc... 

The script will attempt to connect to the mailbox, and will report a successful connection if one is made. It'll then report on the number of messages in the mailbox and the total size. For each message, the program will check for any attachments, a valid extension (as specified in the line $valid_exts = " jpg jpeg png"; ), and total message size.

It is worth noting that my experience with this script has led me to set a threshold of 40,000 bytes for a message to be accepted. Any less than that, and the odds are great that the message is a reply to an earlier message, a brand-new message, or a dual-encoded message that contains both plain text and an HTML equivalent.

If all's well, the file will be saved. If either the extension or message size don't meet the parameters of the program, the file is skipped and we move on.

Hacking the Hack

There are a few ways you can improve upon this hack.

Changing the hardcoded file extensions

One of several ways to hack the script is to change the hardcoded file extensions, which indicate what types of files you want to save. You'll find the list of allowed file extensions in this line:

 my $valid_exts = "jpg jpeg png"; 

You can add whatever extensions you want to this listing; candidates for additions include .bmp and .gif (graphics formats) and .mp3 and .wav (audio formats). Adding more extensions looks like this:

 my $valid_exts = "jpg jpeg png bmp gif mp3 wav"; 

Maybe you want the option to add your own extensions from the command line when you actually run the program. In that case, add an extension option to your GetOptions lines at the top of the script:

 # define our command line flags (long and short versions). GetOptions(\%opts, 'servers=s',      # the POP3 server to use.                    'usernameu=s',    # the POP3 username to use.                    'passwordp=s',    # the POP3 password to use.                    'beginb=i',       # what msg number to start at.                    '  extensionse=s',  # what file extensions to use  . ); 

and then change the line:

 my $valid_exts = "jpg jpeg png"; 

to:

 my $valid_exts = $opts{extensions}  "jpg jpeg png"; 

Remember, the extensions to the far right of the line are the defaults; they are used if you don't specify your own. You can make this as elaborate as needed, with several extensions (as in this example), or as minimal as you like, with just one.

Shortening or eliminating the subject line

You might decide that you're not interested in having a long subject line appear while the program is running; by altering the following lines of code, you can shorten the subject line to just a few letters , or remove it completely:

 if ($header =~ /^Subject: (.*)/) {     $msg_subj = substr(, 0, 50); # trim subject down a bit.     print "Msg $msg_num / ",commify($msg_size),"k / $msg_subj...\n"; 

If you just want to change the length of the subject (defaulting to 50 characters ) then focus on substr($1, 0, 50) . Change 50 to whatever you like. If you don't want a subject line to appear at all, remove the line $msg_subj = substr($1, 0, 50); and the line portion ,"k / $msg_subj...\n ".

Saving attachments to the current directory

The way the hack is set up now, a subdirectory is always created beneath the current one. But if you're in a hurry, or if you want to keep all your attachments in the same place, you can remove the directory completely by changing this line:

 my $savedir = "$y-$m-$d"; mkdir($savedir, 0777); 

to this:

 my $savedir = "."; 

This will save your attachments to the current location. You can also set an option to define the final resting place from the command line, which means going back to those GetOptions lines:

 # define our command-line flags (long and short versions). GetOptions(\%opts, 'servers=s',      # the POP3 server to use.                    'usernameu=s',    # the POP3 username to use.                    'passwordp=s',    # the POP3 password to use.                    'beginb=i',       # what msg number to start at.                    '  savedird=s',     # what directory to save to  .  ); 

and changing the $savedir line to this:

 my $savedir = $opts{savedir}  "$y-$m-$d"; mkdir($savedir, 0777); 
Specifying the size of saved messages

The hack automatically rejects messages smaller than 40 KB, thinking that a message of 40 KB is not going to have much in the way of interesting attachments. But you may want to increase or decrease that number. How? Back to the GetOptions again. Add this line to the end:

 'oksizeo=i'       # minimum file size. 

Then, add these two lines shortly after the GetOptions lines:

 # what size is okay to download?  my $oksize = $opts{oksize}  40000; 

There's one more thing to do. Change this line:

 if (defined($msg_size) and $msg_size < 40000) { 

to:

 if (defined($msg_size) and $msg_size < $oksize) { 

When you run this hack, you might discover that you are saving tons of small image files: icons, banners, or other detritus. In that case, you might also want to check the width and height of the images before you download them. If that sounds dandy, you'll need to use Image::Size , so add the line use Image::Size near the other use lines at the top of the script. Also, insert a few more lines in the code:

 # check image size. too small? delete.  my ($x, $y) = imgsize($path); unless (not defined $x or $x > 100) {     print "  Removing small image size (less than $x pixels tall): $path\n";     unlink $path or print " > Error removing file at $path: $!.";     next; # move on to the next attachment or message. } 

Image::Size spits out the image height and width, which allows you to capture them into variables. After that, it's simple to compare the size of the variables to see if they're too large or too small. Applied here, it's a simple conditional; it checks to make sure that each graphic file is more than 100 pixels high. If it's not, it's deleted from your directory. You can change the minimum height to whatever you like, base the conditional on both $x and $y , or make the minimum a command-line variable using the GetOptions , as we've done with some of our other tweakeries.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net