Hack 81 Keeping Tabs on the Web via Email

figs/expert.gif figs/hack81.gif

If you find yourself checking your email more than cruising the Web, you might appreciate a little Perl work to bring the Web to your mailbox .

If you're an info -junky, you have a growing list of sites that you visit daily, maybe hourly. But sometimes, no matter how many times you refresh the page, some sites just don't update soon enough. It would be better if there were a way to be notified when the site changes, so that you could spend your browsing time better.

Some sites offer a service like this, and others offer syndication feeds that programs can monitor, but there are many sites with which you're out of luck in this regard. In this case, you're going to need your own robot.

Planning for Change

For this hack, we'll choose email as the method of notification, since that seems to be the simplest yet most flexible. We can use some common Perl modules to handle email and download web pages. This just leaves us with figuring out how to determine whether a web page has changed.

Actually, it would be more useful if we could figure out how much a web page has changed. Many web pages change constantly, since some might display the current time, others might show updated comment counts on news stories, and others might include a random quote on the page or feature different headlines for each request. If we're just interested in major differences, such as a brand new front-page story on a news site, we'd like some relative measure.

While there are likely smarter ways of doing this, one quick way is to use the GNU diff utility to compare downloads of a web page across time. Further, it would be useful if we compared only the text of pages, not the HTML, since we're more interested in content than layout or markup changes. For this, we can employ the venerable text-based web browser lynx . lynx is commonly found with many Linux distributions and is easily acquired on most other Unix operating systems. This browser already works to format web pages for a plain text display and, with the use of a command-line option, it can redirect this text to a file.

So, given lynx and diff , we can boil web pages down to their text content and compare changes in content. As an added benefit, we can include the text version of web pages in emails we send as an alternative to HTML.

With all this in mind, let's start our script:

 #!/usr/bin/perl -w use strict; use LWP::Simple; use HTTP::Status; use MIME::Lite; # Locate the utility programs needed our $lynx = '/usr/bin/lynx'; our $diff = '/usr/bin/diff'; # Define a location to store datafiles, # and an address for notification my $data_path = "$ENV{HOME}/.pagediff"; my $email = 'your_email@here.com'; 

So far, we've set up some safety features and loaded up our tool modules. We've also located our utility programs, given the script a place to store data, and chosen an email address for notifications. Next, let's make a list of sites to visit:

 my %sites =   (    'slashdot'     => ['http://slashdot.org/index.html', 500],    'penny_arcade' => ['http://www.penny-arcade.com/view.php3', 20],   ); 

This is a hash that consists of nicknames for sites and, for each site, a list that consists of a URL and a change threshold. This number is very fuzzy and will require some tweaking to get the right frequency of notification. Higher numbers require more changes before an email goes out. We'll see how this works in just a minute.

Next, let's handle each of our favorite sites:

 for my $site (keys %sites) {   my ($url, $threshold) = @{$sites{$site}};   # Build filenames for storing the HTML content, text   # content, as well as content from the previous notification.   my $html_fn = "$data_path/$site.html";   my $new_fn  = "$data_path/$site.txt";   my $old_fn  = "$data_path/$site-old.txt";   # Download a new copy of the HTML.   getstore($url, $html_fn);   # Get text content from the new HTML.   html_to_text($html_fn, $new_fn);   # Check out by how much the page has changed since last notification.   my $change = measure_change($new_fn, $old_fn);   # If the page has changed enough,   # send off a notification.   if ($change > $threshold) {     send_change_notification       ($email,        {         site      => $site,         url       => $url,         change    => $change,         threshold => $threshold,         html_fn   => $html_fn,         new_fn    => $new_fn,         old_fn    => $old_fn        }       );     # Rotate the old text content for the new.     unlink $old_fn if (-e $old_fn);     rename $new_fn, $old_fn;   } } 

The main loop of our script is quite simple. For each site, it does the following:

  • Downloads a new copy of the web page.

  • Saves a copy of the page's text contents.

  • Measures the amount of change detected between this latest download and content saved from the last time an email was sent. If the change is greater than the threshold for this site, it sends an email summarizing the change and rotates out the previously saved content for the new download.

Calling In Outside Help

Now that we have the backbone of the script started, let's work on the functions that the script uses. In particular, these first functions will make use of our external tools, diff and lynx :

 sub html_to_text {   my ($html_fn, $txt_fn) = @_;   open(FOUT, ">$txt_fn");   print FOUT `$lynx -dump $html_fn`;   close(FOUT); } 

This function, by way of lynx , extracts the text content from one HTML file and writes it to another file. It just executes the lynx browser with the -dump command-line option and saves that output.

Next, let's use diff to examine changes between text files:

 sub get_changes {   my ($fn1, $fn2) = @_;   return `$diff $fn1 $fn2`; } 

Again, this simple function executes the diff program on two files and returns the output of that program. Now, let's measure the amount of change between two files using this function:

 sub measure_change {   my ($fn1, $fn2) = @_;   return 0 if ( (!-e $fn1)  (!-e $fn2) );   my @lines = split(/\n/, get_changes($fn1, $fn2));   return scalar(@lines); } 

If one of the files to compare doesn't exist, this function returns no change. But if the files exist, the function calls the get_changes function on two files and counts the number of lines of output returned. This is a dirty way to measure change, but it does work. The more two versions of a file differ , the more lines of output diff will produce. This measure says nothing about the nature of the changes themselves , but it can still be effective if you supply a little human judgment and fudging .

Keep this in mind when you adjust the change thresholds defined at the beginning of this script. You might need to adjust things a few times per site to figure out how much change is important for a particular site. Compared with the complexity of more intelligent means of change detection, this method seems best for a quick script.

Send Out the News

Now that all the tools for extracting content and measuring change are working, we need to work out the payoff for all of this: sending out change notification messages. With the MIME::Lite Perl module ( http://search.cpan.org/author/YVES/MIME-Lite/), we can send multipart email messages with both HTML and plain text sections. So, let's construct and send an email message that includes the original HTML of the updated web page, the text content, and a summary of changes found since the last update.

First, create the empty email and set up the basic headers:

 sub send_change_notification {   my ($email, $vars) = @_;   # Start constructing the email message   my $msg = MIME::Lite->new     (      Subject => "$vars->{site} has changed.".        "($vars->{change} > $vars->{threshold})",      To      => $email,      Type    => 'multipart/alternative',     );   # Create a separator line of '='   my $sep = ("=" x 75); 

Note that we indicate how much the page has changed with respect to the threshold in the subject, and we create a separator line for formatting the text email portion of the message.

Next, let's build the text itself:

 # Start the text part of email   # by dumping out the page text.   my $out = '';   $out .= "The page at $vars->{url} has changed. ";   $out .= "($vars->{change} > $vars->{threshold})\n\n";   $out .= "\n$sep\nNew page text follows:\n$sep\n";   open(FIN, $vars->{new_fn});   local $/; undef $/;   $out .= <FIN>;   close(FIN);   # Follow with a diff summary of page changes.   $out .= "$sep\nSummary of changes follows:\n$sep\n\n";   $out .= get_changes($vars->{new_fn}, $vars->{old_fn})."\n"; 

Here, we dump the text contents of the changed web page, courtesy of lynx , followed by the output of the diff utility. It's a little bit of Perl obscura, but we do some finessing of Perl's file handling to simplify reading the whole text file into a variable. The variable $/ defines what Perl uses as an end-of-line character, normally set to some sort of carriage return or linefeed combination. By using undef to clear this setting, Perl considers the entire contents of the file as one long line without endings and slurps it all down into the variable.

Now that we have the text of the email, let's add it to our message:

 # Add the text part to the email.   my $part1 = MIME::Lite->new     (      Type => 'text/plain',      Data => $out     );   $msg->attach($part1); 

This bit of code creates a message part containing our text, gives it a header describing its contents as plain text, and adds it to the email message. Having taken care of the text, let's add the HTML part of the email:

 # Create and add the HTML part of the email, making sure to add a   # header indicating the base URL used for relative URLs.   my $part2 = MIME::Lite->new     (      Type => 'text/html',      Path => $vars->{html_fn}     );   $part2->attr('Content-Location' => $vars->{url});   $msg->attach($part2);   # Send off the email   $msg->send(  ); } 

This code creates an HTML part for our email, including the HTML content we last downloaded and setting the appropriate header to describe it as HTML. We also define another header that lets mail readers know the base URL for the HTML in order to resolve relative URLs. We set this to the original URL of the page so that images and links resolve properly.

Finally, we send off the message.

Hacking the Hack

You'll probably want to use this script in conjunction with cron [Hack #90] or some other scheduler, to check for changes in pages on a periodical basis. Just be polite and don't run it too often. Checking every hour or so should be often enough for most sites.

As for the script itself, we're cheating a little, since external tools do most of the work. But when we're writing hacks, it's best to be lazy and take advantage of other smart people's work as much as possible. In working out the amount of change between notifications, we're pretty inexact and fuzzy, but the method works. An exercise for the reader might be to find better means for measuring change, possibly methods that also can tell what kind of changes happened , to help you make better decisions on when to send notifications.

Also note that, though this hack uses both the diff and lynx programs directly, there are more cross-platform and pure Perl solutions for finding differences between files, such as the Text::Diff (http://search.cpan.org/author/RBS/Text-Diff/) or HTML::Diff (http://search.cpan.org/author/EZRAKILTY/html-diff/) modules on CPAN. And, with a bit of work, use of lynx could be replaced as well.

l.m.orchard



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net