Section 10.1. Apache Logfiles


10.1. Apache Logfiles

Like many people, I have a weblog and quite a considerable amount of other online writings sitting on my own web server. However once I've written something and it's been indexed by the search engines, people tend to find it, and I feel a vague social responsibility to keep it up there. But things being as they are, and me tinkering as I do, documents go missing. Keeping track of the "404 Page Not Found" errors spit out by your server is, therefore, a good thing to do.

This script, therefore, goes through a standard Apache logfile and produces an RSS 2.0 feed of pages other people have found to be missing.

10.1.1. Walking Through the Code

Let's start with the usual Perl good form of strict; and warnings; and then load up the marvellous Date::Manip module. This is perhaps a little overkill for its use here, but it does allow for some extremely simple and readable code. This is a CGI application, so we need that module, and we're producing RSS, so XML::RSS is naturally required too.

use strict; use warnings; use Date::Manip; use XML::RSS; use CGI qw(:standard);

First off, let's set up the feed. Because this is the simplest possible form of RSS just a list, reallyit is a perfect fit for RSS 2.0. Then we give it a nice title, link and description, as per the specification:

my $rss = new XML::RSS( version => '2.0' ); $rss->channel(     title       => "Missing Files",     link        => "http://www.example.org",     description => "Files found to be missing from my server" );

On my host at least, logfiles are split daily and named after the date. Since I'm fixing the missing files as they appear, I only want to parse yesterday's file. So let's use Date::Manip's functions to return yesterday's date, then create the logfile path, and open a filehandle to it. You will need to change this line to reflect your own setup:

my $yesterdays_date = &UnixDate( "yesterday", "%Y%m%d" ); my $logfile_file = "/web/logs/ben/benhammersley.com/$yesterdays_date.log"; open( LOGFILE, "< $logfile_file" );

Now, go into a loop, taking the logfile line by line and using the mother of regular expressions to split it up into its requisite parts. This is a very useful line to take note of: you can change this section to convert this script to monitor your Apache logfiles for just about anything.

while (<LOGFILE>) {     my (         $host,      $ident_user, $auth_user, $date,     $time,         $time_zone, $method,     $url,       $protocol, $status,         $bytes,     $referer,    $agent       )       = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?) (\S+)" (\ S+) (\S+) "([^"]+)" "([^"]+)"$/;

But today, we're interested in 404 errors. So, the script uses the XML::RSS module to add items for every error found:

my $cleaned_status = $status || "111"; if ( $cleaned_status =  = "404" ) {     $rss->add_item(         title       => "$url",         link        => "$url",         description => "$referer"         );     } }

Then all that's left to do is to close the filehandle, print out the correct MIME type for RSS, and print out the feed we've built:

close(LOGFILE); print header('application/xml+rss'); print $rss->as_string;

10.1.2. The Entire Listing

#!/usr/bin/perl use strict; use warnings; use Date::Manip; use XML::RSS; use CGI qw(:standard); my $rss = new XML::RSS( version => '2.0' ); $rss->channel(     title       => "Missing Files",     link        => "http://www.example.org",     description => "Files found to be missing from my server" ); my $yesterdays_date = &UnixDate( "yesterday", "%Y%m%d" ); my $logfile_file = "/web/logs/ben/benhammersley.com/$yesterdays_date.log"; open( LOGFILE, "< $logfile_file" ); while (<LOGFILE>) {     my (         $host,      $ident_user, $auth_user, $date,     $time,         $time_zone, $method,     $url,       $protocol, $status,         $bytes,     $referer,    $agent       )       = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?) (\S+)" (\ S+) (\S+) "([^"]+)" "([^"]+)"$/;     my $cleaned_status = $status || "111";     if ( $cleaned_status =  = "404" ) {         $rss->add_item(             title       => "$url",             link        => "$url",             description => "$referer"         );     } } close(LOGFILE); print header('application/xml+rss'); print $rss->as_string;



    Developing Feeds with RSS and Atom
    Developing Feeds with Rss and Atom
    ISBN: 0596008813
    EAN: 2147483647
    Year: 2003
    Pages: 118

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net