Section 11.7. Service Monitoring


11.7. Service Monitoring

This section presents simple scripts that can help you monitor services like mail, DNS, and web content. Earlier we showed how you can use the netcat tool to verify that, for example, your SMTP server is up and responding. This is all well and good, but there are times when you may need more control over the situation. These times may include when you need to know when a service:

  • Has been unreachable X number of times

  • Has been unreachable X number of times in timeframe Y

  • Is not meeting your company's Service Level Agreement (SLA). For example, your SLA may state that your SMTP or POP3 services will take no longer than 500 milliseconds (half a second) to service requests.

In each of these instances, it would be nice to know ahead of time that things may not be working properly in your environment.

The examples presented in this section use Perl modules to interact with the services directly. By using Perl, you have a great deal of control over how the services are monitored and how and when they send traps. What follows here is a Perl module that all the service monitors in this section use to track things like SLA information:

   1   #   2   # File: MyStats.pm   3   #   4   5   package MyStats;   6   use Class::Struct;   7   use Exporter;   8   use SNMP_util;   9   our (@ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS, $VERSION, $duration, $count,  10          $countAndTime, $sla, %watchers);  11  12    $VERSION = 1.00;  13    @ISA = qw(Exporter);  14  15    #  16    # There are two scenarios we want to track and alert on:  17    # 1. Some resource has been down a certain number of times  18    # 2. Service Level Agreements (SLAs). We are concerned with making sure  19    # services respond and operate within limits set forth in our SLA.  20    #  21  22    struct Count => {  23       name   => '$',  24       count => '$',  25       currentCount => '$',  26       message=> '$',  27    };  28  29    struct SLA => {  30       name => '$',  31       responseTime => '$',  32       count => '$',  33       currentResponseTime => '$',  34       currentCount => '$',  35       message=> '$',  36    };  37  38    $count;  39    $sla;  40    %watchers;  41  42    sub new {  43       my $classname  = shift;  44       my $self       = {};  45       my %arg  = @_;  46       bless($self, $classname);  47       return $self;  48    }  49  50    sub removeWatcher{  51       my $classname  = shift;  52       my ($name) = @_;  53       if(exists($watchers{$name})){  54          delete($watchers{$name});  55       }  56    }  57  58    sub thisExists{  59       my $classname  = shift;  60       my ($name) = @_;  61       return exists($watchers{$name});  62    }  63  64    sub setCountWatcher{  65       my $classname = shift;  66       my ($name,$c,$message) = @_;  67       $count = Count->new( );  68       $count->name($name);  69       $count->count($c);  70       $count->message($message);  71       $watchers{$name} = $count;  72    }  73  74    sub incrCountWatcher{  75       my $classname = shift;  76       my ($name) = @_;  77       if(exists($watchers{$name})){  78          my $count = $watchers{$name}->{Count::currentCount};  79          $count++;  80          $watchers{$name}->currentCount($count);  81       }  82    }  83  84    sub decrCountWatcher{  85       my $classname = shift;  86       my ($name) = @_;  87       if(exists($watchers{$name})){  88          my $count = $watchers{$name}->{Count::currentCount};  89          if($count > 0){  90             $count--;  91             $watchers{$name}->currentCount($count);  92          }  93       }  94    }  95  96    sub setSLA {  97       my $classname = shift;  98       my ($name,$count,$responseTime,$message) = @_;  99       $sla = SLA->new( ); 100       $sla->name($name); 101       $sla->count($count); 102       $sla->responseTime(sprintf("%.3f",$responseTime)); 103       $sla->currentCount(0); 104       $sla->currentResponseTime(0); 105       $sla->message($message); 106       $watchers{$name} = $sla; 107    } 108 109    sub updateSLA { 110       my $classname = shift; 111       my ($name,$responseTime) = @_; 112       if(exists($watchers{$name})){ 113          if($responseTime >= $watchers{$name}->{SLA::responseTime}){ 114             $watchers{$name}->currentResponseTime($responseTime); 115             my $count = $watchers{$name}->{SLA::currentCount}; 116             $count++; 117             $watchers{$name}->currentCount($count); 118          }elsif($responseTime < $watchers{$name}->{SLA::responseTime} && 119                $watchers{$name}->{SLA::currentCount} > 0){ 120             my $count = $watchers{$name}->{SLA::currentCount}; 121             $count--; 122             $watchers{$name}->currentCount($count); 123             $watchers{$name}->currentResponseTime($responseTime); 124          } 125       } 126    } 127 128    sub sendAlert{ 129       my $classname = shift; 130       my $host = "public\@localhost:162"; 131       my $agent = "localhost"; 132       my $eid = ".1.3.6.1.4.1.2789"; 133       my $trapId = 6; 134       my $specificId = 1300; 135       my $oid = ".1.3.6.1.4.1.2789.1247.1"; 136       foreach my $key (sort keys %watchers){ 137          if($watchers{$key}->isa(Count)){ 138             if($watchers{$key}->{Count::currentCount} >= 139                   $watchers{$key}->{Count::count}){ 140                my $message = $watchers{$key}->{Count::message}; 141                print "Sending Count Trap: $message\n"; 142                snmptrap($host, $eid, $agent,     $trapId,$specificId,$oid,"string",$message); 143                $watchers{$key}->currentCount(0); 144             } 145          } 146          if($watchers{$key}->isa(SLA)){ 147             if($watchers{$key}->{SLA::currentCount} >= 148                   $watchers{$key}->{SLA::count} && 149                   $watchers{$key}->{SLA::currentResponseTime} > 150                   $watchers{$key}->{SLA::responseTime}){ 151                my $message = $watchers{$key}->{SLA::message}; 152                print "Sending SLA Trap: $message\n"; 153                snmptrap($host, $eid, $agent,     $trapId,$specificId,$oid,"string",$message); 154                $watchers{$key}->currentCount(0); 155             } 156          } 157       } 158    } 159 160 161    1; 

The user of this module can create two types of watchers:


Simple counter

You establish some threshold, and as the item you are monitoring changes (for example, becomes unable to connect to your service), MyStats.pm updates the count. When the count exceeds the threshold, an SNMP trap is sent.


SLA

The SLA object allows the user to set a duration and count. For example, if connecting to your SMTP server takes longer than one second (duration) and this happens 10 times (count), send an SNMP trap.

MyStats.pm is a basic implementation, but it is functional as is. Its use will become clearer when we present actual service monitoring scripts.

When monitoring customer-visible services, keep in mind the following:

  • Deploy the monitoring scripts in the network where the path from the monitoring point traverses a path similar to that of the customer. This is rarely possible, but it's worth mentioning.

  • If this sort of placement isn't possible and you have a network that is outside the particular server farm where your services are running, try to at least have the monitoring traffic go through the same router or firewall that the customer would use.

  • If this is still not possible, monitoring your services from the same LAN segment or switch is still better than nothing!

Now let's look at three service monitoring scripts.

11.7.1. Web Content

Many people monitor the hardware their web server runs on without actually monitoring the web content itself. The scripts in this section use the Library for WWW in Perl (LWP) module to interact with a web server's content. The LWP module comes with Perl and you should not have to download a copy. We will present two scripts that perform the following monitoring tasks:

  • Monitor content retrieval from a server

  • Monitor a web site for dead links

The first example is in the same vein as the other service monitors. The second monitor, however, shows how easy it is to validate a web site's links. This can come in handy when you go live with a total redesign of your corporate web site. If the link to investor information is dead, wrong, or just not working, you will want to know about it pronto. Believe it or not, we have seen this happen over and over again.

This script attempts to get the main page from a URL. It detects whether the connection can be made to the web server and whether the request takes an inordinately long time.

 162    #!/usr/bin/perl 163    # 164    # File: web-load.pl 165    # 166    use LWP::Simple; 167    use MyStats; 168 169    my $URL = "http://www.oreilly.com"; 170    my $count = 3; 171    my $loadTime = 1; 172    my $duration = 3; 173    my $name1 = "URL Watcher1"; 174    my $name2 = "URL Watcher2"; 175    my $message1 = "$URL has been down $count times"; 176    my $message2 = "$URL took greater than $loadTime second(s) to load. The        problem        persisted for over $duration seconds"; 177 178    my $stats = MyStats->new( ); 179    $stats->setCountWatcher($name1,$count,$message1); 180    $stats->setSLA($name2,$duration,$loadTime,$message2); 181 182    # 183    # Example taken from O'Reilly's Perl Cookbook 2nd edition 184    # 185    my $start = 0; 186    my $stop = 0; 187    my $sleep = 1; 188    while(1){ 189       $start = time( ); 190       my $content = get($URL); 191       if(!defined($content)) { 192          # Couldn't get content at all! 193          $stats->incrCountWatcher($name1); 194       }else{ 195          $stats->decrCountWatcher($name1); 196          $stop = time( ); 197          my $total = sprintf("%.3f",($stop-$start)); 198 199          $stats->updateSLA($name2,$total); 200       } 201       $stats->sendAlert( ); 202       print "Sleeping...\n"; 203       sleep($sleep); 204    } 

Here are some pertinent points about this script. Note that all the scripts in this section follow the same form when it comes to collecting SLA information and sending traps.


Line 169

This is the base URL you wish to monitor.


Line 170

This value is used to set the count used for simple counting.


Line 171

The $loadTime variable is used for the SLA watcher. This value, expressed as seconds, says "when it takes $loadTime time to do something, then note it."


Line 172

$duration is just like $count, but it's for the SLA watcher.


Lines 173 and 174

These two lines are labels used to uniquely identify the two watchers that this monitor will use. You can have any number of watchers in a monitor, as long as they have a unique name.


Lines 175 and 176

When a trap is sent for a given watcher, these messages will be the guts of the trap. These message strings are meant to be as informative as possible so that someone can begin to resolve the problem.


Lines 178, 179, and 180

Line 178 creates a new MyStats instance. Line 179 creates a count watcher while 180 creates an SLA watcher.


Line 188

We enter a loop and continually monitor the service.


Line 189

We start a timer.


Line 190

Here we do the work of getting the URL content.


Lines 191 and 193

If the content isn't defined, we failed to get the content. Line 193 bumps up the counter for the counter watcher.


Lines 195 through 199

Since we were able to get content from the server, we decrement the counter on Line 195. We stop the timer on Line 196 by setting a variable to the current time. Line 197 calculates how long it took to get the content to three positions after the decimal. This allows for setting $duration to subsecond valuese.g., 0.1 for one-tenth of a second. Line 199 updates the SLA monitor.


Line 201

The sendAlert subroutine handles checking to see if any watchers need to have traps sent on their behalf. See the MyStats.pm code presented at the beginning of this section to see how sendAlert does its thing.


Line 203

The script sleeps, wakes up, and repeats.

That's about it. It really is quite simple but can be very effective.

The following script can find bad links. It starts at a given URL and works its way through the href tags:

     #!/usr/bin/perl     #     # File: web-badlinks.pl     #     use HTML::LinkExtor;     use LWP::Simple;     use MyStats;     my $URL = "http://www.oreilly.com";     my $count = 3;     my $loadTime = 1;     my $duration = 3;     my $name1 = "URL Watcher1";     my $name2 = "Bad Link Watcher2";     my $message1 = "$URL has been down $count times";     my $message2 = "This URL is BAD: ";     my $stats = MyStats->new( );     $stats->setCountWatcher($name1,$count,$message1);     #     # Place links in here that you do not want to check     #     my %exemptLinks = (         # http://www.oreilly.com/partners/index.php  will not get processed.        "$URL/partners/index.php"=>1     );     #     # Parts of this Example taken from O'Reilly's Perl Cookbook,     # 2nd edition     #     my $start = 0;     my $stop = 0;     my $sleep = 1;     while(1){        my $parser = HTML::LinkExtor->new(undef, $URL);        my $html = get($URL);        if(!defined($html)){           # Couldn't get html. Server may be down           $stats->incrCountWatcher($name1);        }else{           $stats->decrCountWatcher($name1);           $parser->parse($html);           my @links = $parser->links;           foreach $linkarray (@links) {              my @element  = @$linkarray;              my $elt_type = shift @element;              while (@element) {                 my ($attr_name,$attr_value) = splice(@element, 0, 2);                 next unless($exemptLinks{$attr_value} != 1);                 if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {                    if(!head($attr_value)){                       if(!$stats->thisExists($attr_value)){                          my $m = $message2.$attr_value;                          $stats->setCountWatcher($attr_value,$count,$m);                       }else{                          $stats->incrCountWatcher($attr_value);                       }                    }                 }              }           }        }        $stats->sendAlert( );        print "Sleeping..\n";        sleep($sleep);     } 

We're not going to go into detail about how this script works. The watchers are set up in a similar fashion to the previous script.

One thing to note is that this script can actually produce false positives. When it comes across a link that requires login credentials, it may wrongly assume the link is bad when in fact it is not. To remedy this, you can add URLs to the %exemptLinks hash and they will be ignored altogether.

Finally, here is some sample output generated by these monitors:

     $ snmptrapd -f -Lo     2005-05-05 12:49:34 NET-SNMP version 5.2.1 Started.     2005-05-05 12:51:39 localhost.localdomain [127.0.0.1] (via UDP: [127.0.0.1]:     37243)     TRAP, SNMP v1, community public             enterprises.2789 Enterprise Specific Trap (1300) Uptime: 0:00:08.00             enterprises.2789.1247.1 = STRING: "http://www.oreilly.com took greater     than 1 second(s) to load. The problem persisted for over 3 seconds"     2005-05-05 13:52:43 localhost.localdomain [127.0.0.1] (via UDP: [127.0.0.1]:     37249)     TRAP, SNMP v1, community public             enterprises.2789 Enterprise Specific Trap (1300) Uptime: 0:00:10.00             enterprises.2789.1247.1 = STRING: "This URL is BAD: http://www.oreilly.     com/partners/index.php" 

11.7.2. SMTP and POP3

The best way to monitor the health of your email service is to actually use it. This means sending and receiving email. The logic flow for monitoring SMTP follows:

  1. Start timer.

  2. Connect to SMTP server.

  3. Send email message to dummy account.

  4. Stop timer.

  5. Note how long it took to interact with the server.

Steps 1 and 4 form a calculation for how long it took to interact with the SMTP server. If you begin to see a decline in the response time, it could be indicative of a problem. Of course, if in step number 2, you aren't able to connect to the server, this should be noted and a trap should be sent.

Monitoring POP3 has a similar logic flow:

  1. Start timer.

  2. Connect to POP3 server.

  3. Start login timer.

  4. Send login credentials.

  5. Start retrieval timer.

  6. Retrieve email for dummy account.

  7. Start delete timer.

  8. Delete email from account.

  9. Stop all timers.

  10. Note how long it took to connect, log in, retrieve, and delete from the POP3 server.

Here we are concerned with measuring several additional aspects of the POP3 server. Knowing how long it took to provide authentication credentials to the server may be useful, as well as knowing how long it took to delete one or more messages.

The scripts used in this section use the Net::SMTP and Net::POP3 modules that come with recent versions of Perl. If you are using an older version of Perl, you should be able to download these modules from http://www.cpan.org. The SMTP monitor is a separate script from the POP3 monitor so that you can easily run one script on one machine and the other on a different machine.

Now let's look at the actual code for the SMTP monitor:

     #!/usr/bin/perl     #     # File: smtp.pl     #     use Net::SMTP;     use MyStats;     my $sleep = 1;     my $server = "smtp.oreilly.com";     my $heloSever = "smtp.oreilly.com";     my $timeout = 30;     my $debug = 1;     my $count = 3;     my $loadTime = 1;     my $duration = 3;     my $mailbox = "test1\@oreilly.com";     my $from = "test1-admin\@oreilly.com";     my $data = "This is a test email.\n";     my $name1 = "Mail Server Watcher1";     my $name2 = "Mail Server Watcher2";     my $message1 = "$server has been down $count times";     my $message2 = "Sending email to $mailbox took greater than $loadTime second(s).     The problem persisted for over $duration seconds";     $stats = MyStats->new( );     $stats->setCountWatcher($name1,$count,$message1);     $stats->setSLA($name2,$duration,$loadTime,$message2);     my $start = 0;     my $stop = 0;     while(1){        $start = time( );        my $smtp = Net::SMTP->new(           $server,           Hello=>$heloServer,           Timeout => $timeout,           Debug => $debug           );        if(!$smtp){           $stats->incrCountWatcher($name1);        }else{           $stats->decrCountWatcher($name1);           $smtp->mail($mailbox);           $smtp->to($from);           $smtp->data( );           $smtp->datasend($data);           $smtp->dataend( );           $smtp->quit;           $end = time( );           my $total = sprintf("%.3f",($stop-$start));           $stats->updateSLA($name2);        }        $stats->sendAlert( );        print "Sleeping...\n";        sleep($sleep);     } 

Now the POP3 script:

     #!/usr/bin/perl     #     # File: pop3.pl     #     use Net::POP3;     use MyStats;     my $sleep = 1;     my $server = "pop3.oreilly.com";     my $username = "kschmidt";     my $password = "pword";     my $timeout = 30;     my $count = 3;     my $loadTime = 1;     my $duration = 3;     my $name1 = "POP3 Server Watcher1";     my $name2 = "POP3 Server Watcher2";     my $message1 = "$server has been down $count times";     my $message2 = "Popping email from $server for account $username took greater     than $loadTime second(s). The problem persisted for over $duration seconds";     $stats = MyStats->new( );     $stats->setCountWatcher($name1,$count,$message1);     $stats->setSLA($name2,$duration,$loadTime,$message2);     my $start = 0;     my $stop = 0;     while(1){        $start = time( );        my $pop = Net::POP3->new($server, Timeout => $timeout);        if(!$pop){           $stats->incrCountWatcher($name1);        }else{           $stats->decrCountWatcher($name1);           if ($pop->login($username, $password) > 0) {              my $msgnums = $pop->list; # hashref of msgnum => size              foreach my $msgnum (keys %$msgnums) {                 # At this point we get the message and delete it. If you want to                 # measure getting and deleting independent of each other, you                 # should probably start a new timer, get the messages, stop the                 # timer, start a new timer, delete the messages and stop the                 # timer. You will also want to create two new SLA trackers.                 my $msg = $pop->get($msgnum);                 $pop->delete($msgnum);              }           }else{              # Login failure. You will want to track this.           }           $pop->quit;           $end = time( );           my $total = sprintf("%.3f",($stop-$start));           $stats->updateSLA($name2);        }        $stats->sendAlert( );        print "Sleeping..\n";        sleep($sleep);     } 

The POP3 script will run continually. As soon as the SMTP script sends an email, the POP3 monitor will spring into action and do its thing.

11.7.3. DNS

One of the services that people often forget to monitor is DNS. Using similar techniques used to monitor web, SMTP, and POP3, we can monitor DNS as well. The Net::DNS Perl module is used in the example and is available from http://search.cpan.org/~olaf/Net-DNS-0.49/. While Net::DNS does not require the presence of the libresolv library on your Unix system to operate, if it does exist, the package uses it to build the module, which allows for increased performance.

This module is full featured and allows for at least the following:

  • Look up a host's address.

  • Discover nameserver(s) for a domain.

  • Discover Mail Exchange (MX) record(s) for a domain.

  • Obtain a domain's Start of Authority (SOA) record.

For our purposes, we will measure how long it takes to perform a DNS query for a host as well as obtain MX records for a domain.

     #!/usr/bin/perl     #     # File: dns.pl     #     use Net::DNS;     use MyStats;     my $sleep = 30;     my $search = "www.oreilly.com";     my $mxSearch = "oreilly.com";     my $count = 3;     my $loadTime = 1;     my $duration = 3;     my $ns = "192.168.0.4";     my $debug = 0;     my $name1 = "DNS Server Watcher1";     my $message1 = "The DNS server $ns took greater than $loadTime second(s) to     respond to queries. The problem persisted for over $duration seconds";     $stats = MyStats->new( );     $stats->setSLA($name1,$duration,$loadTime,$message1);     my $start = 0;     my $stop = 0;     while(1){        $start = time( );        my $res = Net::DNS::Resolver->new(           nameservers => [$ns],           debug       => $debug,           );        my $query = $res->search($search);        if ($query) {           foreach my $rr ($query->answer) {                next unless $rr->type eq "A";              print $rr->address, "\n";           }        } else {           # You may want to create a new watcher for search errors           warn "query failed: ", $res->errorstring, "\n";        }        # lookup MX records        my @mx = mx($res, $mxSearch);        if(@mx){           foreach $rr (@mx) {              print $rr->preference, " ", $rr->exchange, "\n";           }        } else {           # You may want to create a new watcher for MX errors           warn "Can't find MX records for $name: ", $res->errorstring, "\n";        }        $stop = time( );        my $total = sprintf("%.3f",($stop-$start));        $stats->updateSLA($name1);        $stats->sendAlert( );        print "Sleeping..\n";        sleep($sleep);     } 

11.7.4. More Monitoring Suggestions

Here are some suggestions on how you can enhance these monitors:

  • A database such as MySQL can be used to store the response times for every run of a monitor. Over time, a profile of how well a service performs can be developed from the stored information. Additionally, SLA reports can be created that show how often a service was responsive during some time interval.

  • For web monitoring, you might want to create a script that can detect the age of dynamically created content. This would allow an administrator to know if some component on the backend is malfunctioning. We suggest getting a copy of O'Reilly's Perl Cookbook for ways of using LWP and other modules to accomplish this.

  • The bad web link finder can be extended to actually log into pages that require authentication credentials. Again, the Perl Cookbook can help with adding this functionality to the script.




Essential SNMP
Essential SNMP, Second Edition
ISBN: 0596008406
EAN: 2147483647
Year: 2003
Pages: 165

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net