11.7. Service MonitoringThis section presents simple scripts that can help you monitor services like mail, DNS, and web content. Earlier we showed how you can use the netcat tool to verify that, for example, your SMTP server is up and responding. This is all well and good, but there are times when you may need more control over the situation. These times may include when you need to know when a service:
In each of these instances, it would be nice to know ahead of time that things may not be working properly in your environment. The examples presented in this section use Perl modules to interact with the services directly. By using Perl, you have a great deal of control over how the services are monitored and how and when they send traps. What follows here is a Perl module that all the service monitors in this section use to track things like SLA information: 1 # 2 # File: MyStats.pm 3 # 4 5 package MyStats; 6 use Class::Struct; 7 use Exporter; 8 use SNMP_util; 9 our (@ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS, $VERSION, $duration, $count, 10 $countAndTime, $sla, %watchers); 11 12 $VERSION = 1.00; 13 @ISA = qw(Exporter); 14 15 # 16 # There are two scenarios we want to track and alert on: 17 # 1. Some resource has been down a certain number of times 18 # 2. Service Level Agreements (SLAs). We are concerned with making sure 19 # services respond and operate within limits set forth in our SLA. 20 # 21 22 struct Count => { 23 name => '$', 24 count => '$', 25 currentCount => '$', 26 message=> '$', 27 }; 28 29 struct SLA => { 30 name => '$', 31 responseTime => '$', 32 count => '$', 33 currentResponseTime => '$', 34 currentCount => '$', 35 message=> '$', 36 }; 37 38 $count; 39 $sla; 40 %watchers; 41 42 sub new { 43 my $classname = shift; 44 my $self = {}; 45 my %arg = @_; 46 bless($self, $classname); 47 return $self; 48 } 49 50 sub removeWatcher{ 51 my $classname = shift; 52 my ($name) = @_; 53 if(exists($watchers{$name})){ 54 delete($watchers{$name}); 55 } 56 } 57 58 sub thisExists{ 59 my $classname = shift; 60 my ($name) = @_; 61 return exists($watchers{$name}); 62 } 63 64 sub setCountWatcher{ 65 my $classname = shift; 66 my ($name,$c,$message) = @_; 67 $count = Count->new( ); 68 $count->name($name); 69 $count->count($c); 70 $count->message($message); 71 $watchers{$name} = $count; 72 } 73 74 sub incrCountWatcher{ 75 my $classname = shift; 76 my ($name) = @_; 77 if(exists($watchers{$name})){ 78 my $count = $watchers{$name}->{Count::currentCount}; 79 $count++; 80 $watchers{$name}->currentCount($count); 81 } 82 } 83 84 sub decrCountWatcher{ 85 my $classname = shift; 86 my ($name) = @_; 87 if(exists($watchers{$name})){ 88 my $count = $watchers{$name}->{Count::currentCount}; 89 if($count > 0){ 90 $count--; 91 $watchers{$name}->currentCount($count); 92 } 93 } 94 } 95 96 sub setSLA { 97 my $classname = shift; 98 my ($name,$count,$responseTime,$message) = @_; 99 $sla = SLA->new( ); 100 $sla->name($name); 101 $sla->count($count); 102 $sla->responseTime(sprintf("%.3f",$responseTime)); 103 $sla->currentCount(0); 104 $sla->currentResponseTime(0); 105 $sla->message($message); 106 $watchers{$name} = $sla; 107 } 108 109 sub updateSLA { 110 my $classname = shift; 111 my ($name,$responseTime) = @_; 112 if(exists($watchers{$name})){ 113 if($responseTime >= $watchers{$name}->{SLA::responseTime}){ 114 $watchers{$name}->currentResponseTime($responseTime); 115 my $count = $watchers{$name}->{SLA::currentCount}; 116 $count++; 117 $watchers{$name}->currentCount($count); 118 }elsif($responseTime < $watchers{$name}->{SLA::responseTime} && 119 $watchers{$name}->{SLA::currentCount} > 0){ 120 my $count = $watchers{$name}->{SLA::currentCount}; 121 $count--; 122 $watchers{$name}->currentCount($count); 123 $watchers{$name}->currentResponseTime($responseTime); 124 } 125 } 126 } 127 128 sub sendAlert{ 129 my $classname = shift; 130 my $host = "public\@localhost:162"; 131 my $agent = "localhost"; 132 my $eid = ".1.3.6.1.4.1.2789"; 133 my $trapId = 6; 134 my $specificId = 1300; 135 my $oid = ".1.3.6.1.4.1.2789.1247.1"; 136 foreach my $key (sort keys %watchers){ 137 if($watchers{$key}->isa(Count)){ 138 if($watchers{$key}->{Count::currentCount} >= 139 $watchers{$key}->{Count::count}){ 140 my $message = $watchers{$key}->{Count::message}; 141 print "Sending Count Trap: $message\n"; 142 snmptrap($host, $eid, $agent, $trapId,$specificId,$oid,"string",$message); 143 $watchers{$key}->currentCount(0); 144 } 145 } 146 if($watchers{$key}->isa(SLA)){ 147 if($watchers{$key}->{SLA::currentCount} >= 148 $watchers{$key}->{SLA::count} && 149 $watchers{$key}->{SLA::currentResponseTime} > 150 $watchers{$key}->{SLA::responseTime}){ 151 my $message = $watchers{$key}->{SLA::message}; 152 print "Sending SLA Trap: $message\n"; 153 snmptrap($host, $eid, $agent, $trapId,$specificId,$oid,"string",$message); 154 $watchers{$key}->currentCount(0); 155 } 156 } 157 } 158 } 159 160 161 1; The user of this module can create two types of watchers:
MyStats.pm is a basic implementation, but it is functional as is. Its use will become clearer when we present actual service monitoring scripts. When monitoring customer-visible services, keep in mind the following:
Now let's look at three service monitoring scripts. 11.7.1. Web ContentMany people monitor the hardware their web server runs on without actually monitoring the web content itself. The scripts in this section use the Library for WWW in Perl (LWP) module to interact with a web server's content. The LWP module comes with Perl and you should not have to download a copy. We will present two scripts that perform the following monitoring tasks:
The first example is in the same vein as the other service monitors. The second monitor, however, shows how easy it is to validate a web site's links. This can come in handy when you go live with a total redesign of your corporate web site. If the link to investor information is dead, wrong, or just not working, you will want to know about it pronto. Believe it or not, we have seen this happen over and over again. This script attempts to get the main page from a URL. It detects whether the connection can be made to the web server and whether the request takes an inordinately long time. 162 #!/usr/bin/perl 163 # 164 # File: web-load.pl 165 # 166 use LWP::Simple; 167 use MyStats; 168 169 my $URL = "http://www.oreilly.com"; 170 my $count = 3; 171 my $loadTime = 1; 172 my $duration = 3; 173 my $name1 = "URL Watcher1"; 174 my $name2 = "URL Watcher2"; 175 my $message1 = "$URL has been down $count times"; 176 my $message2 = "$URL took greater than $loadTime second(s) to load. The problem persisted for over $duration seconds"; 177 178 my $stats = MyStats->new( ); 179 $stats->setCountWatcher($name1,$count,$message1); 180 $stats->setSLA($name2,$duration,$loadTime,$message2); 181 182 # 183 # Example taken from O'Reilly's Perl Cookbook 2nd edition 184 # 185 my $start = 0; 186 my $stop = 0; 187 my $sleep = 1; 188 while(1){ 189 $start = time( ); 190 my $content = get($URL); 191 if(!defined($content)) { 192 # Couldn't get content at all! 193 $stats->incrCountWatcher($name1); 194 }else{ 195 $stats->decrCountWatcher($name1); 196 $stop = time( ); 197 my $total = sprintf("%.3f",($stop-$start)); 198 199 $stats->updateSLA($name2,$total); 200 } 201 $stats->sendAlert( ); 202 print "Sleeping...\n"; 203 sleep($sleep); 204 } Here are some pertinent points about this script. Note that all the scripts in this section follow the same form when it comes to collecting SLA information and sending traps.
That's about it. It really is quite simple but can be very effective. The following script can find bad links. It starts at a given URL and works its way through the href tags: #!/usr/bin/perl # # File: web-badlinks.pl # use HTML::LinkExtor; use LWP::Simple; use MyStats; my $URL = "http://www.oreilly.com"; my $count = 3; my $loadTime = 1; my $duration = 3; my $name1 = "URL Watcher1"; my $name2 = "Bad Link Watcher2"; my $message1 = "$URL has been down $count times"; my $message2 = "This URL is BAD: "; my $stats = MyStats->new( ); $stats->setCountWatcher($name1,$count,$message1); # # Place links in here that you do not want to check # my %exemptLinks = ( # http://www.oreilly.com/partners/index.php will not get processed. "$URL/partners/index.php"=>1 ); # # Parts of this Example taken from O'Reilly's Perl Cookbook, # 2nd edition # my $start = 0; my $stop = 0; my $sleep = 1; while(1){ my $parser = HTML::LinkExtor->new(undef, $URL); my $html = get($URL); if(!defined($html)){ # Couldn't get html. Server may be down $stats->incrCountWatcher($name1); }else{ $stats->decrCountWatcher($name1); $parser->parse($html); my @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; while (@element) { my ($attr_name,$attr_value) = splice(@element, 0, 2); next unless($exemptLinks{$attr_value} != 1); if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) { if(!head($attr_value)){ if(!$stats->thisExists($attr_value)){ my $m = $message2.$attr_value; $stats->setCountWatcher($attr_value,$count,$m); }else{ $stats->incrCountWatcher($attr_value); } } } } } } $stats->sendAlert( ); print "Sleeping..\n"; sleep($sleep); } We're not going to go into detail about how this script works. The watchers are set up in a similar fashion to the previous script. One thing to note is that this script can actually produce false positives. When it comes across a link that requires login credentials, it may wrongly assume the link is bad when in fact it is not. To remedy this, you can add URLs to the %exemptLinks hash and they will be ignored altogether. Finally, here is some sample output generated by these monitors: $ snmptrapd -f -Lo 2005-05-05 12:49:34 NET-SNMP version 5.2.1 Started. 2005-05-05 12:51:39 localhost.localdomain [127.0.0.1] (via UDP: [127.0.0.1]: 37243) TRAP, SNMP v1, community public enterprises.2789 Enterprise Specific Trap (1300) Uptime: 0:00:08.00 enterprises.2789.1247.1 = STRING: "http://www.oreilly.com took greater than 1 second(s) to load. The problem persisted for over 3 seconds" 2005-05-05 13:52:43 localhost.localdomain [127.0.0.1] (via UDP: [127.0.0.1]: 37249) TRAP, SNMP v1, community public enterprises.2789 Enterprise Specific Trap (1300) Uptime: 0:00:10.00 enterprises.2789.1247.1 = STRING: "This URL is BAD: http://www.oreilly. com/partners/index.php" 11.7.2. SMTP and POP3The best way to monitor the health of your email service is to actually use it. This means sending and receiving email. The logic flow for monitoring SMTP follows:
Steps 1 and 4 form a calculation for how long it took to interact with the SMTP server. If you begin to see a decline in the response time, it could be indicative of a problem. Of course, if in step number 2, you aren't able to connect to the server, this should be noted and a trap should be sent. Monitoring POP3 has a similar logic flow:
Here we are concerned with measuring several additional aspects of the POP3 server. Knowing how long it took to provide authentication credentials to the server may be useful, as well as knowing how long it took to delete one or more messages. The scripts used in this section use the Net::SMTP and Net::POP3 modules that come with recent versions of Perl. If you are using an older version of Perl, you should be able to download these modules from http://www.cpan.org. The SMTP monitor is a separate script from the POP3 monitor so that you can easily run one script on one machine and the other on a different machine. Now let's look at the actual code for the SMTP monitor: #!/usr/bin/perl # # File: smtp.pl # use Net::SMTP; use MyStats; my $sleep = 1; my $server = "smtp.oreilly.com"; my $heloSever = "smtp.oreilly.com"; my $timeout = 30; my $debug = 1; my $count = 3; my $loadTime = 1; my $duration = 3; my $mailbox = "test1\@oreilly.com"; my $from = "test1-admin\@oreilly.com"; my $data = "This is a test email.\n"; my $name1 = "Mail Server Watcher1"; my $name2 = "Mail Server Watcher2"; my $message1 = "$server has been down $count times"; my $message2 = "Sending email to $mailbox took greater than $loadTime second(s). The problem persisted for over $duration seconds"; $stats = MyStats->new( ); $stats->setCountWatcher($name1,$count,$message1); $stats->setSLA($name2,$duration,$loadTime,$message2); my $start = 0; my $stop = 0; while(1){ $start = time( ); my $smtp = Net::SMTP->new( $server, Hello=>$heloServer, Timeout => $timeout, Debug => $debug ); if(!$smtp){ $stats->incrCountWatcher($name1); }else{ $stats->decrCountWatcher($name1); $smtp->mail($mailbox); $smtp->to($from); $smtp->data( ); $smtp->datasend($data); $smtp->dataend( ); $smtp->quit; $end = time( ); my $total = sprintf("%.3f",($stop-$start)); $stats->updateSLA($name2); } $stats->sendAlert( ); print "Sleeping...\n"; sleep($sleep); } Now the POP3 script: #!/usr/bin/perl # # File: pop3.pl # use Net::POP3; use MyStats; my $sleep = 1; my $server = "pop3.oreilly.com"; my $username = "kschmidt"; my $password = "pword"; my $timeout = 30; my $count = 3; my $loadTime = 1; my $duration = 3; my $name1 = "POP3 Server Watcher1"; my $name2 = "POP3 Server Watcher2"; my $message1 = "$server has been down $count times"; my $message2 = "Popping email from $server for account $username took greater than $loadTime second(s). The problem persisted for over $duration seconds"; $stats = MyStats->new( ); $stats->setCountWatcher($name1,$count,$message1); $stats->setSLA($name2,$duration,$loadTime,$message2); my $start = 0; my $stop = 0; while(1){ $start = time( ); my $pop = Net::POP3->new($server, Timeout => $timeout); if(!$pop){ $stats->incrCountWatcher($name1); }else{ $stats->decrCountWatcher($name1); if ($pop->login($username, $password) > 0) { my $msgnums = $pop->list; # hashref of msgnum => size foreach my $msgnum (keys %$msgnums) { # At this point we get the message and delete it. If you want to # measure getting and deleting independent of each other, you # should probably start a new timer, get the messages, stop the # timer, start a new timer, delete the messages and stop the # timer. You will also want to create two new SLA trackers. my $msg = $pop->get($msgnum); $pop->delete($msgnum); } }else{ # Login failure. You will want to track this. } $pop->quit; $end = time( ); my $total = sprintf("%.3f",($stop-$start)); $stats->updateSLA($name2); } $stats->sendAlert( ); print "Sleeping..\n"; sleep($sleep); } The POP3 script will run continually. As soon as the SMTP script sends an email, the POP3 monitor will spring into action and do its thing. 11.7.3. DNSOne of the services that people often forget to monitor is DNS. Using similar techniques used to monitor web, SMTP, and POP3, we can monitor DNS as well. The Net::DNS Perl module is used in the example and is available from http://search.cpan.org/~olaf/Net-DNS-0.49/. While Net::DNS does not require the presence of the libresolv library on your Unix system to operate, if it does exist, the package uses it to build the module, which allows for increased performance. This module is full featured and allows for at least the following:
For our purposes, we will measure how long it takes to perform a DNS query for a host as well as obtain MX records for a domain. #!/usr/bin/perl # # File: dns.pl # use Net::DNS; use MyStats; my $sleep = 30; my $search = "www.oreilly.com"; my $mxSearch = "oreilly.com"; my $count = 3; my $loadTime = 1; my $duration = 3; my $ns = "192.168.0.4"; my $debug = 0; my $name1 = "DNS Server Watcher1"; my $message1 = "The DNS server $ns took greater than $loadTime second(s) to respond to queries. The problem persisted for over $duration seconds"; $stats = MyStats->new( ); $stats->setSLA($name1,$duration,$loadTime,$message1); my $start = 0; my $stop = 0; while(1){ $start = time( ); my $res = Net::DNS::Resolver->new( nameservers => [$ns], debug => $debug, ); my $query = $res->search($search); if ($query) { foreach my $rr ($query->answer) { next unless $rr->type eq "A"; print $rr->address, "\n"; } } else { # You may want to create a new watcher for search errors warn "query failed: ", $res->errorstring, "\n"; } # lookup MX records my @mx = mx($res, $mxSearch); if(@mx){ foreach $rr (@mx) { print $rr->preference, " ", $rr->exchange, "\n"; } } else { # You may want to create a new watcher for MX errors warn "Can't find MX records for $name: ", $res->errorstring, "\n"; } $stop = time( ); my $total = sprintf("%.3f",($stop-$start)); $stats->updateSLA($name1); $stats->sendAlert( ); print "Sleeping..\n"; sleep($sleep); } 11.7.4. More Monitoring SuggestionsHere are some suggestions on how you can enhance these monitors:
|