Hack88.Keep Tabs on Systems and Services | BSD Sockets Programming from a Multi-Language Perspective (Programming Series)

Hack 88. Keep Tabs on Systems and Services

Consolidate home-grown monitoring scripts and mechanisms using Nagios.

Monitoring is a key task for administrators, whether you're in a small environment of 50100 servers or are managing many sites globally with 5,000 servers each. At some point, trying to keep up with the growth in the number of new services and servers deployed, and reflecting changes across many disparate monitoring solutions, becomes a full-time job!

Admins often monitor not only the availability of a system (using simple tools such as ping), but also the health of the services running on the systemthe network devices that connect the systems to each other; peripheral devices such as printers, copiers, uninterruptible power supplies (UPSs); and even air conditioners and other equipment. Often, these tools perform simple connections to services and use SNMP and rstatd data collection and specialized environmental monitoring devices to gain a complete view of the data centers.

While there are plenty of solutions out there for collecting and aggregating this data in some sane way, I've found Nagios to provide the perfect balance between simplicity and power. Nagios is a solution that meets the requirements of our mid-sized organization's computing environment quite well, for reasons such as these:

Dependency checking: If all your printers are on a single switch, and that switch goes down, would you rather get a page about each of 50 unavailable printers or a single page saying "the printer switch is down"? Nagios can be configured (or not) to follow a logical path, so that one unreachable device triggers the checking of other devices upon which the first failure is dependent. If a printer is down, Nagios first checks to make sure the printer switch is up before notifying anyone. If that printer switch is up and someone is running around unplugging printers, you'll get a lot of pages, but if the switch is down you'll be notified of the larger problem, not its consequences. Further, if that printer switch is unreachable because a router between Nagios and the printer switch is down or unavailable, you'll get that message instead, which can save you some troubleshooting time and makes the pages far more interesting and useful.
Downtime scheduling: When you have a host of different tools monitoring your environment, or a single tool that doesn't allow for downtime scheduling, your pager will go nuts as you bring down your environment and possibly again on the way back up. Mix this with a situation in which there is no dependency checking and you'll soon find a group of administrators walking around while their pagers lie vibrating in their desk drawers. With Nagios, you can schedule downtimes and avoid the hassle.
Recovery notification: Many people use monitoring solutions that do simple "ping" monitoring, which tells you if a machine is unreachable. However, if it was unreachable because of a temporary power glitch that caused a switch to momentarily freak out, and the agent never lets you know that the machine became available again one minute later, you could be wasting gobs of time driving over to the site for a problem that has already corrected itself. Nagios will notify you of recoveries.

A lot of solutions don't provide these benefits. Throw in solutions that are tough to customize, don't provide service checks for specialized appliances or services, and are tough to integrate with the few tools you might have that do work well, and you have big headaches and a downhill trend in the morale of your administrators.

9.12.1. Enter Nagios

I've found that Nagios provides an extremely simple way of taking many of our disparate scripts, notification modules, ping checkers, and other tools, and putting them all under the Nagios umbrella as "plug-ins" without my having to change much of anything. In fact, the monitoring functionality that comes preconfigured with Nagios is all handled through shell, Perl, or C programs that Nagios calls in the background.

The barrier to entry was actually so low that within a day I had a very basic Nagios configuration up and running, with a web interface, email notifications, and basic service and host checks working. By the end of the week, I had configured Nagios to be more discriminating in its notifications (e.g., notify only the DBA if the database service became unavailable, but only the Sun admins if the database server went down). I had also configured host and service dependencies, and told it about our next two scheduled downtimes. I had even found existing plug-ins for Nagios that allowed for the retirement of a couple of our home-grown scripts for monitoring things like a NetApp filer and a MySQL database. Things were looking up!

What's more is that the Nagios web interface, while it keeps useful enough statistics to help pinpoint when a problem started or predict your disk needs on a file server over the next year, can also easily be integrated with standard tools such as MRTG [Hack #79] or Cacti.

If you want to get really hardcore, you can also use Nagios to collect SNMP traps, or go fully distributed by using Nagios agents, rather than a central polling mechanism, across your machine room.

The only downside to Nagios that I've found so far is that, while configuration is pretty brainless, there is no configuration GUI or automation, so it all has to be done by hand (which can be somewhat cumbersome and very time-consuming). The payoff is there, though, so let's check out some configuration details. I'll cover only the most basic configuration, because documenting a full-blown Nagios deployment could be another book unto itself!

First you'll need to install Nagios, either using your distribution's package management system (for a binary install) or by going to http://www.nagios.org to grab the source and installing according to the plentiful documentation.

9.12.2. Hosts, Services, and Contacts, Oh My!

We'll start simple. Your machine room consists of hosts. These hosts run services. If either a host or a service that it runs becomes unavailable, you'll want Nagios to notify a contact. Thus, the first thing to do is tell Nagios about these entities. To do this, we add entries in the hosts.cfg, services.cfg, and contacts.cfg files. These files may be located under /etc/nagios if your installation was preconfigured (as on a SUSE system or a Red Hat RPM install), or wherever you told it to put configuration files during a source install (/usr/local/etc/nagios, by default).

Here's a simple hosts.cfg entry that tells Nagios some basic information about a host:

 define host{ use generic-host host_name newhotness alias Jonesy's Desktop address 128.112.9.52 parents myswitch }

You'll notice that all this information is specific to my desktop machine. There's nothing here about how to check the availability of the host, when to check it, or anything else. This is because Nagios allows you to configure a template host entry to hold all of that information (since it's likely to be identical for large numbers of hosts). The template used in the above entry is called generic-host, and can be found near the top of the hosts.cfg file. The generic-host template entry looks like this:

 define host{ name  generic-host notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 notification_interval 360 notification_period 24x7 notification_options d,u,r contact_groups sysstaff check_command check-host-alive max_check_attempts 10 retain_status_information 1 retain_nonstatus_information 1 register 0 }

This one entry does all the heavy lifting for the rest of the devices that reference this template. They will all be checked using the check-host-alive check command, which is a scripted ping command. Per the notification_period key's value, they'll be monitored 24 hours a day, 7 days a week. The notification_options line says to send notifications if the status of the machine is either down (d), unreachable (u), or recovered (r). The flap_detection_enabled option is turned on here, as well. This is a feature of Nagios that seeks to save you from getting pages from services or hosts that change state frequently due to temporary aberrations in network connectivity, host response times, or services that are purposely restarted to pick up automated updates. You have to admit, putting all this detail into one entry is better than putting it into every host entry!

Let's move on to services. A typical services.cfg entry looks like this:

 define service{ use generic-service host_name ftpserver service_description FTP is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups sysstaff notification_interval 120 notification_period 24x7 notification_options w,u,c,r check_command check_ftp }

This is the entry for my FTP server. Again, it includes only the information specific to the FTP server; all the rest of the information comes from the template named generic-service, whose settings are applied to all of the services whose entries refer to it using the use generic-service directive. Notice that I use a service-specific check command called check_ftp. The check_ftp command is just a shell script that attempts to make a connection to the FTP service on ftpserver.

You've no doubt noticed that both the host and service checks send mail to sysstaff if there's a problem. But what is sysstaff? It's actually not an email alias (although you can use one if you like). Instead, it's configured within Nagios itself, in the contacts.cfg and contactgroups.cfg files. Let's have a look! Here's an entry for a contact from the contacts.cfg file:

 define contact{ contact_name jonesy alias Jonesy service_notification_period 24x7 host_notification_period 24x7 service_notification_options c,r host_notification_options d,r service_notification_commands notify-by-email host_notification_commands host-notify-by-email email jonesy@linuxlaboratory.org }

This is my contact entry. It says that I'm to be notified of any host or service failures 24 hours a day, 7 days a week. However, I've hacked my entry so that instead of being notified of every change in state, I'm only notified when services (service_notification_options) are critical (c) and when they recover (r), and when hosts (host_notification_options) are down (d) and when they recover (r). There's an entry like this for everyone who will receive notifications about service or host status from Nagios.

Once all of the contacts are defined, you can group them together to form Nagios-specific groups in contactgroups.cfg. Here's an example:

 define contactgroup{ contactgroup_name sysstaff alias The Systems Guys members jonesy,bill,joe }

That wasn't so hard, was it? Just remember that anyone in a contact group must first be defined as a contact in contacts.cfg.

At this point you have only a very simple configuration, but it's enough to fire up Nagios and have it monitor the hosts and services you defined and notify those who are defined as contacts. Before you do that, though, you should run the following command to do a syntax check:

 $ nagios -v /etc/nagios/nagios.cfg

This runs Nagios in "verify" mode, and we've fed it the main Nagios configuration file, which contains a line for every other configuration file in use. If there's a problem, Nagios will spit out plenty of information for you to find, check out, and fix the problem. In these early stages, the most common issues will probably be related to configuration files defined in nagios.cfg that are not yet being used. For example, since we haven't used the dependency configuration file, you'll want to comment out any references to it in nagios.cfg.

If you received no errors, you're in good shape. You might see "warnings" that point out possible problems to you during config verification, but in many cases these warnings are for things that are intentional, such as contacts that are not assigned to a contact group (which is not required and not always desirable). Once you've verified that the warnings are harmless, or fixed whatever issues existed and reverified things, you can fire up Nagios and begin receiving notifications via email about the hosts and services you've configured.

9.12.3. See Also

http://www.nagios.org