In this chapter, we'll see how the Simple Network Management Protocol (SNMP) can be used in conjunction with the Mon monitoring package to alert you of problems on the cluster nodes. The SNMP daemon, snmpd, runs on each cluster node, and the Mon monitoring daemon runs on the cluster node manager sitting outside of the cluster. When Mon finds a problem, it alerts you via an email message or a text message sent to your PDA device, so that you can take action before the problem affects users running mission-critical applications on the cluster.
The humbly named Mon monitoring package uses a hierarchical relationship between events to decide whether or not to alert the system administrator. For example, if you are watching the http and telnet daemons on a cluster node, and you are also watching to see if the node is alive using the ping utility (and ICMP packets), you don't need three alerts: the first reporting that http is not available, the second complaining that telnet doesn't work, and finally a third telling you the node is not available on the network.
Mon can send alerts using a variety of methods, including:
SNMP traps.
Email messages.
Short Message System (SMS) text messages to a cell phone or PDA device.
A custom script or program.
Mon allows you to control how often an alert is sent if a service continues to be down for a specified time period and to control where and how often alerts are sent according to the day of the week or the time of day.
To monitor a service, Mon runs a monitoring script and passes arguments to it (an IP address to monitor, for example). Mon looks at three things returned by the script:
The exit or return status of the script.
The first line of output printed by the monitor script (the status summary in Mon-speak).
Any remaining output from the monitor script (the status detail).
Mon looks at the return code of the monitoring script. If the return code is 0, Mon knows that nothing bad has happened. Any nonzero return value tells Mon that the test failed. If a script returns a nonzero status more than once, Mon will start comparing the status summary, which is the first line of text printed by the script, to see if it should issue another alert due to the change in the status summary.