Using Event Monitoring Tools | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Event monitoring involves generating notification messages as soon as faults and other interesting conditions are detected . This is unlike UNIX commands, which report status information only when asked. You can configure event monitors to generate an event when a change in status occurs or when a predefined threshold condition has been met. Some event monitors generate events when faults or errors occur at the disk device level.

Several monitors discussed in this section are integrated into the Event Monitoring Service (EMS) framework. The EMS framework, available only on HP-UX systems, enables monitors to be provided in a consistent manner for a system. Whereas the EMS framework itself is freely available, the monitors are usually available as separate products or are bundled with the individual products for which they provide monitoring.

EMS provides a consistent graphical user interface (GUI) for the discovery and configuration of resources that can be monitored . EMS monitors can be configured to send events to OpenView IT/O. In addition, EMS events can be sent directly to any SNMP-capable management station, to a network application listening on a TCP or UDP port, to an e-mail address, or locally, to the console, system log file, or user log file. MC/ServiceGuard can allow packages to be dependent on these EMS resources, as well.

HA Monitors is a software package of EMS monitors. It includes monitors for filesystem space, user and job information, network interface status, MC/ServiceGuard cluster status, and disk volume status.

Event Monitoring Service Disk Volume Monitor

The EMS Disk Volume Monitor (DVM) is included in the HA Monitors product. It can be used to monitor disk resources that are configured using LVM, including copies of mirrored logical volumes .

The EMS DVM can be used in conjunction with MirrorDisk/UX. Using MirrorDisk/UX is completely transparent to an application. However, this can sometimes be a problem. A disk could fail and a mirror could take over, but the operator wouldn't know about the failure, so the failed disk wouldn't get repaired. Eventually, another failure would occur, with no backup for the data. The EMS DVM provides help for this situation.

The EMS DVM is responsible for monitoring a variety of attributes:

Physical volume summary
Physical volume and physical volume link status
Logical volume summary
Logical volume status
Logical volume copies

These attributes (or resources) are structured into a resource hierarchy. Figure 5-2 shows the EMS DVM's resource hierarchy. The resource instances, such as the volume group names or logical volume names , vary from system to system, based on the configuration. The EMS DVM requires monitor requests to be initiated on each local node; however, this also allows monitor requests to be customized to the local systems, so that only the critical disk resources need to be monitored.

Figure 5-2. EMS DVM resource hierarchy.

graphics/05fig02.gif

The physical volume summary provides a summary status of all physical volumes in a volume group. In determining the summary status, the EMS DVM notes which physical volumes are not up, and their associated physical volume group. The possible values for the physical volume summary status are shown in Table 5-1.

The physical volume status and physical volume link status are checked when the EMS DVM attempts to communicate with a device over a particular physical path . The volume group does not have to be active. Valid values are UP, DOWN, and BUSY. If you have configured redundant links to your physical volumes, and the primary link fails, the LVM switches transparently to an alternative path to the device. This results in a SPOF. If you are monitoring the physical volume link status, you will be notified and can take action to restore redundancy by re-establishing an alternative path.

Table 5-1. DVM Status Values for Physical Volume Summary

Status	Physical Volume Summary Interpretation
`UP`	All physical volumes in this volume group are up
`PVG_UP`	At least one PVG exists for which all physical volumes are up
`SUSPECT`	Some PVs are not up; cannot conclude if all data is available
`DOWN`	Some data is not available

Table 5-2. DVM Status Values for Logical Volume Summary

Status	Logical Volume Summary Interpretation
`UP`	Active and all logical volumes are up
`DOWN`	Active and at least one logical volume is down
`INACTIVE`	Inactive; if `/etc/lvmtab` changes, status moves to this from `INACTIVE_DOWN`
`INACTIVE_DOWN`	Inactive, and the last time the volume group was activated, the status was down

The logical volume summary summarizes the LVM status values of all logical volumes in the volume group. The possible values for the logical volume summary status are shown in Table 5-2.

The logical volume status reports the LVM subsystem's status for the selected logical volume. Inactive logical volumes have an INACTIVE status. Active logical volumes have either an UP or DOWN status, depending on whether at least one copy of all the data is available.

The EMS DVM uses logical volume copies to report the number of copies of data available within the logical volume. In this way, the failure of a mirror can be detected. The value returned by the monitor is the number of complete copies of the data that are available. In the screen shown in Figure 5-3, the DVM is being configured to monitor mirrored copies of all logical volumes (All Instances) in the vg00 volume group. Notification is specified to be sent to IT/O (via a proprietary method called opcmsg) when the number of copies drops below three.

Figure 5-3. Configuring the EMS DVM to monitor mirrored copies.

graphics/05fig03.gif

On high-end HP-UX systems, numerous volume groups and logical volumes may be configured. Using the EMS Configuration GUI to configure each to be monitored may be time-consuming . The EMS Configuration GUI supports some wildcarding capabilities to make con figuration easier. Also, an IT/O tool called monvols helps in configuring all the logical or physical volumes for a selected system. The monvols tool and other tools and templates for EMS are available for free on the Web at http://www.software.hp.com. Note, however, that monvols is available only for HP-UX 10.20.

EMS Hardware Monitors

EMS Hardware Monitors provide the ability to detect and report problems with system hardware resources, such as disk components . EMS Hardware Monitors help you to monitor for faults, such as device errors and component failures. They not only poll the hardware at regular intervals, but they also notify you of hardware errors in real time. EMS Hardware Monitors are delivered with HP-UX Online Diagnostics, which is freely available for HP-UX. EMS Hardware Monitors provide monitoring for the following Hewlett-Packard products:

AutoRAID disk arrays
High availability disk arrays
High availability Storage Systems
Fast/Wide SCSI disk arrays
Standalone SCSI and fibre channel disks
Fibre channel SCSI multiplexors
Fibre channel adapters
Fibre channel arbitrated loop hubs

These EMS Hardware Monitors are designed to provide a commonality in the configuration interface, event detection, and message formats, which include a detailed message describing the problem and how to fix it.

Table 5-3. Description of Hardware Event Severity

Severity	Description
Critical	An event that will or has already caused data loss, system downtime, or other loss of service. Immediate action is required to correct the problem. System operation will be affected and normal use of the hardware should not continue until the problem is corrected. If configured with MC/ServiceGuard, the package will experience failover.
Serious	An event that may cause data loss, system downtime, or other loss of service if left uncorrected. The problem should be repaired as soon as possible. System operation and normal use of the hardware may be affected. If configured with MC/ServiceGuard, the package will experience failover.
Major Warning	An event that could escalate to a more serious condition if not corrected. The problem should be repaired at a convenient time. System operation should not be affected, and normal use of the hardware can continue. If configured with MC/ServiceGuard, the package will not experience failover.
Minor Warning	An event that will not likely escalate to a more serious condition if left uncorrected. The problem can be repaired at a convenient time. System operation will not be interrupted , and normal use of the hardware can continue. If configured with MC/ServiceGuard, the package will not experience failover.
Information	An event that occurs as part of the normal operation of the hardware. No action is required. If configured with MC/ServiceGuard, the package will not experience failover.

The EMS Hardware Monitors can report low-level device errors encountered during an I/O with a device. They detect and report component and FRU failures, including fan and power supply problems. Protocol errors are also detected.

Hardware monitoring is critical for maintaining a system's high availability. For example, when using an AutoRAID disk array, two controllers may be used for high availability. The AutoRAID Monitor reports events that could indicate the failure of a controller. These events could result in a loss of hardware redundancy. Other events indicate a potential loss of data redundancy, such as the failure of a disk component or the failure to recover data redundancy after a component failure. Many events may indicate a SPOF and the loss of high availability. Notification of these events is critical so that the failure can be fixed to eliminate the risk of downtime.

Hardware events are assigned a severity level by the monitor. These severity levels reflect the potential impact of the event on system operation. Table 5-3 provides a description of each severity level.

EMS Hardware Monitor configuration is done using the Hardware Monitoring Request Manager. Notification conditions can be configured in a consistent way for all of the supported hardware resources on a system. As hardware is added to the system, monitoring can be enabled automatically. Figure 5-4 shows an example of using the Hardware Monitoring Request Manager to send e-mail notification of all critical and serious hardware events. To configure with MC/ServiceGuard, you need to use the MC/ServiceGuard configuration interface.

Figure 5-4. Using the Hardware Monitoring Request Manager to configure hardware monitoring.

graphics/05fig04.gif

When an EMS Hardware Monitor detects an event, a notification message is sent to the designated target locations. The message contains a full description, including the system on which the event occurred, the date and time when the event was detected, the hardware device on which the event occurred, a description of the problem, the probable cause, and a recommended action. Figure 5-5 shows the detailed message for a media failure event on a SCSI disk. Although not shown, the message contains more detailed information, including product/device identification information, I/O log event information, raw hardware status, SCSI status, and more.

Figure 5-5. Media failure event reported by an EMS Hardware Monitor.

graphics/05fig05.gif

Most EMS Hardware Monitors are "stateless." In other words, events of a designated severity are forwarded as soon as they are detected; no aspect of history or correlation with other data is provided, except that the monitor limits repeated messages by using a repeat frequency. Therefore, determining the current status of a device is difficult, because messages are sent only when an event occurs.

To monitor for hardware device state changes, you can use the Peripheral Status Monitor (PSM), which maintains the state of monitored hardware devices and reports state changes. The PSM gathers events from other EMS Hardware Monitors, but does not send its own notification unless a state change has occurred. By default, critical or serious events cause the PSM to change a device's status to DOWN.

For example, critical disk events from a disk EMS Hardware Monitor cause the PSM to change the device's status to DOWN. The PSM then sends a single "disk device status = down" event, if a user had requested that event. This may be the only message visible to the user. Additional disk failure messages would not be forwarded, because they did not result in a status change (in other words, the status remained DOWN). This reduces the number of events that need to be processed by the user. The last event received should reflect the current device status.

Most of the Hardware Monitors can't automatically detect when a device has been fixed. When a problem is solved , the set_fixed command must be used to alert the PSM manually to set the device status back to UP.

When using an enterprise management tool such as IT/O, which receives messages from multiple systems, you should use the PSM to reduce information overload. However, make sure that stateless events are also configured to go somewhere (such as the system log file), because they provide valuable diagnostic information when a component has failed.

You can learn more about the Hewlett-Packard diagnostic tools on its Web site at http://docs.hp.com/hpux/systems/.

HARAYMON and ARRAYMOND

The High Availability Disk Array Monitor (HARAYMON), available on HP-UX 9. x, notifies the system console of all failures of disk array FRUs in high availability disk array products. AutoRAID disk arrays have a similar monitor, called ARRAYMOND. Failures are reported to the system console and to a user-identified list of e-mail addresses (configured in an ASCII file). Each error message can optionally include the location of the array(s) connected to the host sys tem, the event history of the array(s) connected to the host system, and the system administrator's phone number or mail stop.

Listings 5-10 and 5-11 show some messages displayed by HARAYMON for various error conditions. HARAYMON can detect disk failures, fan unit failures, controller module failures, power supply unit failures, and battery backup unit failures.

Listing 5-10 shows a disk module failure. Note that in addition to a timestamp and the failure type being displayed, the array and slot number (within the array) containing the failed device are also reported. The array containing the failed disk module is 48.0.1, and the slot number containing the failed disk is B2.

Listing 5-10 Output from HARAYMON showing disk module failure.

 =============================== Mon Nov 23 10:58:50 PST 1998 High Availability Array Monitor =============================== Drive Failure Product ID HP A3232A_RAID_5 Physical Device 48.0.1 Disk Position: B2

Listing 5-11 shows a fan unit module failure. Specific product information is shown so that the proper component can be replaced .

Listing 5-11 Output from HARAYMON showing a fan unit module failure.

 =============================== Mon Nov 23 11:32:44 PST 1998 High Availability Array Monitor =============================== Fan Unit Failure Product ID HP A3232A_RAID_5 Physical Device 48.0.1 Fan Number B

HARAYMON and ARRAYMOND have been rendered obsolete on newer HP-UX systems (Release 10.20 or greater) by the EMS Hardware Monitors. You may want to disable the HARAYMON and ARRAYMOND daemons if you are using the EMS Hardware Monitors.

OpenView IT/Operations

IT/Operations (IT/O) is an OpenView application that provides central operations and problem management. NNM is included as part of the IT/O product. IT/O uses intelligent agents that run on each managed system to collect management information, messages, and alerts, and to send the information to a centralized console. After receiving events, IT/O can initiate automatic corrective actions. When an operator reads an individual message, guidance is given and actions may be suggested for further problem resolution or recovery.

IT/O has the following four main windows :

Node Bank: Displays iconographically the systems managed by an operator, and allows for them to be organized into node groups.
Message Groups: Displays logical message groups, such as performance, Oracle, or backup. The message groups serve as one way to organize messages in the Message Browser window.
Message Browser: Shows the events that have been received by the management server.
Application Desktop: Provides access to commonly used diagnostic and administrative applications.

IT/O comes with predefined monitors and templates. For monitoring disk resources, it has a monitor for the root filesystem for Hewlett-Packard, Sun, and other platforms. The predefined message template defines the message condition so that when the root filesystem usage exceeds 90 percent, a warning is sent to the IT/O Message Browser. The message condition defining this criterion is shown in Figure 5-6.

Figure 5-6. IT/O message condition template for monitoring root filesystem usage.

graphics/05fig06.gif

IT/O enables you to define and customize your own monitors and templates. The monitor for the root filesystem can be modified for other critical filesystems. Or, you can define your own monitors and templates to monitor MIB variables, such as the filesystem MIB variables mentioned earlier in this chapter. IT/O periodically queries the MIB object to determine whether a message should be generated. You can write a program or script that is periodically invoked by an IT/O agent. Templates and message conditions can be modified so that the operator is paged under certain conditions.

Many other tools plug in to IT/O to provide additional monitoring and management capabilities. One key example is the monvols EMS utility mentioned earlier in this chapter. This utility integrates into the IT/O Application Bank. It enables monitoring of all physical or logical volumes for all volume groups on a particular system.

IT/O is useful when an operator needs to manage numerous systems consistently. The disk monitoring template can be modified and then downloaded to a set of systems. In this way, multiple systems can be monitored identically.

Enterprise SyMON

Sun Microsystems has an enterprise management product called Enterprise SyMON, a management solution that includes monitoring tools for Sun platforms. It can be thought of as a scaled-down version of IT/O. SyMON provides an event browser and a detailed display of systems and hardware. A graphical view of the physical system layout can be shown. SyMON can be used to monitor the health of disk resources and to isolate hardware and software faults. It analyzes health information to predict potential disk hardware failures. For diagnostics, SyMON can launch the SunVTS diagnostic system or view the system log. Performance monitoring capabilities are also provided.

SyMON can be configured to send SNMP traps for events. Only hardware events are included, such as disk, memory, or tape failures. However, additional events can be generated by placing various rules written in the Tool Command Language (TCL) scripting language in a special directory. Recovery actions can be associated with an event.

I l @ ve RuBoard