Event Monitoring Service | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

The Event Monitoring Service (EMS) is a monitoring framework for HP-UX. It provides a common interface for monitor configuration and event notification. Using the EMS developer's kit, monitors are developed in a common way. Although the framework itself is freely available, some EMS monitors are sold separately, shipped with the products they support, or bundled with the system.

EMS monitors provide help primarily with fault and resource management. Performance monitoring generally requires more sophisticated tools. Some system fault and resource monitoring capabilities are provided by the HA Monitors product and EMS Hardware Monitors. Other EMS monitors enable you to detect when you are getting low on system resources, such as file descriptors, shared memory, and system semaphores.

Monitored Components

Multiple HP products include specific EMS resource monitors, so the resources available for monitoring vary depending on the customer's installed software products. The resources available for monitoring also vary from system to system because the hardware configuration is different.

EMS Hardware Monitors are provided with the Online Diagnostics bundle, available on the HP-UX support media, also known as the HP-UX Diagnostic/IPR Media. EMS Hardware Monitors provide the ability to detect and report problems with system hardware resources, such as device errors and component failures. Monitors are available for system hardware components and various HP storage products, including system memory, SCSI disk and tape devices, AutoRAID disk arrays, high availability disk arrays and storage systems, fast/wide SCSI disk arrays, various fibre channel components, tape autoloaders, and digital linear tapes.

The HA Monitors contain several EMS monitors for monitoring filesystem space, network interface status, disk status, and MC/ServiceGuard cluster status. HA Monitors also provide monitoring for CPU load and the number of system users. The monitors include database monitoring as well.

ATM adapters and HyperFabric adapters from HP can also be monitored using EMS. An EMS monitor is included with each of these products. These resources are also integrated with MC/ServiceGuard. HA ATM, for example, first attempts to perform a local recovery in the event of a failure, but if it is unable to provide local recovery, it notifies MC/ServiceGuard to trigger a failover to an alternate node.

All the different resources that can be monitored are contained in a single EMS resource hierarchy. The portion of the hierarchy containing system hardware and kernel resources is shown in Figure 3-3. You see the entire resource hierarchy when configuring monitor requests in the EMS GUI.

Figure 3-3 EMS resource hierarchy.

 /system/numUsers /system/jobQueue1Min /system/jobQueue5Min /system/jobQueue15Min /system/events/memory/<instance> /system/status/memory/<instance> /system/filesystem/availMb/<filesystem_name> /system/kernel_resource/process_management/nproc /system/kernel_resource/file_system/nflocks /system/kernel_resource/file_system/nfile /system/kernel_resource/misc/ncallout /system/kernel_resource/system_v_ipc/shared_memory/shmmni /system/kernel_resource/system_v_ipc/semaphore/semmni /system/kernel_resource/system_v_ipc/semaphore/semmns /system/kernel_resource/system_v_ipc/message/msgmni /system/kernel_resource/system_v_ipc/message/msgseg /system/kernel_resource/system_v_ipc/message/msgtql /storage/events/disks/default/<path> /storage/events/tape/SCSI_tape/<path> /storage/events/disk_arrays/High_Availability/<path> /storage/events/disk_arrays/AutoRAID/<ID> /storage/events/disk_arrays/FW_SCSI/<path> /storage/events/enclosures/ses_enclosure/<path> /storage/status/disks/default/<path> /storage/status/tape/SCSI_tape/<path> /storage/status/disk_arrays/High_Availability/<path> /storage/status/disk_arrays/AutoRAID/<ID> /storage/status/disk_arrays/FW_SCSI/<path> /storage/status/enclosures/ses_enclosure/<path> /adapters/status/FC_adapter/<path> /adapters/events/FC_adapter/<path> /net/interfaces/lan/status/<interface_name> /net/interfaces/switched/atm/<emulated_lan_name> /net/interfaces/clic/status/<instance> /net/subnetwork/osi/x25subnet/status/<x25_instance>

Monitoring Features

EMS is designed for use in high availability environments. The user can select only the critical components to be monitored, so the resource monitor will not be delayed polling for information from non-critical components. This is different from other tools, such as IT/O, which typically gathers data from all components. The IT/O agent executes a system-wide command and then parses the output. Thus, if a noncritical component hangs , the IT/O agent could be delayed unnecessarily waiting for it to respond. The EMS monitor queries only the critical components by comparison.

The EMS event management libraries have been used by a number of system components on HP-UX to provide monitoring. IBM recently announced a similar set of event management routines, called Phoenix, for its AIX environments, which it hopes will be adopted by its software development partners . Sun does not provide event management libraries for Sun Solaris.

EMS is the only method for allowing a resource monitor to send events to MC/ServiceGuard, which again reflects its emphasis on high availability. EMS can also send notifications via SNMP, opcmsg, TCP, UDP, and e-mail, and can write to the console, a specified text file, or another application.

EMS provides a few key functions. It is meant to provide a consistent way for a user to enable monitoring of different system components, which is done through the EMS SAM GUI, the MC/ServiceGuard SAM GUI, the Hardware Request Manager (monconfig), and a set of EMS library routines.

When configuring monitoring conditions, you have the option to request notification at every polling interval, when the value changes, or when some configured threshold has been met. The polling interval can range from 30 seconds to 1 day.

You can configure additional information to be sent along in an event, providing customization for the specific environment in which EMS is used. Also, when events occur, the resource monitor can include additional resource-specific information to aid problem diagnosis.

EMS was created with system performance in mind. An EMS resource monitor runs only if a user has asked to monitor at least one resource instance. Also, multiple resource instances can be monitored concurrently by the same monitor process. EMS APIs are provided to allow the monitor to check a resource at the appropriate time interval. Monitors that receive resource information asynchronously don't need to use these APIs and can thus operate more efficiently . This can allow for event notification to be received in microseconds, without paying the system performance penalty of frequent polling.

EMS alarms are configured separately for each client. The alarms of other products, such as MeasureWare, are system-wide. EMS can detect problems more quickly than other tools that need to wait for summary time intervals to expire. However, EMS doesn't have the concept of durations or compound conditions, which can be associated with alarms.

EMS requests are per client. A target "user" can receive customized event data, and another "user" can receive different data for the same event. EMS monitor data enables the monitor to provide customized data for an event.

To ensure that monitored resources continue to be monitored, EMS provides a "persistence client." The EMS persistence client detects when a monitor fails and automatically restarts it.

Monitor Discovery and Configuration

EMS provides a consistent GUI for the discovery and configuration of resources that can be monitored. The EMS GUI, available from SAM, can automatically discover the set of resource monitors available on a system. Resource instances can vary from system to system based on hardware and software configuration. Using EMS, you can define conditions indicating when notification events should be sent. Notifications can be sent at periodic intervals, when a component's state changes, or when a threshold condition is met. EMS also allows you to configure where events should be sent.

EMS requires monitor requests to be initiated on each local node; however, this also allows monitor requests to be customized to the local systems so that only the critical resources need to be monitored. EMS should not be considered an enterprise framework because it lacks the ability to enable monitoring across multiple systems easily.

Using the EMS GUI to configure each resource instance to be monitored may be time-consuming . The EMS Configuration GUI supports some wildcarding to make the configuration easier. Also, the EMS Hardware Monitors available with Online Diagnostics come preconfigured. If you are using IT/O, a tool called monvols is available to help you configure all the logical or physical volumes for a selected system. The monvols tool and other tools and templates for EMS are available for free when downloading the EMS Developer's Kit from the Web at http://www.software.hp.com/products/EMS. Note, however, that monvols is available only for HP-UX 10.20.

To configure most EMS monitors, you use SAM. From its Resource Management functional area, you can select the Event Monitoring Service to launch the EMS GUI, and then add a new monitoring request. The initial configuration screen is shown in Figure 3-4.

Figure 3-4. Initial EMS configuration screen.

graphics/03fig04.gif

MC/ServiceGuard provides one GUI, as well as command-line and configuration-file options, to configure packages to be dependent on EMS resources. In this case, event notification is sent directly from a resource monitor to MC/ServiceGuard.

The EMS configuration tools enable a user to configure a resource or set of resources to be monitored. The user chooses the type of notification desired based on applications available in the customer environment. For example, if events should be sent to an OpenView management station, the user can choose SNMP notification. SNMP can also be used to send events to other management stations , such as IBM NetView, Tivoli, or CA Unicenter TNG. If IT/O is being used, opcmsg notification is available on IT/O managed nodes. Although EMS is often used with a management platform such as HP OpenView, HP OpenView is not required. For example, if a customer has written a custom fault-recovery application, event notification could be received directly by the application, using TCP or UDP.

The EMS configuration tools also enable you to browse the list of resources that can be monitored dynamically. This allows a resource monitor to monitor different resource instances on different systems, and provides a standard interface for operators to find available monitors. You can then customize the monitoring for each system. Although EMS provides more flexibility, it is also more difficult to configure, because configuration is generally done once per system.

First, the operator browses through the available resources. Then, he or she selects a resource to monitor and specifies the monitoring parameters, such as thresholds and polling intervals. After the monitor request is made, the operator is returned to the main EMS screen, where all active EMS monitor requests are displayed.

EMS hardware monitor configuration is done by using the Hardware Monitoring Request Manager, monconfig. Notification conditions can be configured in a consistent way for all the supported hardware resources on the system. As hardware is added to the system, monitoring can be enabled automatically and consistently with the way similar hardware is being monitored.

If requesting SNMP trap or opcmsg notification, additional configuration is usually required on the management station. EMS provides templates that can be used in OpenView NNM or IT/O to recognize and format traps and opcmsg notifications for viewing in their respective event browsers.

Monitor Developer's Kit

EMS provides a Developer's Kit to make it easy for customers and third parties to integrate or write their own EMS monitor programs. Customers can define their own important resources to be monitored. Using the EMS Developer's Kit also ensures that monitors will behave in a standard way.

The EMS Developer's Kit includes the necessary header files and libraries to write a custom resource monitor. Monitoring APIs are provided for the monitor to receive requests, wait during polling intervals, and send notifications. A sample monitor is also provided. The mechanism to actually check the value of a resource is resource-dependent, but APIs are provided to determine whether notification should be sent based on the value of a resource. The monitor provider does not need to write code to send different types of notification, such as SNMP traps or e-mail; this is handled by the EMS framework.

All the products in this chapter provide some mechanism to allow you to add your own resource monitors. EMS monitors are more difficult to develop because they require you to write a program instead of merely writing some scripts. EMS, however, provides more flexibility, because its monitors do not have to rely on HP-UX commands to gather data.

Information on using the EMS Developer's Kit is included with other software that is downloadable from the HP software Web site. You can find the Developer's Kit along with HP OpenView templates and other tools at http://www.software.hp.com, under the High Availability Software product category. EMS manuals are available at http://docs.hp.com/hpux/ha.

Notification Methods

EMS was designed specifically to support monitors for high availability resources. Consequently, it supports notification by using an HP-proprietary interface to MC/ServiceGuard. However, EMS can be used without MC/ServiceGuard and it supports an unusually large variety of notification options, including:

SNMP traps
IT/O's RPC mechanism (opcmsg)
E-mail
Logging to the system log file, the console, or a specific log file
UDP or TCP messages

Notification is sent on a per-request basis, which provides additional flexibility. Several EMS events can be managed concurrently, as shown in this example:

Client A requests Event 1 via SNMP
Client A requests Event 2 via opcmsg
Client B requests Event 1 via TCP
Client B requests Event 3 via TCP
Client C requests Event 3 via opcmsg

Notification is then sent to the specified clients when the events occur.

Diagnostic Capabilities

EMS enables the monitor to provide arbitrary data, up to 10,000 bytes, along with an event notification. This information can be unique for each event notification and it can contain vital diagnostic information. For example, a disk failure event could include the serial number of the failed disk device in its monitored data area. Additional text could describe how to fix a failed component. Monitor providers taking advantage of this capability must document how operators should interpret this information.

EMS Hardware Monitors provide detailed information about the cause of an event and give recommended actions.

Unlike other products, such as Unicenter TNG, EMS doesn't provide the ability to take automated actions in response to events. However, you can have events sent to your own customized fault-recovery application by configuring the TCP or UDP notification methods. If EMS is used with IT/O, then you can also configure templates and recovery actions into IT/O.

Additional Information

The EMS framework is freely available from HP with the Online Diagnostics on the support media and the application CD-ROMs.

For the latest information on EMS, check the HP High Availability Web site at http:// www.datacentersolutions.hp.com/2_3_index.html. The EMS manuals and release notes are available at http://docs.hp.com, under High Availability.

You can learn more about the Hewlett-Packard diagnostic tools, including the EMS Hardware Monitors, on the Web at http://docs.hp.com/hpux/systems.

I l @ ve RuBoard