Basic Event Management Functions: Reducing the Noise and Boosting the Signal

Event management must deal with a flood of alerts and select those that actually matter. A dynamic environment with multiple instrumentation points generates different views of the same behaviors. As an example, consider a case where a key database server fails. The database server may have sent an alarm before it crashed, but there is no guarantee that it did so. Active collectors (which were discussed in Chapter 4), or probes, also report a failure after they execute the next virtual transaction against that server.

The active measurements offer an administrator the assurance of independently detecting the problem; however, this approach also generates multiple reports of the same failure. Subsequent measurements will generate another flurry of alarms if the server is still down. Customers wanting that service will trigger more alerts when they cannot connect to the server. Hundreds of alerts can arrive within a small time interval. Effective event management picks the server problem out as a single occurrence for further treatment.

The following sections discuss the various techniques that remove extraneous information, add value to the remainder, and determine the subsequent actions.

Table 5-1 summarizes the event management functions that are discussed in subsequent subsections. It is important to remember that some of these functions might be embedded in instrumentation that you will encounter.

Table 5-1. The Basic Event Management Functions
Function	Value
Volume reduction	Prevents data overload by using roll-up, de-duplication, and intelligent monitoring
Artifact reduction	Eliminates wasted effort and time by using verification, filtering of single alerts, and correlation of multiple alerts
Business impacts	Improves decision making by protecting critical services
Prioritization	Focuses on the most important situations
Activation	Automates responses, speeds resolution, and improves accuracy
Coordination	Integrates alerts and builds automated processes

Although vendor-packaging choices blur lines between instrumentation and event management, the important concept to understand is that the functionality is needed regardless of the specific packaging. The early enterprise management platforms, such as those offered by Hewlett-Packard, Tivoli, and Computer Associates, positioned themselves as a single point for event management; they consequently have a range of functions for processing alerts. Other companies, such as BMC Software and Micromuse, have added similar capabilities, while smaller vendors may offer a limited set of event management functions.

Volume Reduction

Simply reducing the alert volume can be very helpful. Hundreds of alerts reporting the same situation can be generated. However, only a single alert is necessary to note the database server failure and to start recovery procedures.

There are different methods of reducing the alert volume: roll-up, de-duplication, and intelligent monitoring.

Roll-Up Method

Hierarchical collector structures reduce alert volumes by rolling them up from one level to the next. The aggregators described in Chapter 4 are a natural place for implementing this alert compression. In Figure 5-2, three collectors are using virtual transactions against the same server. If the server is congested, all collectors forward a server slow alert to the aggregator. The aggregator simply passes a single server slow alert downward to the event manager or another level in the instrumentation hierarchy.

Figure 5-2. Using Roll-Up to Reduce Alert Volume

De-duplication

A failure may generate a multitude of virtually identical alarms and events that can be consolidated into one alarm by de-duplication. For example, a router failure may spawn a large number of alarms about dropped connections. De-duplication adds information to a single event, indicating a larger number of similar alarms.

Intelligent Monitoring

Adaptive instrumentation (Chapter 4) provides the flexibility for intelligently monitoring situations, and it thereby reduces alert volumes. Consider the example at the beginning of this section. Active collectors have reported a database server failure. It may take some time for the failover procedures to complete and even longer to resolve the problem. If the active collectors continue monitoring, they only add to network and alert loads without adding any new information.

Adaptive instrumentation helps the situation by continuing measurements and not generating any further alerts until the virtual transaction indicates that service is restored. A different alert then informs the management system that the service is again healthy. The elapsed time between the failure and restoration alerts measures the outage.

Additional reduction is possible with deeper knowledge of the service topology. Dependencies can indicate that for some failures, downstream monitoring will not be productive. For example, service behavior can be monitored in steps or in smaller parts of the entire transaction. If a step fails, monitoring those that follow do not yield any useful information until the failed step is repaired. The same approach of monitoring, but not generating, new alerts is used to detect the restoration of the service step.

Artifact Reduction

There are techniques to reduce the raw alert volume by eliminating artifacts, which are measurements that falsely imply an important problem where none actually exists. Response time, for example, could be slowed while network routers are recalculating routing options after a failure or topology change. The next transaction has satisfactory response time after the routing system has stabilized. There is no value in notifying the transaction manager of this artifact. There is nothing to chase and correct because the transient behavior of the routing system has ceased.

A transaction could also be lost or timed out while a server in a tier fails and is replaced. Specific transactions might be lost in the failed server, but after the replacement is operating, operations resume at satisfactory levels.

A similar situation arises when an occasional lost packet triggers an alert because a transaction failed or timed out. Further checking, however, usually finds operations proceeding within the normal range of behaviors.

Large numbers of artifacts can consume large amounts of staff time and divert effort from other tasks. It is difficult for humans to identify all the artifacts and ignore them when appropriate to do so.

There are approaches to help reduce the number of artifacts that slip through:

Verification
Filtering
Correlation

The text discusses each in turn.

Verification

Quickly verifying that an incoming alert is reporting an actual problem is an effective first step in eliminating artifacts. For example, an active collector can be used to run a transaction that has been reported as noncompliant. The active measurement establishes whether the problem persists and is repeatable; if it is, further attention might be warranted. The initial measurement is treated as an artifact if the test doesn't reveal a problem.

Using a repeat failures filter for simple thresholds can help discriminate noise from real failure conditions by requiring that several successive measurements exceed the threshold before an alert is issued. For instance, you can stipulate that it will take 10 minutes to forward an alert if the interval between virtual transactions is 5 minutes and the rule is that two repeated failures are needed. Using criteria for successive measurements frees the system from responding to a single blip that later cannot be found.

A more proactive form of verification uses active measurements after the initial alert is received. Verification with an immediate series of virtual transactions clarifies the situation quickly. Successive failures are detected in less than a minute rather than waiting for 10 minutes to attack the problem.

For example, consider a customer who is verifying the response time of a remotely hosted service. Suppose an active measurement device periodically initiates a virtual transaction, perhaps every 10 minutes. If one of these virtual transactions exceeds the specified response time, the measurement device immediately sends a series of closely spaced virtual transactions. If those complete successfully, no further action is necessary.

If the problem persists, the customer management system notifies the provider and begins tracking the provider response until the problem is resolved and the customer verifies that acceptable service levels are restored.

The collector sending the alert should be used for verification whenever possible because the environment will be more consistent. Using a collector in a different location might change the results, depending on the location of the problem. On the other hand, using multiple monitors from multiple locations can provide some diagnostic triangulation; noting that a problem is detected from one side of the network but not the other can aid in problem isolation.

Filtering

Filtering is the application of rules to a single alert source over some time interval. Figure 5-3 illustrates the application of rules concerning measurements exceeding a specified response-time threshold. This is more sophisticated than a check for successive over-threshold measurements. This is an X out of Y process instead. That is, within a set of Y transactions, any X that is slow constitutes an alert. The figure illustrates a three out of eight conditionany three over-threshold measurements out of eight will trigger an alert.

Figure 5-3. Simple Filtering to Remove Transient Behaviors

Note that these filtering rules require state to be maintained between measurements. They should be selectively applied to a small numbers of sources to avoid loading the event manager. Using simple counters places less processing demands on the event manager, but this is done at the expense of being less able to exploit more effective filtering rules.

Correlation

Filtering tracks a single source over a period of time and eliminates artifacts as a result. Correlation, in contrast, works with a number of alert sources simultaneously (or within short intervals). As mentioned, some types of failures or disruptions trigger many additional alerts. Correlation works with this flood of alerts and removes the secondary artifacts, which are those alerts caused by another problem. For instance, the database server failure results in many reports of failed transactions. Administrators will waste valuable time looking at each service with a problem rather than addressing the cause of all the secondary artifacts.

Correlation is more powerful than filtering because it identifies the most likely cause of a flurry of alerts. The accuracy speeds problem resolution and reduces staff disruption.

Correlation is also more complicated than filtering because it deals with multiple, independent alert sources. Correlation depends on understanding the relationships among various service elements. Essentially, it is the rule of cause and effect. (If you cannot reach a router, you cannot reach the networks that are connected to it, for example.)

Building the appropriate information for a correlation engine is a challenge. Early correlation engines, such as the Tivoli Enterprise Console and the Veritas NerveCenter, were powerful, but they often became shelfware (software that wasn't used in production) because of their complexity. Increasing dynamism of the managed environment increased the staff burden because the rules required more frequent updating by experts. In the end, organizations simply could not afford to use these powerful tools.

Correlation approaches use techniques such as time correlation, which is the examining of (near) simultaneous alarms and determining if they are related. This often reveals a large number of problems. For example, in one situation that I've seen, server performance would suddenly degrade without any signs that the server itself had been changed. A simple time correlation revealed that packet losses increased just before the server had performance problems. It turned out that higher levels of lost packets were leading to timeouts, and dropped connections were causing extra processing and resource conflicts in the server.

Another correlation approach is the matching of a problem signature. Experience indicates that a certain set of alerts appearing at (nearly) the same time points to a specific type of problem. The appropriate responses are activated after the problem signature is matched.

Matching signatures is effective, but it creates some drawbacks as well. One drawback is creating the signatures. Management vendors usually do this by hiring experts to define the signatures and the associated symptoms and cures. The labor and need for detailed expertise means that most customers will not be able to build their own signatures for other service management situations.

Signature matching is also more complex for the correlation engine. Parts of the signature can arrive in any order and they can change the tentative diagnosis as more parts arrive. The correlation engine needs to remember more states and use a time interval to build the signature. (Signatures are discussed in Chapter 6, "Real-Time Operations," in the context of sophisticated real-time operations tools.)

Business Impact: Integrating Technology and Services

Service managers are being called upon to make decisions that affect more than the technology they manage; now they are directly affecting their organization's capacity to generate revenues and communicate with partners and suppliers. Better service management decisions must incorporate more information about the business processes and the services that are using the infrastructures.

Detecting a potential or actual service disruption is only the first step. Determining the likely cause rapidly and accurately speeds restoration of service quality. Rapid problem isolation is simpler and faster when the elements associated with a specific service flow are known. They can be quickly probed for any abnormalities that warrant further attention from element management experts.

Service quality also degrades even when there are no element failures. Rapid changes in loads and activities can introduce problems with resource allocation, temporary congestion, and other instabilities. The same mapping of services to elements enables rapid testing of the associated elements and identifies candidates for detailed analysis.

Conversely, if an element fails, an administrator needs to know which services are affected and what the business ramifications of those services are. A failure affecting a critical business service receives more attention than one that interrupts internal data backup.

Service managers can also use element instrumentation in a different way. Elements associated with key services can send alerts to the service manager. These are informational because the service manager is not usually responsible for responding to element problems. The service manager is informed that changes are occurring even if no disruptions are threatened for key services. Several element failures affecting the service would be another early warning mechanism.

Understanding the business impacts of any alert enables administrators to truly understand what is important to the business and make better decisions.

The subsequent discussion is further divided into the following subsections:

Using top-down and bottom-up approaches
Modeling a service
Care and feeding considerations

Top-Down and Bottom-Up Approaches

A top-down approach begins with the service and works toward the specific elements supporting it. Taking a top-down approach is usually easier when associating elements and services. Starting with the services perspective simplifies associating the supporting infrastructures and the individual elements within them. Some infrastructure elements, such as servers or databases, might be dedicated to specific services, making the associations straightforward. Other elements are shared and thus require further effort to specify.

The complementary bottom-up approach starts with elements and builds associations with the services using them. This is a much more difficult task because elements, such as network devices, do not usually capture and provide any information about the services flowing through them. Servers track the active processes, but building an integrated end-to-end representation is very hard to do.

Modeling a Service

Modeling is the most effective way of associating a service with the elements supporting it. Models use the power inherent in object-based descriptions and tools (see the accompanying sidebar for more information).

A Brief Object Overview

Objects are abstractions that represent real entities. Each instance of an object has attributes, methods, and notifications.

Each instance of an object has a set of attributes that describe the entity they are modeling. A server, for example, may have attributes describing the type of processor(s), the operating system, the memory, and other technical parameters.

Attributes can describe the relationships among the objects. Attributes define logical connectivity among objects and thus define dependencies and membership in groups. Some relationships, such a set of servers running a specific service, are fixed for relatively long periods of time. Others, such as the actual Internet route, are more dynamic and are calculated when they are actually needed.

Methods are the operations that objects can carry out. These operations include rebooting a server, deactivating a specific process, and accessing specific policy information. Methods are not as important as the attributes for building service models; they are used mainly for element management functions at this time.

Notifications define what an object can communicate to another object. They fill the role of alerts, for either an element or a service.

Building a service model is fairly straightforward if the proper ingredients are available. One of the most important is an object library that has the templates for all the common components, whether physical or logical. Objects representing physical components, such as servers, must be readily available and easily customized to represent specific instances of any physical entity. Objects also represent logical entities, such as an application or an external service. Some of these will be common, and others will be specific for each organization.

After the objects are defined, they must be related by setting the appropriate attributes in each object that define dependencies to other objects. Thus, an application object is related to the server where it executes. The application is also related to other functional objects, such as a database system, a content delivery network, or an external search engine.

Care and Feeding Considerations

Unfortunately, much of the correlation between services and elements is created only through manual methods. Administrators must specify the associations, with some accompanying burdens.

Building these associations consumes staff time and is an ongoing challenge as environments continue to change and new services are introduced. A large online auction business, for example, introduces several new services each week, and there would be additional time needed to prepare the proper management views for each.

Maintaining the accuracy of these associations is an ongoing drain on staff in dynamic environments with rapid shifts in resources to match demands. Additional servers might be brought online as demand for a specific service grows. Updating the information stretches staff resources even further.

DIRIG Software PathFinder

DIRIG Software has taken the process of building and maintaining associations between services and elements further with its introduction of PathFinder. PathFinder is focused only on Java-based services, and it exploits that specialization. The first step is discovering the service components, which are the enterprise Java beans, applets, or servlets. PathFinder then determines their relationships within services by scanning directories and other application-building information.

Agents on servers track the activation of the Java components, thus providing the binding of the logical components to the underlying server infrastructures. The logical dependencies between components coupled with the physical association with the servers executing them provide rich information for troubleshooting and isolating service problems.

A component failure is associated with the services it supports and with servers as well. Troubleshooters also know the calling sequence and other high-value information to resolve problems quickly while understanding the business ramifications.

The fluidity of PathFinder is welcome because agents detect the execution of the service components, sparing staff the odious task of trying to track changes and keep information current.

Prioritization

This is a stage when an alert has been identified as an event requiring some response from the management system, including notifying staff, sending reports to appropriate staff, assigning staff to the event, or immediately activating automated procedures.

All events are not of equal importance, however, and management teams must keep the business-critical services running smoothly. A report that online customers are abandoning their shopping carts in droves will be of immediate concern to a business manager; a report that an occasional catalog lookup is slow need not receive the same attention.

Prioritizing correctly means collaborating between the management staff and its service customers (directly or indirectly). It is the customers who must determine and communicate the relative priority of their services to their providers (internal or external). Only when they have this clear indication of business priorities can providers assign the appropriate priorities to the associated alerts and events.

Providers must also assign other priorities to help them with their operations. The text has already mentioned that customers might pay higher premiums or have stricter noncompliance penalties. These concerns must also be incorporated into priority assessments.

The event manager uses these assigned priorities to organize the event stream and guide responses more effectively. Staff members are directed to address their attention to the most critical events.

The event manager also needs an aging mechanism so that low-priority events receive attention within a specified time frame rather than being completely starved-out by higher-priority events. The event priority is automatically increased by the aging mechanism if the event hasn't received attention within a specified time frame.

Prioritized events are usually placed in their appropriate queueminimally offering a severe, moderate, or warning priority level. Some products offer much more granularity with more priority levels to assign. Multiple thresholds can be used to trigger different responses depending on the severity of the alert.

The event monitor interface is also a means for tracking workflow. The status of each event, such as when the event was received and its current status, is available. As events are cleared, they are appropriately marked, logged, and removed from the active queues.

Events must be organized in a variety of formats to meet different management needs. An overall display might show all outstanding events by priority class. Other displays are needed to show the affected customers, the specific SLAs, and the penalties that apply. Staff can also modify the events, changing priorities or clearing them from the console.

Activation

Any event activates one or more management tools. The time constraints imposed by SLAs mandate automatic and rapid responses to problems while the management staff is being notified.

Registration is the process of linking events and management tools. Management tools are activated by the event manager when any events for which they have registered are detected. Specific tools register with the event manager for types and classes of events. A Cisco Systems device manager would register to receive any events generated by specific Cisco elements, for example.

Registration is usually accomplished with an application program interface (API) for the event manager. Most products use a publish/subscribe approach, where a management tool subscribes to certain events. The event manager publishes events, which activate the subscribers. Multiple tools can also be activated by a single event.

The event manager currently uses local server functions to activate the specified management tool. In the future, XML documents will activate remote management tools.

Coordination

Event management can help integrate the services and technology management areas as well as integrate management tools into processes. It is a natural place for integration because element and services instrumentation are already converging there.

One key factor in a bigger role for event management is the use of internally generated alerts. Figure 5-4 offers an example of event management as an integrating factor. A server failure alert is generated (1 in the figure) and leads to the activation (2) of a server management tool. The server manager performs the detailed problem analysis and determines that a hardware failure has occurred and the server is not operational. The server manager then creates an internally generated alert (3), which comes from the management tool, not the managed environment), and the event manager then sends another alert that activates a tool that determines the impact on services (4).

Figure 5-4. Event Manager Using Alerts to Integrate and Sequence a Management Process

The impact assessment tool determines if the server failure is having an impact on service quality, such as congesting the remaining servers and creating unacceptable response times. If that is the case, it sends another internally generated alert (5 in Figure 5-4) that activates provisioning tools and traffic redirection tools and that notifies the staff of a serious threat to service-level compliance.

Incorporating internal alerts from the management system adds more value because a single point receives all alerts and can place them in the proper context. One example of this function would be a performance manager sending an alert if the pool of stand-by servers falls below a defined threshold number. The management staff then has this information and can prioritize it against other events to allocate efforts as effectively as possible.

Event management has a range of functions for sifting through an alert stream and picking from the management system those that need immediate attention. Some products have a full set of these functions, while other products use a more limited set. Other products distribute these functions and couple them more tightly to instrumentation.

Table 5-1. The Basic Event Management Functions