Event and Fault Management Tools | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

The goal of event and fault management is to determine if there are issues that need attention, to generate faults where they do, and to present these faults to the correct people to ensure they get fixed in priority order. Your network should generate as few faults as possible while still covering all issues that need attention. The steps to do this are as follows:

Configuring event generation
Event collection
Event correlation
Fault management

Figure 9-1 shows how each of these steps interrelate and how they must interoperate with the rest of your NMS tools.

Configuring Event Generation

Configuring event generation assists with the configuration of thresholds and event generation for network devices.

As shown in Figure 9-1, an event configuration tool should be able to configure thresholds and event generation based on information gathered from performance polling and the knowledge base. This will entail configuring devices to check triggers and thresholds and generate events. In some cases, devices may not be able to monitor themselves, so there needs to be a mechanism for generating events based on performance monitoring.

A tool that can configure event generation needs a combination of the following capabilities:

Configure RMON alarms and events against any available SNMP object. The tool should help to configure these alarms against the interfaces you want to apply them to. For utilization, the tool should be able to compute what the thresholds should be so you can input a percentage of utilization. Cisco IOS devices allow the configuration RMON either through SNMP or via the command line.
Cisco devices have several capabilities for generating events. These include the built-in SNMP traps and syslog messages. An event configuration tool should have knowledge of these events to avoid generating duplicate events.
An event configuration tool could include support to generate events based upon measurements of response time using the CISCO-RTTMON-MIB.
The EXPRESSION-MIB allows a device to monitor a formula made up of individual objects. A RMON alarm and event could be configured to send an event if the expression exceeds a threshold. A tool in this category could make good use of this capability.
It is important that this tool be able to configure a large number of devices and interfaces in a unified fashion. You don't want to be configuring each interface on each device. This tool should be able to take your policies and apply them across your network.

Some of the tools that allow you to configure event generation include the following:

Cisco CiscoWorks2000 Resource Manager Essentials 3.x NetConfig
Cisco Internetwork Performance Monitor
Cisco Traffic Director
NetScout Manager

In addition, most or all of the products listed under the performance management and reporting sections of this chapter should be able to set triggers and thresholds against the collected data, and deliver these events to your event collection system.

Event Collection Tools

Events come in many forms and from many sources, including syslog messages, SNMP traps, and log files as outlined in Chapter 5, "Configuring Events." Your event collection tool should be able to handle many types of events from many devices and many different protocols. It should be able to normalize these events into a common event format so that your events can be easily processed by your event correlation tool. This relationship is shown in Figure 9-1.

Your event collector needs to understand many different types of events and know the format your event correlation tool expects the events in, so that it can normalize them.

Cisco devices use several methods to communicate events. These include SNMP notifications (traps or informs) and syslog messages. Other sources of events come from your availability and performance monitoring tools and may use different protocols for delivery of these events.

As your collection tool receives events, it may receive multiple copies of the same event or a notification through syslog as well as SNMP that a particular event happened. This is a good time to eliminate these duplicate events.

Also, this is a good point for filtering events to eliminate events that you don't have any interest in. This can significantly reduce the resources required to process events.

Some of the criteria to look at for a tool in this category include the following:

Flexibility in handling many different formats or sources of events, including your availability and performance monitoring tools
Ease of adding new event sources or protocols
Production of normalized events in the format your event correlation tool needs them in
The capability to "de-bounce" or de-duplicate events to reduce the load on the event processor
The capability to filter messages

Some of the products that have capabilities to be event collectors include the following:

Aprisma Spectrum
Castle Rock SNMPc Enterprise Edition
Cisco InfoCenter
Cisco Traffic Director
Computer Associates Unicenter TNG
Hewlett Packard OpenView Network Node Manager
Ipswitch WhatsUp Gold
The Knowledge Group WideAwake
Micromuse Netcool Suite
NetScout Manager
Tivoli NetView

Some of these collectors are specific to one type of event protocol. To generate an enterprise-wide event collector, you may find that you need to combine several of these tools.

Event Correlation Tools

Your event correlation serves one main purpose: to recognize faults among a series of events. Your NMS receives many events, but many of them represent the symptoms, not the problem. Correlation of the events allows the root cause of a problem to be presented, not all the events that were received and had the same problem or fault as the root cause.

As Figure 9-1 shows, this tool is where raw events are processed into fault data. It is central to the process of determining what, if anything, is actually wrong with your network.

An example of correlation is a problem in which a router loses power. Your availability monitor will detect this and report it as an event. However, it also detects that it can't reach any of the other devices behind the router and also reports these as events.

Your event correlation tool needs to have enough knowledge of the topology of the network to determine that the router or the link to the router is the root cause of all these events. It then needs to send this single fault on to the fault manager for processing.

Your correlation tool needs to have detailed information about the devices in your network. For example, the symptoms of buffer issues in Cisco IOS devices vary according to the IOS version, the specific device hardware, and the switching method configured on the interfaces on that device. To be able to correlate interface errors or slow application performance to issues with buffers requires knowledge of all this information.

Correlation tools come in two types: passive and active. A passive correlation tool relies on the events it receives to determine the state of your network. An active correlation tool actively polls your network devices as it gets events to supplement the information supplied by these events. This can lead to a more complete and timely knowledge of the state of your network at the cost of more network traffic. Through your knowledge of your network, you may develop new correlations that work in your network. Some tools will allow you to add these correlations to the tool.

The criteria to look for in an event correlation tool include the following:

The intelligence in handling different types of correlations, including Layer 2 and 3 topology issues. The tool should have a good selection of correlations built in.
A detailed knowledge of the devices in your network to be able to do device-level correlations.
The capability to actively poll for supplemental data to help determine the root cause.
The capability to be tailored to accept new correlations that may work in your network.
The capability to work with your event collector and fault manager.

Correlation tools can be divided into two categories: those that do device-level correlation and those that do network topology-based correlation. The following tools do device-level correlation:

Avesta Trinity
Cisco CiscoWorks 2000 Resource Manager Essentials Syslog Collector and Reloads Report
Cisco Info Center
Concord Network Health Suite
Micromuse Netcool Suite
NetOps Visionary
Smarts InCharge
Tavve EventWatch
Veritas NerveCenter

The following tools do network topology-based correlations:

Avesta Trinity
The Knowledge Group WideAwake
Smarts InCharge
Tavve EventWatch

Fault Management Tools

Fault managers are simply responsible for receiving faults from your event correlation tool and making sure they are handled correctly. As shown in Figure 9-1, fault managing may include delivering faults on to other systems such as paging or email systems, case management systems, network health displays and reports, and fault logs as well as maintaining a fault console.

Your fault management system's main job is delivering faults appropriately. You may want to deliver different priorities of faults differently, as well as delivering different types of faults to different recipients. The tool you select should be able to deliver faults flexibly.

Many network administrators use network health displays to monitor the health of the network. These displays vary from two- and three-dimensional maps of the network with the icon colors reflecting device health, to a simple grid of every device being monitored with color reflecting status, to a list of the top faults. Whatever you choose, be sure that your fault management system can accurately deliver fault status (including when the fault is resolved) to these systems.

A fault console needs to accept faults and display them to interested parties. Some capabilities that you may want in a fault console include the capability to assign faults to a responsible party, log the status of the work being done on the fault, and clear the fault when it is fixed. Many or all of these tasks can be handled by a case management system. In some cases, a full case management system may be overkill. Other network administrators choose to use case management systems for all faults, especially in larger companies or where one is already in use in the company.

Whatever method you choose to use to distribute your faults, you'll want to be able to sort and rank faults differently for different users or purposes. Operators want to look at the faults in a different way from network designers or managers. Operators often want the worst set of faults to always be at the top of a continually refreshing display. On the other hand, network engineers tend to want to look at events in chronological order, with the ability to sort and find based on different criteria. Managers tend to want to see the worst things and their durations over a period of time.

So, some of the criteria you'll need to use in choosing a fault management tool include the following:

The tool works with the faults delivered by your event correlation tool.
The tool can flexibly deliver faults to many different systems, including logs, email systems, paging systems, and case management systems.
The tool includes a fault console for displaying, tracking, and clearing fault conditions.

Some of the products that provide fault management include the following:

Avesta Trinity
Smarts InCharge
Cisco InfoCenter
The Knowledge Group WideAwake
Tavve EventWatch