Event Processing to Determine Faults | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

The goal of event management is to collect all event information and determine what, if any, actions need to be taken as a result of each event. The art to event management is learning what events not to display or react to.

There are several steps that must be taken on receipt of an event:

Events must be collected. Usually, events will come in from a variety of sources through several different methods or protocols, as described in the previous sections of this chapter.
Upon receipt, the events should be normalized to facilitate their processing. Normalizing means to format the events in a consistent way.
Next, you must determine whether this event can be filtered or deleted. Because the volume of events to be processed can be very high, it is important to eliminate undesired events as early as possible. Filtering means to eliminate undesirable events by comparing them to some pattern and eliminating those events that match the pattern.
Next, the management system should correlate events and determine the faults that exist in the network. Correlate in this context means to examine events to determine the root cause of a problem.
Finally, the event and fault management system must take some sort of action on each of these faults.

The following sections go through these steps in more detail.

Event Collection, Normalization, and Filtering

Events need to be collected in a variety of ways. Each protocol will require a different technique and these techniques will be covered in the following sections on each protocol.

You need to normalize the events so that they can be processed in the same way, regardless of the delivery mechanism. The EMS you choose may do much or all of the normalization for you. Or it may require you to supply scripts to do some of the normalization for example, for delivering events to it that are delivered via protocols not supported by that EMS.

Filtering these events is important to do as early as possible to reduce the processing load on your EMS. You may choose to filter events before they are normalized or normalize your events and then filter them all at the same time.

The EMS you use determines the format you use to normalize your events. We are proposing a list of information that needs to be included in the normalized events so that we can discuss the steps to normalize events delivered by the different protocols or methods. This information includes the following:

Source
Time
Priority
Type
Variables

You must choose how to represent each of these in the normalized event format.

First, the source of the event often comes in as an IP address. You may want to convert it to the hostname of the device or a specific IP address such as a loopback address. Also, because some devices have many IP addresses, you may want to control which IP address is used when these devices produce events. This is covered in detail in "Setting Up a Loopback Interface" in Chapter 18, "Best Practices for Device Configuration." Selecting what to use as the time of the event depends on several things. The time of the event is often best set by the event producer. Some protocols provide the capability to supply the time the event occurred, or at least the time the event was sent from the event producer. If you are using Network Time Protocol (NTP) or another way of setting the time throughout your network, the time from your event producers will be the most accurate time. Otherwise, you are better off using the time that the event was delivered to your EMS. More detail on time and NTP is covered in "Setting Up NTP" in Chapter 18, "Best Practices for Device Configuration."

One other decision is how to represent and store time. In most cases, your EMS will make this decision for you. But you'll still need to be aware of the issues and make sure that your EMS handles time sufficiently well for your environment.

If all of your network is in one time zone, you won't have to deal with time zone issues. But if your network is spread across time zones, you should pick a standard way to represent time, either in universal time or in the time zone of your network operations center. You still need to preserve the zone that the device is in so that you can do thresholding and report performance data on a business-hours basis.

Often, the easiest way is to store time in the native format for the platform your EMS is running on. A standard UNIX-format time is stored as a four-byte integer that represents the seconds elapsed since January 1, 1970. You probably also want to store the time zone of the source device.

The priority of the event itself (as distinct from the event producer's priority) is sometimes supplied with the event. Alternatively, you need to get it from your knowledge base. Even if the event supplies a priority, you may choose to override it with information from your knowledge base. In fact, because the device has no concept of the relative importance of different interfaces or ports, you will usually have to adjust or override the priority of the event.

You should always be able to deduce the event type from the event; otherwise, the event will be just so much scrap. Details on how to do this are included in the following sections.

The variables or variable bindings sometimes are supplied by the event. Sometimes, they need to be parsed out of a text field. Even if they are supplied, they often need to be translated in some fashion. For example, SNMP will supply ifIndex for many traps associated with interfaces. However, you will probably want to represent your interfaces with names or functional labels such as T1toBombay or names such as Ethernet0.

After the events are normalized, they need to be saved into a central repository so they can be processed. Depending on the size and topology of your network, you may choose to have several repositories in different locations. The important thing is to collect all events for a given part of your network in the appropriate repository. If an event affects multiple parts of the network, it may need to be delivered to multiple repositories.

A repository can be a database, a formatted file, or a pipe between the event collector and normalizer and the event correlator whatever works for your EMS.

The following sections go through each type of delivery mechanism and what needs to be done to collect and normalize these events.

Collecting and Normalizing Log Files

Many delivery mechanisms place the information into files, usually text files. Basically, the steps your EMS needs to take in order to extract the events contained in log files are as follows:

Lock the file. If a lock can't be obtained immediately, wait until one is available.
Read a line.
Filter, if required, to extract just the events or desired events.
Send events on for processing.
Mark the location in the file in a stateful way.
Unlock the file.

When you manage your log files, whether that means truncating them or archiving them, you'll need to take the following steps:

Lock the file. If a lock can't be obtained immediately, wait until one is available.
Determine the marked location.
Truncate or archive the data before the marked location.
Re-mark the location.
Unlock the file.

Many administrators schedule log-file maintenance once a night or once a week. Another choice is to set a threshold on how large the log file is allowed to grow.

To make a robust system, you need to take a few more steps. One important step is to track the rate at which events are added to the file and the rate at which they are being processed to determine whether your EMS can handle the rate at which events are being generated. You also want your processes to be able to be stopped at the right point when the system is shut down or the EMS system needs maintenance. You need a reliable mechanism for starting automatic processes.

Collecting and Normalizing Syslog Messages

Syslog is normally used to collect information about the system it is running on, as well as to collect messages from other systems. Syslog messages are sent to a particular facility and also have a particular priority. The syslog server can be configured to send messages with certain facilities and/or priorities to certain log files. See the "Syslog" section of Chapter 8 for more information on syslog facilities. Cisco routers and switches, by default, send their syslog messages to the facility "local7." Therefore, it is easy to configure the syslog server to log all messages from Cisco devices to one file. Then, you can process this log file following the steps in the previous section.

Normalization of syslog messages can be fairly easy. However, extracting the type of message and the variables embedded in it requires knowledge of each possible syslog message. For example, consider a syslog message that indicates that an interface went down:

 Sep 19 16:51:07 10.29.2.1 79: Sep 19 16:50:24: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet0, changed state to down

The first three strings show the date and time when the message was received at the syslog server. The next field shows the address or name of the event producer. The next field is the process ID of the process on the event producer that sent the message. All the rest is the text of the message. So, here are the standard fields, as shown in the example:

Date/time from the syslog server = Sep 19 16:51:07
Event producer IP address = 10.29.2.1
Process ID on the event producer = 79

Cisco IOS devices format the text field as follows:

The first field or fields is a representation of the date and/or time on the device if you have "service timestamps" configured on the device. This can make it a bit interesting to parse this file if you don't have your devices configured the same way.
The next field is formatted into three subfields: facility, severity, and mnemonic. Note that the facility is not the same as the facility that's part of the syslog protocol. This facility documents what part of the Cisco IOS generated the message.
The final field is the freeform text message. Often important information, such as which interface this message applies to (as in the previous message), is included in this field.

Cisco IOS syslog messages are documented in the "Cisco IOS Software System Error Messages" manual. This document can be used to help normalize syslog messages. The facility, severity, and mnemonic fields can be used to look up messages in this manual.

Once again, here is the Cisco-specific part of the example message:

Date/time from the Cisco event producer = Sep 19 16:50:24
Cisco facility that generated the message = LINEPROTO
Severity = 5
Mnemonic of the error = UPDOWN
Text = Line protocol on Interface Ethernet0, changed state to down

Note that two date/time entries appear in the syslog message, one recorded by the syslog server and one by the device sending the message. The entry you should use depends on how you manage time on your network. The most accurate time for the event will be the one supplied by the device, provided you've used something such as NTP to set the time accurately for your entire network.

The event producer is sent to the syslog server as an IP address. Most syslog servers attempt to resolve the address to a hostname and, if successful, put the name in the log. By default, Cisco IOS devices use the address of the interface that sends the packet out as the source address. You can specify a different address with the logging source-interface command. It is a good idea to specify a loopback interface as the chosen interface for this command. Then, you can rely on all syslog messages having the same address or name from this device.

Interpreting the text part of the example message, you can see that the type of this event is a link-down message and the interface that went down is Ethernet0. As just discussed, you can choose to track interfaces by name or by a functional label.

The event shows the severity or priority of this message as 5. You can choose to override this priority with information from your knowledge base. To summarize, the example trap would be normalized to the following:

Source = 10.29.2.1
Time = Sep 19 16:50:24 (may want to convert to UTC)
Priority = 5
Type = linkDown
Variables = Interface Ethernet0

Collecting and Normalizing SNMP Notifications

SNMP notifications are somewhat easier to deal with than the preceding event types because they usually are delivered directly to your EMS from the event producer and they come formatted, not as an unformatted text string.

With the definition of SNMPv2, you can choose to send SNMP notifications as either traps or informs. For the purpose of collection and normalization, it doesn't matter which mechanism delivered the event.

Consider the example event from the previous discussion of syslog messages, but this time delivered via SNMP. We show how it's delivered as a SNMPv1 trap as that is the most obscure to decode. A linkDown notification that is equivalent to the previous syslog message would have the following interesting fields:

Source = 10.29.2.1

Generic trap ID = linkDown (3)

Specific trap ID = 0 (not applicable because this is a generic trap)

SysUpTime = 203137

Trap OID = snmpTraps (1.3.6.1.6.3.1.1.4)

Varbinds =

ifIndex = 1

ifAdminStatus = Up

ifOperStatus = Down

First, how do you know what trap this is? The generic trap ID tells you this is a linkDown trap and the trap OID tells you that it is the standard MIB-II version of this generic trap.

Before getting into details of normalizing it, it is helpful to look at the definition of a linkDown trap, from MIB-II as defined in RFC 2233:

 linkDown NOTIFICATION-TYPE        OBJECTS { ifIndex, ifAdminStatus, ifOperStatus }        STATUS  current        DESCRIPTION                "A linkDown trap signifies that the SNMPv2 entity,                acting in an agent role, has detected that the                ifOperStatus object for one of its communication links                is about to transition into the down state."        ::= { snmpTraps 3 }

The source address of the notification identifies the event producer. Similar to syslog messages, Cisco IOS devices use the address of the interface that the message goes out unless told otherwise. Once again, the easiest way to control the address is to choose a loopback interface and select that interface through the snmp-server trap-source global configuration command. See the section "Setting Up a Loopback Interface" in Chapter 18 for more information about using loopback interfaces.

The time for this event must be the time that the event is received at the EMS because SNMP provides only sysUpTime, but doesn't provide a way to correlate sysUpTime to the actual time that an event occurred.

SNMP doesn't provide any hints about the priority of this event, so you have to rely on your knowledge base to supply the priority of a link-down event. For this example, assume that the knowledge base supplies a priority of 5.

Next, the variable bindings for this notification provide some useful information about this event. The first tells you the ifIndex of the interface. Of course, you need to translate this to some sort of meaningful term. You can choose to use the interface name or a description of the function of this interface.

The ifAdminStatus and ifOperStatus may help you determine what happened to shut down the interface, so you will want to preserve these variable bindings. There are times when the interface status changes so quickly on the device that it shows the interface up even though the event is link-down. What is happening in this case is that the event is queued for delivery, but the objects for the variable bindings are not sampled until the packet is being generated. This situation can lead to what looks like inaccuracies in the data.

Here's the event after normalization:

Source = 10.29.2.1
Time = <Sep 19 16:51:07> (from the local EMS system time)
Priority = 5 (from the knowledge base)
Type = linkDown
Variables = Interface Ethernet0

- AdminStatus 'up'
- OperStatus 'down'

RMON events are always delivered as SNMP notifications. Syslog messages can be delivered as SNMP notifications. The following two sections discuss briefly how to collect and normalize these events.

Normalizing RMON Alarms/Events Delivered as SNMP Notifications

RMON alarms can trigger RMON events. RMON events can be delivered to your EMS as a SNMP notification. Take a look at the definition of these SNMP notifications:

 risingAlarm TRAP-TYPE     ENTERPRISE rmon     VARIABLES { alarmIndex, alarmVariable, alarmSampleType,                 alarmValue, alarmRisingThreshold }     DESCRIPTION         "The SNMP trap that is generated when an alarm         entry crosses its rising threshold and generates         an event that is configured for sending SNMP         traps."     ::= 1 fallingAlarm TRAP-TYPE     ENTERPRISE rmon     VARIABLES { alarmIndex, alarmVariable, alarmSampleType,                 alarmValue, alarmFallingThreshold }     DESCRIPTION         "The SNMP trap that is generated when an alarm         entry crosses its falling threshold and generates         an event that is configured for sending SNMP         traps."     ::= 2

RMON alarms are triggered by an object's value crossing a threshold, so most RMON alarms are set on performance-related objects and, therefore, generate performance events. Because these events are delivered using a standard definition, there will be very little information about the specific object. To fill out the variable bindings for this event, or even the specific event type, you'll first have to determine what the alarmVariable is. You may need to query the event producer for additional information.

Other than these small differences, RMON events can be treated as standard SNMP notifications.

Normalizing Syslog Events Delivered as SNMP Notifications

Syslog messages can be delivered as SNMP notifications by Cisco IOS devices. This relieves the EMS from dealing with log files and can improve the reliability of delivery through the informs mechanism defined in SNMPv2. So, processing these events is a combination of receiving the event through SNMP and processing the event as if it were a syslog message. See the previous sections on syslog messages and SNMP notifications to review how to process these events. Just remember that even if the event is delivered as a SNMP notification, the information is contained in the text portion of the message, just like a syslog message.

You may decide to use this mechanism to deliver all syslog messages from Cisco IOS devices. This will mean that you have one fewer log file to track and manage. Managing SNMP notifications should be much less costly for your EMS.

Collecting and Normalizing NMS Events

Your NMS will usually have a built-in event system. Both your NMS and associated applications use this event system to deliver and consolidate events. Your NMS may receive an event of one type and this may trigger the creation of another event. You may want to deliver some or all of these events to your EMS. For example, your NMS may collect all SNMP notifications. It will be important that your EMS receive these events.

There may be several mechanisms to deliver these events to your EMS. The mechanisms may log the events to a file, which can be processed as outlined previously for log files.

Most NMS systems have a mechanism to take an action upon receipt of an event. The NMS could deliver the event through this mechanism to your EMS. The EMS would then normalize the event and continue to process it in the same way as any other message.

The details on normalizing messages depend on the NMS used and the format of its events.

Correlating Events to Determine Faults

After you have all your events in the same format, you can start to correlate them and determine whether they represent faults in your network. The steps required to process your events are discussed in following sections. We introduce the concept of correlation and discuss some specific correlations that can be performed. More correlations are covered throughout Part II of this book.

Correlating Duplicate Events

When you receive an event, an obvious correlation is to look for a duplicate event that has been processed with a timestamp close to the timestamp of the first event. Duplicate events may be found because the device delivers the same event through two different protocols. Or duplicate events may indicate that there is a repeating problem. In the first case, one event can be deleted. In the second case, the EMS should store only one event and extend the information about this event to include the timeframe over which it is occurring and how many instances of this event have occurred.

The time frame in which to look for duplicate events should be short, on the order of just a few seconds in most cases. It is important to not conceal instances when, for example, interfaces may be flapping. At the same time, it is important to reduce the many events produced by the flapping down to one fault that reports the flapping interface and the duration of the problem.

The events left after the elimination of duplicate events should be correlated using the techniques in the following sections.

Passive and Active Correlation

Correlation engines can work in two ways, either operating with the knowledge of network state as delivered by events, or using events as a starting point and actively polling network devices in response to events received. The first method is known as passive correlation; the second method is known as active correlation. Both are valid ways to determine faults and have their advantages and disadvantages.

Passive correlation may take more time to determine the root cause of one or a group of events. However, it doesn't put any additional load on your network.

Active correlation can sometimes significantly reduce the time to determine root cause and, therefore, network faults. But it does so by increasing the amount of traffic on your network. In most cases, this traffic will be small, but the application must be designed carefully to ensure that it does not cause more issues than it resolves.

Both techniques can be applied to the correlations in the following sections.

Correlating Network and Segment Events

This section discusses the kind of network and segment correlations that can be performed without prior knowledge of the topology of the network. Many more correlations can be done with knowledge of the Layer 2 or Layer 3 topology of the network; some of them are discussed in the section "Correlating Events on the Basis of Topology."

Certain things can be correlated easily across a network or subnet. For example, Layer 3 broadcast and multicast traffic is seen by all nodes on the same logical network segment or VLAN. If thresholds for the amount of non-unicast traffic are in place around the network, many events may be generated by high amounts of this traffic. Instead of creating new faults for each event, your EMS could update the first fault with the number of instances seen and the timeframe over which it is occurring. The only Layer 3 knowledge required is the IP address and subnet mask to be able to correlate these events. Note that if a switch supports VLANs, you can't assume that all ports can be correlated.

Correlating information on a point-to-point link is easy also, given that both sides share a common IP subnet. Utilization should be the same and should trigger the same events. Networks utilizing IP unnumbered links require Layer 2 topology knowledge to correlate these links. The Cisco Discovery Protocol (CDP) enables you to correlate these links.

Correlating Events on the Basis of Technology

Often, you can correlate events based upon the technology to verify whether a fault exists. Part II of this book goes into technology correlation in detail. A couple of examples are discussed here.

If you have thresholds on both utilization and collisions on an Ethernet interface and if both thresholds are triggered at nearly the same time, the fault is almost always the utilization. Collisions on Ethernet are a product of the high utilization. A possible way of avoiding the need to do this correlation is to have a more complex threshold for collisions that takes into account the network utilization and only triggers an event if the number of collisions increases without a corresponding increase in network utilization.

Excessive collisions, although indicative of a problem with the network, can also cause many other symptoms on the same interface. For example, you will often see output queue drops on this interface as the packets back up because they can't be sent. If you have thresholds set on queue drops, you may find correlations there also.

Very high utilization on an interface also correlates with output queue drops, so it's important to look at the interface as a whole and determine the root of the problem. So, if you see high collisions and high utilization, as well as output queue drops, you can correlate all of these events to high utilization on this physical Ethernet segment. However, if you see high collisions and high output queue drops, but not high utilization, there is probably a physical problem on that segment.

Correlating Events on the Basis of Topology

When you know the topology of the network, you can make many more correlations, thus allowing you to eliminate many events and report just the underlying fault. The tradeoff is that maintaining accurate knowledge of the network topology can be quite difficult and applying that knowledge to event correlation can be equally difficult. The next two sections review some of the benefits of applying these types of correlations.

Layer 2 Correlation

With knowledge of where the network management station is in relation to the Layer 2 topology of a network, your EMS can determine whether many events are due to one fault.

For example, suppose you have a switch in your datacenter connected to switches on each floor of your building. You may have multiple switches feeding parts of the fifth floor off the main floor switch. You may also have departmental servers on the fifth floor. Suppose you lose the trunk port going from the datacenter to the fifth floor switch. It may be a cabling issue or a board or port failure on one of the switches. In any case, you lose connectivity to all networked devices on the fifth floor. You will very soon receive many events.

Probably the first event to arrive is the link-down event from the datacenter switch, indicating that the trunk port is down. Your availability monitor will also report many events over the next few minutes, indicating that all the devices on the fifth floor are down. However, your EMS can generate a fault from the link-down event, based on knowledge base information that this port is a critical port. If your EMS incorrectly reported faults from any availability events, these incorrectly reported faults will need to be suppressed or retracted. Any availability events for these devices that arrive after the root cause fault has been determined can be directly suppressed as being symptoms of the same fault.

It is important that your EMS have knowledge of when the fault is resolved because while this fault was being resolved, other faults may have occurred that your EMS didn't have visibility to because they occurred on the other side of this fault. So, after this fault is resolved, your EMS should make certain to check the status of the devices on the other side of the fault and create new faults, if necessary.

Another example is correlating CRC and frame alignment errors with duplex mismatches on twisted-pair Ethernet technology connections. If autonegotiation fails to negotiate both sides correctly or if the devices on either side of the link are configured incorrectly, there may be a duplex mismatch. This will result in CRC and/or frame-alignment errors on the receiving side of one or both sides. With knowledge of Layer 2, your EMS can correlate the duplex settings of both sides and determine whether these errors and associated threshold events are due to this duplex mismatch. Your NMS could check the whole network for duplex mismatches or your EMS could check for duplex mismatches only when a CRC or frame-alignment threshold is triggered.

Layer 3 Correlation

Layer 3 correlation can do many of the same kinds of things that Layer 2 correlation can, only for things such as IP networks and subnets instead of segments and LANs.

Consider an example of network failure that affects the Layer 3 topology of the network. Say you have a routed topology that has a connection from a router in your home office in Chicago, Illinois to a regional office in Denver, Colorado. You have some servers in the regional office and the router in Denver also provides connectivity to smaller offices around the southwest. The link from Chicago to Denver goes down. So, just as in the previous example, you'll get a linkDown trap from the home office router.

Your knowledge base should indicate that this link is critical to your southwest operations and that a high-priority fault has occurred. Without correlation, this issue would soon be buried in all the availability events that start coming in soon after. With correlation, it becomes obvious that these availability events are due to the link-down fault and should be suppressed until this issue is resolved.

As in the Layer 2 correlation example, it is important for the EMS to notice when repairs are affected and start processing the suppressed events again so that any more minor faults that may have occurred can be noticed and reported.

Another Layer 3 correlation that may be a bit more interesting is a network where OSPF is used as the routing protocol. Availability polling is reporting intermittent connectivity issues in the network. CPU load on the routers in Area 1 is quite high, as noticed by events generated by thresholds on the avgBusy5 object from the OLD-CISCO-CPU-MIB or (preferred) the cpmCPUTotal5min from the CISCO-PROCESS-MIB. And a high rate of ospfOriginateLsa events is coming from one of the same routers. Lots of ospfIfStateChange events are coming from the same router. Link-down events are also coming from the same router and the same interface. The EMS can correlate all of these events and determine that a flapping interface is causing a high rate of change in the OSPF routing table, enough to cause the CPU load on the routers to stay high. There is intermittent connectivity loss due to routes dropping as the routers fail to keep up with the high rate of change.

All these events can be suppressed until the issue with this interface is cleared up. Your EMS should maintain a careful watch to ensure that it suppresses only the events that can reasonably be expected to be the result of this issue. Then, after the fault is resolved your EMS can review the suppressed events to determine whether there are any other faults hiding behind the resolved fault.