Fault Delivery, Notification, Reporting, and Repair | Performance and Fault Management: A Practical Guide to Effectively Managing Cisco Network Devices (Cisco Press Core Series)

Now that faults have occurred, you need to do something with them. You uncovered a problem with the network. Of course, not all problems are as important as others are. The next few sections will discuss the following:

Methods of fault delivery
Notifying key personnel of critical faults
Reporting faults
Effecting fault repair

Methods of Fault Delivery

The first step to processing faults is delivering them to the correct recipients. Faults need to be delivered to a repository so that they can be processed easily and in a manner that reflects their priority. The lowest-priority faults should be logged to a file for processing when time allows. Higher priority faults should be delivered to a system for tracking faults.

Logging Faults

The only faults that should be logged are low-priority faults. Issues such as performance faults representing utilization thresholds exceeded on links on your network should be logged and processed for planning increased capacity in those areas of your network.

Low-priority faults do not include those that represent either a loss of connectivity on your network or a reduction of redundancy in the network so another fault could cause a loss of connectivity.

Delivering Faults to a Tracking Application

Faults that require attention should be delivered to a tracking application to assure that these faults receive the attention they deserve.

Two possibilities for tracking faults are a fault console a lightweight mechanism for tracking and distributing faults and a trouble-ticketing system. Trouble-ticketing may be overkill for smaller networks, but for larger ones it may be very appropriate. Both of these options are discussed in the following two sections.

Fault Consoles

Faults on a fault console should be sorted and viewable by priority and the time the fault occurred so network engineers can determine easily where to start on all of the issues being tracked. It should be possible to view all faults of a particular type so that an engineer working on capacity issues, for example, can easily view the relevant faults. If the network is large and complex and there are many network operators and engineers, a checkout procedure is desirable so it is clear which faults are being worked on by and by whom.

Trouble-Ticketing Systems

For a trouble-ticketing system to work as a tracking system for your network faults, it must do all that a fault console does and more. For example, your trouble-ticketing system should have the capability to add detailed notes on the status of the processing of the events and the resolution of the event. Trouble-ticketing systems will often be able to track the time to repair a fault. Also, they may be able to track how a fault was resolved. If this includes the replacement of equipment, it may be able to communicate this information to your inventory system so your inventory stays current.

Although a fault console may allow a fault or a group of faults to be closed with just a couple of clicks of the mouse, most trouble-ticketing systems require more time from your operators and engineers. It may be possible to get enough data from the type and nature of faults and their resolution to make this time worthwhile. However, you must convince your operators and engineers of this. Otherwise, much of the information in your expensive trouble ticketing system may consist of lots of "not-applicables."

Notifying Key Personnel of Critical Faults

Some faults represent critical issues in your network that need immediate attention. After delivering these faults to a tracking application, you may wish to notify key personnel that these faults have occurred. First, you must determine which group to deliver the fault to. Normally, most network support organizations will have two groups to deliver faults to: network operators and network engineers.

Some faults, such as WAN link outages, require an operator to contact your service provider with the circuit ID that appears to be down. Other issues when the problem is not so well-defined, such as late or excessive collisions on a network segment that will require analysis of the issue to determine the component at fault and therefore will have to be sent to a network engineer.

Your knowledge base should have details on how you want your faults delivered. Issues that are not well-defined or not in your knowledge base should be delivered to your network engineers for further analysis.

What is to be done with different priority faults, such as paging network engineers for high-priority faults, should be set as a policy on your network and be part of your knowledge base.

Because network engineers and technicians need to be mobile to be able to deal with faults, your EMS needs to be able to determine how best to contact them for different priorities of faults. Some of the alternatives are discussed in the following sections.

Popup Fault Notifications

Some faults should be forwarded to network engineers or technicians via online notification such as a popup window.

Often, operator notifications are handled through pop-up windows on the operator console. This only works for a low rate of notifications because a screen covered with lots of pop-up windows is annoying in the least and can require acknowledgment of each screen before you can use the system, which can take lots of time. For a very low rate, some sort of audible or visual alarm should be added. No one is going to stare at a screen for hours at a time.

Email

It is often useful to email notice of a fault and its associated ID or number to the assigned engineer or technician. Including the priority and nature of the fault allows the engineer to treat the fault with the proper urgency.

Pager and Phone

If your network engineers or technicians carry pagers or portable phones, you'll probably want to notify them about faults assigned to them that are high-priority. Some organizations choose to have a dispatcher or network operator take the fault and notify the appropriate person by page or phone. Other organizations choose to have the EMS do this directly. Once again, it is important to give as much information as possible, but to keep it very short.

Reporting Faults

Faults can be reported many ways. The next two sections discuss the following methods:

Reporting network status
Reporting network health

If you've done a quality job up until now, the active faults on the network should accurately map the problems with your network. In each of these two reporting methods, you will be using the active faults on your network to determine the state of the network.

Reporting Network Status

If you have a network map as part of your NMS, it should be possible to have faults change the color of the affected devices as a way of tracking your network's status. The problem with the highest priority that isn't being currently worked on should generate the reddest color. If you have an escalation policy for faults, such as all Priority 2 faults not resolved within three hours are automatically promoted to the next level or receive additional attention, the color of the affected devices should also reflect this escalation. This can be done as part of your fault console or as an action triggered by your trouble-ticketing system.

Reporting Network Health

A report of network health is the inverse of reporting the fault status of your network. There are many ways of reporting the health of your network. Network health reporting is discussed at length in the "Network and Device Health Reporting" section of Chapter 4, "Performance Measurement and Reporting."

Effecting Fault Repair

You've come a long way from where you started. Instead of having to wait for users to tell you when things are broken, you have the network intelligently reporting problems. The next step is to have your network attempt to repair itself. This capability isn't going to put you out of business by any stretch of the imagination. In fact, designing and implementing such a system for your network require your imagination and experience. The benefit can be a possibly dramatically decreased mean time to repair your network and, therefore, happier users.

Levels of Autonomy

In some cases, it might be possible for your EMS to determine the source of the fault and to effect a method of bypassing or repairing it. There are several levels of autonomy that you might give to your EMS in this area:

Recommend action.
Ask, then repair.
Just do it.

The lowest level of autonomy is for your EMS to recommend a course of action to repair or bypass the fault. At times, this is the only course of action available to your EMS because many faults require human intervention, at least until robotics becomes much more advanced and inexpensive.

The next level of autonomy is for your EMS to ask permission to take an action. The small amount of time required for someone to examine the suggested repair and ensure that it makes sense probably is worthwhile, at least the first couple of times. After an engineer approves the repair, the EMS can implement it. As your confidence grows regarding your EMS' reliability in recommending certain types of repairs, you can consider increasing its autonomy.

The highest level of autonomy is for your network to detect a fault and repair it without any human intervention. You already use this level of autonomy if you use spanning tree, HSRP, or a dynamic routing protocol in your network. What we are talking about here is doing things that the view that your NMS has of your network makes possible. Now, let's look at some examples of what might be possible in your network.

Examples of Fault Repair

Your knowledge base can enable your EMS to determine what actions to take to repair the network. One of the simpler things that your EMS can do is to watch for devices that are misbehaving in ways that affect other devices on the network.

For example, consider a device that is connected to a switch and that continually links to the network and then drops that link. The device's behavior could cause the switch to continuously recalculate its spanning tree possibly enough that the reliability of the network is affected. The signs of this issue are a port on the switch that is reporting a high rate of link down and up events and a high CPU load on that switch. With the high rate of transitions, it's unlikely that anything useful is being done on that port. Therefore, the EMS could safely recommend that the errant port be shut down. An easy next step would be for the EMS to report the problem for followup later.

Another example involves solving an issue with IP multicast traffic sourced from a switch using CGMP to limit the traffic to only ports that find the traffic "interesting." As CGMP is only able to limit multicast traffic coming from a router, it is possible for one port to flood other ports on the same VLAN on the same switch with a high rate of multicast traffic. If there are ports of varying speeds in that VLAN, it is especially easy for the higher-speed ports to saturate the lower-speed ports with multicast traffic.

Knowing these things about the nature of multicast traffic, you easily can establish a policy for the maximum amount of multicast traffic that is allowed to be sourced from any port. Setting a threshold on the amount of multicast traffic sent on all ports (ifOutMulticastPkts in the IF-MIB from RFC 2233) to the chosen limit will alert you when this policy is violated.

You probably also want to set a separate policy on the maximum amount of broadcast traffic that can be sourced across all ports on that VLAN. In both cases, you can set thresholds. First, you need to select a port that should always be operational and for which the connected devices do not subscribe to any multicast groups. Setting thresholds on the selected port for the amount of multicast traffic received (ifInMulticastPkts) will generate a fault if this VLAN is exceeding this threshold.

Your EMS can report misbehaving ports and shut down ports if the issue is a serious enough problem.

Other possibilities for network self-healing include the EMS understanding recent changes in the network, such as configuration changes on devices or software-image changes, and being able to roll back the changes if problems occur on the network that could be attributed to those changes.