Fault Management

 

A dependable network requires that a fault management system be in place. Potential and existing problems need to be detected as soon as possible so that you can take immediate action to resolve the issues. A fault management system detects problems with devices and links, hopefully before end users notice the outage .

An SNMP-configured router sends traps to the management station when it detects a failure. Because SNMP uses UDP to send traps, however, there is no guarantee that the message describing the fault will reach the management station. A fault management system cannot rely solely on traps, but also must poll the routers for information about the state of lines, interfaces, and router components . In addition to polling routers for component information, the management station polls the router itself, sometimes using ICMP pings , to make sure it is accessible. To ensure that the IP address of the router is accessible via any active interface, it is a good idea to use a loopback address as the identifying IP address on the router. A management station polling or pinging the loopback address can use any available routed path to reach the router. A management station polling or pinging a nonloopback interface address on the router will declare the router inaccessible if that interface is down, even if the router can be reached via an alternative path . You should configure traps to use the loopback address as the source address of the packet, and configure the management stations to poll the router via the loopback address.

As with many protocols, a trade-off exists between how fast a management station can detect an outage and the amount of network traffic generated. If the management station misses a trap, and needs to rely on its polling or pinging to detect an outage, the outage may not be detected for quite a while. If the failed device is a router, and the management station is configured to ping the router every 5 minutes, and declare it dead if it misses three pings, it will take up to 15 minutes to detect the failure. A link or other component failure is detected sooner. The management station does not rely on the absence of a response to detect these outages, but rather asks the router for the state of the component. The router responds with the state information.

For example, Figure 9-3 shows a management station polling a router for the state of its interfaces.

Figure 9-3. Management Station Is Polling the Router for Interface States

graphics/09fig03.gif

The management station polls the router for the state of its interfaces using the ifEntry.ifOperStatus object ID in the MIB. The router responds. Three interfaces are up, and one is down.

A fault management system detects failures. The failures are reported to the network operators by visual or audible alerts or are sent by e-mail or pager. The method used for sending alerts is customized to the user 's environment. If someone is in front of the management console 7x24, audible and visual alerts suffice. If the console is not manned all the time, e-mail or pager alerts are sent when no one is at the console. The failure indicates link, router, or router component outages. The alerts occur after the problem has occurred. The fault management station also attempts to alert operators before failures occur.

Many times, specific events lead to a failed component. For instance, a serial line may report high error counts or carrier transitions before it fails completely. A router may report memory problems before it fails. Fault management stations maintain threshold information. When the threshold has been exceeded, an alarm is sent to the network operator. You can configure thresholds for any number of variables. To configure the values of the thresholds, the network is first baselined. The baseline takes place over a period of time, such as a week, when the network is running normally. The normal values of the variables are obtained. You then can configure thresholds at some level (say 20%) above normal.

Some MIB variables that provide useful threshold information include the following:

  • Amount of free memory

  • Average CPU utilization

  • Buffer misses

  • Interface input and output rate

  • Interface input and output errors

  • Interface input and output queue drops

  • Interface packets ignored

  • Interface resets

  • Serial interface CRC, abort, and frame errors

  • Frame Relay FECN/BECN

  • Serial interface carrier transitions

  • Ethernet collisions

  • Ethernet runts, giants, and frame errors

  • Token Ring line and burst errors

  • Token Ring internal errors

  • Token Ring token and soft errors

  • Token Ring signal losses

Some of the items listed occur in a perfectly normal network. When they exceed a threshold, however, performance can be degraded, and a more serious problem may be brewing. The management station polls the routers for the value of these variables periodically. If the change in values between polling periods exceeds the threshold, an alarm is generated.

You also can use RMON for thresholding . With RMON, the management station does not have to poll for the variables. The RMON agent on the router polls the variables locally and sends a trap to the management station when the threshold is exceeded. The management station receives the trap and generates the alarm. The trade-off here is network usage versus router processing. Enabling RMON minimizes network traffic but increases the amount of processing done on the router.



Routing TCP[s]IP (Vol. 22001)
Routing TCP[s]IP (Vol. 22001)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 182

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net