Error/Fault Data for Router HardwarePrevious sections looked at the MIB variables that tell what kind of hardware is in the network and how everything is physically allocated in the devices. This section turns to examining the variables you can use for error and fault management for the routers and switches. For fault management in routers, this discussions focuses mainly on the SNMP traps and syslog messages that tell when hardware issues are arising in the network, versus actively going out and polling MIB objects. However, there are a few MIB variables that should be polled in conjunction with the reception of other events, such as a syslog message or defined MIB threshold being exceeded. MIB Variables for Router FailureFrom OLD-CISCO-SYSTEM MIB and OLD-CISCO-CHASSIS MIB cardTable, MIB variables to watch for router failure are as follows:
These two MIB objects will indicate the general health of the router. After the router is up and you verified "whyReload," all you would need to poll for is cardOperStatus to validate the status of the Interface Processor or NIM cards in the chassis. Most of the time, you will not even actively poll any of these objects to test for hardware faults. It is recommended to use SNMP traps and syslog messages to trigger a response to a possible hardware failure in the network. Based on appropriate syslog messages or SNMP traps received on your Network Management console, you can determine when you need to actively go out and poll these MIB objects. Polling these objects after one of those events is where these MIB objects are meaningful and add value. CLI Commands for Router FailureThe following are comparable show commands that get the same type of resulting data points as the MIB objects described previous for analyzing router health. Router Health from show versionNote that show version does display the reason why the router reloaded the last time. From this output, you can get more details on a particular reload error, such as a software-forced crash or exception error. This command output, in conjunction with a show stack output, can help determine the cause of a router crash. TIP If your router reloaded due to a software-forced crash or anything other than "power on" or "reload," take the output from the show stack command and "paste" it into the following URL on Cisco's Connection Online (CCO) to automatically search for known IOS or hardware defects: http://www.cisco.com/stack/stackdecoder.shtml If no "hits" are displayed in this search engine, please open a case with the Cisco TAC and provide the engineer with a show tech support output that includes this command. Example 10-10 emphasizes output from the show version command that is related to router health. Example 10-10 Obtaining router health information with show version. Router>show version Cisco Internetwork Operating System Software IOS (tm) 4000 Software (C4000-JS-M), Version 11.2(17), RELEASE SOFTWARE (fc1) Copyright 1986-1999 by cisco Systems, Inc. Compiled Mon 04-Jan-99 18:40 by ashah Image text-base: 0x00012000, data-base: 0x0077EBC0 ROM: System Bootstrap, Version 4.14(7), SOFTWARE Router uptime is 16 minutes System restarted by error - Software forced crash, PC 0xF9128 A System image file is "c4000-js-mz.112-17.bin", booted via flash cisco 4000 (68030) processor (revision 0xB0) with 16384K/4096K bytes of memory. Processor board ID 5026712 G.703/E1 software, Version 1.0. Bridging software. SuperLAT software copyright 1990 by Meridian Technology Corp). X.25 software, Version 2.0, NET2, BFE and GOSIP compliant. TN3270 Emulation software. 2 Ethernet/IEEE 802.3 interface(s) 2 Serial network interface(s) 128K bytes of non-volatile configuration memory. 4096K bytes of processor board System flash (Read/Write) Configuration register is 0x2102 The "System restarted by" line (A) indicates the reason for the router last reset. In this example, you can tell that the router reloaded due to a software-forced crash of some kind. This data alone does not mean anything to you or to a Cisco Support Engineer in the Technical Assistance Center (TAC). You need to also gather output from the show stack command to get an accurate representation of where the failure occurred in the IOS or hardware. You or a TAC engineer can feed the output from the show stack command into the stack decoder on CCO to determine whether a defect exists: http://www.cisco.com/stack/stackdecoder.shtml You also may need to get a "core dump" of the memory in the router if a defect is not accurately identified. A core dump is useful to the Cisco IOS development engineers to help determine the cause of the crash. Please refer to the following URL on CCO for information on creating a core dump:
CAUTION Please consult with the Cisco TAC prior to producing a core dump for the TAC. Router Health from show stackThe output from this command gives you more details on why the router reloaded, especially if it was caused by an error of some kind. You or a TAC engineer can feed the output from the show stack command into the stack decoder on CCO to determine whether a defect exists: http://www.cisco.com/stack/stackdecoder.shtml You may be required to also get a "core dump" of the memory in the router if a defect is not accurately identified. A core dump is useful to the Cisco IOS development engineers to help determine the cause of the crash. Please refer to the following URL on CCO for information on creating a core dump:
CAUTION Again, please consult with the Cisco TAC prior to producing a core dump for the TAC. Example 10-11 provides show stack output, with emphasis on information for router health. Example 10-11 Obtaining router health information from show stack. Router>sh stack Minimum process stacks: Free/Size Name 1408/2000 Router Init 2632/4000 Init Interrupt level stacks: Level Called Unused/Size Name 3 7810 2540/3000 Network interfaces 4 0 3000/3000 High IRQ Int Handler 5 1355 2896/3000 Console Uart System was restarted by error - Software forced crash, PC 0xF9128 A 4000 Software (C4000-JS-M), Version 11.2(17), RELEASE SOFTWARE (fc1) Compiled Mon 04-Jan-99 18:40 by ashah (current version) Image text-base: 0x00012000, data-base: 0x0077EBC0 Stack trace from system failure: FP: 0x843978, RA: 0xFFC9A FP: 0x84399C, RA: 0xE9936 FP: 0x8439B8, RA: 0xFCC46 Starting with the "System restarted by error…" in line (A) and continuing to the end of the command output, these lines provide information relating to the cause of a router crash or reload. Again, this data should be provided to a TAC engineer or should be fed into the stack decoder on CCO, as indicated previously, to search for possible known defects. Router Health from show diagbusThe show diagbus command displays cards in the router that are not recognized for one reason or another. Nuances such as "UNKNOWN," hardware revisions of "255.255," or serial numbers with all zeroes are the key values to pick out of this data. This output can be correlated to syslog messages or SNMP traps relating to failing hardware or incompatible hardware. Boards can show up "UNKNOWN" if the card is not supported by the IOS release running on the router or if there is a hardware problem with the card. Sometimes, if you see valid output in this command for a particular card when an issue of some kind is seen, the output from show controller cbus will show no microcode installed on the card (Sw 0.00). This typically indicates an IOS compatibility problem with the Interface Processor (IP). The example given in Example 10-12 is an extreme one, but it is still feasible. All the highlighted fields indicate a problem with the cards installed. There is no valid data in the appropriate fields, either "UNKNOWN," all zeroes, or "maxed out" to the size of the space, such as "255.255" for hardware revision. Example 10-12 Obtaining router health information from show diagbus. Router#sh diagbus Slot 0: UNKNOWN A port adapter Port adapter is analyzed Port adapter insertion time unknown Hardware revision 255.255 A Board revision UNKNOWN A Serial number 4294967295 Part number 800-11534335-255 Test history 0xFF RMA number 255-255-255 A EEPROM format version 255 EEPROM contents (hex): 0x20: FF 77 FF FF FF FF FF FF FF FF FF FF FF FF FF FF 0x30: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Slot 1: UNKNOWN A port adapter Port adapter is analyzed Port adapter insertion time unknown Hardware revision 255.255 A Board revision UNKNOWN A Serial number 4294967295 Part number 800-11534335-255 Test history 0xFF RMA number 255-255-255 A EEPROM format version 255 EEPROM contents (hex): 0x20: FF 77 FF FF FF FF FF FF FF FF FF FF FF FF FF FF 0x30: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF A Slot 2: Ethernet port adapter, 4 ports Port adapter is analyzed Port adapter insertion time unknown Hardware revision 0.0 A Board revision UNKNOWN A Serial number 0 A Part number 00-0000-00 A Test history 0x0 RMA number 00-00-00 EEPROM format version 0 EEPROM contents (hex): 0x20: 00 42 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Slot 3: Ethernet port adapter, 4 ports Port adapter is analyzed Port adapter insertion time unknown Hardware revision 1.1 Board revision A0 Serial number 5361301 Part number 800-02027-02 Test history 0x0 RMA number 00-00-00 EEPROM format version 1 EEPROM contents (hex): 0x20: 01 42 01 01 00 51 CE 95 50 07 EB 02 00 00 00 00 0x30: 50 00 00 00 97 05 30 00 FF FF FF FF FF FF FF FF SNMP Traps for Router Failure [4]From MIB CISCO-GENERAL-TRAPS, several SNMP traps are relevant to router failure, as follows:
A reload trap signifies that the sending protocol entity is reinitializing itself so that the agent's configuration or the protocol entity implementation can be altered. This trap uses the values from the MIBS sysUptime and whyReload in its packet generation (varbinds). A coldStart trap signifies that the sending protocol entity is reinitializing itself so that the agent's configuration or the protocol entity implementation may be altered. This trap uses the values from the MIBS sysUptime and whyReload in its packet generation (varbinds). A linkDown trap signifies that the sending protocol entity recognizes a failure in one of the communication links represented in the agent's configuration. This trap uses the values from the MIBs ifIndex, ifDescr, ifType, and locIfReason in its packet generation (varbinds). A linkUp trap signifies that the sending protocol entity recognizes that one of the communication links represented in the agent's configuration has come up. This trap uses the values from the MIBS ifIndex, ifDescr, ifType, and locIfReason in its packet generation (varbinds). From the linkup trap timestamp, you can determine how long a particular interface was down relative to the linkDown trap. This is especially useful when calculating network availability. Syslog Messages for Router FailureThe messages reported in Table 10-4 represent some of the more common syslog messages. This does not mean that all other messages are not important, but they are seen less frequently. You can use the basic methodology defined with these messages to do correlations with SNMP MIB objects or show commands. The same methodology can be applied to other messages seen in the syslog, which are not reported here. Most of the syslog messages reported here are seen on the high-end routers such as the 7x00 series routers. The syslog messages have different severity levels, as indicated by the number in the message. The lower the number, the more severe the issue in the router. You should act on severities between 0 and 3 and be aware of messages with severities of 4 through 7. TIP It is recommended to use timestamps in syslog to determine when an event occurred. Note that if the device and the syslog server have different clock sources, the times may be slightly different.
|