ErrorFault Data for Router Hardware

Error/Fault Data for Router Hardware

Previous sections looked at the MIB variables that tell what kind of hardware is in the network and how everything is physically allocated in the devices. This section turns to examining the variables you can use for error and fault management for the routers and switches.

For fault management in routers, this discussions focuses mainly on the SNMP traps and syslog messages that tell when hardware issues are arising in the network, versus actively going out and polling MIB objects. However, there are a few MIB variables that should be polled in conjunction with the reception of other events, such as a syslog message or defined MIB threshold being exceeded.

MIB Variables for Router Failure

From OLD-CISCO-SYSTEM MIB and OLD-CISCO-CHASSIS MIB cardTable, MIB variables to watch for router failure are as follows:

whyReload: The reason for the router's most recent reboot.
CardOperStatus: The status of the Interface Processor or NIM cards in the chassis.

These two MIB objects will indicate the general health of the router. After the router is up and you verified "whyReload," all you would need to poll for is cardOperStatus to validate the status of the Interface Processor or NIM cards in the chassis. Most of the time, you will not even actively poll any of these objects to test for hardware faults. It is recommended to use SNMP traps and syslog messages to trigger a response to a possible hardware failure in the network. Based on appropriate syslog messages or SNMP traps received on your Network Management console, you can determine when you need to actively go out and poll these MIB objects. Polling these objects after one of those events is where these MIB objects are meaningful and add value.

CLI Commands for Router Failure

The following are comparable show commands that get the same type of resulting data points as the MIB objects described previous for analyzing router health.

Router Health from show version

Note that show version does display the reason why the router reloaded the last time. From this output, you can get more details on a particular reload error, such as a software-forced crash or exception error. This command output, in conjunction with a show stack output, can help determine the cause of a router crash.

TIP

If your router reloaded due to a software-forced crash or anything other than "power on" or "reload," take the output from the show stack command and "paste" it into the following URL on Cisco's Connection Online (CCO) to automatically search for known IOS or hardware defects:

http://www.cisco.com/stack/stackdecoder.shtml

If no "hits" are displayed in this search engine, please open a case with the Cisco TAC and provide the engineer with a show tech support output that includes this command.

Example 10-10 emphasizes output from the show version command that is related to router health.

Example 10-10 Obtaining router health information with show version.

 Router>show version  Cisco Internetwork Operating System Software  IOS (tm) 4000 Software (C4000-JS-M), Version 11.2(17), RELEASE SOFTWARE (fc1)  Copyright  1986-1999 by cisco Systems, Inc.  Compiled Mon 04-Jan-99 18:40 by ashah  Image text-base: 0x00012000, data-base: 0x0077EBC0  ROM: System Bootstrap, Version 4.14(7), SOFTWARE  Router uptime is 16 minutes  System restarted by error - Software forced crash, PC 0xF9128 ^A  System image file is "c4000-js-mz.112-17.bin", booted via flash  cisco 4000 (68030) processor (revision 0xB0) with 16384K/4096K bytes of memory.  Processor board ID 5026712  G.703/E1 software, Version 1.0.  Bridging software.  SuperLAT software copyright 1990 by Meridian Technology Corp).  X.25 software, Version 2.0, NET2, BFE and GOSIP compliant.  TN3270 Emulation software.  2 Ethernet/IEEE 802.3 interface(s)  2 Serial network interface(s)  128K bytes of non-volatile configuration memory.  4096K bytes of processor board System flash (Read/Write)  Configuration register is 0x2102

The "System restarted by" line (A) indicates the reason for the router last reset. In this example, you can tell that the router reloaded due to a software-forced crash of some kind. This data alone does not mean anything to you or to a Cisco Support Engineer in the Technical Assistance Center (TAC). You need to also gather output from the show stack command to get an accurate representation of where the failure occurred in the IOS or hardware. You or a TAC engineer can feed the output from the show stack command into the stack decoder on CCO to determine whether a defect exists: http://www.cisco.com/stack/stackdecoder.shtml

You also may need to get a "core dump" of the memory in the router if a defect is not accurately identified. A core dump is useful to the Cisco IOS development engineers to help determine the cause of the crash. Please refer to the following URL on CCO for information on creating a core dump:

http://www.cisco.com/warp/customer/68/15.html

CAUTION

Please consult with the Cisco TAC prior to producing a core dump for the TAC.

Router Health from show stack

The output from this command gives you more details on why the router reloaded, especially if it was caused by an error of some kind. You or a TAC engineer can feed the output from the show stack command into the stack decoder on CCO to determine whether a defect exists: http://www.cisco.com/stack/stackdecoder.shtml

You may be required to also get a "core dump" of the memory in the router if a defect is not accurately identified. A core dump is useful to the Cisco IOS development engineers to help determine the cause of the crash. Please refer to the following URL on CCO for information on creating a core dump:

http://www.cisco.com/warp/customer/68/15.html

CAUTION

Again, please consult with the Cisco TAC prior to producing a core dump for the TAC.

Example 10-11 provides show stack output, with emphasis on information for router health.

Example 10-11 Obtaining router health information from show stack.

 Router>sh stack  Minimum process stacks:   Free/Size   Name   1408/2000   Router Init   2632/4000   Init  Interrupt level stacks:  Level    Called Unused/Size  Name    3        7810   2540/3000  Network interfaces    4           0   3000/3000  High IRQ Int Handler    5        1355   2896/3000  Console Uart  System was restarted by error - Software forced crash, PC 0xF9128 ^A  4000 Software (C4000-JS-M), Version 11.2(17), RELEASE SOFTWARE (fc1)  Compiled Mon 04-Jan-99 18:40 by ashah (current version)  Image text-base: 0x00012000, data-base: 0x0077EBC0  Stack trace from system failure:  FP: 0x843978, RA: 0xFFC9A  FP: 0x84399C, RA: 0xE9936  FP: 0x8439B8, RA: 0xFCC46

Starting with the "System restarted by error…" in line (A) and continuing to the end of the command output, these lines provide information relating to the cause of a router crash or reload. Again, this data should be provided to a TAC engineer or should be fed into the stack decoder on CCO, as indicated previously, to search for possible known defects.

Router Health from show diagbus

The show diagbus command displays cards in the router that are not recognized for one reason or another. Nuances such as "UNKNOWN," hardware revisions of "255.255," or serial numbers with all zeroes are the key values to pick out of this data. This output can be correlated to syslog messages or SNMP traps relating to failing hardware or incompatible hardware. Boards can show up "UNKNOWN" if the card is not supported by the IOS release running on the router or if there is a hardware problem with the card. Sometimes, if you see valid output in this command for a particular card when an issue of some kind is seen, the output from show controller cbus will show no microcode installed on the card (Sw 0.00). This typically indicates an IOS compatibility problem with the Interface Processor (IP).

The example given in Example 10-12 is an extreme one, but it is still feasible. All the highlighted fields indicate a problem with the cards installed. There is no valid data in the appropriate fields, either "UNKNOWN," all zeroes, or "maxed out" to the size of the space, such as "255.255" for hardware revision.

Example 10-12 Obtaining router health information from show diagbus.

 Router#sh diagbus  Slot 0:  UNKNOWN ^A port adapter         Port adapter is analyzed         Port adapter insertion time unknown         Hardware revision 255.255 ^A      Board revision UNKNOWN ^A         Serial number     4294967295    Part number    800-11534335-255         Test history      0xFF          RMA number     255-255-255 ^A         EEPROM format version 255         EEPROM contents (hex):           0x20: FF 77 FF FF FF FF FF FF FF FF FF FF FF FF FF FF           0x30: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF  Slot 1:  UNKNOWN ^A port adapter         Port adapter is analyzed         Port adapter insertion time unknown         Hardware revision 255.255 ^A       Board revision UNKNOWN ^A         Serial number     4294967295    Part number    800-11534335-255         Test history      0xFF          RMA number     255-255-255 ^A         EEPROM format version 255         EEPROM contents (hex):           0x20: FF 77 FF FF FF FF FF FF FF FF FF FF FF FF FF FF           0x30: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ^A Slot 2:         Ethernet port adapter, 4 ports         Port adapter is analyzed         Port adapter insertion time unknown         Hardware revision 0.0 ^A         Board revision UNKNOWN ^A         Serial number     0 ^A          Part number    00-0000-00 ^A         Test history      0x0           RMA number     00-00-00         EEPROM format version 0         EEPROM contents (hex):           0x20: 00 42 00 00 00 00 00 00 00 00 00 00 00 00 00 00           0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  Slot 3:         Ethernet port adapter, 4 ports         Port adapter is analyzed         Port adapter insertion time unknown         Hardware revision 1.1           Board revision A0         Serial number     5361301       Part number    800-02027-02         Test history      0x0           RMA number     00-00-00         EEPROM format version 1         EEPROM contents (hex):           0x20: 01 42 01 01 00 51 CE 95 50 07 EB 02 00 00 00 00     0x30: 50 00 00 00 97 05 30 00 FF FF FF FF FF FF FF FF

SNMP Traps for Router Failure [4]

From MIB CISCO-GENERAL-TRAPS, several SNMP traps are relevant to router failure, as follows:

reload
coldStart
linkdown
linkUp

A reload trap signifies that the sending protocol entity is reinitializing itself so that the agent's configuration or the protocol entity implementation can be altered. This trap uses the values from the MIBS sysUptime and whyReload in its packet generation (varbinds).

A coldStart trap signifies that the sending protocol entity is reinitializing itself so that the agent's configuration or the protocol entity implementation may be altered. This trap uses the values from the MIBS sysUptime and whyReload in its packet generation (varbinds).

A linkDown trap signifies that the sending protocol entity recognizes a failure in one of the communication links represented in the agent's configuration. This trap uses the values from the MIBs ifIndex, ifDescr, ifType, and locIfReason in its packet generation (varbinds).

A linkUp trap signifies that the sending protocol entity recognizes that one of the communication links represented in the agent's configuration has come up. This trap uses the values from the MIBS ifIndex, ifDescr, ifType, and locIfReason in its packet generation (varbinds). From the linkup trap timestamp, you can determine how long a particular interface was down relative to the linkDown trap. This is especially useful when calculating network availability.

Syslog Messages for Router Failure

The messages reported in Table 10-4 represent some of the more common syslog messages. This does not mean that all other messages are not important, but they are seen less frequently. You can use the basic methodology defined with these messages to do correlations with SNMP MIB objects or show commands. The same methodology can be applied to other messages seen in the syslog, which are not reported here. Most of the syslog messages reported here are seen on the high-end routers such as the 7x00 series routers.

The syslog messages have different severity levels, as indicated by the number in the message. The lower the number, the more severe the issue in the router. You should act on severities between 0 and 3 and be aware of messages with severities of 4 through 7.

TIP

It is recommended to use timestamps in syslog to determine when an event occurred. Note that if the device and the syslog server have different clock sources, the times may be slightly different.

Table 10-4. Syslog Messages for Router Health Information
Message	Explanation
`%SYS-5-RELOAD: Reload requested`	A reload or restart was requested, typically issued from the CLI command reload. This message is generated prior to the router resetting.
`%SYS-5-RESTART: System restarted [chars]`	This message is seen after the router comes back online after booting up. Based on this message, you can poll the MIB object whyReload to get the reason for the reload, especially if it is an unscheduled reload.
`%CBUS-3-CMDTIMEOUT: Cmd timed out, CCB [hex], slot [chars], cmd code [chars]-Traceback=[hex]`	A command sent from the system to an interface processor failed to complete successfully. The system recovered by generating an error code to the requester. Copy the error message exactly as it appears on the console or in the system log, call your Cisco technical support representative, and provide the representative with the gathered information. Based on receipt of this syslog message, you can poll the MIB cardOperStatus for the particular interface processor to find out if whether is still operational. Also, executing the CLI command show diag or show controller cbus can give you an understanding of what is going on with the card. This message is sometimes seen with other messages at the same time, such as RSP-3-oriented messages.
`%CBUS-3-INITERR: Interface [dec], Error ([hex]), idb [hex] [dec] [chars]` `cbus_init()`	The switch processor or ciscoBus controller signaled an error while processing a packet or selecting an interface. This indicates a software problem. Copy the error message exactly as it appears on the console or in the system log. Issue the show tech-support command to gather data that may provide information to determine the nature of the error. If you cannot determine the nature of the error from the error message text or from the show tech-support output, call your Cisco technical support representative and provide the representative with the gathered information. Looking specifically at the output from show controller cbus or show diag within the show tech-support output can assist you in isolating the problem.
`%CBUS-3-OUTHUNG: [chars]: tx[char] output hung ([hex] - [chars]), [chars]`	This message is commonly seen with hex characters of 800E. You may see support personnel refer to these messages as "800E messages." 800E means that the transmit queue was full at the time a request for a transmit buffer was sent by the RP on that particular interface. The 800E error occurs only if the full state is persistent (tql == 0 && output-hold-queue does not equal NULL several consecutive attempts from IOS). A transmission attempt on an interface failed. The interface might not be attached to a cable or there might be a software problem. Check to see that the interfaces are all connected to the proper cables. Monitor the show controller cbus for possible isolation of the problem.
`%RSP-2-QAERROR: [chars] error, [chars] at addr [hex] ([chars]) log [hex], data [hex] [hex]`	A software error was detected during packet switching. This means that an interface freed the same queue element twice (reused) or attempted to free a zero queue element. An error was detected in the queueing hardware. Using the command show controller cbus or show diag should help you pinpoint the location of the problem.
`%DBUS-3 (All messages)`	All DBUS errors usually indicate a hardware problem with the processor card or with an interface processor card. The recommended course of action when seeing these errors is to replace the problematic card, usually based on the "slot" reported in the DBUS error message. Interface Processor cards are typically in bad shape if you see one of these messages.