Error/Fault Data for Switch Processors
Here, we'll look at switch information relating to fault management. We'll identify some key MIBs and show commands that relate to switch health.
MIB Variables for Switch and Module "Health" Status
From CISCO-STACK-MIB, the ModuleStatus variable provides the operational status of the module. If the status is not ok, the value of moduleTestResult gives more detailed information about the module's failure condition(s). The possible values seen in this MIB object are as follows:
By polling this MIB, you can keep watch on the modules installed in the switch versus keeping track of every port on a switch. The latter can be excessive, except for the trunk and other "critical" ports that you identify.
A related MIB object from CISCO-STACK MIB is ModuleTestResult, which provides the result of the module's self-test. A zero indicates that the module passed all tests. Bits set in the result indicate error conditions.
CLI Commands for Analyzing Switch and Module Health
The show module and show test commands are related to the ModuleStatus MIB. For details on the output from the show module command, see Chapter 10.
Using the show test Command
The show test command shows you the status of the self-tests run against the individual modules. The status of the test results assists you in pinpointing the possible cause for minorFault or majorFault, as indicated by the values in the moduleStatus MIB.
Example 11-13 shows sample output for show test.
Example 11-13 Using show test to determine the health of a module.
Switch> sh test 2 Module 2 : 48-port 4 Segment 10BaseT Ethernet Repeater Port Status: Ports 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ----------------------------------------------------------------------------- . . . . . . . . . . . . . . . . . . . . . . . . 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ------------------------------------------------------------------------ . . . . . . . . . . . . . . . . . . . . . . . . LCP Diag Status for Module 2 (. = Pass, F = Fail, N = N/A) CPU : . Sprom : . Bootcsum : . Archsum : N RAM : . LTL : . CBL : N DPRAM : N SAMBA : N Saints : . Pkt Bufs : . Repeater : . FLASH : N SAINT/SAGE Status : Saint 1 2 3 4 ----------------- . . . . Packet Buffer Status : Saint 1 2 3 4 ----------------- . . . . Loopback Status [Reported by Module 1] : Saint 1 2 3 4 ----------------- . . . .
The type of card you have installed in each slot determines what kind of output you see in the show test [mod_num] output. If the card is working properly, you should see all "." next to the individual tests. If something failed on the card, you'll see an "F."
Using the show log Command
The show log command shows you the error log of the system, such as reboot histories, module reset counts, exception errors with corresponding hex dumps, and self-test results for the supervisor modules.
This command is very useful for examining the overall health and stability of your switch. If there are any exceptions to why the Supervisor card reset, the results are stored here.
The show log output on a switch is stored in NVRAM, so it is not cleared after a reset of the switch. You have to manually clear the log to take all values back to 0. It is good practice to clear the log every time you upgrade the software on the switch, due to possible exception counters stored under the network management processor. There is no need to store an exception count for a software release other than the current running release.
Output from show log also is good for comparing the last reset time and date of the supervisor with that of the other modules in the switch. Drawing that correlation can assist you in determining when module cards were OIRed (online insertion and removal) or reset by other methods without the entire switch resetting.
Example 11-14 shows sample output from show log.
Example 11-14 Using show log to determine the health of the switch or module.
Switch> show log Network Management Processor (ACTIVE NMP) Log: Reset count: 3A Re-boot History: Feb 18 1998 17:14:18 0 B, Feb 05 1998 15:16:28 0 Feb 05 1998 14:20:33 0 Bootrom Checksum Failures: 0 C UART Failures: 0 Flash Checksum Failures: 0 Flash Program Failures: 0 Power Supply 1 Failures: 0 Power Supply 2 Failures: 0 Swapped to CLKA: 0 Swapped to CLKB: 0 Swapped to Processor 1: 0 Swapped to Processor 2: 0 DRAM Failures: 0 Exceptions: 9 Last Exception occurred on Feb 18 1998 17:14:18 B Software version = 2.4(2) D NVRAM log: Network Management Processor (STANDBY NMP) Log: Reset count: 3 A Re-boot History: Feb 18 1998 17:14:18 0 B, Feb 05 1998 15:16:28 0 Feb 05 1998 14:20:33 0 Bootrom Checksum Failures: 0 C UART Failures: 0 Flash Checksum Failures: 0 Flash Program Failures: 0 Power Supply 1 Failures: 0 Power Supply 2 Failures: 0 Swapped to CLKA: 0 Swapped to CLKB: 0 Swapped to Processor 1: 0 Swapped to Processor 2: 0 DRAM Failures: 0 Exceptions: 0 NVRAM log: Module 3 Log: Reset Count: 4 A Reset History: Wed Feb 18 1998, 17:14:18 B Sun Feb 15 1998, 04:34:12 Thu Feb 5 1998, 15:17:38 Thu Feb 5 1998, 14:21:43
The following items are highlighted in Example 11-14:
SNMP Traps Relating to Switch Health
The moduleUp and moduleDown traps (CISCO-STACK-MIB.traps) indicate that a module in the switch chassis has either just come online or just gone offline. Here, you can track when cards are inserted into the chassis by OIR, or track when cards are removed or having problems.
The coldStart and whyreload trap (CISCO-GENERAL-TRAPS) indicate that the switch was powered on and restarted. These traps will be sent when the switch is coming up very similar to that of the router or when the switch unexpectedly restarts.
Syslog Messages Relating to Switch Health
A number of syslog messages are useful for analyzing switch health, and apply directly to the MIB objects and CLI commands previously discussed. They are collected in Table 11-5.
CLI Commands for Analyzing Switch System Resources
The key system resources needing evaluation on switches, such as resource errors and low clusters, cannot be gathered from SNMP MIB objects, so CLI commands are used instead.
Here are several show commands relating to the evaluation of system resources on a switch. This section will cover the following:
Using the show inband and show biga Commands
The show inband command applies to the Supervisor III engines and the show biga command applies to Supervisor I and II engines.
The show inband or show biga command shows statistics from the SAGE ASIC chip that front-ends the processor for data traffic. The chip resides on the processor card. The output you need to concern yourself with here is the field RsrcErrors. These commands can be executed only from the enable mode.
Resource errors are important to look at over time when you are experiencing performance problems. If this counter is increasing rapidly over a short amount of time, you are "starving" the resources on the switch processor. Thus, it cannot process frames such as BPDUs, VTP, ISL, and CDP. Incrementing resource errors typically means that the switch cannot allocate memory or buffers (mbufs) for frames received on the processor. When the switch cannot process these frames, especially BPDUs, the switch network can become unstable. For example, if the processor does not see BPDUs, ports in blocking mode can go to forwarding mode and thus cause a snowball effect of a bridge loop and disable.
Example 11-15 shows sample output for show biga and show inband.
Example 11-15 Using show biga and show inband to evaluate available system resources on a switch.
Switch (enable) sh biga BIGA Registers: cstat: 00 upad : FFFF pctrl : 0000 nist : 0000 sist : 0098 hica : 0000 hicb : 0000 hicc : 00 dctrl: F5FF dstat: 0000 dctrl2: 80 npim : 00F8 thead: 102FC804 ttail: 102FC804 ttmph : 102FC804 tptr : 10497E62 tdsc : 00000500 tlen : 0000 tqsel : 05 rhead: 102FA5D0 rtail: 102FA5B4 rtmph : 102FA5EC rptr : 104E5280 rdsc : 80000000 rplen: 102FA5E4 rtlen : 00000000 rlen : 1572 fltr : 00FF fc : 00 Rev : 04 CFG : 02020202 BIGA Driver: Initializd: TRUE SpurusIntr: 00000000 NPIMShadow: 00F8 BIGA Receive: RxDone : FALSE First RBD : 102FA534 Last RBD : 102FC118 SoftRHead : 102FA5C0 SoftRTail : 102FA5A4 FramesRcvd: 00202501 BytesRcvd : 21197580 QueuedRBDs: 00000256 RsrcErrors: 00006520A BIGA Transmit: First TBD : 102FC134 Last TBD : 102FC818 SoftTHead : 102FC134 SoftTTail : 102FC134 Free TBDs : 00000064 No TBDs : 00000000 AcknowErrs: 00000000 HardErrors: 00000000 QueuedPkts: 00000000 XmittedPkt: 01604290 XmittedByt: 136665648 Panic : 00000000 Frag<=4Byt: 00000000 Switch(enable) sh inband Inband Driver: DriverPtr: A067D300 Initializd: TRUE SpurusIntr: 00000000 RxDone: FALSE TxDMAWorking: FALSE RxRecovPtr: 00000000(-1) FPGACntl: 004F Characteristics:0000 LastISRCause: 04 Transmit: First TBD : A0681B84(0 ) Last TBD : A0682B64(0 ) TxHead : A0681D44(14 ) TxTail : A0681D44(14 ) AvailTBDs : 00000128 QueuedPkts: 00000000 XmittedPkt: 00247610 XmittedByt: 22625836 PanicEnd : 00000000 PanicNullP: 00000000 BufLenErrs: 00000000 Len0Errs : 00000000 Frag<=4Byt: 00000665 SpursTxInt: 00000000 No TBDs : 00000000 NullMbuf : 00000000 Receive: First RBD : A067D384(0 ) Last RBD : A0681B60(511) RxHead : A067E320(111) RxTail : A067E2FC(110) AvailRBD : 00000512 RsrcErrors: 00000000A PanicNullP: 00000000 PanicFakeI: 00000000 FramesRcvd: 03173999 BytesRcvd : 246115897 RuntsRcvd : 00000000 HugeRcvd : 00000000 GT64010 IntMask: F00F0000 IntCause: 0330E083 GT64010 TX DMA (CH 1): Count: 0000 Src : 013D5C62 Dst : 4ff10056 NRP : 000000 Cntl : 15C0 GT64010 RX DMA (CH 2): Count: 0680 Src : 4FF20000 Dst : 01c84d80 NRP : 0067BC Cntl : 55C0 PSI (PCI SAGE/PHOENIX Interface) FPGA: Control : 004F TxCount : 0056 RxDMACmd: 35C0 RxBufSiz: 0680 MaxPkt : 0680 IntCause: 0000 IntMask : 0003
Monitoring RsrcErrors (A) is important over time, especially over a short time frame when switch performance problems are occurring. If this counter is incrementing over a long period of time, it is not as crucial.
Using the show mbuf command
The fixed buffers on switches are permanently set and come in two flavors: mbuf and clusters. Each mbuf is segmented into 128 bytes (116 data bytes), whereas clusters are packets greater than 1664 bytes (13 mbufs and 1508 data bytes). The only traffic that affects the mbuf and cluster counters is traffic destined to the supervisor engine, such as BPDUs, VTP, or CDP. The show mbuf all output displays the current amount of mbufs free and clusters free, as well as the lowest mbufs and clusters free.
The critical values that need to be looked at with switch buffers are the "free" and "lowest free" mbufs and clusters because they can help identify possible memory leaks or lack of proper memory resources. Free mbufs, lowest free mbufs, clfree, and lowest clfree should be flagged if they go below 100, which is used as an initial baseline threshold.
Example 11-16 shows sample output from show mbuf.
Example 11-16 Using show mbuf to determine system resources available on a switch.
Switch(enable) sh mbuf MBSTATS: mbufs 10224 clusters 3932 free mbufs 9946A clfree 3675 B lowest free mbufs 9935 C lowest clfree 3665 D MALLOC STATS : Block Size Free Blocks 16 1 48 2 112 1 144 1 208 1 240 1 400 1 > 496 4 Largest block available : 7510096 Total Memory available : 7546400 E Total Memory used : 563952
The highlighted information from Example 11-16 is as follows: