ErrorFault Data for Switch Processors

Error/Fault Data for Switch Processors

Here, we'll look at switch information relating to fault management. We'll identify some key MIBs and show commands that relate to switch health.

MIB Variables for Switch and Module "Health" Status

From CISCO-STACK-MIB, the ModuleStatus variable provides the operational status of the module. If the status is not ok, the value of moduleTestResult gives more detailed information about the module's failure condition(s). The possible values seen in this MIB object are as follows:

other(1) none of the following
ok(2) status ok
minorFault(3) minor problem
majorFault(4) major problem

By polling this MIB, you can keep watch on the modules installed in the switch versus keeping track of every port on a switch. The latter can be excessive, except for the trunk and other "critical" ports that you identify.

A related MIB object from CISCO-STACK MIB is ModuleTestResult, which provides the result of the module's self-test. A zero indicates that the module passed all tests. Bits set in the result indicate error conditions.

CLI Commands for Analyzing Switch and Module Health

The show module and show test commands are related to the ModuleStatus MIB. For details on the output from the show module command, see Chapter 10.

Using the show test Command

The show test command shows you the status of the self-tests run against the individual modules. The status of the test results assists you in pinpointing the possible cause for minorFault or majorFault, as indicated by the values in the moduleStatus MIB.

Example 11-13 shows sample output for show test.

Example 11-13 Using show test to determine the health of a module.

 Switch> sh test 2 Module 2 : 48-port 4 Segment 10BaseT Ethernet Repeater Port Status:   Ports 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24   -----------------------------------------------------------------------------         .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ------------------------------------------------------------------------ .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . LCP Diag Status for Module 2  (. = Pass, F = Fail, N = N/A)  CPU         : .    Sprom    : .    Bootcsum : .    Archsum  : N  RAM         : .    LTL      : .    CBL      : N    DPRAM    : N   SAMBA : N  Saints      : .    Pkt Bufs : .    Repeater : .    FLASH    : N  SAINT/SAGE Status :   Saint 1  2  3  4   -----------------         .  .  .  .  Packet Buffer Status :   Saint 1  2  3  4     -----------------             .  .  .  .  Loopback Status [Reported by Module 1] :   Saint  1  2  3  4   -----------------          .  .  .  .

The type of card you have installed in each slot determines what kind of output you see in the show test [mod_num] output. If the card is working properly, you should see all "." next to the individual tests. If something failed on the card, you'll see an "F."

Using the show log Command

The show log command shows you the error log of the system, such as reboot histories, module reset counts, exception errors with corresponding hex dumps, and self-test results for the supervisor modules.

This command is very useful for examining the overall health and stability of your switch. If there are any exceptions to why the Supervisor card reset, the results are stored here.

The show log output on a switch is stored in NVRAM, so it is not cleared after a reset of the switch. You have to manually clear the log to take all values back to 0. It is good practice to clear the log every time you upgrade the software on the switch, due to possible exception counters stored under the network management processor. There is no need to store an exception count for a software release other than the current running release.

Output from show log also is good for comparing the last reset time and date of the supervisor with that of the other modules in the switch. Drawing that correlation can assist you in determining when module cards were OIRed (online insertion and removal) or reset by other methods without the entire switch resetting.

Example 11-14 shows sample output from show log.

Example 11-14 Using show log to determine the health of the switch or module.

 Switch> show log Network Management Processor (ACTIVE NMP) Log:   Reset count:   3^A   Re-boot History:   Feb 18 1998 17:14:18 0 ^B, Feb 05 1998 15:16:28 0                      Feb 05 1998 14:20:33 0   Bootrom Checksum Failures:      0 ^C UART Failures:                   0   Flash Checksum Failures:        0   Flash Program Failures:          0   Power Supply 1 Failures:        0   Power Supply 2 Failures:         0   Swapped to CLKA:                0     Swapped to CLKB:               0   Swapped to Processor 1:         0   Swapped to Processor 2:          0   DRAM Failures:                  0   Exceptions:     9     Last Exception occurred on Feb 18 1998 17:14:18 ^B     Software version = 2.4(2) ^D NVRAM log: Network Management Processor (STANDBY NMP) Log:   Reset count:   3 ^A   Re-boot History:   Feb 18 1998 17:14:18 0 ^B, Feb 05 1998 15:16:28 0                      Feb 05 1998 14:20:33 0   Bootrom Checksum Failures:      0 ^C UART Failures:                   0   Flash Checksum Failures:        0   Flash Program Failures:          0   Power Supply 1 Failures:        0   Power Supply 2 Failures:         0   Swapped to CLKA:                0     Swapped to CLKB:               0   Swapped to Processor 1:         0   Swapped to Processor 2:          0   DRAM Failures:                  0   Exceptions:                     0 NVRAM log: Module 3  Log:   Reset Count:   4 ^A   Reset History: Wed Feb 18 1998, 17:14:18 ^B   Sun Feb 15 1998, 04:34:12                  Thu Feb 5 1998, 15:17:38                  Thu Feb 5 1998, 14:21:43

The following items are highlighted in Example 11-14:

A "Reset count" is the number of times that particular line card resets. Notice the difference between the reset count on the two Network Management Processors (slots 1 and 2) and the slot 3 module. Slot 3 must have been reset manually or by the reset command one extra time.

B The "Re-boot History" line indicates the time and date of the all the resets the line card exhibited, up to 10. You can compare this to the line "Last Exception occurred…" below it.

C The failures for the Supervisor cards are highlighted here. These are cumulative counts of failures that occurred on the Network Management Processor or Supervisor card. Typically, you'll see power supply failure increase more than others because every time the switch resets, the power supply failure increments.

D The "Software version" line, as indicated here, is the software version the exception occurred in. If this is not the current software running, you should clear the log to get an accurate count of the appropriate errors that may occur with the current release of software.

SNMP Traps Relating to Switch Health

The moduleUp and moduleDown traps (CISCO-STACK-MIB.traps) indicate that a module in the switch chassis has either just come online or just gone offline. Here, you can track when cards are inserted into the chassis by OIR, or track when cards are removed or having problems.

The coldStart and whyreload trap (CISCO-GENERAL-TRAPS) indicate that the switch was powered on and restarted. These traps will be sent when the switch is coming up very similar to that of the router or when the switch unexpectedly restarts.

Syslog Messages Relating to Switch Health

A number of syslog messages are useful for analyzing switch health, and apply directly to the MIB objects and CLI commands previously discussed. They are collected in Table 11-5.

Table 11-5. Syslog Messages for Switch Health Information
Message	Explanation
`SYS-5-SYS_RESET: System reset from [chars]`	The switch has been reset, either by a failure or by manual intervention, such as from a change management window.
`SYS-3-MOD_MINORFAIL: Minor problem in module [dec]` `SYS-3-MOD_FAILREASON: Module [dec] failed due to CBL0 error` `SYS-3-MOD_FAIL: Module [dec] failed to come online`	These three syslog messages indicate that some type of failure on a particular line card or Supervisor card has occurred. These can be correlated to the moduleDown trap received or to the moduleStatus MIB object. Based on this result, you should actively poll for the moduleStatus for the given module number as indicated by the [dec] placement in the message.
`SYS-5-MOD_INSERT: Module [dec] has been inserted` `SYS-5-MOD_REMOVE: Module [dec] has been removed` `SYS-5-MOD_RESET: Module [dec] reset from [chars]`	These three syslog messages explain when a module is inserted, removed, or reset either by a failure as illustrated above, or by manual intervention.
`SNMP-5-MODULETRAP: Module [dec] [[chars]] Trap` `SNMP-5-COLDSTART: Cold Start Trap` `SNMP-5-WARMSTART: Warm Start Trap`	These three SNMP syslog messages are indications that a SNMP trap was sent out based on the message type. The moduleUp/Down trap, coldStart trap, and warmStart trap are indicated here. The warmStart trap is an indication that the switch has supervisor redundancy and the backup Supervisor card is now active. You can correlate these syslog messages to the trapd daemon running on your management station to see whether the appropriate trap was received.

CLI Commands for Analyzing Switch System Resources

The key system resources needing evaluation on switches, such as resource errors and low clusters, cannot be gathered from SNMP MIB objects, so CLI commands are used instead.

Here are several show commands relating to the evaluation of system resources on a switch. This section will cover the following:

show inband
show biga
show mbuf

Using the show inband and show biga Commands

The show inband command applies to the Supervisor III engines and the show biga command applies to Supervisor I and II engines.

The show inband or show biga command shows statistics from the SAGE ASIC chip that front-ends the processor for data traffic. The chip resides on the processor card. The output you need to concern yourself with here is the field RsrcErrors. These commands can be executed only from the enable mode.

Resource errors are important to look at over time when you are experiencing performance problems. If this counter is increasing rapidly over a short amount of time, you are "starving" the resources on the switch processor. Thus, it cannot process frames such as BPDUs, VTP, ISL, and CDP. Incrementing resource errors typically means that the switch cannot allocate memory or buffers (mbufs) for frames received on the processor. When the switch cannot process these frames, especially BPDUs, the switch network can become unstable. For example, if the processor does not see BPDUs, ports in blocking mode can go to forwarding mode and thus cause a snowball effect of a bridge loop and disable.

Example 11-15 shows sample output for show biga and show inband.

Example 11-15 Using show biga and show inband to evaluate available system resources on a switch.

 Switch (enable) sh biga BIGA Registers:     cstat:       00  upad :     FFFF   pctrl :     0000   nist :     0000     sist :     0098  hica :     0000   hicb  :     0000   hicc :       00     dctrl:     F5FF  dstat:     0000   dctrl2:       80   npim :     00F8     thead: 102FC804  ttail: 102FC804   ttmph : 102FC804   tptr : 10497E62     tdsc : 00000500  tlen :     0000   tqsel :       05     rhead: 102FA5D0  rtail: 102FA5B4   rtmph : 102FA5EC   rptr : 104E5280     rdsc : 80000000  rplen: 102FA5E4   rtlen : 00000000   rlen :     1572     fltr :     00FF  fc   :       00   Rev  :        04   CFG  : 02020202 BIGA Driver:     Initializd:     TRUE  SpurusIntr: 00000000  NPIMShadow:     00F8 BIGA Receive:     RxDone    :    FALSE     First RBD : 102FA534  Last  RBD : 102FC118     SoftRHead : 102FA5C0  SoftRTail : 102FA5A4     FramesRcvd: 00202501  BytesRcvd : 21197580     QueuedRBDs: 00000256  RsrcErrors: 00006520^A BIGA Transmit:     First TBD : 102FC134  Last  TBD : 102FC818     SoftTHead : 102FC134  SoftTTail : 102FC134     Free TBDs : 00000064  No TBDs   : 00000000     AcknowErrs: 00000000  HardErrors: 00000000     QueuedPkts: 00000000  XmittedPkt: 01604290     XmittedByt: 136665648  Panic     : 00000000     Frag<=4Byt: 00000000 Switch(enable) sh inband Inband Driver: DriverPtr:  A067D300    Initializd:     TRUE  SpurusIntr: 00000000     RxDone:        FALSE  TxDMAWorking:  FALSE  RxRecovPtr: 00000000(-1)     FPGACntl:      004F  Characteristics:0000  LastISRCause:     04     Transmit:      First TBD : A0681B84(0  )  Last  TBD : A0682B64(0  )      TxHead    : A0681D44(14 )  TxTail    : A0681D44(14 )      AvailTBDs : 00000128       QueuedPkts: 00000000      XmittedPkt: 00247610       XmittedByt: 22625836      PanicEnd  : 00000000       PanicNullP: 00000000      BufLenErrs: 00000000       Len0Errs  : 00000000      Frag<=4Byt: 00000665       SpursTxInt: 00000000      No TBDs   : 00000000       NullMbuf  : 00000000     Receive:      First RBD : A067D384(0  )  Last  RBD : A0681B60(511)      RxHead    : A067E320(111)  RxTail    : A067E2FC(110)      AvailRBD  : 00000512       RsrcErrors: 00000000^A      PanicNullP: 00000000       PanicFakeI: 00000000      FramesRcvd: 03173999       BytesRcvd : 246115897      RuntsRcvd : 00000000       HugeRcvd  : 00000000 GT64010 IntMask: F00F0000  IntCause: 0330E083 GT64010 TX DMA (CH 1):     Count: 0000  Src  : 013D5C62   Dst   : 4ff10056   NRP  : 000000     Cntl :       15C0 GT64010 RX DMA (CH 2):     Count: 0680  Src  : 4FF20000   Dst   : 01c84d80   NRP  : 0067BC     Cntl :       55C0 PSI (PCI SAGE/PHOENIX Interface) FPGA:     Control : 004F  TxCount : 0056     RxDMACmd: 35C0  RxBufSiz: 0680  MaxPkt  : 0680     IntCause: 0000  IntMask : 0003

Monitoring RsrcErrors (A) is important over time, especially over a short time frame when switch performance problems are occurring. If this counter is incrementing over a long period of time, it is not as crucial.

Using the show mbuf command

The fixed buffers on switches are permanently set and come in two flavors: mbuf and clusters. Each mbuf is segmented into 128 bytes (116 data bytes), whereas clusters are packets greater than 1664 bytes (13 mbufs and 1508 data bytes). The only traffic that affects the mbuf and cluster counters is traffic destined to the supervisor engine, such as BPDUs, VTP, or CDP. The show mbuf all output displays the current amount of mbufs free and clusters free, as well as the lowest mbufs and clusters free.

The critical values that need to be looked at with switch buffers are the "free" and "lowest free" mbufs and clusters because they can help identify possible memory leaks or lack of proper memory resources. Free mbufs, lowest free mbufs, clfree, and lowest clfree should be flagged if they go below 100, which is used as an initial baseline threshold.

Example 11-16 shows sample output from show mbuf.

Example 11-16 Using show mbuf to determine system resources available on a switch.

 Switch(enable) sh mbuf MBSTATS:         mbufs                   10224    clusters        3932         free mbufs              9946^A    clfree          3675 ^B         lowest free mbufs       9935 ^C    lowest clfree   3665 ^D MALLOC STATS : Block Size       Free Blocks   16             1   48             2   112            1   144            1   208            1   240            1   400            1 > 496            4 Largest block available : 7510096 Total Memory available  : 7546400 ^E Total Memory used       :  563952

The highlighted information from Example 11-16 is as follows:

A "free mbufs" is the number of current mbufs free for the processor. The amount of DRAM installed in the switch determines the size of the mbufs allocated at boot time, as indicated by the mbufs row.

B "clfree" is the number of current clusters free for the processor. The amount of DRAM installed in the switch determines the size of the clusters allocated at boot time, as indicated by the clusters row.

C "lowest free mbufs" is the field you need to trend and watch for memory resource usage.

D "lowest clfree" needs close attention as well because it also trends memory resource usage.

E "Total Memory available" is the amount of fixed DRAM memory allocated for mbufs.