Scenario 4-10: Troubleshooting Spanning Tree

As with any protocol you configure, knowing how to verify, monitor and troubleshoot the ongoing operation of the protocol is crucial to effectively maintaining your network. In the previous scenarios, you have been shown how you can verify and monitor spanning-tree configuration and operation. This scenario is dedicated specifically to give you tips on how to troubleshoot spanning tree.

NOTE

No specific topology is provided for this scenario, because the content of this scenario has a very broad scope and no single topology can adequately cover the content.

Spanning-tree problems generally have a major impact on the network and can involve massive network meltdowns. Almost always, the problem is because of one simple issueconfiguration BPDUs are not being propagated correctly on segments with blocking ports. This issue causes those blocking ports to forward traffic (even though a blocking port does not forward traffic, it is very important to understand that the port still receives BPDUs, which are processed by the switch). This forwarding has the effect of introducing loops, which quickly leads to total network chaos. Before troubleshooting, ensure you are familiar with the following:

The full topology of the Layer 2 network
The location of the root bridge
Which ports should be blocked (i.e., the redundant links that loop the bridging topology)

Understanding all of the above ensures that you can easily identify what area of the network is malfunctioning and how the network topology should be configured. The following lists a guide for your troubleshooting process:

Identify the loop
Break the loop
Check possible causes

Identifying the Loop

When you run into a spanning-tree problem, you will likely receive a sudden flood of calls saying the network is either down or running very slowly. The most definitive way to prove that a spanning-tree loop is the cause is to capture traffic on a link. However, you will normally be under pressure to provide a fix, and that is why the next sections discuss the quickest ways to identify a potential spanning-tree issue.

Catalyst OS

On CatOS, you can use the show system command to quickly identify the current load on the system backplane, as well as the time when peak load occurred. Example 4-74 demonstrates the use of this command:

Example 4-74. Verifying Current and Historic System Load

 Switch-A (enable) show system PS1-Status PS2-Status Fan-Status Temp-Alarm Sys-Status Uptime d,h:m:s Logout ---------- ---------- ---------- ---------- ---------- -------------- --------- ok         none       ok         off        ok         0,12:16:25     20 min PS1-Type   PS2-Type   Modem   Baud  Traffic Peak Peak-Time ---------- ---------- ------- ----- ------- ---- ------------------------- 120w AC    none       disable  9600   5%    100% Thu Aug 15 2002, 16:01:20 System Name              System Location          System Contact ------------------------ ------------------------ ------------------------ Switch-A

The Traffic column indicates the current traffic load, and the Peak and Peak-Time columns indicate the peak traffic load and when it occurred.

To drill down to the cause of the issue, you can check port utilization levels to see if anything appears out of the ordinary. Obviously, if you understand which ports should be blocking, you should check those ports first because the utilization should be very low. The show mac mod/port command displays sent/received frame statistics, as shown in Example 4-75.

Example 4-75. Verifying Port Statistics

 Switch-C (enable) show mac 2/2 Port     Rcv-Unicast          Rcv-Multicast        Rcv-Broadcast -------- -------------------- -------------------- --------------------  2/2                     4467                56170                 3059 Port     Xmit-Unicast         Xmit-Multicast       Xmit-Broadcast -------- -------------------- -------------------- --------------------  2/2                        0                  143                    0 ... (Output truncated) ...

In Example 4-75, port 2/2 is a blocking port. You can see that very few multicast frames have been transmitted, which could represent configuration BPDUs sent during topology changes. You should see normal traffic statistics for received traffic, because another port of the segment is forwarding traffic to the segment.

Example 4-76 shows another useful command, the show top command, which displays the top traffic statistics for a variety of options on a per-port basis (the top 20 are shown by default). When checking for spanning tree loops, you will find the show top bcst command useful because it displays the top ports sorted by broadcast utilization.

Example 4-76. Verifying Top Traffic Statistics

 Switch-D (enable) show top bcst 2/2 Start Time:     08/16/2002,04:42:42 End Time:       08/16/2002,04:43:13 PortType:       all Metric:         bcst (Tx + Rx) Port  Band- Uti Bytes                Pkts       Bcst       Mcst       Error Over       width  %  (Tx + Rx)            (Tx + Rx)  (Tx + Rx)  (Tx + Rx)  (Rx)  flow ----- ----- --- -------------------- ---------- ---------- ---------- ----- ----  2/1    100   0                12726        174         60        114     0    0  2/2    100   0                 1237          8         43         65     0    0 ... (Output truncated) ...

Cisco IOS

Cisco IOS provides the show interface counters command (see Example 4-77), which displays counters about frames sent and received on an interface. Again, you should check your blocked ports and verify that the transmit traffic utilization is very low.

Example 4-77. Verifying Interface Traffic Statistics

 Switch-C# show interface fa0/2 counters Port            InOctets   InUcastPkts   InMcastPkts   InBcastPkts Fa0/2              64506          1023          1032           978 Port           OutOctets  OutUcastPkts  OutMcastPkts  OutBcastPkts Fa0/2                912             0            34             0

Breaking the Loop

In most organizations, the network has become a critical component of running an efficient and profitable business operation. Any downtime or poor performance can directly affect the bottom line of the organization, so chances are you need to restore the network as quickly as possible, before determining the cause of the problem. You should also be prepared for any reoccurrences to ensure that the problem does not reoccur again. The following strategies can be taken:

Disabling ports
Turning on event logging

Disabling Ports

An effective way to quickly eliminate loops is to manually disable ports that should be in a Blocking state. Performing this action should remove a loop if it has formed and will not affect the network because these ports are normally blocking. Use the set port disable command (CatOS) or the shutdown interface configuration command (Cisco IOS) to disable a port.

WARNING

Disable ports with caution, as you might accidentally disconnect your Telnet session if you are performing the configuration remotely or disrupt legitimate traffic by shutting down the wrong ports. If your network is in such a state that even your exec sessions (via Telnet or console) are not responding due to the high CPU utilization incurred by looping traffic causing 100 percent bandwidth utilization, you can resort to physically disconnecting the ports that you think are at fault.

Turning on Event Logging

After restoring the network, you should monitor the network closely for a few hours to ensure the problem does not resurface. An easy way to monitor the network is to turn on event logging/debug for spanning-tree events. Use the set logging level spantree 7 command (CatOS) or the debug spantree events command (Cisco IOS).

Example 4-78 shows how to configure spanning-tree logging on CatOS.

Example 4-78. Logging Spanning Tree Events

 Switch-A> (enable) set logging level spantree 7 Switch-A> (enable) set logging console 2002 Jan 16 03:13:52 %SPANTREE-6-PORTBLK: Port 2/1 state in VLAN 1     changed to blocking 2002 Jan 16 03:13:52 %SPANTREE-5-PORTLISTEN: Port 2/1 state in VLAN 1     changed to listening 2002 Jan 16 03:14:07 %SPANTREE-6-PORTLEARN: Port 2/1 state in VLAN 1     changed to learning 2002 Jan 16 03:14:22 %SPANTREE-6-PORTFWD: Port 2/1 state in VLAN 1     changed to forwarding

The first command in Example 4-78 enables logging of all spanning-tree events from level 7 (the lowest severity) up to the highest severity events. The second command enables the logging to be output to the console sessionnote that you can send the output to a SYSLOG server. The final lines show spanning-tree events as they occur; notice the timings between each state.

TIP

The set logging level command as used in the example sets the logging level only for the current session. To set the logging level permanently, add the default keyword to the end of the command (e.g., set logging level spantree 7 default). Be aware that setting a low severity level may generate a lot of useless information.

Cisco IOS offers real-time debugging tools that can provide in-depth, low-level monitoring and troubleshooting. Cisco IOS is a little light when it comes to debugging spanning tree, but does offer a couple of debugging options. It is important to note the distinction between logging and debugging. Logging is normally used on an ongoing basis, whereas debugging is used only for a session, indicating it is more a troubleshooting tool.

You can debug general spanning-tree events (debug spanning-tree events), or you can debug the actual BPDUs as they are received (debug spanning-tree bpdu). Example 4-79 demonstrates the use of the debug spanning-tree events command when an interface is initialized.

Example 4-79. Debugging Spanning Tree Events

 Switch-C# debug spanning-tree events 12:58:06: set portid: VLAN0001 Fa0/1: new port id 8001 12:58:06: STP: VLAN0001 Fa0/1 -> listening 12:58:21: STP: VLAN0001 new root port Fa0/1, cost 19 12:58:21: STP: VLAN0001 sent Topology Change Notice on Fa0/1 12:58:21: STP: VLAN0001 Fa0/2 -> blocking 12:58:21: STP: VLAN0001 Fa0/1 -> learning 12:58:36: STP: VLAN0001 sent Topology Change Notice on Fa0/1 12:58:36: STP: VLAN0001 Fa0/1 -> forwarding

Checking Possible Causes

If a blocked port is not receiving configuration BPDUs, it eventually transitions to a Forwarding state to assume the designated bridge role. The following lists some possible reasons why a blocked port would not be receiving BPDUs:

Duplex mismatch
Unidirectional link
Corrupted frames
Lack of resources
Incorrect timer configuration
Incorrect use of PortFast

Duplex Mismatch

A duplex mismatch is a very common problem and is generally caused by one side being configured to full-duplex and the other side being configured to autosense. In this configuration, the autosensing side chooses half-duplex, which can cause collisions because the full-duplex side does not exercise the CSMA/CD algorithm. This will cause the half-duplex side to back off sending and can cause spanning-tree issues if the full-duplex port is a blocked port (the half-duplex side may back off sending configuration BPDUs, which could incorrectly transition the blocked port to a Forwarding state). Use the show port mod/port command (CatOS) or show interface command (Cisco IOS) to verify duplex settings.

Unidirectional Link

A unidirectional link occurs when traffic flows in one direction, but not the other. A unidirectional link is common on links that use fiber and/or transceivers, where a faulty fiber/transceiver may lead to a unidirectional link. If a link contains a blocking port, and BPDUs are not received due to a unidirectional link, then the port transitions to a Forwarding state, causing a loop. On high-end Catalyst switches, the unidirectional link detection (UDLD) protocol allows the switch to detect failures, so enabling this protocol is recommended if possible.

Corrupted Frames

Corrupted frames are, where configuration BPDUs are corrupted and ignored by the bridge with the blocked port, a less common problem. You can use the show port mod/port command (CatOS) or the show interface command (Cisco IOS) to check for corrupted frames.

Lack of Resources

Lack of resources refers to situations where the switch CPU is overloaded and cannot properly operate spanning tree, causing issues. The simple way to ensure that your switch has an acceptable CPU load is to use the show inband command (CatOS) or the show processes cpu command (Cisco IOS). The show inband command maintains a counter that is incremented every time the CPU has been too overloaded to perform a task.

Limitations exist as to how many STP instances a switch can run before CPU resource becomes an issue. The limitations are measured in a parameter called logical ports, with a logical port basically being a single spanning-tree port in a single VLAN (note that a trunk consists of multiple logical ports). The formula for calculating logical ports is as follows:

(number of non-ATM trunks * number of active VLANs on trunk) + 2 * (number of ATM trunks * number of active VLANs on trunk) + number of non-trunking ports

For example, if you have a switch that contains two trunks that actively trunk for ten VLANs and has 100 non-trunk ports, then the number of logical ports is (2 * 10) + (0) + 100, which is 120. Table 4-6 lists the logical port limitations on the Catalyst 4000/5000/6000 switches.

Table 4-6. Logical Port Limits
Platform	Supervisor Engine	Maximum Logical Ports
Catalyst 4000	Supervisor 1 and 2	1500
Catalyst 5000	Supervisor 1 (8 MB RAM) Supervisor 1 (20 MB RAM) Supervisor 2, 3F Supervisor 2, 3G Supervisor 3	200 400 1500 1800 4000
Catalyst 6000	All	4000

In the real world, if you are reaching the limits described in Table 4-5, your design has issues that will cause a lot of other problems as well. You should always limit the number of active VLANs in a Layer 2 network to no more than 50 or so. If you need to support more VLANs than this, you should look at implementing a Layer 3 topology that splits your Layer 2 network into smaller chunks that each have to support a smaller number of VLANs.

TIP

To reduce the number of logical ports, prune your trunks, enabling only the required VLANs on each trunk. This pruning eliminates logical ports for VLANs that are not used on the local switch. It is important to note that although VTP prunes unused VLANs from a trunk, STP ports still exist on the trunk. Therefore you must manually prune VLANs from a trunk by configuring the allowed list of VLANs for each trunk if you also wish to reduce the number of logical ports.