As with any protocol you configure, knowing how to verify, monitor and troubleshoot the ongoing operation of the protocol is crucial to effectively maintaining your network. In the previous scenarios, you have been shown how you can verify and monitor spanning-tree configuration and operation. This scenario is dedicated specifically to give you tips on how to troubleshoot spanning tree. NOTE No specific topology is provided for this scenario, because the content of this scenario has a very broad scope and no single topology can adequately cover the content. Spanning-tree problems generally have a major impact on the network and can involve massive network meltdowns. Almost always, the problem is because of one simple issueconfiguration BPDUs are not being propagated correctly on segments with blocking ports. This issue causes those blocking ports to forward traffic (even though a blocking port does not forward traffic, it is very important to understand that the port still receives BPDUs, which are processed by the switch). This forwarding has the effect of introducing loops, which quickly leads to total network chaos. Before troubleshooting, ensure you are familiar with the following:
Understanding all of the above ensures that you can easily identify what area of the network is malfunctioning and how the network topology should be configured. The following lists a guide for your troubleshooting process:
Identifying the LoopWhen you run into a spanning-tree problem, you will likely receive a sudden flood of calls saying the network is either down or running very slowly. The most definitive way to prove that a spanning-tree loop is the cause is to capture traffic on a link. However, you will normally be under pressure to provide a fix, and that is why the next sections discuss the quickest ways to identify a potential spanning-tree issue. Catalyst OSOn CatOS, you can use the show system command to quickly identify the current load on the system backplane, as well as the time when peak load occurred. Example 4-74 demonstrates the use of this command: Example 4-74. Verifying Current and Historic System Load Switch-A (enable) show system PS1-Status PS2-Status Fan-Status Temp-Alarm Sys-Status Uptime d,h:m:s Logout ---------- ---------- ---------- ---------- ---------- -------------- --------- ok none ok off ok 0,12:16:25 20 min PS1-Type PS2-Type Modem Baud Traffic Peak Peak-Time ---------- ---------- ------- ----- ------- ---- ------------------------- 120w AC none disable 9600 5% 100% Thu Aug 15 2002, 16:01:20 System Name System Location System Contact ------------------------ ------------------------ ------------------------ Switch-A The Traffic column indicates the current traffic load, and the Peak and Peak-Time columns indicate the peak traffic load and when it occurred. To drill down to the cause of the issue, you can check port utilization levels to see if anything appears out of the ordinary. Obviously, if you understand which ports should be blocking, you should check those ports first because the utilization should be very low. The show mac mod/port command displays sent/received frame statistics, as shown in Example 4-75. Example 4-75. Verifying Port Statistics Switch-C (enable) show mac 2/2 Port Rcv-Unicast Rcv-Multicast Rcv-Broadcast -------- -------------------- -------------------- -------------------- 2/2 4467 56170 3059 Port Xmit-Unicast Xmit-Multicast Xmit-Broadcast -------- -------------------- -------------------- -------------------- 2/2 0 143 0 ... (Output truncated) ... In Example 4-75, port 2/2 is a blocking port. You can see that very few multicast frames have been transmitted, which could represent configuration BPDUs sent during topology changes. You should see normal traffic statistics for received traffic, because another port of the segment is forwarding traffic to the segment. Example 4-76 shows another useful command, the show top command, which displays the top traffic statistics for a variety of options on a per-port basis (the top 20 are shown by default). When checking for spanning tree loops, you will find the show top bcst command useful because it displays the top ports sorted by broadcast utilization. Example 4-76. Verifying Top Traffic Statistics Switch-D (enable) show top bcst 2/2 Start Time: 08/16/2002,04:42:42 End Time: 08/16/2002,04:43:13 PortType: all Metric: bcst (Tx + Rx) Port Band- Uti Bytes Pkts Bcst Mcst Error Over width % (Tx + Rx) (Tx + Rx) (Tx + Rx) (Tx + Rx) (Rx) flow ----- ----- --- -------------------- ---------- ---------- ---------- ----- ---- 2/1 100 0 12726 174 60 114 0 0 2/2 100 0 1237 8 43 65 0 0 ... (Output truncated) ... Cisco IOSCisco IOS provides the show interface counters command (see Example 4-77), which displays counters about frames sent and received on an interface. Again, you should check your blocked ports and verify that the transmit traffic utilization is very low. Example 4-77. Verifying Interface Traffic Statistics Switch-C# show interface fa0/2 counters Port InOctets InUcastPkts InMcastPkts InBcastPkts Fa0/2 64506 1023 1032 978 Port OutOctets OutUcastPkts OutMcastPkts OutBcastPkts Fa0/2 912 0 34 0 Breaking the LoopIn most organizations, the network has become a critical component of running an efficient and profitable business operation. Any downtime or poor performance can directly affect the bottom line of the organization, so chances are you need to restore the network as quickly as possible, before determining the cause of the problem. You should also be prepared for any reoccurrences to ensure that the problem does not reoccur again. The following strategies can be taken:
Disabling PortsAn effective way to quickly eliminate loops is to manually disable ports that should be in a Blocking state. Performing this action should remove a loop if it has formed and will not affect the network because these ports are normally blocking. Use the set port disable command (CatOS) or the shutdown interface configuration command (Cisco IOS) to disable a port. WARNING Disable ports with caution, as you might accidentally disconnect your Telnet session if you are performing the configuration remotely or disrupt legitimate traffic by shutting down the wrong ports. If your network is in such a state that even your exec sessions (via Telnet or console) are not responding due to the high CPU utilization incurred by looping traffic causing 100 percent bandwidth utilization, you can resort to physically disconnecting the ports that you think are at fault. Turning on Event LoggingAfter restoring the network, you should monitor the network closely for a few hours to ensure the problem does not resurface. An easy way to monitor the network is to turn on event logging/debug for spanning-tree events. Use the set logging level spantree 7 command (CatOS) or the debug spantree events command (Cisco IOS). Example 4-78 shows how to configure spanning-tree logging on CatOS. Example 4-78. Logging Spanning Tree EventsSwitch-A> (enable) set logging level spantree 7 Switch-A> (enable) set logging console 2002 Jan 16 03:13:52 %SPANTREE-6-PORTBLK: Port 2/1 state in VLAN 1 changed to blocking 2002 Jan 16 03:13:52 %SPANTREE-5-PORTLISTEN: Port 2/1 state in VLAN 1 changed to listening 2002 Jan 16 03:14:07 %SPANTREE-6-PORTLEARN: Port 2/1 state in VLAN 1 changed to learning 2002 Jan 16 03:14:22 %SPANTREE-6-PORTFWD: Port 2/1 state in VLAN 1 changed to forwarding The first command in Example 4-78 enables logging of all spanning-tree events from level 7 (the lowest severity) up to the highest severity events. The second command enables the logging to be output to the console sessionnote that you can send the output to a SYSLOG server. The final lines show spanning-tree events as they occur; notice the timings between each state. TIP The set logging level command as used in the example sets the logging level only for the current session. To set the logging level permanently, add the default keyword to the end of the command (e.g., set logging level spantree 7 default). Be aware that setting a low severity level may generate a lot of useless information. Cisco IOS offers real-time debugging tools that can provide in-depth, low-level monitoring and troubleshooting. Cisco IOS is a little light when it comes to debugging spanning tree, but does offer a couple of debugging options. It is important to note the distinction between logging and debugging. Logging is normally used on an ongoing basis, whereas debugging is used only for a session, indicating it is more a troubleshooting tool. You can debug general spanning-tree events (debug spanning-tree events), or you can debug the actual BPDUs as they are received (debug spanning-tree bpdu). Example 4-79 demonstrates the use of the debug spanning-tree events command when an interface is initialized. Example 4-79. Debugging Spanning Tree Events Switch-C# debug spanning-tree events 12:58:06: set portid: VLAN0001 Fa0/1: new port id 8001 12:58:06: STP: VLAN0001 Fa0/1 -> listening 12:58:21: STP: VLAN0001 new root port Fa0/1, cost 19 12:58:21: STP: VLAN0001 sent Topology Change Notice on Fa0/1 12:58:21: STP: VLAN0001 Fa0/2 -> blocking 12:58:21: STP: VLAN0001 Fa0/1 -> learning 12:58:36: STP: VLAN0001 sent Topology Change Notice on Fa0/1 12:58:36: STP: VLAN0001 Fa0/1 -> forwarding Checking Possible CausesIf a blocked port is not receiving configuration BPDUs, it eventually transitions to a Forwarding state to assume the designated bridge role. The following lists some possible reasons why a blocked port would not be receiving BPDUs:
Duplex MismatchA duplex mismatch is a very common problem and is generally caused by one side being configured to full-duplex and the other side being configured to autosense. In this configuration, the autosensing side chooses half-duplex, which can cause collisions because the full-duplex side does not exercise the CSMA/CD algorithm. This will cause the half-duplex side to back off sending and can cause spanning-tree issues if the full-duplex port is a blocked port (the half-duplex side may back off sending configuration BPDUs, which could incorrectly transition the blocked port to a Forwarding state). Use the show port mod/port command (CatOS) or show interface command (Cisco IOS) to verify duplex settings. Unidirectional LinkA unidirectional link occurs when traffic flows in one direction, but not the other. A unidirectional link is common on links that use fiber and/or transceivers, where a faulty fiber/transceiver may lead to a unidirectional link. If a link contains a blocking port, and BPDUs are not received due to a unidirectional link, then the port transitions to a Forwarding state, causing a loop. On high-end Catalyst switches, the unidirectional link detection (UDLD) protocol allows the switch to detect failures, so enabling this protocol is recommended if possible. Corrupted FramesCorrupted frames are, where configuration BPDUs are corrupted and ignored by the bridge with the blocked port, a less common problem. You can use the show port mod/port command (CatOS) or the show interface command (Cisco IOS) to check for corrupted frames. Lack of ResourcesLack of resources refers to situations where the switch CPU is overloaded and cannot properly operate spanning tree, causing issues. The simple way to ensure that your switch has an acceptable CPU load is to use the show inband command (CatOS) or the show processes cpu command (Cisco IOS). The show inband command maintains a counter that is incremented every time the CPU has been too overloaded to perform a task. Limitations exist as to how many STP instances a switch can run before CPU resource becomes an issue. The limitations are measured in a parameter called logical ports, with a logical port basically being a single spanning-tree port in a single VLAN (note that a trunk consists of multiple logical ports). The formula for calculating logical ports is as follows:
For example, if you have a switch that contains two trunks that actively trunk for ten VLANs and has 100 non-trunk ports, then the number of logical ports is (2 * 10) + (0) + 100, which is 120. Table 4-6 lists the logical port limitations on the Catalyst 4000/5000/6000 switches.
In the real world, if you are reaching the limits described in Table 4-5, your design has issues that will cause a lot of other problems as well. You should always limit the number of active VLANs in a Layer 2 network to no more than 50 or so. If you need to support more VLANs than this, you should look at implementing a Layer 3 topology that splits your Layer 2 network into smaller chunks that each have to support a smaller number of VLANs. TIP To reduce the number of logical ports, prune your trunks, enabling only the required VLANs on each trunk. This pruning eliminates logical ports for VLANs that are not used on the local switch. It is important to note that although VTP prunes unused VLANs from a trunk, STP ports still exist on the trunk. Therefore you must manually prune VLANs from a trunk by configuring the allowed list of VLANs for each trunk if you also wish to reduce the number of logical ports. |