Case Study: In the Trenches with OSPF
This case study is intended to describe a real world case where this problem actually occurred and how it was identified and corrected. The case study then outlines some lessons to help prevent it from happening again in the future. The troubleshooting model introduced earlier in this chapter is used throughout this case study as a process to reference when performing network troubleshooting.
Recently, a large broadcast storm occurred in an OSPF Enterprise network, affecting a region of the network that consisted of approximately 50 geographically separate sites consisting of over 75 routers serving approximately 3,000 users. This condition brought all user WAN traffic in the impacted area to a standstill.
Through the course of our troubleshooting, we identified how and why localized broadcasts were erroneously being propagated across the WAN, resulting in a dramatic degradation in network performance. Additionally, several Cisco OSPF router configuration problems were identified and corrected during the course of troubleshooting.
When troubleshooting in any type of networking environment, a systematic troubleshooting methodology works best. The seven steps outlined throughout this section will help you to clearly define the specific symptoms associated with network problems, identify potential problems that could be causing the symptoms, and then systematically eliminate each potential problem (from most likely to least likely) until the symptoms disappear.
This process is not a rigid outline for troubleshooting an internetwork. Rather it is a foundation from which you can build a problem-solving process to suit your particular internetworking environment.
The following troubleshooting steps detail the problem-solving process:
Customer Reports a Network Slowdown
The customer has called the Network Operations Center (NOC) and reported a network slowdown at a number of critical sites. The situation is even more urgent, as the customer is preparing to run the end of inventory reconciliation report. The network must be available at this critical time or the customer will lose money.
Step 1: Define the Problem
The first step in any type of troubleshooting and repair scenario is to define the problem. What is actually happening is sometimes very different from what is reported; thus, the truth in this step is defining the actual problem. You need to do two things: identify the symptoms and perform an impact assessment.
Our customer called us and explained that the network response was extremely slow. This, of course, was a rather vague and broad description from a network operations standpoint. Due to the nature of the problem report (that is, it can sometimes be difficult to define slow, a clear understanding of the problem was required before we could proceed with developing an action plan. This was accomplished by gathering facts and asking several questions to the user reporting the problem. According to the users, the general symptoms included:
Step 2: Gather Facts
After the problem is defined, it is then necessary to begin gathering the facts surrounding the problem. This step will provide the facts that were gained in this network case study.
Before starting to troubleshoot any type of networking problem, it is usually helpful to have a network diagram. Figure 8-6 shows the diagram we used.
Following the previously mentioned troubleshooting methodology, we collected as many facts as possible and made some general observations by connecting to the routers in question. We gathered facts from several sources on the router, including the Cisco log buffer and by utilizing various Cisco SHOW commands. Our observations revealed the following facts and occurrences within the network.
Router B in Figure 8-6 reported high traffic input to the Ethernet segment at Headquarters. This caused the Ethernet connectivity to become so unstable that the links would become unavailable for brief periods. Consequently, OSPF adjacencies were being reformed repeatedly. The following is an excerpt from the SYSLOG on Router A:
Mar 1 00:08:17 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet0, changed state to down Mar 1 00:08:29 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet0, changed state to up Mar 1 00:08:35 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet0, changed state to down Mar 1 00:08:39 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet0, changed state to up
As you can see by the SYSLOG entries, Ethernet connectivity was being lost for brief periods of time. The router was definitely showing us a contributing factor to the problems being reported by our customer.
Missing OSPF Adjacencies