Problem-Solving Model | Cisco Network Security Troubleshooting Handbook

While troubleshooting a production network problem, it is best to use a systematic troubleshooting approach. It is not uncommon to see a network security troubleshooter take an unsystematic approach, which might work sometimes, but could introduce other problems to the network. Besides, troubleshooting takes enormous amounts of time unless you know about the product and technology, and take a systematic approach.

This section explains the systematic troubleshooting approach that you can use to help isolate your problems.

Step 1.	Define the problem.
Step 2.	Gather facts about the problem.
Step 3.	Consider possibilities.
Step 4.	Create an action plan.
Step 5.	Implement the action plan.
Step 6.	Observe the results.
Step 7.	Repeat Steps 1 through 6 if necessary.
Step 8.	Document the changes after the issue is resolved.

The sections that follow provide additional information about each of these steps.

Step 1: Define the Problem

This is the first step in the process. Based on the information available, you must define your problem precisely. You should define the problem in terms of a set of symptoms and potential causes. Every problem should have an element (for example, PIX firewall, translation, connection, and so on) and one or two lines outlining the problem statement. Sometimes you have to deal with multiple problems. In that case, it is important to define each problem individually and address each one in the order of priority. This is especially important because without solving the problem of highest priority, you might not be able to address the next issue.

For instance, users might complain that the PIX firewall was not allowing their traffic from inside to the Internet. There could be multiple reasons for this. It could be that the translation is not formed that is required for the connection to be formed. This, in turn, results in an unsuccessful connection across the PIX firewall. Or, it could be that translation is formed but the connection is not formed. So, under that circumstance, you might define two problem statements. The first statement could be this: "The PIX is not building up translation." The second statement could be this: "The PIX is not building up a connection." Now you need to prioritize which is more important. Obviously, without translation, there are no connections. So, you need to work on translation first, because resolving translation may resolve the connection. From this example you can see that setting priorities is important. So, to properly analyze the problem, identify the general symptoms, and then ascertain what kinds of problems (causes) could result in these symptoms. The problem statement should follow this format: What is wrong with what?

Here are some of the questions that will help you define a good problem statement:

Identify the concerns end users have by asking these questions:

What is wrong in the network?
What actions do I want to take in the network?
How do I know there is a problem? Is it based on hard data like debugs, traces, screen shots, and show commands, or is it based on users' experiences?
What do I see in the network that indicates that there is a problem?

Once you have answers to these preliminary questions, you might need to prioritize the problem by answering the following questions:

What has the biggest impact on my network operations?
What is the impact on my network or operations?
What is the impact on my business?
What are my deadlines, and what is the timeframe or maintenance window I am working in?
When are my users coming into work?
What is the impact on our business or operations if we do not get this fixed?
What signs do I see that this impact will be changing?

After going through steps 1 and 2, you should be able to define the problems and prioritize them if needed. The problem statement can now be as narrowed to the following: "Translation is not getting built up on the PIX firewall for inside host x."

Step 2: Gather the Facts

Gathering facts is a step that will help to isolate the possible causes. The following actions should be taken to uncover the facts:

1	What are the reported problems? To find out the problems reported, take the following steps: Ask questions of affected users, network administrators, managers, and other key people. Collect information from sources such as network management systems, protocol analyzer traces, output from router diagnostic commands, or software release notes. Chapter 2, "Understanding Troubleshooting Tools," covers the tools required for collecting and verifying facts.
2	Where is the problem reported? Once you identify symptoms of the problems, you need to find out exactly where the problem is reported. This deals with the location of the device and the problems in the device. The following questions, along with a topology diagram of the network, will help you find information about where the problem is reported: Which device (VPN concentrator, PIX, and so on) is having a problem? What country, city, wiring closet is it in? A topology map also will give you this information. What interface (for example on PIX, is it DMZ, outside, or the inside interfaces) is experiencing problems? Which functions in the software are problematic (for example is it connection, translation, etc. on the PIX Firewall)? Which perimeter of the network is the source of the problem?
3	When is the problem reported? It is very important to know when the problem is reported. This might help in identifying what changes have occurred at the time of the problem occurrence. The following question will help in getting this information: When did you first notice the problem? (Refer to timestamps in the syslog.) You can ask the end users for the time when the problem occurred and see if the time stamp on the log correlates with the time end users report. When was the problem was first reported and logged? Is the problem intermittent (appearing and disappearing at different parts of the day for example), or continuous? At which stage of the protocol negotiation does the failure occur? (For example, in the case of IPSec, is it failing on phase 1 or phase 2?) What changes were made to the network before, after, or at the time the problem was reported by syslog or an end user. Which devices are affected? Are there any other similar or different types of devices that possibly may be affected but are not affected, and what are the key differences?
4	What is the scope of the problem? Understanding the magnitude of the problem is important. The following questions will assist you in identifying the magnitude of the problem: How many network/security devices experience the problem? How big is the issue? How many users are affected? How many issues are there on any specific device? For example, are you experiencing both high CPU and failover issues on the PIX? One might be responsible for the other, but you need to include all possibilities. Is there any specific pattern or trend? Is the problem becoming aggravated, reducing, or staying the same? For example, if you have a connection problem across the PIX firewall for a specific host, are there multiple hosts having problems with connection? Is the problem reduced at a specific time of the day?

5	Use baseline information. After collecting as many facts as possible, use the baseline information (configuration, statistics, and so on) to find out what has changed in terms of configurations and statistics. For example, you might have baselined the Port Address Translation (PAT) to be 30K during a busy hour. And if you find the translation number has crossed more than 30K at any given time, this could be a potential problem.

Step 3: Consider Possible Problems

If you do a good job in fact-finding, know your network well, and have topology and baseline information on hand, this step is easier. In other words, the success or failure of this step depends heavily upon the previous steps.

Using the facts, you can eliminate some of the potential problems from the list you defined in Step 1. Depending on the data, for example, you might be able to identify whether a problem involves software or configuration. At every opportunity, try to decrease the number of potential problems, so that you can create an efficient plan of action, which is discussed next.

Step 4: Create an Action Plan

Based on the remaining potential problems deduced from the previous step, prioritize the issues. Then start making the changes one by one, based on the list you have created, with highest priority first. Working with only one variable at a time enables you to reproduce a given solution to a specific problem. If you alter more than one variable simultaneously, you might be able to solve the problem, but identifying the specific change that eliminated the symptom becomes far more difficult, and will not help you solve the same problem if it occurs in the future.

Step 5: Implement the Action Plan

Perform each step carefully while testing to see whether the symptom disappears.

Step 6: Observe Results

Whenever you change a variable, be sure to gather results. Generally, you should use the same method of gathering facts that you used in Step 2 (that is, working with the key people affected, in conjunction with using your diagnostic tools).

Step 7: Repeat if Necessary

Analyze the results to determine whether the problem has been resolved. If it has, then the troubleshooting is complete.

Step 8: Document the Changes

If the problem has not been resolved, you must create an action plan based on the next most likely problem in your list. Return to Step 4, change one variable at a time, and repeat the process until the problem is solved. However, if the problem is resolved, be sure to document the changes you make. This step is very important and often is ignored. Every change you make to the network poses the potential of creating another problem (although this does not often happen). So, if you have the documentation on the changes you make, you can always refer back to them. Besides, this process produces a good knowledge base for others in the department who are not involved with the specific troubleshooting that you have performed.