The Process of Troubleshooting | Inside Network Perimeter Security (2nd Edition)

Troubleshooting is a problem-solving process that many find rewarding. In general, it revolves around proving or disproving hypotheses. The following steps are all part of the troubleshooting process:

1.	Collect symptoms.
2.	Review recent changes.
3.	Form a hypothesis.
4.	Test the hypothesis.
5.	Analyze the results.
6.	Repeat if necessary.

You probably already apply some form of this process whenever you sit down to solve a problem. You gather information and formulate a hypothesis, whether consciously or not, by simply thinking of a pertinent question to ask. The answer either strengthens or weakens the proof and helps you move on to the next question.

Let's consider these steps in the context of an example in which Internet users can no longer access their company website. You have learned of this issue via a call to the company helpdesk from an outside user complaining that she can no longer remotely connect to the site. It's time to gather some facts to see if we can get to the root of the problem.

Collecting Symptoms

Collecting symptoms seems like an obvious step in the troubleshooting process, but sometimes it's easy to jump into solving a problem before you really understand what the problem is. You can often save a lot of time by slowing down the process and confirming the symptoms before proceeding further. Try to re-create the problem (unless the results are catastrophic, of course). If you can re-create the problem, you will have an easier time later determining whether your solution actually fixes it. Also, check whether your client or server software includes debug options that might help you collect more specific symptoms.

After talking to the outside user, we now know that the main symptom for the website problem is that the browser times out after a while and presents a message saying that the server is unavailable. Other calls have also been received from outside users, so this isn't a contained problem for one user. To verify this symptom, we try to connect to the troubled server from the outside as well, using a dial-up Internet connection. We are also unable to connect to the site, so this appears to be a problem that affects all outside users. In addition, we try to access the server locally. It works fine, which suggests that this is probably not a server problem. Because the symptoms haven't given us a definitive answer yet, let's consider things that might have recently changed in our environment that could shed some light on the situation.

Reviewing Recent Changes

Reviewing recent changes is included as a separate step in the troubleshooting process, because the cause of so many problems can be traced back to a recent change to the environment. Obviously, recent changes must be considered if they coincide with the problem timeframe. In fact, you should consider these changes even if they don't at first seem to relate to the problem. Often, a change uncovers problems that existed before, but didn't manifest themselves. A popular example of this is a server change that doesn't present itself until the first time the server is rebooted. Depending on the circumstances and your environment, you may be able to remove the change immediately and worry about understanding the real cause later. As far as our sample problem goes, we do some research and determine that a firewall protecting the web server's network segment was replaced with a new firewall about the time the problem started. We need more information before we can fix it. To get more information, we need to formulate a hypothesis to pursue.

Forming a Hypothesis

If you like puzzles, you might enjoy this part of troubleshootingthat is, unless you are under severe pressure to fix the problem. This is where the troubleshooting gurus really shine. Your mission is to combine all your observations, experience, intuition, and prayers to come up with a fix for the problem. You do this by first hypothesizing the cause of the problem and then working to prove it. If you can't guess a specific cause, try the reverse approach. Form a hypothesis and then work to disprove it. This is a good way to collect additional symptoms or other pertinent information.

Let's continue with the example in which our users can't access the website after a firewall upgrade. An obvious hypothesis is that we somehow configured the firewall rule set incorrectly, and it's blocking inbound HTTP access to our web server. The next step is to test that hypothesis.

Testing the Hypothesis

Ideally, you test your hypothesis by implementing a fix for the problem, which is the most direct way to prove or disprove it. However, you might still be working to narrow the possible causes of the problem, in which case a fix is not yet apparent. In that event, you might design and execute a series of tests to gather information until a specific fix presents itself or until you're forced to try a different hypothesis.

Our firewall problem isn't in the "fix-it" stage yet because we're still investigating whether the firewall rule set is the problem. Perhaps the easiest way to test that hypothesis is to look at the firewall configuration and logs to see whether the traffic is being blocked. A quick check shows that the configuration is correct, and the log shows that HTTP traffic is being allowed through.

Analyzing the Results

After you have executed your test, the next step is to analyze the results. If your test involved implementing a fix, such as rearranging the firewall's rule set, then all you need to do is check whether the problem is resolved. This process will be much easier if you are able to reproduce the problem. If the problem isn't fixed yet, you will need to analyze the test results to determine what to do next.

We tested our hypothesis that our firewall rule set was incorrect and found that it was not blocking HTTP traffic to the server. We might not have completely disproved the hypothesis, but we should look at other possibilities. We will have to continue troubleshooting the problem to get to its root cause and fix it.

Repeating If Necessary

If the problem is solved, your work is done. Otherwise, you will have to perform another iteration of the process. If your test disproved your hypothesis, you must continue by forming and testing another hypothesis. Otherwise, you will have to design another test whose results you can analyze to further narrow the possible causes. This is the nature of most problems, which aren't often solved by the first pass through the troubleshooting process.

We have completed one iteration for our sample firewall problem without solving it. Maybe you would have started with another hypothesis, or you already have another test in mind for gathering more information. To learn whether you're right, however, you have to read more of the chapter to finish the diagnosis!