One of the key elements of troubleshooting a network problem is having a plan of action. Many of the trouble calls you will receive are likely to be user issues involving things like the improper use of software. When you're faced with what appears to be a real problem, you should follow a set troubleshooting procedure, which should consist of a series of steps like the following:
The steps you follow can be slightly different, or you can perform the steps in a slightly different order, but the overall process should be similar. The following sections examine each of these steps.
The first step in troubleshooting a network problem is to determine exactly what is going wrong, and to note the effect of the problem on the network so that you can assign a priority to a problem. In a large network environment, there are often many more calls for support than the network support staff can handle at one particular time. Therefore, it is essential to establish a system of priorities that dictate which calls get addressed first. As in the emergency department of a hospital, the priorities should not necessarily be based on who is first in line. More often, it is the severity of the problem that determines who gets attention first, although it is usually not wise to ignore the political reality that senior management problems get addressed before those of the rank and file.
The following rules can help you to establish priorities:
Sometimes it's difficult to determine the exact nature of the problem from the story told by a relatively inexperienced user, but part of the process of narrowing down the cause of a particular problem involves obtaining accurate information about what has occurred. Users are often vague about what they were doing when they experienced the problem, or even what the indications of the problem were. For example, in many cases, users call the help desk because they received an error message, but they neglect to write down the wording of the message. Gentle training of users in the proper procedures for documenting and reporting problems is part of the network technician's job as well. It might not be any help to you now, but it can help you the next time a user receives an error.
For now, you can begin by asking questions like the following:
The next step in assessing the nature of the problem is to see if it can be duplicated. Network problems that you can easily duplicate are far easier to fix, primarily because you can easily test to see if your solution was successful. However, there are many types of network problems that are intermittent, or that might occur for only a short period of time. In these cases, you might have to leave the incident open until the problem occurs again. In some instances, having the user reproduce the problem can lead to the solution. User error is a common cause of problems that can seem to be hardware- or network-related to the inexperienced eye.
When you've determined that the problem can be duplicated, you can set about determining the actual source of the difficulty. If, for example, a user has trouble opening a file in a word processing application, the difficulty might lie in the application, in the user's computer, in the file server where the file is stored, or in any of the networking components in between. The process of isolating the location of the problem consists of eliminating the elements that are not the cause, in a logical and methodical manner.
If it's possible to duplicate the problem, you can begin to isolate the cause by reproducing the conditions under which the problem occurred, using a procedure like the following:
If you determine that the problem lies somewhere in the network and not in the user's computer, you can then begin the process of isolating the area of the network that is the source of the problem. For example, if you are able to reproduce the problem on another nearby computer, you can then begin performing the same task on computers located elsewhere on the network. Again, proceed methodically and document your results. For example, you can try to reproduce the problem on another computer connected to the same hub, and then on a computer connected to a different hub on the same LAN. If the problem occurs throughout the LAN, try a computer on a different LAN. Eventually, you should be able to narrow down the source of the problem to a particular component, such as a server, router, hub, or cable.
When a computer or other network component that used to work properly now does not, it stands to reason that some change has occurred. When a user reports a problem, one of the most important pieces of information the network troubleshooter can gather is how the computing environment changed immediately prior to the malfunction. Unfortunately, getting this information from the user can often be difficult. The response to the question "Has anything changed on the computer recently?" is nearly always "No," and it's only some time later that the user remembers to mention that a major hardware or software upgrade was performed just prior to the problem occurrence. On a network with properly established maintenance and documentation procedures, it should be possible to determine if any upgrades or modifications to the user's computer have been made recently. Official records are the first place you should look for information like this.
Major changes, such as the installation of new hardware or software, are obvious possible causes of the problem that is occurring, but the network troubleshooter must be conscious of causes evidenced in more subtle changes as well. An increase in network traffic levels, for example, as disclosed by a protocol analyzer, can be a contributing cause of a reduction in network performance. Occasional problems noticed by several users of the same application, cable segment, or LAN can indicate the existence of a fault in a component of the network. Tracking down the source of a networking problem can often be a form of detective work, and learning to "interrogate" your "suspects" properly can be an important part of the troubleshooting process.
For more information about error messages and other indicators used to troubleshoot network problems, see Lesson 2: Logs and Indicators, in Chapter 18, "Network Troubleshooting Tools."
There's an old medical school axiom that says when you hear hoofbeats, think horses, not zebras. In the context of network troubleshooting, this means that when you look for possible causes of a problem, start with the obvious first. For example, if a workstation is unable to communicate with a file server, don't start by checking the routers between the two systems; check the simple things on the workstation first, such as whether the network cable is plugged into the computer. The other important part of the process is to work methodically and document everything you check, so that you don't duplicate your efforts.
After you have isolated the problem to a particular piece of equipment, you can proceed to try and determine if it is caused by hardware or software. If it's a hardware problem, you might then proceed by replacing the unit that is at fault or by using an alternate. Communication problems, for example, might force you to try replacing network cables until you find one that is faulty. If the problem is in a server, you might need to replace components, such as hard drives, until you find the culprit. If you determine that the problem is caused by software, you might want to try running an application or storing data on a different computer, or reinstalling the software on the offending system.
In some cases, the process of isolating the source of a problem includes the resolution of the problem. If, for example, you end up replacing network patch cables until you find the one that is faulty, replacing the bad cable is the resolution of the problem. In other cases, however, the resolution might be more involved, such as having to reinstall a server application or operating system. Because other users might need to access that server, you might have to defer the resolution of the problem until a later time, when the network is not in use and after you've backed up the data stored on the server. In some cases, you might even have to bring in outside help, such as a contractor to pull new cables. This can require careful scheduling to avoid having the contractor's work conflict with the activities of you and your users. Sometimes, you might want to provide an interim solution, such as a substitute workstation or server, until you can definitively resolve the problem.
When you have implemented your resolution to the problem, you should return to the very beginning of the process and repeat the task that originally caused the problem. If the problem no longer occurs, you should test the other functions related to the changes you've made to ensure that in fixing one problem, you haven't created another. It is at this point that the time you spent documenting the troubleshooting process becomes worthwhile. You should repeat the procedures you used to duplicate the problem exactly, to ensure that the problem the user originally experienced has been completely eliminated, and not just temporarily masked. If the problem was intermittent to begin with, it may take some time to ascertain if your solution has been effective. You might need to check with the user several times to make sure that the problem is not reoccurring.
It is important, throughout the troubleshooting process, to keep an eye on the big network picture, and not let yourself become too involved in the problems experienced by one user (or application, or LAN). It is sometimes possible, while implementing a solution to one problem, to create another that is more severe or that affects more users. For example, if users on one LAN are experiencing high traffic levels that diminish their workstation performance, you might be able to remedy the problem by connecting some of their computers to a different LAN. However, although this solution might help the users originally experiencing the problem, you might overload another LAN in the process, causing another problem that is more severe than the first one. You might want to consider a more far-reaching solution instead, such as creating an entirely new LAN and moving some of the affected users over to it.
Although it is presented here as a separate step, the process of documenting your actions should begin as soon as the user calls for help. A well-organized network support organization should have a system in place in which each problem call is registered as a trouble ticket that eventually contains a complete record of the problem and the steps taken to isolate and resolve it. In many cases, a technical support organization operates using tiers, which are groups of technicians of different skill levels. Calls come in to the first tier, and if the problem is sufficiently complex or the first-tier technician is unable to resolve it, the call is escalated to the second tier, which is composed of senior technicians. As long as everyone involved in the process documents his or her activities, there should be no problem when one technician hands off the ticket to another. In addition, keeping careful notes prevents people from duplicating each other's efforts.
The final phase of the troubleshooting process is to explain to the user what happened and why. Of course, the average network user is probably not interested in hearing all the technical details, but it's a good idea to let users know whether their actions caused the problem, exacerbated it, or made it more difficult to resolve. This gradual education of the network's users can lead to a quicker resolution next time, or even prevent a problem from occurring altogether.
Place the following steps of the problem isolation process in the proper logical order: