This lesson aims to give you the tools you need to become a skilled network problem solver. If you approach a network problem with a plan of action, the cause and resolution will be easier to find. Here, you will learn to apply a structured approach to divide a network into functional units and then identify the problem.
After this lesson, you will be able to:
- Prepare a plan of action to research a network problem.
- Conduct the research needed to isolate the problem.
- Use a structured approach to identify a network problem.
- Identify the severity of a network problem based on its initial symptoms.
Estimated lesson time: 30 minutes
Troubleshooting is perhaps the most difficult task that computer professionals face. Added to the need to get to the bottom of a problem afflicting the network is the pressure to do so as quickly as possible. Computers never seem to fail at a convenient time. Failures occur in the middle of a job or when there are deadlines, and pressures to fix the problem immediately are intense.
After a problem has been diagnosed, locating resources and following the procedures required to correct the problem are straightforward. But before that diagnosis occurs, it is essential to isolate the true cause of the problem from irrelevant factors.
Troubleshooting is more of an art form than an exact science. However, to be efficient and effective as a troubleshooter, you must approach the problem in an organized and methodical manner. Remember that you are looking for the cause, not its symptoms; yet frequently, problems as originally reported are just symptoms and not the true cause. As a troubleshooter you need to learn to quickly and confidently eliminate as many alternative causes as possible. This will allow you to focus on the things that might be the cause of the problem. To do this, you must take a systematic approach.
The process of troubleshooting a computer network problem can be divided into five steps.
The first phase is the most critical, yet most often ignored. Without a complete understanding of the entire problem, you can spend a great deal of time working on the symptoms, without getting to the cause. The only tools required for this phase are a pad of paper, a pen (or pencil), and good listening skills.
Listening to the client or network user is your best source of information. Remember that while you might know how the network functions and be able to find the technical cause of the failure, those operating the network on a daily basis were there before and after the problem started and probably recall the events that led up to the failure. By drawing on their experience with the problem, you can get a head start on narrowing down the possible causes. To help identify the problem, list the sequence of events, as they occurred, before the failure. You might want to create a form with these questions (and others specific to the situation) to help organize your notes.
Some general questions to ask might include:
Users—even those with little or no technical background—can be helpful in collecting information if they are questioned effectively. Ask users what the network is doing or not doing that makes them think it's not functioning correctly. User observations that can be clues to the underlying cause of a network problem include the following:
As you continue to ask questions, you can begin to narrow your focus, as the following list illustrates:
If only one user has a problem, the user's workstation is probably the cause.
Intermittent symptoms are a sign of failing hardware.
Any change in operating system software can cause new problems.
If only one application causes problems, focus on the application.
If a similar problem occurred in the past, there might be a documented solution.
Increased traffic can cause logon and processing delays.
Check to verify that new network equipment has been correctly configured.
Installation and training issues can cause application problems.
The moved equipment might not be connected to the network.
Some vendors offer telephone, online, or onsite support.
There might be a documented solution on the vendor's Web site.
Check for documented repairs and ask coworkers about attempted repairs.
The next step is to isolate the problem. Begin by eliminating the most obvious problems and work toward the more complex and obscure. Your purpose is to narrow your search down to one or two general categories.
Be sure to observe the failure yourself. If possible, have someone demonstrate the failure to you. If it is an operator-induced problem, it is important to observe how it is created, as well as the results.
The most difficult problems to isolate are those which are intermittent and that never seem to occur when you are present. The only way to resolve these is to re-create the set of circumstances that cause the failure. Sometimes, eliminating causes that are not the problem is the best you can do. This process takes time and patience. The user also needs to keep detailed records of what is being done before and when the failure occurs. It can help to tell the user to refrain from doing anything with the computer when the problem recurs, except to call you. That way, the "evidence" won't be disturbed.
While the information collected provides the foundation for isolating the problem, the administrator should also refer to documented baseline information to compare with current network behavior. In Chapter 12, "Administering Change," we learned how to document a network by creating a baseline. Now it is time to put that knowledge to work. Rerun tests under the same set of conditions as prevailed when you created the baseline, then compare the two results. Any changes between the two can indicate the source of the problem.
Information gathering involves scanning the network and looking for an obvious cause of the problem. A quick scan should include a review of the documented history of the network to determine if the problem has occurred before and, if so, whether there is a recorded solution.
After you have narrowed your search down to a few categories, the final process of elimination begins.
Create a planned approach to isolating the problem based on your knowledge at this point. Start by trying out the most obvious or easiest solution to eliminate and continue toward the more difficult and complex. It is important to record each step of the process; document every action and its results.
After you have created your plan, it is important to follow it through as designed. Jumping ahead and randomly trying things out of order can often lead to problems. If the first plan is not successful (always a possibility), create a new plan based on what you discovered with the previous plan. Be sure to refer to, reexamine, and reassess any assumptions you might have made in the previous plan.
After you have located the problem, either repair the defect or replace the defective component. If the problem is software-based, be sure to record the "before" and "after" changes.
No repair is complete without confirmation that the job has been successfully concluded. You need to make sure that the problem no longer exists. Ask the user to test the solution and confirm the results. You should also make sure that the fix did not generate new problems. Be sure to confirm not only the problem you fixed, but also that what you have done has not had a negative impact on any other aspect of the network.
Finally, document the problem and the repair. Recording what you've learned will provide you with invaluable information. There is no substitute for experience in troubleshooting, and each new problem presents you with an opportunity to expand that experience. Keeping a copy of the repair procedure in your technical library can be useful when the problem (or one like it) occurs again. Documenting the troubleshooting process is one way to build, retain, and share experience.
Remember that any changes you have made might have affected the baseline. You might need to update the network baseline in anticipation of future problems and needs.
If the initial review of network statistics and symptoms does not expose an obvious problem, dividing the network into smaller parts to isolate the cause is the next step in the troubleshooting process. The first question to ask is whether the problem stems from the hardware, or the software. If the problem appears to be hardware-based, start by looking at only one segment of the network, then looking at only one type of hardware.
Check the hardware and network components including:
Often, isolating or removing a portion of the network will help to get the rest of the network up and operational again. If removing a portion solved the problem for the rest of the network, the search for the problem can be focused on the part that was removed.
Network protocols require special attention because they are designed to bypass network problems and attempt to overcome network faults. Most protocols use what's known as "retry logic," in which the software attempts an automatic recovery from a problem. This becomes noticeable through slow network performance as the network makes new and repeated attempts to perform correctly. Failing hardware devices, such as hard drives and controllers, will use retry logic by repeatedly interrupting the CPU for more processing time to complete their task.
When you are assessing hardware performance problems, use the information obtained from the hardware baselines to compare against the current symptoms and performance.
After you have gathered the information, rank the list of possible causes in order, beginning with the most likely and moving to the least likely cause of the problem. Then select the most likely candidate from the list of possible causes, test it and see if that is the problem. Start from the most obvious and work to the most difficult. For example, if you suspect that a faulty network interface card (NIC) in one of the computers is the cause of the trouble, replace it with a NIC that is known to be in good working order.
A fundamental element in network problem solving is setting priorities. Everyone wants his or her computer fixed first, so setting priorities is not an easy job. While the simplest approach is to prioritize on a "first come, first served" basis, this does not always work, as some failures are more critical to resolve than others. Therefore, the initial step is to assess the problem's impact on the ability to maintain operations. For example, a monitor that is gradually getting fuzzy over several days would have a lower priority to address than the inability to access the payroll file server prior to a check run.
Given the following scenario, describe how you would research, identify, prioritize, and resolve this network problem:
The network has been running well at the site of a small manufacturer. However, a user in the quality control division now calls to report that she is unable to get the daily status reports printed by the printer in the department. Meanwhile, the shipping department reports that a rerouted print job did not print in the quality control department. What is your strategy for solving this network problem?
The following points summarize the main elements of this lesson: