22.2. HP Troubleshooting Methodology | HP ProLiant Servers AIS: Official Study Guide and Desk Reference

< Day Day Up >

The high degree of interaction between the server system, options hardware, operating system, and application software can make it difficult to isolate the root cause of a problem. Intermittent problems and problems generated by multiple subsystem malfunctions can be especially difficult to troubleshoot.

HP has developed a six-step troubleshooting methodology, shown in Figure 22-1, to systematically get to the core of a problem, resolve it, and take steps to limit the possibility of it happening again. These six steps are as follows:

1. Collect data.

2. Evaluate the data to determine potential subsystems causing the issue.

3. Develop an optimized action plan.

4. Execute the action plan.

5. Determine whether the problem is solved.

6. Implement preventive measures.

Figure 22-1. HP troubleshooting methodology flowchart.

As part of any change management process, altering only one variable at a time can show the impact of that change. If manipulating one variable does not result in any performance increase, revisiting the data to determine the next variable to manipulate is the next step. By following the HP troubleshooting methodology, you can use a standard approach to reduce possibilities until a solution set is found.

This methodology provides a logical framework to troubleshoot system problems and reach problem resolution. A logical framework also provides a consistent and solid foundation for other technicians and system engineers to work from when escalation is necessary.

22.2.1 Troubleshooting Step 1 Collect Data

The first step in troubleshooting a problem involves spending the time and effort to gather helpful information. There are actually two skills involved in this step: (1) asking the right questions, and (2) using the appropriate tools and methods to gather and analyze system data.

22.2.1.1 ASKING THE RIGHT QUESTIONS

As you begin the troubleshooting process, start by asking a series of questions that will help you thoroughly understand the nature of the problem. Here are some of the questions you could ask:

What happens when the system fails?
What specific component is failing and when?
What errors display?
Can you duplicate the failure at will or is it random?
Do you notice anything unusual that you think would be helpful?
What has recently changed?

Through experience and logic, the answers to these kinds of questions will help you narrow down the possible causes and will help you focus your efforts as you begin collecting additional data on your own.

22.2.1.2 COLLECTING SYSTEM DATA

After you have determined the answers to the general questions, you now begin the process of collecting very specific information about the system.

The specific data information you collect should include each of the following items:

The hardware components physically installed in the system
The software installed in the system (including patches, service packs, and other incremental updates)
Specific information about the failure, such as the following:
- Stop/abend/trap messages
- HP Insight Manager error conditions
- Critical error log messages
- Power-On Self-Test (POST) messages

HP provides several tools for viewing system data. Some of these utilities are integrated into the server itself, but all complement each other. These utilities include the following:

Integrated Management Log
Integrated Management Log Viewer
Integrated Management Display
Integrated Management Display Utility
Enhanced Integrated Management Display Service

Each of these utilities is described briefly later in this chapter.

22.2.1.3 SETTING A PERFORMANCE BASELINE FOR WINDOWS SYSTEMS

As part of the information gathering process, you establish a performance baseline.

! Important

This should be done before any changes are made to a system.

Windows Server 2003 (and earlier versions) includes a useful utility called Performance. This utility is located under the Administrative Tools icon in the Control Panel. It is used to monitor either local or remote system performance. After the needed counters are chosen, Performance can track and record them. This is useful for real-time monitoring and logging for a baseline. Over time, this data can help identify system bottlenecks.

After the tool is started, you can add various counters to the System Monitor feature to track performance of a local or remote computer. You can choose related counters and specific instances from several performance object categories. Specific instances refer to the ability to choose all or specific processors or page files. (Not all counters have instances to choose from.)

Although some of the counters represent averages for read/write requests, separate counters for read and write operations can be used to gain a more specific view of activity instead.

Note

To get an explanation of each counter, click Explain from the Add Counters window after clicking the plus sign (+) icon in System Monitor.

Some counters have different scales, so it is important that the scale of one counter does not affect the readability of the other counters. Counter scales can be adjusted by right-clicking the counter in the legend area at the bottom of the Performance windows and selecting Properties, and then Data. On this tab in the System Monitor Properties window, you can adjust the color, scale, width, and style of each counter. Other tabs enable you to adjust the properties of source, graphs, colors, and fonts.

A Web browser provides a convenient way to monitor these counters. After the counters are selected and optimized for readability, they can be saved as an HTML-format file by right-clicking the graph. The file can then be opened in a browser on any computer.

22.2.2 Troubleshooting Step 2 Evaluating and Interpreting the Data

After you have gathered the data, the next step is to evaluate and interpret the data to determine which subsystem or subsystems could be causing the problem.

The evaluation and interpretation of the data enables you to do the following:

1. Determine which components could cause what happened.

2. Isolate faults to a hardware or software subsystem.

3. Understand the mode of failure.

After you have determined what is most likely causing the problem to occur, you are ready to move to Step 3.

22.2.3 Troubleshooting Step 3 Develop an Optimized Action Plan

After collecting the facts and isolating the specific mode of failure, your next step in the troubleshooting process is to develop an optimized action plan. The action plan is developed through the following steps:

1. Identify specific root causes for the specified mode of failure.

2. Identify possible solutions for each possible root cause.

3. Rank the possible solutions in a priority order by balancing the time and cost that it will take to implement each solution against the likelihood that the solution will resolve the problem. (It is possible that the initial possible solution will not solve the problem, but it might yield additional helpful information that will help solve the problem.)

4. Identify the steps necessary to implement each solution.

5. Compile all the steps into an optimized action plan by eliminating redundancy and ensuring that only one variable is being manipulated at a time.

6. Incorporate an escalation plan into the master action plan. You should be prepared to escalate the situation for additional technical assistance. The escalation plan should contain a list of whom to contact and the information the escalation recipient would need.

22.2.4 Troubleshooting Step 4 Execute the Action Plan

In Step 4, you implement the optimized action plan you created in Step 3. It is critical that you carefully observe and record the results of each step. Even if the action plan does not solve the problem, it might provide more clues to solving it.

To execute the action plan, you carefully execute each step, implementing only one solution (that is, modifying only one variable) at a time.

As you implement each step, observe and record the results of each step, including any error messages or changes in functionality.

22.2.5 Troubleshooting Step 5 Determine Whether the Problem Is Solved

Step 5 is to evaluate the results of each step until the problem has been isolated and resolved. If the problem is not resolved, you cycle back through the troubleshooting methodology by doing the following:

Collecting more data
Utilizing the information gathered from implementation of the action plan
Evaluating the information
Developing another optimized action plan
Implementing the optimized action plan
Repeating as necessary, escalating when appropriate

22.2.6 Troubleshooting Step 6 Implement Preventive Measures

As soon as the problem is resolved, you implement the necessary preventive measures that will ensure that the problem is not repeated (if possible). You should also look at opportunities to improve or increase system availability.

To implement preventive measures, follow these steps:

1. Determine the root cause of the problem.

2. Determine proactive steps that can prevent the problem from recurring.

3. Devise a system test to verify changes and procedures before implementing them into production.

4. Implement a new set of procedures, software, and administrative maintenance to attain a higher level of availability.

5. Perform preventive maintenance, including checking for loose cables, reseating boards, and checking for proper airflow.

6. Add fault-tolerant elements to critical subsystems, where applicable.

The HP troubleshooting methodology provides you with a structured approach to solving problems in an efficient manner.

< Day Day Up >