Troubleshooting Cluster Problems


Troubleshooting cluster-related issues is not for the faint of heart or for the beginner. It requires a lot of fortitude, persistence, experience, and a support contract with Microsoft Technical Support. The problem is that clustering is very complex and involves your node hardware, shared array, hardware drivers, operating system, clustering services, and SQL Server 2005. Any problem you are having could be caused by any one of them, and identifying the exact cause of a problem is often difficult.

Another reason cluster troubleshooting is difficult is because the feedback you get, in the form of messages or logs, is not always accurate or complete, assuming you get any feedback at all. And when you do get feedback, the resources to identify and remedy problems are minimal.

Because of all of this, if you have a cluster, you should plan on purchasing Microsoft Technical Support for your cluster. This is a very good investment, and one that will pay for itself. We have used Microsoft Technical Support many times, and in most cases, they have been able to help. You don't need to automatically call support as soon as you have a problem; you should always try to identify and resolve problems if you can. But at some point, especially if your cluster is down and you need help getting it back up, you need to be able to recognize when you can't resolve the problem by yourself and when you need help from Microsoft.

In this section we have included some general advice to get you started on how to identify and resolve cluster-related problems.

How to Approach Clustering Troubleshooting

As we discussed how to install clustering in this chapter, we have emphasized over and over the importance of performing a task, testing, and if everything is working OK, then proceed with the next step. The reason for this approach is to help you more easily identify what is causing the problem as soon as possible after it happens. For example, if things are working correctly, then you perform a task and then test what you did, and the task you performed fails, you can fairly assume that what you just did is directly or indirectly responsible for the problem, making problem identification easier. If you don't perform regular testing and don't notice a problem until after many tasks have been performed, then identifying the causes of problems is much more difficult. So, in essence, the best way to troubleshoot problems is by performing incremental testing. This also makes it much easier if you have a detailed installation plan that you can follow, helping you to ensure that you are performing all the necessary steps (including testing at appropriate places).

Do It Right the First Time

You can save a lot of troubleshooting problems by preventing them. Here's how.

  • Be double sure that all of the hardware for your nodes and shared array are on Microsoft's cluster compatibility list.

  • Be sure that you are using the latest hardware and software drivers and service packs.

  • Create a detailed installation plan that you can use as your guide for the installation and for disaster recovery should the need arise.

  • Learn as much as you can about clustering before you begin your installation. Many cluster problems are user-created because they person responsible guessed instead of knowing for sure what they were doing.

Gathering Information

To help identify the cause of a problem you often need lots of information. Unfortunately, the information you need may not exist, or it may be scattered about in many different locations, or it may be downright misleading. In any event, to troubleshoot problems, you have to find as much information as you can.

Here are some of the resources you can use to gather information to troubleshoot a cluster problem.

  • Know what is supposed to happen. If you expect a specific result, and you are not getting it, be sure that you fully understand what is supposed to happen, and exactly what is happening. In other words, know the difference between the two.

  • Know what happened directly before the problem occurred. This is much easier if you test often, as described earlier.

  • Is the problem repeatable? Not all problems can be easily repeated, but if they can, this is useful information.

  • When some problems occur, error messages appear on the screen. Be sure that you take screen snaps of any messages for references. Some DBAs have the habit of clicking OK after an error message without recording its exact content. Often, the exact content of a message is helpful if you need to search the Internet to learn more about it.

  • There are a variety of logs you may be able to view, depending on how far along you are in the cluster setup process. These include the operating system logs (three of them); the cluster log (located at c:\windows\cluster\cluster.log); the SQL Server Setup log files (located at %ProgramFiles%\Microsoft SQL Server\90\Setup Bootstrap\LOG\Summary.txt); and the SQL Server 2005 log files.

  • If the error messages you identify aren't obvious (are they ever?) search for them on the Internet, including newsgroups.

The more information you can gather about a problem, the better position you are to resolve the problem.

Resolving Problems

Sometimes a problem is obvious and the solution is obvious. If that's the case, you are lucky.

In many other cases, the problem you have may or may not be obvious, but the solution is not obvious. In these cases, we have found that instead of wasting a lot of time trying to identify and fix a problem, that the easier and quicker solution is to reinstall your cluster from scratch.

Many cluster problems are due to the complexity of the software involved, and we have discovered that it is often much faster to just rebuild the cluster from scratch, including the operating system. This is especially true if you have tried to install cluster services or SQL Server 2005 clustering and the setup process aborted during setup and did not uninstall itself cleanly.

When you are building a new cluster, rebuilding it to resolve problems is usually an option because time is not an issue. But what if you have a SQL Server 2005 cluster in production and it dies so bad that neither node works and you don't have time to rebuild it. Then what do you do? You bring in Microsoft.

Working with Microsoft

Operating a SQL Server 2005 cluster without having a Microsoft Technical Support contract is like operating a car without insurance. You can do it, but if you have any unexpected problems, you will be sorry you went without.

Generally, there are two main reasons you would need to call Microsoft Technical Support for clustering issues. First, it's a non-critical issue that you just can't figure out for yourself. In this case, you will be assigned an engineer, and over a period of several days, you will work with that engineer to resolve your problem. This often involves running an application provided by Microsoft to gather information about your cluster so the engineer can resolve it.

The second reason to call is because your production cluster is down and there are no obvious solutions to getting it back up quickly. Generally, in this case, we recommend you call Microsoft Technical Support as soon as you can to get the problem ticket started. In addition, you must emphasize to the technical call screener (the first person who answers the phone) and the engineer you are assigned to, that you are facing a production down situation and that you want to declare a critical situation (critsit). This tells Microsoft that your problem is top priority and you will get special service. When you declare a critsit, the other person on the phone may want to dissuade from doing so because it causes a chain of events to happen within Microsoft Technical Support they like to avoid. But if your production cluster is down, you need to emphasize the nature of your problem and tell them that you want to open the critsit. You may have to repeat this several times so that that the nature of your problem is fully understood. If it is, you will get immediate help with your problem until your cluster is back up and running.



Professional SQL Server 2005 Administration
Professional SQL Server 2005 Administration (Wrox Professional Guides)
ISBN: 0470055200
EAN: 2147483647
Year: 2004
Pages: 193

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net