Troubleshooting SQL Server 2000 in a failover cluster configuration is not always the same as diagnosing problems on a stand-alone server. Because the configuration is a bit more complex, you need to diagnose in a specific order. Before that, however, you must understand the barriers that can exist when implementing failover clustering.
Microsoft PSS has identified these barriers as the leading culprits for failover clustering problems:
Lack of planning. One of the problems is that a missed configuration step can lead to significant problems in production. If your server cluster is unstable, your failover cluster will be, too, and possibly vice versa. For example, is the driver for your SAN the right version? Take the time to plan your installations properly using the tools presented throughout this book.
Failure to comply with the Hardware Compatibility List (HCL) requirements. SQL Server failover clustering is only supported on HCL or Windows Catalog “certified systems as described in Chapter 5.
Not understanding the technology. There might be a lack of understanding about why clustering is being used, what it provides, and what it does not provide. This can lead to dissatisfaction with the platform, even though it is performing as designed. Please take the time to review Chapter 3 to understand the basics of clustering.
Internal politics. If the customer has one group for SQL Server, another for networking, and yet another for system work, the groups must be able to work together, understand the technical issues (such as the need for domain accounts and dedicated IP addresses), and understand how they must interact both on a technical level and on a political level. Internal turf wars are a common problem and can prove to be a significant barrier to availability. Teamwork is crucial.
Failure to troubleshoot high availability technology systematically. Provided nothing has changed in the application environment, you have to troubleshoot the technology in layers ”consistently.
Lack of technology-aware diagnostics. How you diagnose a system depends on the technology involved. This is no different with SQL Server high availability systems. For example, make sure that if you are using a cluster, your diagnostic systems are cluster-aware.
Building cluster awareness and understanding into applications. SQL Server developers need to build in cluster awareness to make the end- user experience seamless. For example, if your database takes n minutes to go through failover and recovery, how does your application respond during those n minutes? This can be controlled by building awareness of the failover process into your application for any high availability technology approach, not just for clustering.
Securing SQL Server access with use of an SSL certificate, which is covered earlier in this chapter. If this is applied wrong, you might need to completely reinstall or recover your failover cluster.
When you are working with PSS to solve a problem, make sure that you fulfill all requests for information ” especially when supporting a cluster. PSS can gather and send subsets of clustering information to different specialists for concurrent analysis, as long as you supply this information. Your level of cooperation and communication in interacting with PSS affects the time it takes to resolve your issue.
Most SQL Server cluster issues (more than 75 percent) have nothing to do with SQL Server, and therefore cannot be solved by sending log information in isolation. Proper resolution of a case requires proactive communication. This involves keeping accurate records of changes and times of changes ( deliberate alterations or observed fluctuations). Track configurations. Keep complete notes of all error messages received or observed rather than recording generic comments, such as Installation failed. Communicate any actions you took and their results, as well as anything you suspect and why, or anything that is happening simultaneously . For example, detail the following:
Was a file in use? If so, identify which file you could not update.
Was a share missing?
Did the user have permission?
Was there a hardware problem detected ? Fix the errors constantly being reported in the Setupapi.log file so PSS can use that as a resource to see if there are system file-protection issues being encountered .
Once you have involved PSS, refrain from making any changes to the system without confirming them with PSS. This is a simple extra step that can facilitate resolution. It can prevent additional problems from arising (for example, if your plan has known issues).
Roughly 70 to 80 percent of all SQL Server virtual server problems are not SQL Server “based. They have to do with base cluster issues at the operating system or a hardware failure, driver issue, and so on. Always troubleshoot in the following bottom-up order, as this uncovers the layers of problems. Never assume that SQL Server is your issue until you discover that it is. If you troubleshoot in a different order than that listed here, you might have success, but experience has proven that following this exact sequence uncovers your problems with greater certainty :
Base Windows issues
Security issues (such as permissions)
Server cluster or other base clustering issues
SQL Server issues
Check the operating system s Event Viewer at the application, system, and security logs. Sometimes the problem is apparent, such as a disk or network card failure, or relevant error messages from the operating system or SQL Server might appear. Next , check the relevant log files. The SQL Server installation log files are located under %windir%, and the cluster logs are located where the system variable %clusterlog% is set ( generally \\Windowsinstalldir\Cluster). These are the files:
Setup.log Log for the local binaries setup portion of a SQL Server installation.
Sqlstp n .log The log for the SQL Server Setup, where n is the number of the setup attempts.
Sqlsp n .log The log for SQL Server Service Pack Setup, where n is the number of the setup attempts.
Sqlclstr.log The log for the clustered instances of SQL Server.
Cluster.log The main cluster log file.
If you call PSS, there is a good chance they will want you to run the Microsoft Product Support (MPS) Reporting Tool. The MPS Reporting Tool gathers detailed information regarding a system s current configuration in a nondestructive way. The reporting tool does not make any registry changes or modifications to the operating system. The data that is collected will assist the Microsoft Support Professionals with fault isolation. When used with a cluster, this tool should be executed on all nodes that are part of the SQL Server failover cluster definition.
In addition to the information the tool gathers, you will be asked to provide information specific to your installation that PSS cannot gather manually. When the tool finishes, a text document appears on your screen with details of the requested information.
|On the CD|| |
A version of the MPS Reporting tool is included with this book in MPSRPT_2002_SQLv6f.zip. Test it on your systems, both clustered and nonclustered, and you can also store this information in your run book. PSS might have a later version that they will ask you to use if you open a support case.
This section presents some of the common issues and solutions when implementing failover clustering.
When the installation process attempts to install the SQL Server binaries on the other nodes, it fails (possibly with a failed logon request error). When installing on the remote nodes, the SQL Server Setup program is actually run on each of the nodes. If you install from a network share, ensure that all nodes have connectivity to the share without having to specify a network password. (For example, you should be able to view \\sharecomputer\sharepath without specifying credentials.) If you are installing from a compact disc on a particular node, make sure that the drive is actually shared and that the other cluster nodes are configured to communicate properly. This includes making sure that the proper accounts exist on each node and that the other nodes are set up for Windows Authentication. Mapping a drive letter, even if it is the same on all nodes, does not work because the installation process accesses the UNC path of the share.
After installing and rebooting the server, the SQL Server install does not seem to complete. Sometimes file renames (for example, MDAC files) are blocked at startup due to locks on files in use during startup. Therefore, if the file remains read-only, the installation process never completes.
Windows Clustering loses connectivity to the SQL Server virtual server. This could be because the process used to perform the IsAlive check is run in the context of the Windows Clustering service account. This account must have sysadmin rights in SQL Server. If you encounter problems with this, check all logins and cross-reference them with the cluster logs to see if there is an IsAlive check.
If you removed Builtin\Administrators from SQL Server, you have to manually add the Cluster Service account to the Sysadmin group of SQL Server so it can run IsAlive.
After installation of a new SQL Server virtual server, clients cannot connect to the server, especially when using graphical user interface (GUI) tools. Sometimes a successful manual connection using a command-line tool (for example, the bcp or osql utilities) must be initiated first. This sometimes occurs when the virtual server is not yet registered in Windows Internet Name Server (WINS). If your installation was a named instance, you have to verify that your clients are using a minimum of MDAC 2.6.
In some cases, Setup might fail because the Full-Text Search resource type does not exist. The resource is called Microsoft Search Service. If this is the case, you might have to manually create the type to get the installation to complete next time. To create the resource, go to a command prompt and execute
cd %windir%\cluster regsvr32 gathercl.dll
It is helpful to register this file on all nodes. You can then rerun Setup.