As you know, things can and do happen to your cluster. You have to be prepared to handle these situations when they arise. Whether it is a corrupt disk, failed disk, or bad node, the more you know up front, the more effectively you can deal with the problem. The following sections present some scenarios that you should consider when attempting to restore your server clusters that have SQL Server virtual servers on them.
Although some scenarios might apply to other clustered applications such as Microsoft Exchange, each application has its own exact set of steps to restore. The steps outlined here are for SQL Server 2000 only.
If you are using Windows 2000, the Microsoft Windows 2000 Server Resource Kit (Microsoft Press, ISBN 1-57231-8058) includes a few tools you might find useful:
Under \Apps\Clustool, there is a tool named Clusrest.exe. It might assist you in the restoration of your quorum log to a live quorum. This can also be downloaded from http://www.microsoft.com/windows2000/techinfo/reskit/tools/existing/clusrest-o.asp. Dumpcfg.exe can be useful for backing up and restoring disk signatures, which are crucial for the health of a server cluster.Clustool.exe can be useful for backing up and restoring parts of your server cluster configuration. Please note that things like the cluster IP address, the cluster name , and the quorum disk are not restored with this. Read more information on each of these tools before deploying them in your environment to see if they will be useful to you.Windows 2003 Server Resource Kit tools are available from the Microsoft Windows Server 2003 Deployment Kit: A Microsoft Resource Kit (Microsoft Press, ISBN 0-7356-1486-5) or from http://www.microsoft.com/windowsserver2003/techinfo/reskit/resourcekit.msp.
In this scenario, all of the cluster nodes themselves are apparently fine, but the Cluster Service cannot start on any node because the quorum resource cannot be brought online. This issue will appear in the Event Log. To resolve this issue you have two options:
Option 1 Replace the quorum disk if the drive itself has failed or reformat the quorum disk if the physical drive has not failed. Use an Authoritative Restore, if you have one, to bring up one node.
Option 2 Follow these steps:
Open a Command window on one node.
Enter the following command: cluster -fixquorum
This starts the Cluster Service, knowing that the quorum cannot be brought online. The -fixquorum command does not fix any data for you. It allows you to choose an alternate quorum resource ( assuming you have one available to use; you could always try LocalQuorum). By setting a new quorum, new quorum log files are created on the quorum but the registry checkpoint files are not restored because the old quorum is not available. For the checkpoint files, use the instructions for Scenario 4.
There is a tool in the Windows 2000 Resource Kit, ClusterRecovery, that can help in this scenario.
This happens if a node cannot join the cluster and entries in the cluster log indicate a corrupted hive. You have three possible options (only perform one):
Option 1 Do a Non-Authoritative Restore on this node and have it join the cluster.
Option 2 Copy the latest checkpoint (ChkXXX.tmp) file from the quorum disk and overwrite the file %windir%\Cluster\Clusdb on the affected node and restart the service.
Option 3 Perform the following steps:
Stop the service on a working cluster node. Unload the cluster hive using Regedt32.
Copy the file %windir%\Cluster\Clusdb from the working node to %windir%\Cluster\Clusdb on the affected node, and restart the Cluster Service on all nodes.
If no nodes can join the cluster, a node cannot start the Cluster Service, and Event Viewer indicates a corrupt quorum log, you can do the following: Start the Cluster Service from a command line using the -resetquorumlog switch of Cluster.exe. If all of the resources start successfully and there do not seem to be any lingering issues, you do not need to do anything else. You reset the quorum and created new quorum log files on the quorum disk. However, the registry checkpoint files are not restored because the old quorum is not available. You need to follow the steps in Scenario 4 to restore those.
If using -resetquorumlog does not work, restore using an Authoritative Restore on one node and restart the Cluster Service to form the cluster. Use a Non- Authoritative Restore on all other nodes.
In the event a registry checkpoint file cannot be found or loaded due to corruption, resources might not have the most up-to-date information in the registry when they are brought online. The impact depends on the resource. In some cases, a resource might fail to come online. In other cases, configuration changes that were made might be lost. If a checkpoint file is missing, this is not logged to Event Viewer s event log, but it is in the cluster log. If you see this is an issue, use the ClusterRecovery tool mentioned earlier to re-create the resource checkpoint files. However, do it only for the resources that cannot start, not all checkpoint files. You will create more problems if you do that.
If that does not solve the problem, perform an Authoritative Restore on one cluster node and restart the Cluster Service to form the cluster. Use a Non- Authoritative Restore on other nodes.
In this case, assume that the quorum disk is functioning and the cluster database is intact. In these procedures, you might have to evict cluster nodes from your definition because they might be damaged.
Never evict the node from the cluster in Cluster Administrator before removing it from the SQL Server virtual server definition using SQL Server Setup. You might encounter issues that require a complete reinstall.
Because the other nodes and the cluster itself are up and running, you can concentrate only on the failed node. Because the quorum is intact, you can perform a Non-Authoritative Restore, which should work with either the system state backup or a local backup. The end result is that the cluster database on the damaged node is not restored; you can then have that node rejoin the cluster. Once it joins the server cluster, it synchronizes the cluster database with the most recent copy available. The exact steps are as follows :
Verify that all functionality that was owned by the failed server has started on another node.
Run SQL Server Setup and evict the failed node from the SQL Server virtual server definition as described in the section Adding or Removing a Cluster Node from the Virtual Server Definition and Adding, Changing, or Updating a TCP/IP Address earlier in this chapter. You might see messages similar to the ones shown in Figures 6-31, 6-32, and 6-33 during the process.
Figure 6-31: Error message 1.
Figure 6-32: Error message 2.
Figure 6-33: Error message 3.
Verify that the node has been removed by issuing the following query:
SELECT * FROM ::fn_virtalservernodes()
Evict the node itself from Cluster Administrator. To do this:
In the left-hand pane of Cluster Administrator, select the node to evict, right-click it, and select Stop Cluster Service if it is still running the Cluster Service. If not, skip this step.
Once that node has indicated that the service is stopped (a red circle with a white X should appear next to it, as shown in Figure 6-34), right-click it and select Evict Node.
Figure 6-34: Stopped cluster node.
Rebuild or restore the failed node (whichever procedure is most appropriate for your environment).
Rejoin the node to the server cluster.
Run SQL Server Setup and rejoin the SQL Server virtual server definition.
Reinstall the SQL Server 2000 service pack on the previously damaged node only, as detailed in Chapter 13.
Test failover to the node. This causes an availability outage , but it is the only way to ensure everything is working properly.
As long as you have one node functioning in the cluster, perform the steps found in the preceding section on each of the failed nodes.
This is the worst-case scenario: your entire cluster is down. At this point, if none of the nodes can start, you are looking at using your backups . Use a Non- Authoritative Restore on one node. If the quorum disk is fine, the node should be able to form the cluster with the current state on the quorum disk. Then run a Non-Authoritative Restore on all other nodes.
If that does not work, restore an Authoritative backup on one node and use a Non-Authoritative Restore for all other nodes.
If a disk fails and resources are dependent on it, the resource will not start. This could be due to corruption or some other problem. If the disk itself comes online and is recognized by the operating system, perform a restore from a backup on that disk. If the disk does have corruption, replace the physical disk and perform a Non-Authoritative Restore on one node. Then restore the data to the disk. Instead of that procedure, you could use ClusterRecovery to replace an existing physical disk resource without having to do a system state restore, and then you could restore the data on the disk.
At this point, if you have walked through the preceding six scenarios and you could not recover due to the fact you did not have proper backups, you can try the things listed in the following sections, but there is no guarantee that you can avoid a complete reinstall.
If the cluster database is intact and the quorum disk is fine, follow the steps in Scenario 5. If you lose all nodes and do not have backups, you have to do a complete reinstall from the ground up knowing that you might have lost all of your SQL Server data.
Use Option 2 from Scenario 1 to employ the -fixquorum flag option of the cluster command.
You can try to copy the latest checkpoint (ChkXXX.tmp) file from the quorum disk and overwrite the file %windir%\Cluster\Clusdb on the affected node and restart the cluster service. Or you might want to try the following:
Stop the service on another node. Unload the cluster hive using Regedt32.
Load the registry hive on the affected node.
Copy the file %windir%\Cluster\Clusdb from one of the running nodes in the cluster to %windir%\Cluster\Clusdb on the affected node and restart the Cluster Service on all nodes.
Use the steps found in Scenario 3 to use -resetquorum .
Follow the instructions in Scenario 4 to use the ClusterRecovery tool.
If the disk has been forcefully dismounted, you might need to run chkdsk to bring the disk online. The Cluster Service runs chkdsk automatically when the disk is brought online. Windows Server 2003 preserves a chkdsk log so that you can see what state the disk is in and what issues were found.
If the application data on the disk is corrupted or deleted and you do not have a backup, there is no way to recover the data.