Maintaining the Cluster | Professional SQL Server 2005 Administration (Wrox Professional Guides)

Once you have your SQL Server 2005 cluster up and running (and tested) you are ready to put it into production. This may involve creating new databases, moving databases from older servers to this one, setting up jobs, and so on. In most cases, managing SQL Server 2005 on a cluster is the same as managing it on a non-cluster. The key thing to keep in mind is that whenever you access your cluster with any of your SQL Server 2005 administrative tools, such as Management Studio, you will be accessing it using its virtual name and IP address, but if you are using any of the operating system tools, such as System Monitor, you will need to point these directly to the node in question (which is usually the active node).

In most cases, as a DBA, you probably will be administering SQL Server 2005 using Management Studio, but sometimes you will need to access the individual nodes of the cluster. If you have easy access to the cluster, you can always log on locally. Or if you prefer remote administration, you can use Terminal Services to access the nodes.

When a DBA first begins to administer their first cluster, they get a little confused where SQL Server 2005 is actually running. Keep in mind that a clustered instance of SQL Server 2005 (even in a multinode cluster) only consists of two nodes: the active and passive nodes. At any one time, an instance of SQL Server 2005 is only running on the active node, not the passive node. So when you need to look at the nodes directly, generally you will want to look at the active node. If you don't know which node is currently active, you can find out by using Cluster Administrator.

When you log into the active node (or connect to it remotely using Terminal Services), and then bring up Windows Explorer (a routine task), you will be able to see the SQL Server data drives and the quorum drive. But if you log onto the passive node, you will not be able to see the SQL Server data drives or the quorum drive. This is because drives can only be accessed from a single node at a time. On the other hand, if you check the Event Logs on either the active or passive nodes, you will see that they are virtually identical, with events from both nodes being displayed. Event logs are replicated between the active and passive nodes.

If you access your cluster through Terminal Services, be aware of a couple of odd behaviors. For example, if you use Terminal Services to connect to the cluster using the virtual cluster name and IP address, you will be connecting to the active node, just as if you used Terminal Services to connect to the active node directly (using its virtual name and IP address). But if a failover should occur and you are using Terminal Services to access the cluster using the virtual cluster name and IP address, Terminal Services will get a little confused, especially if you are using Windows Explorer. For example, you may discover that your data and quorum drives no longer appear to be accessible, even though they really are. To resolve this problem, log out of Terminal Services and reconnect after a failover; then things will appear as they should.

Learn to Love Cluster Administrator

While you won't need to use Cluster Administrator a lot, you still need to become very familiar with it as the DBA of a SQL Server 2005 cluster. You'll mostly use it when you need to perform some maintenance on your cluster and you need to manually fail over the cluster from one node to another, or to troubleshoot and fix problems.

You should keep close tabs on the resource groups so that you can find out which nodes the resource groups are currently running on, along with their current status. Sometimes, failovers occur so quickly that you are not aware that one has even occurred. By checking the current Owner of the resources, you can easily and quickly tell which node the resources are running on. The active node is the node where the resources are currently running. If you don't want to have to be checking the Cluster Administrator all the time to see if a failover has occurred, you always have the option of setting up a SQL Server 2005 alert, so you can keep abreast of what is happening on you cluster.

Once your cluster is up and running, be sure you take the time to familiarize yourself with Cluster Administrator.

Doing the Routine

The day-to-day activities of managing a SQL Server 2005 cluster are not much different than that of a non-clustered instance of SQL Server 2005. In addition to the routine you follow for all of you SQL Servers, here are some steps you can do in addition for you clusters.

Checking the Cluster

Each morning, start up Cluster Administrator to verify that the active node has not changed (due to a failover) and also review the status of each of the resources. It is uncommon to ever find a problem this way, but when does occur that shows here, you want to address it immediately.

Reviewing Logs

Because of the complexity of clusters, you should review all of the operating system Event Logs every day, looking for any potential issues. One of the most common we have found over the years administering SQL Server clusters are shared disk-related errors, which can indicate a configuration problem, software driver problem, or hardware problem, all of which need to be corrected as soon as possible.

Note

We want to remind you one last time that SQL Server clusters do not protect your data on the shared array. You must still back up your data.

Dealing with Cluster Failovers

SQL Sever 2005 instance failovers can occur for many different reasons. When one does occur, you need to identify what happened and why. This way, you can fix whatever caused the problem so it won't happen again.

In some cases after a failover, you can just fail back and everything will be fine. In the other extreme, you may have to replace a failed node. Here's what you must do if you have a failover that requires you to replace a failed node. In this example, we are assuming this is a two-node active/passive cluster and that the failed node won't even turn on.

In this scenario, you can continue to run on the now active node until you can secure some downtime to replace the failed node. Unfortunately, to fix a "broken" node, SQL Server 2005 must be downed for a while as it is fixed.

Removing the Failed Node

Because there is only one node currently active in your cluster, you must perform all of these steps from the single, active node.

Back up all of your data. You will want a copy of all your data somewhere else, like tape, or on another array. In addition, SQL Server 2005 will be down during this process, so be sure that you are doing this doing an approved downtime.
In Control Panel, select Add or Remove Programs, select SQL Server, and then click Change.
On the Component Selection screen which appears, select the components you want to remove from the selected SQL Server instance, and click Next to proceed.
The System Configuration Checker begins and scans your current configuration. When it is done, click Next.
On the Change or Remove Instance screen, click Maintain the Virtual Server.
The Cluster Node Configuration screen appears. To remove the failed node, select the node from the Selected Nodes list, and then click Remove. Then click Next; and on the final screen, click Update. This will begin the removal of the SQL Server 2005 instance from the failed node, but leaving it running on the good node. You perform this step even though the failed node is not accessible.
Now, you must evict the failed node from cluster services. To do this, start Cluster Administrator on the active node.
Right-click the failed node displayed in Cluster Administrator, and then click Evict Node. This will remove the failed node from the cluster.

At this point, the failed node has been removed and you have a one-node cluster that is still fully functioning as a SQL Server 2005 instance. If desired, clients can connect using the virtual SQL Server 2005 cluster name and do work.

Re-Adding the New Node

At some point, you will need to rebuild the hardware and operating system on the failed node so that it is again identical to the good node. Once the hardware and operating system is ready, the next step is re-add the new node to the cluster, then add SQL Server 2005 back to the node.

From the new node, use Cluster Administrator to add the new node to the cluster. Follow the same steps you used to add the second node to your cluster when you first built the cluster.
From the old node, in Control Panel, select Add or Remove Programs, select SQL Server, and then click Change.
On the Component Selection screen that appears, select the components you want to add to the selected SQL Server instance, and click Next to proceed.
The System Configuration Checker begins and scans your current configuration. Assuming there are no problems with the configuration, click Next.
On the Change or Remove Instance screen, click Maintain the Virtual Server.
The Cluster Node Configuration screen appears. To add the new node, select the node from the Available Nodes list, and then click Add. Then click Next, and on the final screen, click Update. This will add the SQL Server 2005 instance to the new node, installing the software on it similarly to how it did when the cluster was first built.
Once the new node has been added, thoroughly test it before putting it back into production.

What Happens if My Disk Array Dies?

As you know by now, a cluster does not protect your data, only the cluster nodes themselves. So what happens if the shared array stops working? How do you recover? In most cases, if your shared array fails, you will have to remove SQL Server clustering and clustering services from the two nodes, fix or replace the shared array, then reinstall clustering services and SQL Server 2005 clustering. And last of all, restore your databases.

Installing Patches and Service Packs

Installing operating system patches and service packs, and SQL Server 2005 service packs and hot fixes, have become part of the routine for DBAs. Even though we have come to hate installing them, we can't forget that each new release helps to make our platform more secure and stable. The following is some advice on installing this despised but important code.

Operating System Patches and Service Packs

Upon release, don't immediately install a new patch or service pack until you have had the opportunity to get feedback from the user community on potential problems. If you have a test cluster, upgrade it first to see how your software "reacts" to the new software. You don't necessarily need to install each monthly patch every month. Instead, save them up and do them once a quarter. This helps to reduce downtime. On the other hand, if your organization's security policies require that monthly patches be put on, then do so.

Patches and service packs are normally installed one node at a time. If this is the case, first install the patch or service pack on the designated passive node of the cluster. If all goes well, then fail over from the active to the passive node and install the patch or service pack on the former active node, then fail back. If you have more than two nodes, then rotate through all of the active nodes one at a time as each one successfully completes. You should do this during a low activity time to minimize user disruptions as the failovers occur.

SQL Server 2005 Service Packs and Hot Fixes

Upon release of a new service pack or hot fixes, be sure you test them thoroughly on a test system (with your current applications), before upgrading your clusters. While a test SQL Server cluster is ideal for testing, testing a service pack or hot fix on a non-clustered SQL Server (with your applications) is also a fairly good test.

SQL Server 2005 service packs and hot fixes are cluster-aware and will automatically update all nodes of a cluster at the same time. This makes installation easy, but will require some scheduled downtime as you may need to reboot the servers and fail them over after the installation.

Test, Test, and Test Regularly

By now, you are probably getting a little tired of us talking about testing, but it is important to the continuing successful administration of a SQL Server 2005 cluster that you test it regularly, just as it was when you first installed the cluster. Essentially, there are two times when you want to test an existing SQL Server 2005 cluster:

Whenever you make any changes to the cluster, including hardware, software, or configuration.
On a periodic basis, such as once a quarter, just to test if everything is still working.

When you do test, you should follow the test plan described earlier in this chapter you used after you built the cluster. This way, you can be sure that all aspects of your cluster work properly.

Like most things, you don't have to be an ogre about testing; your ongoing testing procedure can vary based on the circumstances. For example, if you make simple a configuration change in Cluster Administrator and you know exactly what you did and why, then testing right after the change is probably not mandatory, and testing can take place the next time you perform a bigger change, or wait until your regular, quarterly test. In addition, if you have just tested your cluster two weeks before a scheduled quarterly test, you can probably skip the quarterly this test time. The key thing is to remember to test regularly.