Clustering: The Big Picture | Professional SQL Server 2005 Administration (Wrox Professional Guides)

In this section we take a brief look at how clustering works, along with clustering configuration options. I will also be introducing some terminology that you need to be familiar with.

How Clustering Works

Clustering is a complex technology with lots of messy details. To keep from scaring you too much too soon, we'll take a look at the big picture of how clustering works. In this section, we take a look at active and passive nodes, the shared disk array, the quorum, public and private networks, and the virtual server. Finally, we explain how a failover works.

Active vs. Passive Nodes

Although a SQL Server 2005 cluster can support up to eight nodes, clustering actually only occurs between two nodes at a time. This is because a single SQL Server 2005 instance can only run on a single node at a time, and should a failover occur, the failed instance can only failover to another individual node. This adds up to two nodes. Clusters of three or more nodes are only used where you need to cluster multiple instances of SQL Server 2005. We'll discuss larger clusters later in this chapter.

In a two-node SQL Server 2005 cluster, one of the physical server nodes is referred to as the active node, and the second one is referred to as the passive node. It doesn't matter which of the physical servers in a cluster is designated as the active or the passive, but it is easier from an administrative point of view to go ahead and assign one node as the active and the other as the passive. This way, you won't get confused which physical server is performing which role at the current time.

When we refer to an active node, we mean that this particular node is currently running an active instance of SQL Server 2005 and that it is accessing the instance's databases, which are located on a shared data array.

When we refer to a passive node, we mean that this particular node is not currently in production and is not accessing the instance's databases. When the passive node is not in production, it is in a state of readiness, so that if the active node fails, and a failover occurs, it can automatically go into production and begin accessing the instance's databases located on the shared disk array. In this case, the passive node then becomes the active node, and the formerly active node becomes the passive node (or the failed node, if a failure occurs that prevents it from operating).

Later in this chapter, we will take a look two different types of SQL Server 2005 cluster configurations: one called active/passive and one called active/active.

Shared Disk Array

Unlike nonclustered SQL Server 2005 instances, which usually store their databases on local disk storage, clustered SQL Server 2005 instances store data on a shared disk array. By shared, we mean that both nodes of the cluster are physically connected to the disk array but that only the active node can access the instance's databases. There is never a case where both nodes of a cluster are accessing an instance's databases at the same time. This is to ensure the integrity of the databases.

Generally speaking, a shared disk array is a SCSI- or fiber-connected RAID 5 or RAID 10 disk array housed in a stand-alone unit, or it might be a SAN. This shared array must have at least two logical partitions. One partition is used for storing the clustered instance's SQL Server databases, and the other is used for the quorum.

The Quorum

When both nodes of a cluster are up and running, participating in their relevant roles (active and passive), they communicate with each other over the network. For example, if you change a configuration setting on the active node, this configuration change is automatically sent to the passive node and the same change is made. This generally occurs very quickly and ensures that both nodes are synchronized.

As you might imagine, though, it is possible that you could make a change on the active node, but before the change is sent over the network and the same change made on the passive node (which will become the active node after the failover), the active nodes fails, and the change never gets to the passive node. Depending on the nature of the change, this could cause problems, even causing both nodes of the cluster to fail.

To prevent this from happening, a SQL Server 2005 cluster uses a quorum. A quorum is essentially a log file, similar in concept to database logs. Its purpose is to record any change made on the active node, and should any change recorded here not get to the passive node because the active node has failed and cannot send the change to the passive node over the network, the passive node, when it becomes the active node, can read the quorum file and find out what the change was and then make the change before it becomes the new active node.

In order for this to work, the quorum file must reside on the quorum drive. A quorum drive is a logical drive on the shared array devoted to the function of storing the quorum.

Public and Private Networks

Each node of a cluster must have at least two network cards. One network card will be connected to the public network and the other to a private network.

The public network is the network that the SQL Server 2005 clients are attached to, and this is how they communicate to a clustered SQL Server 2005 instance.

The private network is used solely for communications between the nodes of the cluster. It is used mainly for the heartbeat signal. In a cluster, the active node puts out a heartbeat signal, which tells the other nodes in the cluster that it is working. Should the heartbeat signal stop, a passive node in the cluster becomes aware that the active node has failed and that it should at this time initiate a failover so that it can become the active node and take control over the SQL Server 2005 instance.

The Virtual Server

One of the biggest mysteries of clustering is how do clients know when and how to switch communicating from a failed cluster node to the new active node? And the answer may be a surprise. They don't. That's right; SQL Server 2005 clients don't need to know anything about specific nodes of a cluster (such as the NETBIOS name or IP address of individual cluster nodes). This is because each clustered SQL Server 2005 instance is given a virtual name and IP address, which clients use to connect to the cluster. In other words, clients don't connect to a node's specific name or IP address but instead connect to a virtual name and IP address that stays the same no matter what node in a cluster is active.

When you create a cluster, one of the steps is to create a virtual cluster name and IP address. This name and IP address is used by the active node to communicate with clients. Should a failover occur, the new active node uses this same virtual name and IP address to communicate with clients. This way, clients only need to know the virtual name or IP address of the clustered instance of SQL Server, and a failover between nodes doesn't change this. At worst, when a failover occurs, there may be an interruption of service from the client to the clustered SQL Server 2005 instance, but once the failover has occurred, the client can once again reconnect to the instance using the same virtual name or IP address.

How a Failover Works

Although there can be many different causes of a failover, we'll look at the case where the power stops for the active node of a cluster and the passive node has to take over. This will provide a general overview of how a failover occurs.

Assume that a single SQL Server 2005 instance is running on the active node of a cluster and that a passive node is ready to take over when needed. At this time, the active node is communicating with both the database and the quorum on the shared array. Because only a single node at a time can be communicating with the shared array, the passive node is not communicating with the database or the quorum. In addition, the active node is sending out heartbeat signals over the private network, and the passive node is monitoring them to see if they stop. Clients are also interacting with the active node via the virtual name and IP address, running production transactions.

Now, for whatever reason, the passive node stops working because it no longer is receiving any electricity. The passive node, which is monitoring the heartbeats from the active node, now notices that it is not receiving the heartbeat signals. After a predetermined delay, the passive node assumes that the active node has failed and it initiates a failover. As part of the failover process, the passive node (now the active node) takes over control of the shared array and reads the quorum, looking for any unsynchronized configuration changes. It also takes over control of the virtual server name and IP address. In addition, as the node takes over the databases, it has to do a SQL Server startup, using the databases, just as if it is starting from a shutdown, going through a database recovery. The time this takes depends on many factors, including the speed of the system and the number of transactions that might have to be rolled forward or back during the database recovery process. Once the recovery process is complete, the new active nodes announces itself on the network with the virtual name and IP address, which allows the clients to reconnect and begin using the SQL Server 2005 instance with minimal interruption.

Clustering Options?

Up to this point, we have been talking about simple two-node clusters running a single instance of SQL Server 2005. In fact, this is only one of many options you have when clustering SQL Server 2005. In this section, we take a look at these options.

Active/Passive vs. Active/Active

In the examples so far, we have been describing what is called an active/passive cluster. This is a two-node cluster where there is only one active instance of SQL Server 2005. Should the active node fail, the passive node will take over the single instance of SQL Server 2005.

In order to save hardware costs, some organizations like to configure what is called an active/active cluster. This is also a two-node cluster, but instead of only a single instance of SQL Server 2005 running, there are two instances, one on each physical node of the cluster. In this case, should one instance of SQL Server 2005 fail, the other active node will take over, which means the remaining node must run two instances of SQL Server 2005 instead of one.

The advantage of an active/active cluster is that you make better use of the available hardware. Both nodes of the cluster are in use instead of just one, like in an active/passive cluster. The disadvantage is that when a failover occurs, then both SQL Server 2005 instances are running on a single server, which can hurt the performance of both instances. To help overcome this problem, both of the physical servers can be oversized in order to better be able to meet the needs of both instances should a failover occur. In addition, if you have an active/active cluster running two instances of SQL Server 2005, each instance will need its own logical disk on the shared array. Logical disks cannot be shared among instances of SQL Server 2005.

If you have the need for two SQL Server 2005 clustered instances, you have three choices:

Two active/passive clusters
One active/active cluster
One 3-node cluster (more on this shortly)

Which should you choose? If you want to be conservative, chose two active/passive clusters because they are easier to administer and provide more redundant hardware, and also they won't cause any application slowdown should a failover occur.

If you want to save hardware costs and don't mind potential application slowdowns and the added complexity of this option, use an active/active cluster.

If you think that you will be adding more clustered SQL Server 2005 instances in the future, and don't mind the complexity of large clusters, consider a multinode cluster.

Multinodes Clusters

The number of nodes supported for SQL Server 2005 clustering depends on which version of the software you purchase, along with which version of the operating system you intend on using.

Purchase the Right Software

One of the reasons it is very important to research your clustering needs is because they directly affect what software you need, along with licensing costs. Here are your options:

Windows Server 2003 Enterprise Edition (32-bit or 64-bit) and SQL Server 2005 Standard Edition (32-bit or 64-bit). Supports up to 2-node clustering
Windows Server 2003 Enterprise Edition (32-bit or 64-bit) and SQL Server 2005 Enterprise Edition (32-bit or 64-bit). Supports up to 8-node clustering
Windows Server 2003 Datacenter Edition (32-bit or 64-bit) and SQL Server 2005 Enterprise Edition (32-bit or 64-bit). Supports up to 8-node clustering

If you need only a two-node cluster, you can save a lot of money by purchasing Windows Server 2003 Enterprise Edition and SQL Server 2005 Standard Edition. If you want more than two clustered nodes, your licensing costs will escalate quickly.

What about Windows 2000 Server, can you cluster SQL Server 2005 with it? Yes you can, if you want. But given the increased stability of clustering in Windows 2003 over Windows 2000, we recommend using Windows 2003.

Number of Nodes to Use?

Like most things in clustering, there are no clear-cut answers to how many nodes should your cluster use. Before we try to offer some advice in this area, let's first discuss how multinode clustering works.

As we have already discussed, in a two-node cluster, an instance of SQL Server 2005 runs on the active node, while the passive node is currently not running SQL Server 2005, but is ready to do so when a failover occurs. This same principle applies to multinode clusters.

For example, let's say that you have a three-node cluster. In this case, two of the nodes would be active, running their own individual instances of SQL Server 2005, and the third physical node would act as a passive node for the other two active nodes. If either of the two active nodes failed, the passive node would take over after a failover.

Let's look at the other extreme, an eight-node cluster. In this case, you could have seven active nodes and one passive. Should any of the seven active nodes fail, then the passive node would take over after a failover.

The advantage of larger nodes is that less hardware is used for failover needs. For example, in a two-node cluster, 50 percent of your hardware is used for redundancy. But if you are using an eight-node cluster, only 12.5 percent of your cluster hardware is used for redundancy.

On the downside, using only one passive node in a multiple-node cluster makes the assumption that no more than one of the active nodes will fail at the same time. Or if they do, the failed over nodes will run slowly on the single node. On the other hand, if you want, a multinode cluster can have more than one passive node. For example, you could have six active nodes and two passive nodes. Besides these risks, multinode clusters are more complex to manage than two-node clusters, on top of being very expensive in hardware and software licensing costs.

So how many nodes should your cluster have? It depends on your needs (how many SQL Server 2005 instances do you need to cluster?) along with your budget, your in-house expertise, and your aversion to complexity. Some companies have many different SQL Server instances they need to cluster, but choose to use multiple two-node active/passive clusters instead of a single multinode cluster. In many cases, it is best just to keep things as simple as possible.

Clustering Multiple Instances of SQL Server on the Same Server

Up to this point, we have made the assumption that a single instance of SQL Server 2005 will run on a single physical server, but this is not a requirement. In fact, SQL Server 2005 Enterprise Edition can run up to 50 instances on a single clustered physical server, and SQL Server 2005 Standard Edition up to 16 instances. But is this a good idea? Not in the world we live in.

The purpose of clustering is to boost high availability; it is not to save money. Adding multiple instances to a cluster adds complexity, and complexity reduces high availability. Cluster multiple instances of SQL Server 2005 on the same physical node at your own risk.