Understanding the Microsoft Cluster Models | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

Microsoft technology offers several cluster models that are available to support SQL Server. They are as follows:

Model A High availability and static load balancing
Model B Hot spare solution with maximum availability
Model C Partial server cluster solution
Model D Virtual server only with no fail-over
Model E Hybrid solutions using the best of the previous models

Model A: High-Availability Solution and Static Load Balancing

Clustering SQL Server provides the ability to recover from resource failure immediately and provide seamless connectivity to your clients and client applications. The resources that can put your data tier in the dark upon failure can be physical, such as hardware, or logical such as a service failure. When resource failure is identified, the failed resource and any resources that are dependent on the failed resource are moved from the failed node.

Clustering is not limited to one fail-over server. Windows Server 2003 cluster support up to eight nodes. The clustering configurations can be a two-node cluster, such as one active server and one standby server waiting for something to fail on the active server, to as cluster solution as critical as eight nodes running their own services. As soon as a resource fails on one node, the next preferred node will be able to seize ownership of the resources and take over from the failed node, the so-called fail-over process. This is about as far as we need to go describing the actual clustering process on Windows Server 2003, as the Cluster Service itself is beyond the scope of this book.

Designing the SQL Server Cluster

Good planning is a vital ingredient of a SQL Server cluster (all clusters for that matter). But you need to have a clear understanding of what your solution needs and the applications that it supports is paramount. Let’s discuss the particulars and some different configurations.

Documenting the Dependencies

You should first start with the physical disk as the first resource SQL Server depends on. Without physical disks in the system on which to install SQL Server, you cannot read any data and begin service. When a SQL Server virtual server fails over, you need to make sure that the first resource that is claimed by the new node is the disk; all clusters come into this world on a hard disk. The next resource in the order of priorities for continuing connectivity is the public network; next comes the network name (which depends on the network); and so on.

To make life easier for yourself, you should sketch or chart the resources and the sequence of the fail-overs. This will help you see what depends on what and what it needs to exist. Failing over the SQL Server virtual server before the network fails over does not make much sense.

Understanding SQL Server Active/Passive Configurations

Active/passive configurations require at least two nodes. One node does all the work. In other words, it processes all connections and serves data while the other node waits; it’s the active partner. The passive node is actually a hot standby server. When a resource fails on the active node, the passive node gains ownership of all dependent resources and become the active partner.

An active/passive cluster configuration ensures that there will be minimal or zero performance degradation due to the failure on the node. As this also doubles the hardware costs (two identical servers), it is not always the most economical option.

A/A Configurations and Multiple Instances

An active/active SQL Server cluster configuration means that each participating node in the cluster is serving requests, but not as part of the same SQL Server instance. Active/ active clusters may appear as if multiple servers are sharing the load of responding to client requests from multiple sources. This, however, is what load-balanced clusters do. An active/active SQL Server actually means that “multiple instances” of SQL Server are running on the nodes of the cluster; not that two nodes of the cluster are active over a single instance of the Master and the other system databases.

N+1 Configurations

N+1 configurations can the best of both worlds for clustering requirements. In an N+1 configuration the term N+1 refers to N nodes with an additional (+1) node running as a standby. This means in a failure scenario the (+1) virtual server takes ownership of the failed resource and any other dependent resource(s). This also means that you can receive more resource productivity out of your hardware and still have a hot standby node to be a dedicated resource in the event of resource failure. The cost is relative; if the application earns $50K a week, then you not only can afford this configuration, you require it.

In an N+1 configuration of four nodes, three of the nodes would be serving requests actively and one would be a hot standby partner that any node can fail to. Using N+1 configuration is more cost-effective that a traditional active/passive configuration because it is not a one-to-one active-to-passive ratio. You can support multiple instances of SQL Server, and in the event of a single node failure, performance would not be adversely affected. N+1 configuration also address the cost factor. Because an N+1 configuration has a standby node, it is more expensive, but it also doesn’t have the one-to-one, active-to-passive, or functionally working-to-waiting for a failure ratio that active/passive configurations have. This makes it more cost effective than active/passive configurations. This also means that you must have at least three nodes for a true N+1 configuration.

In active/passive configurations all the instances of SQL Server are all owned by the same server. One node of the cluster owns all resources all the time. When resource failure occurs, all resources are transferred to the other node. In active/active configurations there are at least as many installed SQL Server instances as there are servers participating in the cluster, and so all of the clustered nodes are actively working and using their available resources. What are those resources?

For the default instance you will need

A clustered host server (node)
A SQL Server network name (this will be the Virtual SQL Server name using the Network Name resource)
A SQL Server network IP address (it is not recommended that you use the public network for the heartbeat)
Physical disk resources for the data and log files (best practice is to have these on separate disks to maximize performance)
The SQL Server Service
The SQL Server Agent Service
The SQL Full Text Search Service

For any named instance you will need

A clustered host server
A SQL Server network name (this will be an instance name of SQL Server as “Virtualservername\instancename”)
A SQL Server network IP address.
Physical disk resources for the data and log files (best practice is to have these on separate disks to maximize performance)
The SQL Server Service
The SQL Server Agent Service
The SQL Full Text Search Service

With multiple instance configurations each instance of SQL Server acts independently of the other. Each instance needs its own distinct set of resources. These instances only “interact” if and when multiple instances of SQL Server are owned by the same node. There are concepts to be understood and taken into consideration. The primary consideration is performance planning. Multiple instances require their own resources, and that takes careful planning and coordination. This involves all dependent resources, which are discussed in the next sections.

Standby Services: Advantages and Disadvantages

There are definite advantages and disadvantages to clustering standby services. Some are obvious, and some are not. Some of the items that we will review can be an advantage or a disadvantage, depending on the perspective (IT staff vs. Accounting) from which you are viewing these items.

High availability is the single most important factor when weighing advantages and disadvantages of clustering for an administrator. While most DBAs think performance is at the top of the list, performance problems cannot be easily addressed without considering a cluster. Using standby services, that is, services and resources that are not actively being used and are waiting for failure to occur, is most often a decision based on the availability of funding. Your business’ high-availability goals can be achieved through a clustering solution that does not have standby services.

Multiple instance or active/active clustering is the most predominant form for a solution like this. The performance of your application(s) from the database perspective is where the decision to use or not use standby services is made. If standby services are not used and a multiple instance architecture is used, there are no available unused resources in the event of failure. The failed burden of the load is dispersed across the rest of the node(s) of the cluster. In our instance all resources for multiple installations of SQL Server are now owned and managed by a single server.

What if the load is too much for a single server to bear? The application may time out, report errors to end users, fail to serve requests, even crash the server completely That is when my worst two enemies show up, data corruption and unpredictable results. Is your application still truly highly available? In the event of a disaster, can one server truly hold the entire load? The costs for hardware, software licensing, and a paid administrator are not small. Have you ever calculated the cost of one hour of downtime? Twelve hours? Twenty-four hours? In some scenarios, more than a few minutes and the company is facing a real loss, and the challenge of finding a new administrator.

Advantages of standby services are abundant. Most notable is the ability to be redundant. Resilience to recover from resource failure. In the event of failure of a single node, there will be no performance degradation because a twin of the failed system is available to serve requests at a moment’s notice. Another advantage to standby services is the ability to continually provide services and stay up to date with Windows and security updates.

Load Balancing

Load balancing is a challenging feat in environments where the difficulties of managing multiple instances threaten the uptime of a server. With standby services each node of a cluster can own the resource pools as the opposite node(s) are brought up to date and rebooted if necessary. This configuration results in a greater amount of total uptime for connected clients and processes.

This model offers a high-availability solution and an acceptable level of performance when only one node is online. However, you can also attain a high performance level when both nodes are kept online, meaning they are both serving clients. The model has been designed to allow for maximum utilization of hardware resources.

In this model, you can create a virtual server on each node, which makes its own set of resources available to the network. A virtual server can be detected as the usual server and accessed by clients as any other server. Capacity is configured for each node to allow the resources on each node to run at optimum performance. You would aim, however, to configure the resources in such a way as to ensure that if one node went south, the other node would be able to temporarily take on the burden of running the resources from the other, potentially catering to a huge surge in connections and access to resources. Usually all the client services remain available during the fail-over; only performance suffers because one server is now performing the job of two.

This model is useful for the high availability needs of file-sharing and print-spooling services. For example, two file and print shares are established as separate groups, one on each server. If one goes to hell, the other inherits the estate and takes on the file-sharing and print-spooling jobs of the deceased server. When you configure the fail-over policy, you will usually ensure that the temporarily relocated group is set to prefer its original server. In other words, when the failed server rises from the grave, the displaced group returns to the control of its preferred server. Operations resume again at normal performance. Client will notice only a minor interruption. This model can be summarized as follows:

Availability High
Suggested fail-over policies Assign a preferred server to each group
Suggested fail-back parameters Allow fail-back for all groups to the preferred server

Business Scenarios

Using Model A you can solve two problems that typically occur in a large computing environment:

First, a problem will occur when a single server is running multiple large applications, which can cause a degradation in performance. To solve the problem, you would cluster a second server with the first. Applications are split across the servers, both active service clients.
Second, the problem of availability arises when the two servers are not connected. But by placing them in a cluster, you assure greater availability of both applications for the client.

Consider a corporate intranet that relies on a database server supporting two large database applications. The databases are used by hundreds of users who repeatedly connect to the database from sunrise to sundown. During peak connect times, however, the server cannot keep up with the demand on performance.

A solution would be to install a second server, form a cluster, and balance the load. We now have two servers, and each one is running and supports a database application. When one server goes down, we would be back to our original problem, but only for as long as it takes to bring the server back online. Once the failed server is recovered, we fail back and restore the load-balanced operations.

Another scenario might involve a retail business that relies on two separate servers. For example, one of them supports Internet Web services, and the other provides a database for inventory, ordering information, financials, and accounting. Both are critical to the business because without the Web access, customers cannot browse the catalog and place the orders. And without access to the database accounting applications, the orders cannot be completed and the staff cannot access inventory or make shipping arrangements.

One solution to ensure the availability of all services would be to join the computers into a cluster. This is a similar solution to the one formerly discussed, but with a few differences. First we would create a cluster that contains two groups, one on each node. The one group contains all resources that we need to run the Web-service applications, such as IP addresses and pooled business logic. The other group contains all of the resources for the database application, including the database itself.

In the fail-over policies of each group would specify that both groups can run on either node, thereby assuring their availability should one of the nodes fail.

Model B: The “Hot Spare”

Under this model we obtain maximum availability and performance, but we will have an investment in hardware and software that is mostly idle. The hot spare is therefore a redundant server. No load balancing is put in place here, and all applications and services serve the client on the hot active server, called the primary node. The secondary node is a dedicated “hot spare,” which must always be kept ready to be used whenever a fail-over occurs. If the primary node fails, the “hot spare” node will immediately detect the failure and pick up all operations from the primary To continue to service clients at a rate of performance that is close or equal to that of the primary node, it must be configured almost identically to the primary. For all intents and purposes the two servers, primary and secondary, are like peas in a pod.

This model is ideal for critical database and Web server applications and resources. We can use this model to provide a “hot spare” database node for all servers dedicated to supporting Web access to our databases, such as those servers running Internet Information Services. The expense of doubling the hardware is justified by the protection of clients’ access to the data. If one of your servers fails, a secondary server takes over and allows clients to continue to obtain access to the data. Such configurations place the databases on a shared cluster. In other words the primary and secondary servers allow access to the same databases. This model provides the following benefits:

Availability Very high, redundant service.
Suggested fail-over policies Usually we would configure identical twin servers if the budget allowed it and we would not need a preferred server to fail back from in the event of a disaster. If money is an issue and you are forced to make one server Arnold Schwarzenegger and the other Danny DeVito, then the Arnold becomes the preferred server for any of the groups. When one node has greater capacity than the other, setting the group fail-over policies to prefer the more powerful server ensures performance remains as high as possible.
Suggested fail-back parameters As just discussed, if the secondary node has identical capacity to the primary node, you can prevent fail-back for all of the groups. But if the secondary node has less capacity than the primary node, you should set the policy for immediate fail-back or for fail-back at a specified off-peak hour.

Model C: The Partial Cluster

This model caters to the applications that cannot fail over on the same servers from which resource groups are set to fail over. First we need to configure applications that will not fail over when the server goes down. The applications can be installed on the server or servers that form part of the cluster, but they cannot use the shared disk array on the shared bus. These applications have the usual availability; if the server goes down, the application goes down too. Such applications would not be considered critical; otherwise, they would have to access the shared disk array and be installed on both servers.

A database application that might not be too critical to exclude from the fail-over could be the accounting database that only gets updated by a front-line system once a day Accounting staff might lose access to the application for an hour or two, which might be acceptable if they spend part of their day munching on corn chips.

When the server failure occurs, the applications that are not configured with fail-over policies are unavailable and they will remain unavailable until the node on which they are installed is restored. They will most likely have to be restarted manually or, as services, set to automatically start them when the operating system starts. The applications you configured with fail-over policies fail over as usual, or according to those policies you set for them. This model provides the following benefits:

Availability High for applications configured for fail-over; normal for others
Suggested fail-over policies Variable
Suggested fail-back parameters Variable

Model D: Virtual Server Only with No Fail-Over

Here we would use the virtual server concept with applications on a single-node server cluster. In other words, this cluster model makes no use of fail-over. You could also call it the cluster that wasn’t. It is merely a means of organizing the resources on a server for administrative convenience and for the convenience of your clients. So what’s the big deal? Well, the main deal is that both administrators and clients can readily see descriptively named virtual servers on the network rather than navigating a list of actual servers to find the shares they need. There are also other advantages as follows:

The Cluster Service automatically restarts the various groups of applications and their dependent resources after the server is restored after a crash. Applications that do not have mechanisms for automatic restart can benefit from the Cluster Service’s automatic restart features.
It is also possible to cluster the node with a second node at a future time, and the resource groups are already in place. All you need to do is configure fail-over policies for the groups and leave the virtual servers ready to operate. Often a full-blown fail-over cluster is reduced to a partial cluster because the primary or secondary node goes down hard and requires a few days or longer to repair.

This model lets you locate all your organization’s file resources on a single server, establishing separate groups for each department. Then when clients from one department need to connect to the appropriate share-point, they can find the share as easily as they would find an actual computer.

Availability Normal
Suggested f fail-over policies Not applicable
Suggested fail-back parameters Not applicable

Model E: The Hybrid Solution

As the model name suggests, Model E is a hybrid of the former models just discussed. By using the hybrid solution model, you can incorporate advantages of the previous models and combine them in one cluster. By providing sufficient capacity, you can configure several types of fail-over scenarios to coexist on the same two nodes. All fail-over activity occurs as normal, according to the policies you set up.

For administrative convenience, two file-and-print shares in a cluster (which do not require fail-over ability) are grouped logically by department and configured as virtual servers. An application that cannot fail over resides on one of the clusters and operates as normal without any fail-over protection.

Availability High or very high for resources set to fail over with other resources not configured to fail over.
Suggested fail-over policy Variable