Before addressing SQL Server high availability technologies, it is important to understand what the operating system provides, as some of the SQL Server functionality depends on its foundation ”Windows.
Windows Clustering is the name that represents the umbrella of clustering technologies in the Microsoft Windows 2000 Server and Microsoft Windows Server 2003 families. Windows Clustering is a feature in the following versions of the operating system:
Windows 2000 Advanced Server (32-bit)
Windows 2000 Datacenter Server (32-bit)
Windows Server 2003 Enterprise Edition (32-bit and 64-bit)
Windows Server 2003 Datacenter Edition (32-bit and 64-bit)
Windows Clustering is comprised of two technologies: server clusters and Network Load Balancing clusters.
|More Info|| |
Chapter 5, Designing Highly Available Windows Servers, reviews in depth the planning, configuration, and administration of a highly available Windows server and how these factors relate to SQL Server high availability.
A server cluster is Microsoft s form of clustering for the operating system that is designed for availability. It has also been referred to as Microsoft Cluster Service (MSCS) when the component itself was available as part of Microsoft Windows NT 4, Enterprise Edition. It is a collection of servers that, when configured, provide an environment for hosting highly available applications. A server cluster protects against hardware failures that could take down an individual server, operating system errors that cause system outages (for example, driver problems or conflicts), and application failures. Simply put, it protects the business functionality within certain limitations, but not necessarily the server itself.
Once a problem is detected , a cluster-aware application tries to restart the process on the current server first. (The cluster-aware application must be installed into the cluster.) If that is not possible, it automatically switches to another server in the cluster. The process of switching from one server to another in a cluster is known as a failover . Failovers can also happen manually, as in the case of planned maintenance on a server.
Server clustering requires a minimum of two servers; otherwise you essentially have a stand-alone system. A server cluster by itself does not really do much ”applications or specific uses, such as SQL Server or file and print servers, must be configured to use the cluster to provide availability. An instance of Microsoft SQL Server 2000 installed on a node that is part of a server cluster will not failover to another node if it is not configured to use the cluster mechanics; it will act as a stand-alone SQL Server.
All server cluster configurations must appear as a complete hardware solution under the cluster list on the Microsoft Hardware Compatibility List (HCL) for the operating system choice you are making. The configuration is a viewed as complete solution, not individual parts that can be put together like a jigsaw puzzle. Microsoft can support only a completely HCL-compliant cluster solution. Please consult http://www.microsoft.com/hcl/ for more information.
The main constraint of a server cluster is the distance between redundant parts within the cluster itself. Conceptually, the limit is just a latency for synchronous operations, which is really a data consistency issue dependent on the underlying technology and its limitations (for example, in an older cluster, the limitations of Small Computer System Interface [SCSI]). During implementation, this means that any intracluster communications would be affected, because a server cluster relies on a single image view of the storage subsystem used, whether it is direct attached storage or a storage area network (SAN).
Another important aspect of any solution that has a server cluster as the basis for a back end is the compatibility of the applications that will be running on the cluster. One of the biggest misconceptions about server clusters is that because of automatic failover, they solve most availability problems. As noted in Chapter 1, Preparing for High Availability, a solution is only as good as its weakest link. For most applications to work properly in a cluster, they should be coded to the Microsoft Clustering application programming interface (API) and become cluster-aware, allowing them to react to cluster-specific events. This means that if a problem is detected in the cluster and the process fails over to another server, the application handles the failover gracefully and has minimal or no impact on end users. More information about the Clustering API and coding cluster-aware applications appears in Chapter 5, Designing Highly Available Windows Servers.
When using any prepackaged application from a vendor, such as backup software, consult the vendor to ensure that the application will run properly on your cluster.
A server cluster is comprised of the following essential elements:
Virtual Server A virtual server is one of the key concepts behind a server cluster: to a client or an application, a virtual server is the combination of the network name and Internet Protocol (IP) address used for access. This network name and IP address are the same for a clustered service, no matter what node it is running on. The actual name and IP are abstracted so that end users or applications do not have to worry about what node to access. In the event of a failover, the name and IP address are moved to the new hosting node. This is one of the primary benefits of a server cluster. Two examples: When SQL Server 2000 failover clustering is installed on a cluster it acts as a virtual server, as does the base cluster.
Cluster Node A node is one of the physical servers in the cluster. With Windows Server 2003, the operating system might support more nodes than the version of SQL Server that you are using. Please review Table 3-1 and consult Chapter 6, Microsoft SQL Server 2000 Failover Clustering, for more details.
Number of Nodes
Windows 2000 Advanced Server (32-bit)
Windows 2000 Datacenter Server (32-bit)
Windows Server 2003, all versions
Private Network The private network is also commonly referred to as the heartbeat. It is a dedicated intracluster network that is used for the sole purpose of running processes that check to see if the cluster nodes are up and running. It detects node failure, not process failure. The checks occur at intervals known as heartbeats.
Public Network The public network is used for client or application access. The heartbeat process also occurs over the public network to detect the loss of client connectivity, and can serve as a backup for the private network.
Shared Cluster Disk Array The shared cluster disk array is a disk subsystem (either direct attached storage or a SAN) that contains a collection of disks that are directly accessible by all the nodes of the cluster. A server cluster is based on the concept of a shared nothing disk architecture, which means that only one node can own a given disk at any given moment. All other nodes cannot access the same disk directly. In the event that the node currently owning the disk fails, ownership of the disk transfers to another node. This configuration protects the same data stored on the disk from being written to at the same time, causing contention problems.
Shared versus shared nothing is a topic of debate for some. A completely shared environment would require whatever clustered software is accessing the shared disk from n number of nodes to have some sort of a cluster-wide synchronization method, such as a distributed lock manager, to ensure that everything is working properly. Taking that to the SQL Server 2000 level, any given clustered instance of SQL Server has its own dedicated drive resources that cannot also be used by another SQL Server clustered instance.
Quorum Disk The quorum disk is one of the disks that resides on the shared cluster disk array of the server cluster, and it serves two purposes. First, the quorum contains the master copy of the server cluster s configuration, which ensures that all nodes have the most up-to-date data. Second, it is used as a tie-breaker if all network communication fails between nodes. If the quorum disk fails or becomes corrupt, the server cluster shuts down and is unable to start until the quorum is recovered.
LooksAlive Process LooksAlive is an application-specific health check that is different for each application. For SQL Server, this is a very lightweight check that basically says, Are you there?
IsAlive Process The IsAlive check is another, more thorough, application-specific health check. For example, SQL Server 2000 in a cluster runs the Transact -SQL statement SELECT @@SERVERNAME to determine if the SQL Server can respond to requests . It should be noted that in the case of SQL Server, this check does not guarantee that there is not a problem in one of the databases; it just ensures that if someone wants to connect and run a query in SQL Server, they can.
Neither IsAlive nor LooksAlive can be modified to run another query or check.
As shown in Figure 3-1, for SQL Server, the external users or applications would connect to the SQL Server virtual IP address or name. In the example, Node 1 currently owns the SQL resources, so transparent to the client request, all SQL traffic goes to Node 1. The solid line connected to the shared disk array denotes that Node 1 also owns the disk resources needed for SQL Server, and the dashed line from Node 2 to the shared disk array means that Node 2 is physically connected to the array, but has no ownership of disk resources.
Now that the basics of the cluster components have been covered from a high level, it is important to mention two important cluster concepts that are exposed when a cluster is installed and used by an administrator:
Cluster Resource The most basic, lowest level unit of management in a server cluster. Some of the standard resource types include Dynamic Host Configuration Protocol (DHCP), File Share, Generic Application, Generic Service, IP, Network Name, Physical Disk, Print Spooler, and Windows Internet Naming Service (WINS).
Cluster Group A collection of server cluster resources that is the unit of failover. A group can only be owned by one node at any given time. A cluster group is made up of closely related resources and resembles putting a few documents about the same subject in a folder on your hard disk. For example, each SQL Server 2000 instance installed on a server cluster gets its own group.
A Network Load Balancing cluster is a collection of individual servers configured to distribute Transmission Control Protocol/Internet Protocol (TCP/IP) and User Datagram Protocol (UDP) traffic among them according to a set of rules. Network Load Balancing creates an available, scalable solution for applications such as Internet Information Services (IIS) and Internet Security and Acceleration Server (ISA Server), but not necessarily SQL Server.
Unlike a server cluster, a Network Load Balancing cluster can be made up of nonspecialized complete solutions. However, the hardware must still be on the HCL.
From a SQL Server perspective, Network Load Balancing is not always the best choice to provide either scalability or availability, as it presents some challenges in the way that it operates, which differs from the transactional way in which SQL Server operates. For Network Load Balancing and SQL Server to work in, say, a load balanced and write situation, they would have to be coded to use the shared disk semantics described earlier with some sort of lock manager. Network Load Balancing, however, can be used to load balance read-only SQL Servers (such as catalog servers for a Web site), which also provides greater availability of the catalog servers in the event one of them fails. Network Load Balancing can also be used in some cases to abstract the log shipping role change and switch to another replicated SQL Server if those are used as availability technologies in your environment.
Administrators commonly want to combine the functionality of server clusters with Network Load Balancing to provide both availability and scalability, but that is not how the product is currently designed, as each technology has specific uses.
The following concepts are important to understanding how Network Load Balancing works:
Virtual Server Like a server cluster, a virtual server for Network Load Balancing represents the virtualized network name and IP address used for access. The important difference is that Network Load Balancing does not persist state in which each node knows if all the member nodes are up or not, but it does know which nodes are running. In a server cluster, every node knows about all cluster members , whether they are online or not, as state is persisted .
Cluster Node A node is one of the physical servers in the cluster, just like in a server cluster. However, each node is configured to be identical to all the other nodes in the cluster (with all of the same software and hardware) so that it does not matter which node a client is directed to. There is no concept of shared disks as there is with a server cluster. That said, for example, a Web service could be making requests to one file share, but it is not the same as the shared disk array for a server cluster. There can be up to 32 nodes in a Network Load Balancing cluster.
Heartbeat Like a server cluster, a Network Load Balancing cluster has a process for ensuring that all participating servers are up and running.
Convergence This process is used to reach consensus on what the cluster looks like. If one node joins or leaves the Network Load Balancing cluster, because all nodes in the cluster must know which servers are currently running, the convergence process occurs again. Because of this, convergence results in the high availability of services that can take advantage of Network Load Balancing because connections that were going to a now-dead node are automatically redistributed to others without manual intervention.
Figure 3-2 shows how a read-only SQL Server could be used with Network Load Balancing. The external users or applications would connect to the Network Load Balancing virtual IP address or name; however, behind the virtual IP address, things are different. Each node has its own database configuration and disk. When a client request comes in, an algorithm at each node applies the port rules and convergence results to drop or accept the request. Finally, one of the nodes services the request and sends the results back.