8.3 Windows cluster technology fundamentals | Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)

Most people were first introduced to cluster technology via Digital Equipment Corporation’s (now Compaq/HP) OpenVMS Cluster. Also known as a VAXCluster, this technology was introduced around 1984 and revolutionized the way we view computer systems. The now called an OpenVMS Cluster provides for a collection of systems to be connected to shared devices, and all systems can read or write to all of the devices. Two key attributes that many are looking for from a clustered environment are performance and scalability. You expect that after investing extra money in the hardware and software required to support clusters, you will receive some additional performance and scalability benefits such as parallel processing capabilities, I/O shipping (mirror I/O requests across systems), or other techniques that can be gained from a group of systems operating as one. In some environments such as UNIX or OpenVMS, this can be a reality. However, the current truth in the Windows Server environment is that cluster services do not afford any of the true performance and scalability benefits you may desire. One benefit that Windows clustering can afford is availability. By allowing for applications and services to have fail over and resiliency capabilities, clusters provide increased application availability. For applications like Exchange Server, clustering can provide protection from some specific causes of downtime like hardware failures.

The last thing you may expect cluster technology to provide for your system is increased manageability. Manageability can come in many forms. The most useful benefit for a clustered Exchange Server is the maintenance benefit of being able to move services and resources from one cluster node to another while maintenance operations like firmware, driver, operating system, service pack, and application updates are performed. Once the maintenance operations are complete, services and resources can be moved back to their original cluster node. When you connect a group of systems into a unified system and provide a single-system virtual server to clients requesting services, you enable the ability to share the load and the availability of those services across the entire system without the users of these services or even the administrative staff having any knowledge of a cluster configuration. When one node experiences a failure, the other nodes can fail over services or resources that the node owned. In most cases, this can be totally transparent to the clients.

8.3.1 Microsoft’s three-pronged cluster “theology”

When Windows Server (NT) originally shipped, discussions on clustering technology from Microsoft were pretty straight forward. In recent times, however, things have become a bit more convoluted. Besides the original clustering technology that was released with Windows NT Enterprise Edition in 1997, Microsoft has added several other technology assets to its “ clustering” portfolio. Microsoft looks at this space from two viewpoints. First, there is horizontal scalability—you increase the scale of the application or solution through the addition of “nodes” or processing entities. Microsoft’s other view on cluster solutions looks at availability—you provide service continuance via the ability to fail over applications and services. These two viewpoints have helped shape Microsoft’s offering of cluster solutions. Horizontal scalability and availability are provided across a set of three Microsoft technologies: Network Load Balancing (NLB), Component Load Balancing (CLB), and Cluster Service. These technologies constitute Microsoft’s three-pronged strategy for clustering, illustrated in Figure 8.1.

click to expand
Figure 8.1: Microsoft’s three-pronged approach to clustering.

Network Load Balancing (NLB)

The NLB service load balances incoming Internet Protocol traffic across clusters of up to 32 nodes. NLB enhances both the availability and scalability of Internet server-based programs such as Web servers, streaming media servers, and terminal services. By acting as the load-balancing infrastructure and providing control information to management applications built on top of Windows Management Instrumentation (WMI), Network Load Balancing can seamlessly integrate into existing Web server farm infrastructures. Network Load Balancing will also serve as an ideal load balancing architecture for use with the Microsoft Application Center Server in distributed Web farm environments. From an Exchange perspective, NLB has little interest to us—the exception being the ISP/ASP community, which may benefit from the use of NLB clusters to host Exchange front-end servers. Since Exchange front-end servers are basically IIS running Exchange messaging protocols (such as IMAP, POP3, SMTP, and HTTP), they will benefit in the same manner that other Web farm applications benefit from NLB—through the ability to host Exchange protocols in an NLB scheme.

Component Load Balancing (CLB)

Microsoft’s CLB technology provides dynamic load balancing of middle-tier application components that use COM+. With CLB, COM+ components can be load balanced over multiple nodes to dramatically enhance the availability and scalability of software applications. Unlike Cluster Service and NLB, which are built into the Windows operating system, CLB is a feature of Microsoft Application Center. It is designed to provide high availability and scalability for transactional components. CLB is scalable up to eight servers and is ideally suited to building distributed solutions. For Exchange administrators, CLB offers nothing in the form of enhanced availability for Exchange deployments. CLB is focused on line-of-business (LOB) applications that involve the deployment of a middle tier for application/business logic.

Microsoft Cluster Service

With the introduction of Windows NT Enterprise Edition in 1997, Microsoft included several optional components. MSCS is one of these components. The goal of MSCS is to extend the operating system to include high-availability features seamlessly and to support application without modification. Microsoft specifically excluded two features from MSCS. First, MSCS is not lock-step fault tolerant. This means it does not provide instantaneous moving of running applications. Applications running on MSCS are very unlikely to achieve levels such as 99.99999% availability—about 3 to 4 seconds of downtime per year (however, four nines—99.99% is a real possibility for Exchange). Also, MSCS is not able to recover a shared state between client and server. In other words, work in progress during a failure scenario will most likely have to be repeated.

8.3.2 A closer look at Microsoft Cluster Server

From a foundational architecture point of view, MSCS is based on a shared nothing model. In the world of clustering technology, there are two basic programming approaches to resource management—shared-nothing and shared-disk. The cluster architectural programming approach dictates how servers participating in a cluster manage and use both local and cluster devices and resources. A shared-disk (or resource) implementation allows all cluster participants (nodes) to own and access cluster resources. This approach allows all applications running on all nodes to compete and have the same disks available. If two nodes in a shared-disk cluster need to access the same data, the data is either read separately by each node or must be copied from one node to another. Applications running in this model must have a method of synchronizing and serializing access to the shared data to prevent potential conflicts within the system (such as multiple processor access on an SMP system) or across the cluster. Usually, this is accomplished with a service that provides locking of the shared resource—sometimes called a distributed lock manager. A distributed lock manager will track and provide access to cluster resources for all applications. The distributed lock manager can detect and resolve any conflicts in accessing the data. However, this operation does require additional overhead of system resources on each node in the cluster. In addition, a shared-disk typically requires special hardware configurations or support for multiple hosts to access the same devices. Since MSCS does not implement the shared-disk approach, we will forgo further discussions of this approach. It is worth noting, however, that there are many advantages and disadvantages to both the shared-disk and shared-nothing approaches that could entail quite a lengthy discussion. This is a bit of a “religious” argument, however, and is not germane to our discussion of Microsoft technology.

In the shared-nothing cluster (illustrated in Figure 8.2), each server owns and manages local devices as specific cluster resources. Devices that are common to the cluster and physically available to all nodes, like SAN logical units (LUNs), are owned and managed (reserved) by only one node at a time. For resources to change ownership, a complex reservation and contention protocol is followed and implemented by cluster services running on each node. MSCS is based on the shared-nothing clustering model.

click to expand
Figure 8.2: Comparing shared nothing to shared disk cluster architectures.

In an MSCS cluster, a resource is defined as any physical or logical component that can be brought on-line and taken off-line, managed in a cluster, hosted by only one node at a time, and moved between nodes.

Each node has its own memory, system disk, and operating system installation, and each node starts out owning a subset of the cluster’s resources. If a node fails, another node takes ownership of the failed node’s resources (this process is known as fail over). Microsoft Cluster Server then registers the network address for the resource on the new node so that client traffic is routed to the system that is available and now owns the resource. When the failed node is later brought back on-line, MSCS can be configured to redistribute resources and client requests appropriately (this process is known as failback). Microsoft selected the shared-nothing model because developers felt it provided for easier management of resources and applications utilizing those resources. The shared-nothing model also can be implemented on industry-standard hardware configurations that do not require the proprietary hardware that a shared-disk model would require (again, this could be a point religiously argued by those who prefer the shared-disk over the shared-nothing approach).

Cluster resources and resource groups

The basic unit of management in a Microsoft cluster is the resource. Resources are logical or physical entities or units of management within a cluster system that can be changed in state from on-line to off-line, are manageable by the cluster services, and are owned by one cluster node at a time. Cluster resources include physical resource entities such as hardware devices like network interface cards and disks or logical resource entities like server name, network name, IP address, and services. In a cluster, there are both physical and logical resources typically configured. Within the MSCS framework, resources are grouped into logical units of management and dependency called resource groups. A resource group usually comprises both logical and physical resources such as virtual server names, IP addresses, and disk resources. Resource groups can also just include cluster specific resources such as the cluster time service or resources used for managing the cluster itself. The key in the shared-nothing model is that a resource group can only be owned by one node at a time. Furthermore, the resources that are part of the resource group owned by a cluster node must exist on (be owned by) that node.

The shared-nothing model prevents different nodes within the cluster from simultaneous ownership of resource groups or resources within a resource group. As mentioned earlier, resource groups also maintain dependency relationships between different resources contained within each group. This is because resources in a cluster very often depend on the existence of other resources in order to function or start. For example, a virtual server or network name must have a valid IP address in order for clients to access that resource. Therefore, in order for the network name or virtual server to start (or not fail), the IP address it depends on must be available. This is known as resource dependency. Within a resource group, the dependencies among resources can be quite simple or very complex. Resource dependencies (as shown later in Figures 8.6 and 8.8) are maintained in the properties of each resource and allow the cluster service to manage how resources are taken off-line and brought on-line. Also, resource dependencies cannot extend beyond the context of the resource group to which they belong. For example, a virtual server cannot have a dependency on an IP address that exists within a resource group other than its own resource group. This restriction is due to the fact that resource groups within a cluster can be brought on-line and off-line and moved from node to node independently of one another. Each resource group also maintains a policy that is available clusterwide that specifies its preferred cluster node(s) (the node in the cluster it prefers to run on—preferred owners) and the possible owners node(s) (the node to which it should fail over in the event of a failure condition). Resource groups are the fundamental unit of management within a Microsoft Windows cluster. As such, it is important that you have a keen understanding of how they function and operate.

click to expand
Figure 8.3: A basic two-node MSCS configuration.

Key cluster terminology

There has been much confusion in the world of cluster technology, not just over basic terminologies, but also about architectural implementations. Table 8.1 highlights some key terminology for MSCS.

Table 8.1: Key Terminology for MSCS
Resource	The smallest unit that can be defined, monitored, and maintained by the cluster. Examples are physical disk, IP address, network name, file share, print spool, generic service, and application. Resources are grouped together into a resource group. The cluster uses the state of each resource to determine whether a failover is needed.
Resource group	A collection of interdependent resources that logically represents a client/server function. The smallest unit that can fail over between nodes.
Resource dependency	A resource may depend on other resources. A resource is brought on-line after any resource on which it depends. A resource is taken off-line before any resources on which it depends. All dependent resources must failover together.
Quorum resource	Stores the cluster log data and application data from the registry used to transfer state information between clusters. Used by the cluster service to determine which node can continue running when nodes cannot communicate. Currently, in Windows server, the only quorum-capable resource is the physical disk.
Active/passive	A mode of configuration and operation for cluster resource groups and virtual servers. An Active/Passive configuration allows a virtual server instance to fail to a standby node in the cluster that is not running any other virtual servers. Active/ Passive clusters always provide an “inactive” node available for failovers. Active/Passive configuration is forced by the cluster when applications have no MSCS support (absence of a resource DLL or the use of the generic cluster resource DLL).
Active/active	A mode of configuration and operation for cluster resource groups and virtual servers. An Active/Active configuration allows for multiple virtual servers per node and all cluster nodes are active and running virtual servers. Active/Active configurations also allow for additional failover and configuration flexibility. Applications must provide a cluster resource DLL in order to support Active/Active operation.
Cluster membership	Term used to describe cluster participation and the orderly addition and removal of active nodes to and from the cluster.
Virtual server	The network entity used by the client for the cluster resource group—a combination or collection of configuration information and resources such as network name and IP address resources.
Cluster node	Physical system that participates in a cluster and is capable of owning and managing cluster resources and resource groups.

Microsoft Cluster Service components

As shown in Figure 8.4, MSCS is implemented as a set of independent and somewhat isolated components (device drivers and services) all within the confines of the Cluster Service. This set of components lays on top of the Windows operating system and acts as a service. Fundamentally, the MSCS architecture comprises of key components: the cluster service, resource monitors, and resource DLLs. Additionally, the Cluster Administrator allows Independent Software Vendors (ISVs) to develop extension DLLs for management capability, but I will not address these here. By using this design approach, Microsoft avoided many complexities that may have been encountered in other design approaches, such as system scheduling and processing dependencies between the Cluster Service and the operating system. When layered on Windows Server, the Cluster Service provides some basic functions that the operating system needs to support clustering. These basic functions include dynamic network resource support, support for making and querying reservations on disks, file system support for disk mounting and unmounting, and shared resource support for the I/O subsystem. Table 8.2 provides a brief overview of each of these components.

click to expand
Figure 8.4: MSCS architecture and components.

Table 8.2: MSCS Components
Node Manager	Maintains resource group ownership of cluster nodes based on resource group node preferences and the availability of cluster nodes.
Resource Monitor	A resource monitor is an interface between the Cluster Service and the cluster resources and runs as an independent process. The Cluster Service uses resource monitors to communicate with the resource DLLs. It utilizes the cluster resource API and RPCs to maintain communication with the resource DLLs. Each resource monitor can run as a separate process.
Failover Manager	Works in conjunction with resource monitors to manage resource functions within the cluster such as failovers and restarts.
Checkpoint Manager	Maintains and updates application states and registry keys on the cluster quorum resource.
Cluster Communications Manager	Manages and maintains communication between cluster nodes.
Cluster Configuration Database Manager	Maintains and ensures coherency of the cluster database on each cluster node that includes important cluster information such as node membership, resources, resource groups, and resource types.
Event Processor	Processes events relating to state changes and requests from cluster resources and applications.
Cluster Membership Manager	Manages cluster node membership and polls cluster nodes to determine state.
Cluster Event Log Manager	Replicates system event log entries across all cluster nodes.
Global Update Manager	Provides updates to the Configuration Database Manager to ensure cluster configuration integrity and consistency.
Cluster Object Manager	Provides management of all cluster service objects and the interface for cluster administration.
Cluster Log Manager	Works with the Checkpoint Manager to ensure that the recovery log on the cluster quorum disk is current and consistent.

The resource monitor

The resource monitor provides an interface between the Cluster Service and the cluster resources, and it runs as a separate process. The Cluster Service uses the resource monitor to communicate with the resource DLLs. The DLL handles all communication with the resource, thus shielding the Cluster Service from resources that misbehave or stop functioning. Multiple copies of the resource monitor can be running on a single node, thereby providing a means by which unpredictable resources can be isolated from other resources. A resource monitor runs in a process separate from the Cluster Service; this protects the Cluster Service from resource failures. For Exchange 2000/2003, the default cluster resource monitor process manages each Exchange virtual server running on the cluster. Via the Exchange resource DLL (EXRES.DLL), this resource monitor can communicate with the Exchange service components of the virtual server and can provide application-specific intelligence and instrumentation.

The resource DLL

Closely related to our discussion of the resource monitor is the resource DLL. The resource monitor and resource DLL communicate using the MSCS cluster resource API, which is a collection of programmatic entry points, callback functions, and related structures and macros used to manage cluster resources. Applications that implement their own resource DLLs to communicate with the Cluster Service and that use the cluster API to request and update cluster information are defined as cluster-aware applications. Applications and services that do not use the cluster or resource APIs and cluster-control code functions are unaware of clustering and have no knowledge that MSCS is running. These cluster-unaware applications are generally managed as generic applications or services. Both cluster-aware and cluster-unaware applications run on a cluster node and can be managed as cluster resources. However, only cluster-aware applications can take advantage of features offered by Cluster Service through the cluster API. Cluster-aware applications can report status proactively or reactively (via the IsAlive and LooksAlive API calls, which I will discuss in more detail later in the chapter) to the resource monitor, respond to requests to be brought online or taken off-line gracefully, and respond more accurately to IsAlive and LooksAlive requests issued by the cluster services. Cluster-aware applications should also implement Cluster Administrator extension DLLs, which contain implementations of interfaces from the Cluster Administrator extension API. A Cluster Administrator extension DLL allows an application to be configured into the Cluster Administrator tool ( Cluadmin.exe ).

Implementing custom resource and Cluster Administrator extension DLLs allows for specialized management of the application and its related resources and enables the system administrator to install and configure the application more easily.

Cluster fail over modes of operation

With Microsoft Cluster Server, two types of fail over are supported: resource fail over and service fail over. Both allow for increased system availability. More comprehensive in capabilities, the Resource Fail over mode takes advantage of cluster APIs that enable applications to be cluster aware. This is provided via a resource DLL that can be configured to allow customizable fail over of the application. Resource DLLs provide a means for Microsoft Cluster Service to manage resources. They define resource abstractions, interfaces, and management. In a Resource Fail over Mode of operation, it is assumed that the service is running on both nodes of the MSCS cluster (also known as Active/Active) and that a specific resource such as a database, a virtual server, or an IP address fails over—not the entire service. Many applications from ISVs as well as those from Microsoft do not have resource DLLs available that enable them to be cluster aware. To offset this, Microsoft has provided a generic service resource DLL, which provides basic functionality to these applications running on Microsoft Cluster Service—Windows Notepad Clustering is indeed a reality! The generic resource DLL provides the Service Fail over mode and limits the application to running on one node only (also known as Active/Passive). In a Service Fail over mode, a service is defined to MSCS as a resource. Once defined, the MSCS Fail over Manager ensures that the service is running on only one node of the cluster at any given time. The service is part of a resource group that uses a common name throughout the cluster. As such, all services running in the resource group are available to any network clients using the common name.

Active/Active versus Active/Passive

The discussion on the fail over types (service and resource failovers) as well as the differences between cluster-aware and cluster-unaware applications had been oversimplified into two basic terms—Active/Active and Active/Passive. When deploying cluster solutions with Windows Server, the level of functionality and flexibility that an application can enjoy in a clustered environment directly relates to whether it supports Active/Passive or Active/ Active configuration. Active/Active means that an application can run on all nodes in the cluster at the same time. This means that the application services are running and servicing users from each node in the cluster. To do this, an application must have support for communicating with the cluster services via its own resource DLL. Also, the application must be architected in such a way that specific resource units can be treated independently and failed over to other nodes. Per previous discussions, specific support from the application vendor (whether Microsoft or a third-party vendor) is required for the application to run in an Active/Active cluster configuration. In some cases for Active/Passive configurations, the application is either limited architecturally, has no specific resource DLL support, or both. In an Active/Passive configuration, the application runs on only one cluster node at a time, or there is always at least one node in the cluster reserved for fail over. However for Exchange 2000/2003, Active/Passive is the preferred mode of operation and Exchange offers full cluster API support in an Active/Passive configuration. In the event that an application does not have its own resource DLL, it also has no awareness of the cluster software. Likewise, the cluster software has no application awareness and simply understands that a generic service or group of services and resources must be treated as a fail over unit.

It is important to understand that many applications support both Active/ Active and Active/Passive configurations (Exchange Server 2000//2003, for example). If an application supports both modes of operation, there are often pros and cons for each mode you choose for deployment. For example, in our later discussions of Exchange 2000/2003, you will find that although both modes are supported, there are some severe caveats to Active/Active configurations of Exchange Server (2000/2003 only). The choice on which mode you should deploy will be highly influenced by deployment trade-offs and limitations in the applications for which you choose to implement MSCS.

Obviously, MSCS could warrant a book all by itself (in fact, there are many good books on the subject). While I don’t wish to delve any further into MSCS (in order to focus on Exchange Server), it is important that Exchange administrators understand the importance that a keen understanding of MSCS has to their success with deploying Exchange clusters. As such, I do want to provide a list of resource material for readers to leverage toward that end. Table 8.3 provides a list of cluster references for your enjoyment and bedtime reading.

Table 8.3: MSCS Resources and Information
Resource	Link/Source/Pointer
Introducing Microsoft Cluster Service (MSCS) in the Windows Server 2003 Family	http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnnetserv/html/wns-introclustermscs.asp
Microsoft Cluster Service FAQ	http://www.microsoft.com/NTServer/Support/faqs/clustering_faq.asp
Windows Server 2003 Server Cluster Architecture	http://www.microsoft.com/windowsserver2003/techinfo/overview/servercluster.mspx
Technical Overview of Windows Server 2003 Clustering Services	http://www.microsoft.com/windowsserver2003/techinfo/overview/clustering.mspx
Quorums in Microsoft Windows Server 2003 Clusters	http://www.microsoft.com/windowsserver2003/techinfo/overview/clusterquorums.mspx
Microsoft Windows Clustering: Storage Area Networks	http://www.microsoft.com/windowsserver2003/techinfo/overview/san.mspx
Geographically Dispersed Clusters in Windows Server 2003	http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx
Microsoft Cluster Service API Reference	http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mscs/mscs/clres_v1_functions.asp