< Day Day Up > |
What's all the buzz about high availability? Why is everyone so intent on achieving the Utopia of server availability: five nines? It really all comes down to one thing: economics. The economics of today's Internet-centric world demand that critical services and servers be available 100% of the time. In the absence of perfection (which no one has delivered yet), the bar for highly available solutions has been set at five nines: 99.999% uptime. What exactly does that equate to, though? Five nines availability enables you to have critical services offline for 5.25 minutes per year. That's an unbelievably low number, no matter how you look at it. But that's the goal of highly available solutions. As you might know, 5 minutes per year is barely enough time to apply a hot fix, much less a service pack. The answer to this problem is highly available server solutions. When discussing highly available solutions, there are two distinctly different ways to look at the problem: one based on hardware and one based on software. Windows Server 2003 provides you with two types of software-based high availability: clustering and Network Load Balancing (NLB). We examine the pertinent details of clustering as it relates to supporting Exchange Server 2003 computers.
Of course, having any solution in place highly available or not is of little use if disaster strikes and removes it from operation. Environmentally or intentionally caused disasters are a fact of life that you simply cannot afford to ignore. Although you might not be able to prevent your servers from experiencing a disaster condition, you can prevent extended downtimes and the temporary unavailability of the network by implementing a well-planned and practiced disaster-recovery plan as we discuss later in this chapter. Clustering is accomplished when you group independent servers into one large collective entity that is accessed as if it were a single system. Incoming requests for service can be evenly distributed across multiple cluster members or can be handled by one specific cluster member. The Microsoft Cluster Service (MSCS) in Windows Server 2003 provides highly available fault-tolerant systems through failover. When one of the cluster members (nodes) cannot respond to client requests, the remaining cluster members respond by distributing the load among themselves, thus responding to all existing and new connections and requests for service. In this way, clients see little, if any, disruption in the service being provided by the cluster. Cluster nodes are kept aware of the status of other cluster nodes and their services through the use of heartbeats. A heartbeat is used to keep track of the status of each node and also to send updates in the configuration of the cluster. Clustering is most commonly used to support database, messaging, and file/print servers. Windows Server 2003 supports up to eight nodes in a cluster. High-Availability TerminologyAlthough we don't typically take pages within a chapter to define key terms, the terminology associated with clustering is somewhat esoteric and a good understanding of it is key to successfully implementing and managing any clustered solution. Although the following list of terms is not all-inclusive, it represents some of the more important ones you should understand:
How Does Clustering Work?Clustering uses a group of between two and eight servers that all share a common storage device. Recall that a cluster resource is an application, service, or hardware device that is defined and managed by the cluster service. The cluster service (MSCS) monitors these cluster resources to ensure that they are properly operating. When a problem occurs with a cluster resource, MSCS attempts to correct the problem on the same cluster node. If the problem cannot be corrected such as a service that cannot be successfully restarted the cluster service fails the resource, takes the cluster group offline and moves it to another cluster node, and restarts the cluster group. MSCS clusters also use heartbeats to determine the operational status of other nodes in the cluster. Two clustering modes exist:
Exchange Clustering SpecificsYou might be wondering which mode is better: active/passive or active/active. When using the active/active mode, you can deploy Exchange in a cluster with only two nodes which is the maximum supported. Each node in that cluster can run two instances of the Exchange virtual server (EVS; recall the definition of the cluster virtual server), for a total of four instances of the EVS. Should one of the nodes fail, the single remaining node is loaded with all resources from both servers, possibly even resulting in an overloaded condition and causing it to fail as well. In addition, to ensure that reliable failover occurs, each node in the active/active cluster can host a maximum of only 1,900 active mailboxes much less than an Exchange server might normally hold. On the other hand, if you implement an active/passive mode cluster (as Microsoft recommends), you can achieve a much more reliable and robust solution. An active/passive cluster must contain at least one active node and at least one passive node however, you cannot exceed eight nodes total. As an example, suppose you had an eight-node active/passive cluster. You might configure six nodes as active and the remaining two as passive. This gives you a multilayer backup plan if more than one active node should fail within a short period of time. Finally, you must also bear in mind that a single Exchange server is limited to four storage groups. This is typically not a problem when you use an active/passive cluster, but becomes an acute problem when you use an active/active cluster. If one of the active servers has three storage groups on it and the other one also has three storage groups on it, two of the storage groups on the node that fails will not be mounted, thus making those mailboxes or public folders unavailable to clients. This is one more reason why active/passive clustering is the best way to cluster your Exchange servers. Cluster ModelsThree distinctly different cluster models exist for configuring your new cluster. You must choose one of the three models at the beginning of your cluster planning because the chosen model dictates the storage requirements of your new cluster. The three models are presented in the following sections in order of increasing complexity and cost. Single-Node ClusterThe single-node cluster model, shown in Figure 7.1, has only one cluster node. The cluster node can make use of local storage or an external cluster storage device. If local storage is used, the local disk is configured as the cluster storage device. This storage device is known as a local quorum resource. A local quorum resource does not make use of failover and is most commonly used as a way to organize network resources in a single network location for administrative and user convenience. This model is also useful for developing and testing cluster-aware applications. Figure 7.1. The single-node cluster can be used to increase service reliability and also to prestage cluster resource groups.Despite its limited capabilities, this model does offer the administrator some advantages at a relatively low entry cost:
Single-Quorum ClusterThe single-quorum cluster model, shown in Figure 7.2, has two or more cluster nodes that are configured so that each node is attached to the cluster storage device. All cluster-configuration data is stored on a single-cluster storage device. All cluster nodes have access to the quorum data, but only one cluster node runs the quorum disk resource at any given time. Figure 7.2. The single-quorum cluster shares one cluster storage device among all cluster nodes.Majority Node Set ClusterThe majority node set cluster model, shown in Figure 7.3, has two or more cluster nodes that are configured so that the nodes might or might not be attached to one or more cluster storage devices. Cluster configuration data is stored on multiple disks across the entire cluster, and the cluster service is responsible for ensuring that this data is kept consistent across all of the disks. All quorum traffic travels in an unencrypted form over the network using server message block (SMB) file shares. This model provides the advantage of being able to locate cluster nodes in two geographically different locations; they do not all need to be physically attached to the shared cluster storage device. Figure 7.3. The majority node set cluster model is a high-level clustering solution that allows for geographically dispersed cluster nodes.Even if all cluster nodes are not located in the same physical location, they appear as a single entity to clients. The majority node set cluster model provides the following advantages over the other clustering models:
However, you must abide by some requirements when implementing majority node set clusters to ensure that they are successful:
The primary disadvantage to this clustering model is that if a certain number of nodes fail, the cluster loses its quorum and it then fails. Table 7.1 shows the maximum number of cluster nodes that can fail before the cluster itself fails.
As shown in Table 7.1, the majority node cluster set remains operational as long as a majority more than half of the initial cluster nodes remains available.
Cluster Operation ModesYou can choose from four basic cluster operation modes when using a single-quorum cluster or a majority node set cluster. These operation modes are specified by defining the cluster failover policies accordingly, as discussed in the next section, "Cluster Failover Policies." The four basic cluster operation modes are listed here:
Now that you've been introduced to failover, let's examine cluster failover policies. Cluster Failover PoliciesAlthough the actual configuration of failover and failback policies is discussed later in this chapter, it is important to discuss them briefly here to properly acquaint you with their use and function. Each resource group within the cluster has a prioritized listing of the nodes that are supposed to act as its host. You can configure failover policies for each resource group to define exactly how each group will behave when a failover occurs. You must configure three settings:
|
< Day Day Up > |