Introduction to Clustering | Mastering Microsoft Exchange Server 2007 SP1

Conceptually, clustering is fairly simple to describe. A cluster is a collection of computers that work together to provide a single service. In some implementations of a cluster, multiple nodes all share the workload; the Windows Server 2003 Network Load Balancing feature is an example. In other implementations, one node provides the actual service (this is called the active node) and another node is waiting to take over that service if the first node fails (this is the passive node). An example of this is how Exchange 2007 mailbox servers are clustered.

Clustering is certainly nothing new. Microsoft has supported Exchange clustering since the Exchange 5.5 days, but other vendors, such as Digital Equipment Corporation (now part of HP), Tandem (now part of HP), IBM, and others, have supported clustering technologies for the last 25 years.

The purpose of clustering varies somewhat, but almost always one of the primary intentions is to have a mechanism to provide a higher level of availability. Network Load Balancing not only provides higher availability for the services it offers, it can also allow you to scale further and support more clients.

In the following sections, we want to introduce you to the clustering concepts supported by Exchange 2007 Mailbox server roles and familiarize you with some of the advantages and disadvantages of these types of clustering. If these technologies sound like "the next step" for your organization, then we encourage you to read further though the Microsoft website and white papers for implementation details. Full implementation details of mailbox clustering are beyond the scope of this book.

Is Clustering the Next Step?

When organizations are considering implementing Exchange clusters, the reason is that they want to achieve higher availability for their e-mail services. The most important question that should be asked, though, is whether clustering is the next step toward higher availability.

Carefully look at all of the reasons that you have had unscheduled downtime over the past year or two. Determine the root cause for each of these downtime incidents. For each of these, ask yourself if redundant hardware or redundant copies of the database would have spared you that downtime. You may be surprised at what you find. Many organizations find that clustering would solve almost none of their unscheduled downtime problems.

Next, examine the additional costs involved in implementing clusters. These costs include hardware, software, storage, consulting, and training. Will these additional costs be justified based on the amount of downtime that you have had versus the level of availability you require.

Clustering: The Good, the Bad, and the Ugly

We certainly don't want to discourage anyone from implementing Exchange clustering. If the technology is right for your organization and it will provide you with value, then certainly you should be seriously considering clustering your important Exchange 2007 mailbox servers. In the next sections, we give you our opinion of the good, bad, and ugly points of clustering.

The Good

Many people that implement Exchange clustering become raving fans of the technology and would have the word clustering tattooed on their foreheads if their spouses or significant others would allow it. Let's look at a few of the high points and selling points for implementing a cluster.

It increases availability by bringing failed mailbox servers back online sooner.
It gives you the ability to meet service level agreements.
It reduces scheduled downtime windows for mailbox servers.
Clustered continuous replication allows for replicated copies of mailbox data.
It provides automatic failover of Exchange mailbox services.
Failover times are usually under five minutes in duration.
Outlook in cached Exchange mode can make a failover invisible to desktop users.

The Bad

Naturally, there are a few points that are going to count against clustering when you begin planning or implementing a clustered environment:

Exchange 2007 only supports clustering the Mailbox role.
There may be additional software costs (Windows and Exchange server licenses for passive node; Enterprise Edition of Windows and Exchange is required).
Single copySingle copy clusters require shared storage such as storage area networks (SANs) regardless of whether iSCSI or Fibre Channel attached shared storage is used. This can significantly increase costs.
Some third-party software (backup, message hygiene, and so on) is not supported on clustered nodes.
Additional expertise is required to manage clustering technologies and shared storage.
Some typical management functions (such as stopping and starting a service) are different on clusters than they are on typical Windows servers.

The Ugly

Why the "ugly" category? Well, the "bad" category is all technology and cost. These are things that as information technology professionals, we can address through budgeting and training. However, "the ugly" rears its head at the management layer. Often clustering technologies are sold to senior management by salespeople. The technical implementation and requirements get a little fuzzy at this layer.

When clustering is "sold" to management, expectations are often very high for what can be delivered. Management often views this as "magic" or the "answer to all our problems."
Failover times are usually between two and five minutes, but possibly longer. The perception is often that failover is always instantaneous.
Exchange clustering does not protect an organization from power failures, network infrastructure failures, Active Directory problems, DNS problems, and so on.
Additional expertise and training are required to support the additional complexity. Often this is in the form of consultants and classes. This is a factor that management often does not hear about.

Overcoming the Bad and the Ugly

Both the bad and ugly aspects of clustering can be overcome. The most important factor is to separate fact from fiction and hype in all of your management discussions, briefings, presentations, and documentation. Don't oversell the technology or feed the perception that Exchange clustering is actually "magic e-mail pixie dust." A lot of us get caught in the "gee it will be so great to have this new technology" frenzy that we are afraid to present all of the facts for fear we will lose our funding.

Balance and lessen the enthusiasm of salespeople, vendors, and senior management with by presenting a realistic view of clustering features, functions, and benefits.

The Intimidation Factor

One interesting factor of implementing a cluster is the intimidation factor. Junior or inexperienced administrators often look at clustering technology as a black box. Though they may have been trained to manage Windows servers, they are not trained in clustering.

Interestingly enough, this can function as a factor for improving availability. Quite a bit of downtime that we have come in contact with has been the result of junior administrators taking initiative (incorrectly) and making a situation worse. Frequently, though, with clusters, the junior administrator will leave them alone and wait for someone with more clustering experience.

Single Copy Clusters

In Exchange 2000/2003, we just referred to mailbox clusters as an Exchange cluster. However, in Exchange 2007, there are two different approaches to implementing clustering. The first is a single copy cluster (SCC); the SCC is identical to an Exchange 2000/2003 cluster. In this case, the "single copy" part of the name refers to the fact that there is only a single copy of each Exchange 2007 mailbox or public folder database. The database resides on some type of shared storage array.

With single copy clusters, there can be between two and eight nodes in a cluster. At any given time, there is always at least one node that is in the passive role and is ready to assume responsibility for a clustered mailbox server.

Note

Terminology update: If you used Exchange 2000/2003 in a clustered environment, then you are used to the term Exchange virtual server. This term has been replaced by clustered mailbox server.

Figure 15.12 shows a simple illustration of a two-node active-passive cluster. In this diagram, both the active node (NODE01) and the passive node (NODE02) are connected to the shared storage system. NODE01 owns the clustered mailbox server's name (EXCHANGE01), the clustered mailbox server's IP address (192.168.254.10), the shared disks (F:\ and G:\), and the other Exchange server resources such as the system attendant and the information store services.

image from book
Figure 15.12: Simple two-node clustered mailbox server

Take note of the clustered mailbox server's name, EXCHANGE01. This name is part of a clustered resource and is not permanently assigned to any physical node in the cluster. This name, EXCHANGE01, is the name that Outlook clients use when connecting to the Exchange server.

All of the nodes in a single copy cluster are connected to both a public network and a private network. The private network is used for the cluster heartbeat. The heartbeat is what the nodes in the cluster use to determine if the other nodes are still alive and healthy. One of the two nodes will also own the cluster name and cluster IP address, though these are not shown in Figure 15.12. The same node that owns the generic cluster resources also owns the Q:\ drive resource, or the quorum disk. The quorum disk holds the cluster database and information about which node in the cluster owns each clustered resource. This helps to keep a passive node from taking over a resource that is owned (and in use) by an active node of the cluster.

If anything about physical NODE01 or the services running on it fails, this will initiate what is known as a failover. A failover occurs when the passive node in the cluster determines that something is not working properly on the active node. The resources necessary to bring the clustered mailbox server are taken over by the other node. So, in the case of Figure 15.12, NODE02 would assume control of the shared disks, the Exchange clustered mailbox server name, the clustered mailbox server IP address, and the Exchange services. The failover takes as long as necessary to mount the shared disks and then to start the Exchange services and mount the databases found on the shared disks. Depending on how busy the server was at failure, up to 20MB worth of outstanding transactions for each storage group may have to be committed to the database.

Clustered Continuous Replication Clusters

Clustered continuous replication (CCR) clusters is a new technology to Exchange 2007. CCR clusters use many of the same principals as single copy clusters but without the use of shared storage. Instead, as log files are filled up on the active node of the cluster, the passive node then copies those log files over and commits them to a backup copy of the database that resides on the passive node. CCR clusters are always two-node clusters. One node is always the active node and one node is always the passive node.

Figure 15.13 shows a simple CCR configuration. In this diagram, both the active and passive nodes have local storage. This storage could conceivably be on a SAN or iSCSI storage system, but the storage would not be shared between the two nodes. The only thing that is actually shared is a shared folder on a third server. The file share on the third server contains information that is used in the event that the two nodes cannot communicate over the private network. This helps to ensure that a failover does not occur due to the loss of the private network. This cluster type is called a majority node set. In Figure 15.13, the witness file share is a file server, but it could just as easily be one of the Client Access or Hub Transport server roles.

image from book
Figure 15.13: Simple clustered continuous replication configuration

As transaction logs are filled and committed to the disk on the active node, they are copied to the passive node. A service on the passive node then commits those transactions to the backup copy of the databases.

In the event of a failure of the active node, the passive node will mount the databases and roll forward any transaction logs that have not yet been committed. It will then assume the role of the production Exchange server (in this case EXCHANGE01).

Though we really like CCR clustering, there are a few positive and negative points you should be aware of when deciding to use this technology:

The database on the passive node may not be completely synchronized with the active node. It might be a minute or two behind the active node (depending on activity levels). The transport dumpster on the Hub Transport servers help to minimize any data loss.
The hardware on the active node and the passive node does not need to be identical, but it should be comparable with respect to CPU, memory, network, and disk capacity.
Databases are replicated on a storage-group-by-storage-group basis and you can have only one database per storage group in CCR environment.
CCR clusters reduce the complexity of Exchange mailbox server clustering somewhat because they do not require shared storage.
CCR clusters can increase administrative complexity if the passive node of the cluster becomes out of sync with the active node. This may be because log files are not being copied to the passive node and thus the passive copy of the databases are out of sync.
CCR clusters do not scale as well as SCC clusters. For example, a four-node SCC cluster could support three active Exchange nodes. To support three active clustered mailbox servers using CCR, you would need six servers.