8.2 The problems clustering does and does not solve | Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)

A common mistake made when employing cluster technology is to view clustering as the quintessential answer to all of the downtime problems an Exchange deployment faces. Proceeding with this notion is a fatal mistake, however. Clustering can only address issues within the domain of the cluster service technology. For example, if you have poor disaster-recovery practices or software that is plagued with bugs, no magic clustering technology is going to save you—there is no magic clustering dust! Clustering technology, especially that in Windows Server, can only help solve issues it was designed to address. Simply stated, clusters most directly help you reduce single points of failure. A stand-alone server running Exchange has several points of failure. Hardware components like the system board, processors, power supplies with no redundancy, network cards, and so forth may fail at some point in the life of a server. For example, industry-standard servers do not have any form of processor redundancy available, whereby a failed processor is lock-step fault tolerant and the system experiences zero downtime when a processor fails. This type of technology is only available in high-end systems such as HP’s (formerly Tandem, then Compaq) Himalaya Servers. By deploying a cluster solution, many single points of failure can be eliminated. In some cases, software issues can even be thought of as single points of failure, and clusters may address some of these issues as well.

Another downtime problem that system operators often face is planned outages. Planned outages are often not “charged” against the availability of a deployment, but are, nevertheless, downtime in which the system is not available for client access. These outages take the form of routine maintenance, rolling upgrades, configuration management “blockpoints,” configuration changes, and hardware or software upgrades. Deploying clustered Exchange servers can help address planned outages by allowing services running on one cluster node to be failed over to another node or nodes while maintenance activities are performed. As I mentioned earlier in the book, Microsoft OTG saw a huge benefit of clustering for planned downtime—when you spend a lot of time “dogfooding” prerelease product code, you have a great deal of planned downtime as you migrate from build to build. The ability to fail services over to another node in the cluster can be invaluable. In most cases, you can perform comprehensive software or hardware upgrades and routine maintenance without users even knowing about it. The ability of cluster technology to assist with the problem of planned outages is perhaps the most important benefit of clustering available, but is used, unfortunately, very infrequently. This benefit alone may be enough to justify an organization’s investment in clustering Exchange Server.

Clustering does not solve many other problems such as poor training, procedures, many software issues, or major catastrophes. In addition, clustering cannot help when there are infrastructure failures to services that directly support Exchange such as WINS, DNS, AD, or network services. Clustering is not a replacement for sound disaster-recovery practices.

Related to the implementation of MSCS, every cluster is built on servers with shared storage such as shared SCSI or FC. Shared storage presents an additional single point of failure that a cluster cannot protect you from. When all nodes in a cluster attach to the shared storage via a single controller installed in the individual node, that controller becomes a single failure point that the cluster cannot tolerate. Technologies such as Switched Fibre Channel and redundant I/O paths can compensate for this by allowing redundant controllers in each cluster node that are attached to a separate switch fabric in the FC SAN. However, the storage subsystem still represents a single failure point. Beyond reducing single points of failure and minimizing planned outages, clustering technology may not solve other key issues that cause downtime for our Exchange deployment. Clustering and the technologies upon which it is built can add a significant degree of complexity to your environment. As you look further into clustering Exchange Server, it is important that you understand and evaluate whether or not clustering can address your key issues.