1.3 The four pillars of mission-critical | Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)

Let us start down the path by looking at the definition of mission critical as applied to Microsoft Exchange Server 2003. In my mind, a mission-critical system running Exchange Server must stand on four key pillars: reliability, scalability, security, and manageability (as shown in Figure 1.1).

1.3.1 Reliability

When one considers the elements of a mission-critical server, the first and most obvious is reliability. Reliability is, quite simply, the ability to survive any interference with normal operations. Also, the degree of interference does not always need to be catastrophic. For example, in March 1999, organizations worldwide fell victim to the Melissa virus, which shut down messaging systems everywhere. While Melissa did not specifically destroy any data, it prevented many users from accessing their messaging systems because these systems were either shut down by support staff as they scrambled to remove the virus from information stores or were unavailable because of the mail storms created as a result of the virus in action. In this example, Exchange Server information stores did not need to be restored from tape (in most cases), and once the virus had been eradicated, the system was restored to normal operation. The other extreme, however, is the scenario in which your Exchange server or group of servers is completely destroyed, leaving nothing but a memory. Recovering from a disaster of this type means that you may have to start from scratch with new hardware, software, backup sets, and possibly even a new location. In order to tolerate a disaster of this magnitude, the utmost proactive planning, documentation, testing, and best practices are required. Reliable Exchange servers need to be recoverable in the case of a catastrophic loss, but also need to provide continued services when even the most minor interruption occurs. Mission-critical servers also need to be resilient against not only data loss, but also data corruption. When Exchange Server data becomes corrupted, support staff must have some options to return to the last known good state of the Exchange Server private or public information store.

Reliability is normally expressed in terms of the probability that a given system will provide normal service and operation during a given period. This is usually expressed as a percentage within a period of 1 year. For example, you might boast that your e-mail system achieved 99.99% reliability in 1999. Many are not sure what this really means. In addition, we often hear the terms “four nines” or “five nines” of reliability thrown around today. One rule of thumb that I have often heard is that “five nines” (99.999%) translates to 5 minutes per year. That is one method of quick reference, but Table 1.3 may explain this little better.

Table 1.3: Reliability and Downtime—24/7 (365 Days Per Year or 8,760 Hours)
Availability/Reliability (“Nines”)	Downtime Per Year
90% (one nine)	876 hours (36.5 days)
99% (two nines)	87.6 hours (3.65 days)
99.9% (three nines)	8.76 hours
99.99% (four nines)	3,153 seconds (52.55 minutes)
99.999% (five nines)	315 seconds (5.25 minutes)
99.9999% (six nines)	31.5 seconds

The U.S. government, IEEE, hardware vendors, and many standards organizations have guidelines for calculating reliability for electronics and computer systems as well. Table 1.4 provides some basics for calculation of system availability that are used industrywide.

Table 1.4: Reliability Measurement Formulas
Measure	Formula/Equation
Failure rate ( )	= 1/MTBF Where: MTBF = mean time between failures
Reliability (R)	R = e –^T Where: e = natural logarithm = failure rate T = time (period to be measured)
Reliability of a parallel system ( R p)	R p = 1 – [(1– R 1) x (1– R 2) x... (1– R n)] Where: R 1... Rn = reliability (R) of each system in parallel
Reliability of a series system ( R s)	R s = R 1 R 2 ... R n Where: R 1... Rn = reliability ( R ) of each system the series
Source: MIL-STD and IEEE specifications.

Closely tied to reliability is recoverability. When we discuss recoverability for Exchange Server, it is assumed that the server has already become unavailable. Thus, we need to recover it. A reliable Exchange server must have a high degree of recoverability. A system that only fails once a year, but that requires 2 weeks to recover, is not very reliable. Recoverability need not only address such subjects as reinstalling software and restoring data from tape, it must also address whatever is necessary to get users up and running as soon as possible after a failure. Better yet, recoverability should endeavor to restore normal operation before users are even aware of the outage. For example, as long as a user has access to his or her inbox and can send and receive mail, he or she could be considered recovered. An example of this could be a scenario in which an Exchange server’s information store becomes corrupted. One angle of attack on this problem could be to move as many mailboxes as possible immediately to a spare backup Exchange server. Once the move is complete, these users are back in operation, in many cases experiencing little or no loss in productivity. While no actual recovery in the form of a restore from tape has occurred, the system achieves a higher degree of availability than simply starting over and restoring the database from tape. Recoverability should be viewed in the same light as fault tolerance versus fault recovery.

Fault-tolerance technologies, such as RAID disk arrays, ECC memory, redundant power supplies, clustering, and so forth, provide a means to tolerate faults that occur without downtime. ECC memory, for example, can detect and correct single- and double-bit memory errors by including some extra bits (i.e., 72 versus 64 bits) in the memory subsystem that can be used to verify the integrity of the data being read from system memory. Thus, if one or two bits are incorrect, the memory system can correct the error without the system crashing. Another example is redundant Network Interface Cards (NICs). Server vendors like HP provide technologies that allow two NICs in a server to function as for to each other, which is called NIC teaming. If the primary NIC fails or experiences a cable fault, the secondary NIC will transparently pick up where the primary left off—all without interrupting the operation of the server. Fault tolerance is very important, but you also need a second line of defense—fault recovery. Fault recovery is a method of ensuring that the system will rapidly recover in the event of a critical fault that could not be tolerated. For example, PC-based servers available today do not provide on-line processor redundancy similar to what is available in HP’s Himalaya NonStop systems. The next best thing, however, is the ability to recover quickly in the event that one processor in a multiple-processor system fails. In the past, if a processor failed, the system would simply halt, and an administrator would need to diagnose and repair the failed processor before returning the system to normal operation. Today, however, many server vendors provide the capability for off-line processor recovery. Off-line processor recovery quickly recovers the server in the event of a processor failure. When the processor fails, the system BIOS senses the condition and reboots the system with the bad processor disabled. The end result for multiprocessor systems is that, if one CPU fails, the system will reboot and return to normal operation on the remaining good processors. In this case, a fault was not avoided, but the system was quickly recovered.

Recoverability also extends into backup and restore operations. In the event that no other option exists, an Exchange server needs to be recoverable from tape in the most expedient manner. In later chapters, we will discuss methods, tools, and best practices for ensuring expedient recovery of Exchange servers. These methods are not limited to tape-based restoration, however. With so many alternate technologies available, such as snapshots and data replication, these tools can also be employed as additional measures for increasing the recoverability and ultimately the availability for Exchange Server deployments.

The reliability and recoverability of a system are the cornerstone of mission-critical servers. An understanding of how to measure reliability will be crucial later on when we discuss service-level definitions and analyze Exchange server downtime causes. It is important to understand that reliability for an Exchange Server is determined by many factors. A surprising fact is that hardware failures account for fewer root-cause cases of server downtime for Exchange Server as well as many other mission-critical applications. Root causes such as software errors, operational issues, and personnel account for more system downtime than hardware does. According to Strategic Planning Research, less than 20% of computer system downtime is caused by equipment failures. Data from OnTrack International and other sources shows similar trends. The most significant cause—human error—can be attributed to poor planning and procedures, lack of training, and a myriad of other human factors. Throughout this book, I hope to emphasize this point and provide some of the education that we as Exchange Server designers, implementers, operators, and administrators need to reduce this factor and its impact on mission-critical Exchange Server deployments.

1.3.2 Scalability

Scalability is something much more than the ability to handle many users per server. However, most hardware vendors who have an interest in the huge Exchange Server deployment opportunity that has developed in the last several years have invested in the resources and personnel required for producing and publishing performance, scalability, and benchmarking information for Microsoft Exchange Server. With the release of Exchange Server 5.5 and subsequently Exchange Server 2000, the published performance results have experienced a vast leap. Figure 1.2 illustrates the substantial performance improvements that the most recent versions of Microsoft Exchange Server have yielded. Microsoft has, for the most part, solved the most relevant performance issues related to Exchange Server with the most recent versions. Recent improvements in hardware technology, such as 32-processor systems, large memory access in Windows (up to 64 GB), processor technology advances such as Intel’s HyperThreading, memory architectures, I/O, and host bus architectural improvements, as well as disk and disk-subsystem advances, have drastically improved Exchange Server scalability. However, scalability must also come in the form of operating system and application scalability. Regardless of the hardware capabilities, a poorly designed operating system or application will hinder the total solution’s scalability. Software vendors must design applications and operating systems to make maximum use of the hardware technologies at their disposal.

It is also important to understand that benchmarks do not tell the whole story either. When hardware vendors set out to provide Exchange Server benchmarking information, the focus is on optimally performing systems. As a result, published benchmarks are performed using top-of-the-line hardware configurations that represent the latest available server platforms from each vendor. Since before Exchange Server 4.0 shipped until just prior to Exchange 2000 shipping, I was involved in Exchange Server benchmarking activities for Compaq Computer Corporation (now HP). Early on in our efforts to provide customers with performance information around Microsoft Exchange Server, I sought to provide performance results that were based only on what I called customer-deployable scenarios. Customer deployable scenarios translate to providing benchmarks based on server configurations that organizations would actually use when deploying Microsoft Exchange Server. For example, most organizations may not deploy their Exchange servers with 4 GB of RAM installed on every box. A less obvious example would be configuring a disk subsystem for RAID5 instead of RAID0. Ideally, to achieve maximum performance, one would configure a disk subsystem used for benchmarking for RAID0 since it provides the best performance in the majority of environments. In my benchmarking activities at Compaq, I sought to produce only benchmarks based on RAID5 or RAID0+1 disk arrays. RAID5 suffers heavily in a write-intensive environment like that of Exchange Server. My goal, however, was to use RAID5 since most organizations would use RAID5 or RAID1/0+1 when designing and deploying their Exchange servers. While RAID5 does not yield the highest performance, it does provide data protection and offers a reasonable trade-off between performance and data protection. Thus, my thought process was that benchmarks for Exchange Server based on these real-world server configurations would be much more useful and credible to organizations that were using the information I produced for making key Exchange Server deployment decisions.

My thinking had one small flaw—you should never neglect the marketing implications of your benchmarking activities. Soon my competitors from other companies like Digital Equipment Corporation (obviously, previous to Compaq’s acquisition), Hewlett-Packard (prior to HP’s acquisition of Compaq), and others were providing better benchmark numbers than mine based on server configurations utilizing RAID0 disk subsystems. This was understandable since RAID0 can provide as much as 40% more (depending on a number of factors) I/Os per second than the same number of disk drives configured as RAID5. The matter of using RAID0 versus RAID5 was discussed with Microsoft and the other hardware vendors, and the decision (with which I reluctantly agreed) was made that RAID0 bench-marks for Exchange Server would be allowed, provided that the vendor specified that RAID0 provided no data protection capabilities in the event of disk failure. The result was that benchmark results climbed higher and higher as vendors including Compaq abandoned my dream of customer deployable scenarios in favor of publishing the highest possible result.

Servers cannot be mission critical in nature if they can’t handle the demands of large user loads and the growth trends in system capacity that accompany them. For Exchange Server, scalability has recently taken a back seat to reliability. Although it is not the focus of this book, I would be remiss if I did not discuss Exchange Server performance and scalability in relationship to mission-critical systems. In Chapter 10 “Proactive Management for Mission-Critical Exchange Servers,” we will also discuss performance management, which is the key to achieving scalable servers that respond to growth in user capacity. When discussing scalability, it is important to look at it in terms of a two-sided coin. Side one, the most well known, is performance scalability. Simply put, as the server’s workload increases, how does the server respond to meet this demand? The other side of the coin is capacity scalability. Capacity scalability involves the degree to which a server can meet the ever-increasing demands for more—more disk space, more information stores, more messages, more attachments, more directory entries, and so forth. Table 1.5 illustrates some different components of performance versus capacity scalability.

Table 1.5: Comparing Performance Versus Capacity Scalability
Performance Scalability	Capacity Scalability
I/Os per second	Information store size
Instructions per second	Directory size
Transactions per second	Mailbox size
Messages per second	Users per server
Megabytes per second	Messages per folder

More traditionally, scalability is viewed in terms of the additional work that can be done by adding more resources. The additional resources are items like more processors, memory, disks, threads, processes, and buffers. The additional work could be in the form of more users, transactions, messages, or megabytes. For Exchange Server, scalability is manifested in many ways. Scalability may be the number of users that a given hardware platform will support. Alternately, scalability may be the degree to which the Exchange Directory (for Exchange 5.5) or the Active Directory (in the case of Exchange 2000 and Exchange Server 2003) is able to grow. This is not a crucial point, but it is one worth mentioning. Depending on your context and frame of reference, scalability means different things to different people. Nonetheless, scalability is an essential element of mission-critical servers.

1.3.3 Security

Another pillar on which mission-critical Exchange servers stand is security. An entire portfolio of reliability and scalability best practices and tools will not be enough if your messaging system is not secure. Secure messaging systems must be bulletproof in terms of virus, sabotage, denial of service, and other forms of attack. Chapter 9 “Locking Down Mission-Critical Exchange Servers” will be devoted to this key element. For a mission-critical messaging server, several perimeters must be defended. These include gateways, networks, message stores, and clients. Gateways provide an opening for mail storm attacks, viruses, and other forms of attack. The network wire is open to “snooping,” “spoofing,” and “sniffing.” Message stores and mailboxes are potentially open for unauthorized people to access and for viruses to destroy or infect. Client systems can be a point of access or the point of origin for viruses and unauthorized people to access the entire system. Mission-critical servers must provide mechanisms to defend these perimeters. Exchange Server provides many security mechanisms and tools for preventing these attacks. Table 1.6 highlights some of the most common security breaches and denial-of-service attacks and the built-in tools that Exchange has to protect against them.

Table 1.6: Built-in Security Mechanisms for Exchange Server
Attack	Perimeters	Exchange Protection
Virus	Gateway, message store, and client	VSAPI
Sniffing	Message store, network, and gateway	Message encryption
Spoofing	Message store, network, and gateway	Digital signatures Windows Rights Management
Mail storm/SPAM/UCE	SMTP Virtual Server/Gateway	SMTP Server configuration Real-Time Black Lists Outlook/OWA features

Unfortunately, security is often overlooked in the design of mission-critical servers. With so much focus on reliability, disaster recovery, and scalability, implementers are often hard-pressed to allocate planning cycles to address security issues. Contributing to the problem, tools and mechanisms to address security issues have only recently become available for Exchange Server. Prior to version 5.5, only minimal mechanisms were available in Exchange to protect against the most common forms of attack. In addition, many organizations do not have security expertise in-house and often must rely on expensive consultants to deploy some of the more advanced security measures required.

Microsoft added an increased focus on security in Exchange Server 2003. The elimination of the Key Management Service (KMS) in Exchange Server 2003 marks a change in direction for Exchange development. Rather than continued development of Exchange Server’s own components for security features such as S/MIME, Microsoft chose to eliminate these components (those public key infrastructure features that were delivered via KMS in previous versions of Exchange Server) and rely on Windows Server features in Certificate Services or third-party solutions such as Verisign or Entrust. In addition, Microsoft also stepped up its focus on security features such as antispam and antivirus in Exchange Server 2003. Finally, Microsoft sought to bring the same level of security features to both Outlook MAPI clients and OWA clients. In Exchange Server 2003, OWA clients can now participate in secure messaging scenarios by enabling of digital signing and encryption features for OWA users. There will be more on this in Chapter 9 when we take a closer look at security for Exchange Server 2003. By combining the increasing number of security features in the Windows operating system and Exchange Server 2003 server with ever-improving best practices for securing our Exchange deployments, this release of Exchange Server promises to provide a degree of security unsurpassed by previous versions.

1.3.4 Manageability

Personally, I believe that proactive management techniques are the key component to highly available Exchange servers. Many shortcomings in other areas, such as reliability, scalability, and security, can be overcome or compensated for by applying the right management tools, methodologies, and philosophies to your deployment. The factors that impact manageability can be characterized as follows:

Technical knowledge: A good understanding of the technology that Microsoft Exchange Server uses and that surrounds it, such as Windows Server. Good training and access to an environment for experimentation (such as a lab) are the keys to the ability to understand Exchange Server in order to perform proactive management.

Utilizing industry sources such as magazines, white papers, and support databases can also be invaluable in this area.

A systems approach: This is an understanding that any management solution is a careful balance between things you have some control over (i.e., hardware, OS, network) and those that are somewhat more elusive (i.e., control over users and outside issues that cause problems). This is the ability to look at the big picture when making decisions about your Exchange Server deployment. Again, a proactive approach is called for here.

Planning and design: The Windows Server and Exchange Server environment can be very complex. Effective planning, design, and pilot activities must take place before deployment in order to identify and resolve problems before production. It is a lot easier to resolve issues during the design phase than during the production phase of any information technology deployment.

Configuration management: The Exchange Server environment offers many variables. The key to any successful deployment is the ability to manage and control these variables. The most successful Exchange Server deployments actively manage and monitor these variables. These are items such as registry tuning, drivers, OS and application variations, service packs, hot-fixes, firmware updates, and so forth.

Establishment of SLAs and performance management: It is often commonplace for organizations to manage performance in a reactive manner (i.e., when users complain). The Windows and Exchange Server environment offers rich capabilities in this area. Microsoft’s recent foray into this space, Microsoft Operations Manager (MOM), is a welcome addition and Exchange 2000/2003’s management pack for MOM is a great help with proactive management. In addition, many third parties such as BMC and NETIQ (and others) offer management tools that aid in the proactive management of Exchange servers. The definition of SLAs is an effective method of measuring performance proactively. SLAs prompt the monitoring of various performance characteristics within the Exchange Server environment in order to measure the degree to which these service levels are being met. We will discuss the definition and establishment of SLAs in the next chapter. The establishment of SLAs is also a key driver for other aspects of manageability such as problem management and performance management.

Through the use of proactive management practices, I have seen many organizations drastically reduce Exchange Server downtime. Proactive management calls for the preceding items—technical knowledge, high-level approach, planning and design, configuration management, and the establishment of criteria with which to assess your success (SLAs). My point here is a key thesis of this book (more on this in Chapter 10). Only through understanding and proactive planning, design, and implementation can a system such as Exchange achieve mission-critical capabilities. Furthermore, success comes only through a thorough understanding of Exchange itself, as well as the technologies we can employ such as disaster recovery, storage, clustering, and management to increase system reliability. In addition, if the first and only focus of our system planning on performance, scalability, and server sizing, we can hardly expect a system to achieve a high degree of uptime.