Chapter 2: Public Enemy 1: Exchange Downtime | Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)

As I charge down the path of building a mission-critical Exchange deployment, it would not be prudent to start without a discussion and investigation into downtime and outages for Exchange Server. If your messaging system has become mission critical, the cost of downtime can be devastating to an organization. These costs can come in the form of lost revenue, lost opportunity, failed Service-Level Agreements (SLAs), and/or noncompliance/performance penalties. This does not even consider the fixed costs that an organization must pay whether its employees are working or not (nor does it account for those who just “hang it up” for the day because their server is down). What can be the most damaging are the incalculable losses such as the loss of customer relationships or satisfaction. When your customers and business partners are affected by an outage in your messaging system, they may perceive a poorly run organization with which they are not excited to do business. As a result, our discussions around outage and downtime should begin in the business context. In other words, what does downtime of your Exchange deployment really cost the company? The higher the quantifiable costs, the more dollars are justified for deploying high-availability measures and investing in personnel and procedures in order to reduce downtime. The more successful an organization is in delivering the levels of reliability required, the faster a return on investment can be realized. In this chapter, we will look at the enemy of mission-critical systems—downtime. It is important that we have a consensus on how downtime is to be measured and the affect this will have on how the Exchange deployment is viewed by users and management. I will not treat downtime as a simple concept, but as a complex series of components. My point here is that by understanding the internals of a downtime event, organizations can reduce the overall event and provide a means of continuous process improvement. Finally, the understanding of downtime and its measurement can help organizations set reasonable goals for achieving mission-critical operation. Understanding, measuring, and determining the cost of downtime are the key to making business case arguments for further investments in mission-critical technologies and services.

2.1 How do you measure downtime?

Often taken for granted or misunderstood is the measurement of downtime for computer systems. We mistakenly assume that everyone measures downtime for Exchange Server in the same fashion. Typically, we conclude that downtime is the total number of hours that the server is unavailable. For example, as I discussed in Chapter 1, if the server was down for 8.76 hours or less in 1 year, we conclude that 99.9% availability was achieved. However, not all organizations measure downtime in the same manner. One common alternative to a simple measure of server inaccessibility is the lost-client-opportunity method. This method tasks the number of times a client or user attempts to perform a function and is unable to. In other words, when your Exchange server is down, you cannot access e-mail. Instead of tracking the number of hours that your Exchange server is unavailable (the most common method of measuring downtime), you track lost client opportunities to do work. This method of measurement is derived from the telephone system and other dial-tone services in which system users expect the system to be available when usage is attempted. If your organization views the e-mail system in a similar manner as the telephone system, this method of downtime measurement may be appropriate for you. Essentially, whether you are measuring actual server downtime or lost client opportunities, the same downtime is being measured. However, when you are looking at downtime statistics, one method may yield different results from another. When you are looking at the amount of time a server is down, there is not necessarily any element of client service factored in. When lost client hours or opportunities are measured, client service is the primary concern. In order to determine which method of measurement is appropriate for your organization, let’s take a look at both methods of measuring downtime for a hypothetical 1-hour downtime event (during a 1-month period) that impacts an Exchange server supporting 2,000 users.

2.1.1 Method 1: Simple server downtime

This method simply measures how many hours a particular server was down without regard to how many users a server supports or the lost services for those users. The amount of downtime for a server is compared against the total service hours in one year (8,760 hours for yearly 7 24 operations or 720 hours for a 30-day month). Dividing the downtime hours by 8,760 provides this comparison. The result is the commonly used “nines of availability.” Finally, don’t forget that not all hours are created equal, as it is much worse to be down during month-end reporting than it is to be down on a holiday.

Server availability (1 month)

720 hours (1 month) – 1 hour of downtime = 719 hours of availability

719 hours/720 hours = 99.86% (less than 3 “nines”)

2.1.2 Method 2: Lost client opportunity

This method directly considers the client services affected by a server outage. Using this approach, the number of users each server supports becomes more important. In our hypothetical scenario, the server has a user load of 2,000. When calculating availability for the server, the user load is included. The number of users is multiplied by the total lost service time, yielding a total lost service measurement. The total service is then divided by the total service opportunity for the system or some other period. In my experience, organizations will use a service interval such as 1 million or 1,000,000 hours. This method will yield results that are based on the lost client opportunities for the service period (720 hours for our hypothetical scenario). The resulting measure is a figure of lost client opportunity (hours) for the total client-service period.

Server availability (1 month) with client lost opportunity

(720 2,000) – (1 2,000)/(720 2,000) = 99.86%

As you can see, a simple measurement using either method will yield the same result. Both measurements yielded a 99.86% availability measure for the server. However, using a deployment wide measurement such as lost opportunity hours per 1 million hours can yield very different results because individual server metrics are masked as the focus shifts from server outage to client service hours and opportunity. Figure 2.1 shows how this measurement may look for a large deployment scenario measured monthly with a sample period of 1 million workstation hours.

click to expand
Figure 2.1: Illustrating lost client opportunity downtime measurement.

Although the preceding discussion may seem like splitting hairs, the method you choose to track and view downtime data for your deployment is important. This is of particular importance when you are being held accountable for SLAs for availability. As a system manager, you may want to select the measurement method that most closely matches how your SLAs are defined and that fits best into the traditional measurements used historically in your organization. In order to determine what the availability requirements are for your Exchange deployment, you need to understand several key issues. The process of planning and designing a system that delivers the required levels of availability includes these steps:

Understand the cost of downtime for your messaging system.
Understand the anatomy of downtime for your Exchange deployment.
Access the causes of downtime in your Exchange deployment.
Set SLAs.
Architect the system with steps 1 to 4 in mind.