It takes time and discipline to plan for high availability deployments. Successful deployments are invariably based on well-informed decision making in which how you go about a deployment is as important as the specific technologies you deploy. Many variables affect the availability of any system, such as hardware, system software, data, application technologies, and the environment. Furthermore, many of the factors critical to a successful deployment are managerial, organizational, and procedural. Each must come together in carefully thought out and correctly applied policies and procedures. This lesson explains how to measure system availability and what types of failures contribute to system downtime. The lesson also tells you how to create a checklist that allows you to monitor your site’s availability.
When you build systems for high availability, you require them to perform necessary functions under stated conditions for specified periods. These systems need to behave predictably and reliably so that customers can build operating plans for the systems that are critical to the functioning of their businesses. To users and managers of information systems, application availability—not system availability—is paramount. Although there’s no industry-wide standard for calculating availability, mean time to failure (MTTF) and mean time to recovery (MTTR) are the metrics most often cited. MTTF is the mean time until a device will fail, and MTTR is the mean time it takes to recover from a failure.
Hardware components usually have what is known as an exponential failure distribution. Under normal circumstances and after an initial phase, the longer a hardware component operates, the more frequently it fails. Therefore, if you know the MTTF for a device, you might be able to predict when it will enter its failure mode. The historical statistical data of mechanical and electrical components fits the so-called bathtub curve, as shown in Figure 1.4.
Figure 1.4 - The bathtub curve
This model shows three distinguishable phases in a component’s life cycle: burn-in, normal aging, and failure mode. Each phase is characterized by some signature or behavior, which varies from domain to domain. Failure rates are characteristically high during burn-in, but drop off rapidly. Devices seldom fail during the normal aging period. As the devices age, however, the failure rates rise dramatically but predictably.
You can use this observation in two ways. First, you can observe the characteristics of devices in the normal aging phase and look for indications that correlate with observed failure rates. Some hardware vendors do this and have devised mechanisms to measure the characteristics of devices that you can use with some success as failure predictors. Second, you can keep track of the actual failure rates of specific devices and replace them before they enter the expected failure mode. You usually use this strategy when the cost of a failed component can be catastrophic. It requires relatively long sample periods in order to capture statistically significant MTTF data.
Unfortunately, MTTF statistics are normally useful predictors only for components with an exponential failure distribution. Software and hardware components have different failure characteristics, which makes it difficult to manage or predict software failures. As a result, MTTF applies only to certain categories of software defects. For example, data dependent errors, such as those triggered by an anomalous input stream, require the input stream itself to be predictable in order for previous failure rates to act as predictors of future failure rates.
For systems that are fully recoverable, MTTR is an equally important quantity, since it correlates directly with system downtime. You can measure downtime by using the following formula:
Downtime = MTTR ÷ MTTF
For example, if you have a device with an MTTR of 2 hours and an MTTF of 2,000 hours, downtime will be 0.1 percent.
To calculate availability, you can use the following formula:
Availability = MTTF ÷ (MTTF + MTTR)
Again, if you have an MTTR of 2 and an MTTF of 2,000, your availability will be 99.9 percent.
As you can see, focusing exclusively on increasing MTTF without also taking MTTR into account won’t maximize availability. This is due in part to the costs associated with designing hardware and software systems that never fail. Such systems tax current design methodologies and hardware technologies and are always extremely expensive to deploy and maintain.
Customers with high availability requirements should maximize MTTF by carefully designing and thoroughly testing both hardware and software and reduce MTTR by using failover mechanisms, such as clustering.
Availability depends on whether a service is operating properly. You can think of availability as a continuum ranging from 100 percent (a completely fault-tolerant site that never goes offline) to 0 percent (a site that’s never available). All sites have some degree of availability. Today, many companies target "three 9s" availability (99.9 percent) for their Web sites, which means that there can be only approximately 8 hours and 45 minutes of unplanned downtime a year. Telephone companies in the United States typically target "five 9s," or 99.999 percent uptime (approximately 5 minutes and 15 seconds of unplanned downtime a year).
Table 1.2 shows these common classes of 9s and their associated availability percentages and related annual downtime.
Table 1.2 Rule of 9s
|Availability Class||Availability Measurement||Annual Downtime|
Although any company might strive for additional uptime for its Web site, significant incremental hardware investment is required to get those extra "9s." However, maintaining even 99.9 percent uptime means downtime of 8.75 hours per year, which could result in a significant loss of revenue for a company that relies heavily on the revenue generated by its Web site.
The high cost of downtime makes planning essential in environments that require high availability. The simplest model of downtime cost is based on the assumption that employees are made completely idle by outages, whether due to hardware, network, server, or application failure. In such a model, the cost of a service interruption is given by the sum of the labor costs of the idled employees combined with an estimate of the business lost due to the lack of service. Several factors cause system outages, such as software failures, hardware failures, network failures, operational failures, and environmental failures.
Identifying the cause of an outage can be complicated. For example, one company observed that about 20 percent of all the outages reported were attributed to operating system failures and 20 percent were attributed to application failures. However, a separate internal study of calls to Microsoft Product Service and Support found that most calls reporting apparent operating system failure turned out to be due to improper system configuration, defects in third-party device drivers, or system software.
It should also be noted that defects in virus protection software could cause system outages because they’re implemented as kernel-level filter drivers. You can avoid many system outages attributed to software problems by using better operational procedures, such as carefully selecting the software that you allow to run on your servers.
Hardware failures occur most frequently in mechanical parts such as fans, disks, or removable storage media. Failure in one component may induce failure in another. For example, defective or insufficient cooling may induce memory failures or shorten the time to failure for a disk drive.
Other moving parts, such as mechanical and electromechanical components in the disk drive, are also among the most critical, although new storage techniques have dramatically enhanced the reliability of disk drives.
Market pressures have driven storage vendors to provide these improvements at commodity prices. In 1988 the MTTF rating for a commodity 100 MB disk was around 30,000 hours. Today the MTTF specification for a commodity 2 GB drive is 300,000 hours—about 34 years. In these demanding environments, hardware requires sufficient airflow and cooling equipment. System administrators are advised to use platforms capable of monitoring internal temperatures and generating Simple Network Management Protocol (SNMP) alarms when conditions exceed recommended ranges.
Random access memories allow for the use of parity bits to detect errors or the use of error correcting codes (ECC) to detect and correct single errors and to detect two-bit errors. The use of ECC memories for conventional and cache memories is as important to overall system reliability as the use of RAID.
With distributed systems, it’s very important to realize that the underlying network’s performance and reliability contribute significantly to the system’s overall performance and reliability.
Changes in the topology and design of any layer of the protocol stack can affect the whole. To ensure robustness, it’s necessary to evaluate all layers, although this holistic approach is rarely used. Instead, many businesses view the network as a black box with a well-defined interface and service level, totally ignoring the evidence of any coupling between system and network.
Reliable deployments of Windows 2000–based systems require some procedures that may not be obvious to those migrating from more elaborate mainframe or minicomputer-based data centers or migrating their Information Technology (IT) operations from more informal personal computer environments. Customers can minimize or altogether avoid many of the problems identified so far by using disciplined operational procedures, such as regular, complete backups and avoidance of unnecessary changes to configuration and environment.
Statistics from one disaster-recovery study found that 27 percent of declared data-center disasters are recorded as due to power loss. In this study a declared disaster was defined as an incident in which there was actual loss of data in addition to loss of service. This figure includes power outages due to environmental disasters such as snowstorms, tornadoes, and hurricanes.
You can create a checklist to monitor your site’s availability. Table 1.3 provides an overview of the types of information that you should include in your checklist. Many of these issues will be discussed in more detail in later chapters.
Table 1.3 Availability Checklist
|Type of Information||Description|
Monitor how bandwidth is being used (peak and idle) and how usage increases (whether it increases, when it increases, and how long it increases).
Monitor network availability by using Network Internet Control Message Protocol (ICMP) echo pings. ICMP is available from most network monitoring software.
Monitor normal and abnormal shutdowns of the operating system. Monitor normal operation and failover events of SQL Server. Monitor normal and abnormal shutdowns in Internet Information Server (IIS).
Monitor HTTP requests issued internally, issued from ISP networks (such as AOL, Microsoft MSN, MCI, Sprint, and so on), and issued from different geographic locations (New York, San Francisco, London, Paris, Munich, Tokyo, Singapore, and so on).
Use the following performance metrics to ensure availability: number of visits, latency of requests for set of operations and page groups, CPU utilization, disk storage, disk input/output (I/O), fiber channel loop bandwidth and performance, and memory.
The three distinguishable phases in a component’s life cycle are burn-in, normal aging, and failure mode. Although there is no industry-wide standard for calculating availability, MTTF and MTTR are the metrics most often cited. You can determine downtime by using the formula MTTR ÷ MTTF. You can determine availability by using the formula MTTF ÷ (MTTF + MTTR). MTTF statistics are normally useful predictors only for components with an exponential failure distribution. As a result, MTTF applies only to certain categories of software defects. Customers who require high availability should maximize MTTF by carefully designing and thoroughly testing both hardware and software and reduce MTTR by using failover mechanisms such as clustering. All sites have some degree of availability. A specific number of 9s is sometimes used to describe a system’s availability. For example, three 9s has an availability measurement of 99.9 percent and an annual downtime of 8.8 hours. Several factors cause system downtime, including software failures, hardware failures, network failures, operational failures, and environmental failures. You can create a checklist to monitor your site’s availability. The checklist should include information about bandwidth usage, network availability, system availability, HTTP availability, and performance metrics.