The definition of uptime and downtime is different in every computing environment. Technically, all downtime should count toward a final availability number, whether it is planned or unplanned , because both are service disruptions. Some environments do not count planned downtime, such as periods of normal maintenance, against an eventual system availability calculation. Although this is an individual decision each company makes, padding availability numbers by excluding planned downtime might send the wrong message to both management and end users. In fact, many enterprise customers pay for an external service to monitor the availability of their systems in comparison to their competitors to ensure that the entire system s availability meets the guiding principles determined by the team.
Calculating availability is simple:
Availability = (Total Units of Time “Downtime)/Total Units of Time
For example, assume:
Total units of time = 1 year (8760 hours)
Downtime = 3 hours
Plug that into this formula:
Availability = (8760 hours “ 3 hours)/8760 hours
which results in .999658 availability of the individual system.
Keep in mind that the individual system rarely exists alone: it is part of a whole solution. You have now calculated the individual system s availability, but what is the entire solution s overall availability? That number is only as good as the weakest critical link, so if one of the essential servers in the solution only has .90497 uptime, that is the uptime of the entire solution. The number for the overall solution also has to factor in such things as network uptime, latency, and throughput. These critical distinctions bring the availability of a solution into focus. Having said that, qualify numbers when they are calculated and explained. This definition might be too simplistic for some environments ”if the server that has .90497 uptime is not considered mission-critical and does not affect the end user s experience or the overall working of the solution, it might not need to be counted as the number for the overall solution. The bottom line is that multiple components in a solution can fail at the same time or at different times, and everything counts toward the end availability number.
There are also the related concepts of mean time between failures (MTBF) and mean time to recovery (MTTR). MTBF is the average expected time between failures of a specific component. A good example is a disk drive ” every disk has an MTBF that is published in its corresponding documentation. If a drive advertises 10,000 hours as its MTBF, you then need to think about not only how long that is, but also how the usage patterns of your application will shorten or lengthen the life (that is, a highly used drive might have a shorter life than one that is used sparingly). An easy way to think of MBTF is predicted availability, whereas the calculation detailed earlier provides actual availability.
MTTR is the average time it takes to recover from a failure. Some define the MTTR for only nonfatal failures, but it should apply to all situations. MTTR fits into the disaster recovery phase of high availability.
The calculation example yielded a result of 99.9 percent uptime. This equates to three nines of availability, but what exactly is a nine? A nine is the total number of consecutive nines in the percentage calculation for uptime, starting from the leftmost digit, and is usually measured up to five nines.
The following table shows the calculations from one to five nines.
Downtime (Per Year)
99.999 percent (five nines)
Less than 5.26 minutes
99.99 percent (four nines)
5.26 minutes up to 52 minutes
99.9 percent (three nines)
52 minutes up to 8 hours, 45 minutes
99.0 percent (two nines)
8 hours, 45 minutes up to 87 hours, 36 minutes
90.0 “98.9 percent (one nine)
87 hours, 36 minutes up to 875 hours, 54 minutes
Most environments desire the five nines level of availability. However, the gap between wanting and achieving that number is pretty large. How many companies can tolerate only 5.26 or less minutes of downtime ”planned and unplanned ”in a given calendar year? That number is fairly small. Good-quality, properly configured, reliable hardware should yield approximately two to three nines. Beyond that, achieving a higher level of nines comes from operational and design excellence. The cost of achieving more than three nines is high, not only in terms of money, but also effort. The cost might be exponential to go from three, to four, and ultimately, to five nines. Realistically, striving to achieve an overall goal of three or four nines for an individual system and its dependencies is very reasonable and something that, if sustained, can be considered an achievement.
Consider the following example: Microsoft SQL Server 2000 is installed in your environment. Assume that Microsoft releases a minimum of two service packs per year. Each service pack installation puts the SQL Server into single-user mode, which means it is unavailable for end user requests . Also assume that because of the speed of your hardware, the service pack takes 15 minutes to install. At the end of the install process, it might or might not require a reboot. For the sake of this example, assume it does. A reboot requires 7 minutes, for a total of 44 minutes of planned server downtime per year. This translates into 99.9916 percent uptime for the system, which is still four nines, but you can see how something as simple as installing a service pack can eliminate five nines of availability in short order. One possible way to mitigate this type of situation is to have one of the standby systems brought online to take requests, synchronize the primary server when the service pack is finished, and then switch back to the primary server. Each switch will result in a small outage , but it might wind up being less than the 22 minutes it takes for one service pack. That process also assumes that it has been tested and accurately timed on equivalent test hardware.
If you ask the CEO, CFO, CIO, or CTO, each one will most likely tell you that they require five nines and 24/7 uptime. Down in the trenches, where the day- to-day battles of system administration are fought, the reality of achieving either of those goals is challenging, to say the least.
Keep in mind that the level of nines and the requirement might change during the day. If the company is, say, an 8-to-5 or a 9-to-6, Monday-to-Friday type of shop, those are the crucial hours of availability. Although the other hours might require the system to be up, there might not be a 100 percent uptime guarantee for the outlying hours. When the solution is designed, the goal should be to achieve the highest level of nines required, not shooting for the moon (five nines) ”that might be overkill. When the question is asked during the business drivers meeting, such situations need to be taken into account. Five nines of availability from Monday to Friday might only mean 9 hours a day, which translates into 2340 hours per year. This number is well below the 8760 hours required for 24/7 support, 365 days a year. All of a sudden, what once seemed like a large problem might now seem much easier to tackle.
Also be mindful that each person ”from members of management right down to each end user ”will have different goals, work patterns, and agendas . In good times, the demand for high availability might be someone s priority, but if other factors and goals, such as high profitability, are currently driving the business, availability might not be number one on the list of priorities. This also relates to the nature of some industries or different organizations ”financial and medical institutions by default require availability, whereas service or retail organizations might see availability only as an aspect of the business that might or might not be important. This could mean that in the negotiations, a variable rate of availability might be discussed, and might confuse people.
For example, if you need your systems to have five nines of availability from 8 A.M. to 8 P.M., that should be your target for availability. Stating different rates of availability for different hours can potentially lead to the design of an inferior system that will not be able to support the highest required availability number. In the end, it is important to understand all of the factors when putting together the business drivers and guiding principles that ultimately will govern the solution. Ignoring any one of the groups touching the systems is a mistake.