Redundancy and Deployment | Upgrading and Repairing Servers

One of the most important aspects of a server deployment is the decisions you make about the level of fault tolerance you need to build into the system. When people think about fault tolerance, they tend to think in terms of the equipment or servers in the room or building. You can build fault tolerance into equipment with duplication and failover, as discussed in Chapter 17, "Server Rooms," but you do so at a cost. The real trick in determining what kind of redundancy you need is to determine how much risk you can realistically afford and to build your deployment appropriately. If you don't need a mission-critical system where your uptime is 99.999% or more, then buying or building a system that has less than five minutes of downtime per year is a waste of money that would be better spent on other resources.

Your project specification should include a description of the intended levels of fault tolerance that you expect for the different systems involved. You should spell out the rationale that explains why you chose that level so that it is clear why you are making the investment in the equipment you have selected. These are fundamental assumptions, and they affect overall system costs dramatically, so it's a really good idea to explain the assumptions early on in the process and seek comment from the people involved in the project (managers, users, and other designers) about whether those assumptions are within their expectations.

The following is a list of factors that should play into your selection of system redundancy:

Cost of downtime, as a function of time
Number of workers or amount of business affected
Systems that can be purchased with the budget available to fund the project
Additional staffing and operational complexity required to maintain a level of fault tolerance
Ability to respond in a timely fashion to a failure

A few years back, a set of Sun servers at eBay's main data center went offline without a sufficient offsite server failover capability, crashing the entire site for several hours. The next day, eBay's stock value fell $5 billion, down 21% from the day before. This loss, albeit a temporary one, is a dramatic example of the issue that many organizations face when their IT systems fail: The amount of money lost due to system failure can be many times the cost of the entire IT system.

Tip

It's a good idea to perform an analysis to determine the level of risk associated with specific levels of downtime for the systems to be deployed. That analysis, as well as the business assumptions involved with accepting some level of risk, should be spelled out in your completed project specification.

Not every system outage leads to huge losses, nor does a system failing necessarily kill someone, as it might do to a critical care patient at the hospital. So your level of risk may be relatively low. Because the level of risk is essentially a business decision, the business managers of the funding sources should play a central role in setting a deployment's level of redundancy. At one large company, it was decided that the email system was a critical system but didn't require mission-critical status. The business decision was made that an outage of a couple hours every so often was the maximum risk that could be tolerated. After longer than two hours of downtime, it was found that division managers would start to send their employees home or put an end to various business processes. An analysis of the costs of downtime for that particular system was performed, and it was found that after two hours of system downtime, the costs of business lost climbed exponentially, and within four hours, the cost exceeded the entire cost of all the systems and servers that were part of the deployment.

When you have decided on an appropriate level of fault tolerance, you must determine how to assess what your equipment's actual fault tolerance will be. That information isn't always easy to come by, but there is one guiding principle: Any system's fault tolerance is no better than the component with the lowest level of fault tolerance. That is a standard calculation of risk; in any complex system, your level of risk can be no better than the factor with the highest level of risk. Therefore, when you want to improve the reliability of any system, the first step is to address the fault tolerance of the weakest part of the system.

Generally, in modern computer systems, the items with the shortest average mean lifetimes are the mechanical devices (for example, disk drives). That's why RAID is such an important feature in server systems that are expected to have high fault tolerance. Anything that movesoptical drives, tape systems, mechanical switches, and so onis a candidate for a backup system.

Vendors quote reliability figures, but those numbers don't really provide significant guidance when you're attempting to address the issue of how much fault tolerance to purchase. A vendor's disk drives may have a nominal rating of five years MTBF (mean time between failures). That number is usually valid when you are measuring thousands of drives, randomly scattered among a number of production runs, but failures tend to run in streaks when some problem arises in one particular production run. An MTBF of five years per disk won't be particularly helpful to you if 10% of a 50-drive array fails at about the same time because they were all part of the same production run.

There's really no easy answer to the problem of average failure rates versus failures of specific instances. In some cases, it is possible to distribute components in such a way that your risk is spread out. You could, for example, move disk drives from any one purchase into a number of separate arrays systematically so that there aren't too many of the same production together. However, when you do that, you may be simply spreading your problems around to more systems and making more work for yourself. Chasing this problem is a fool's errand. However, when this kind of consideration crops up, the most appropriate consideration is whether to purchase drives with a higher level of MTBF, moving up in reliability from drives with a five-year average to ones with perhaps seven years.