Mean Time to Failure and Mean Time to Recover

The two most common metrics used to measure fault tolerance and avoidance are the following:

  • Mean time to failure (MTTF) The mean time until the device will fail
  • Mean time to recover (MTTR) The mean time it takes to recover once a failure has occurred

Although a great deal of time and energy is often spent trying to lower the MTTF, it's important to keep in mind that even if you have a finite failure rate, if your MTTR is zero or near zero, this may be indistinguishable from a system that hasn't failed. Downtime is generally measured as MTTR/MTTF, but because it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing the MTTR for your most likely and costly points of failure.

Most modern electronic components have a distinctive "bathtub" curve that represents their failure characteristics, as shown in Figure 36-1. During the early life of the component (referred to as the burn-in phase), it's more likely to fail; once this initial phase is over, a component's overall failure rate remains quite low until it reaches the end of its useful life, when the failure rate increases again.

Figure 36-1. The normal statistical failure rates for mechanical and electronic components: a characteristic "bathtub" curve.

The typical commodity hard disk of 10 years ago had an MTTF on the order of three years. Today, a typical MTTF for a commodity hard disk is more likely to be 35 to 50 years! At least part of that difference is a direct result of counting only the portion of the curve in the normal aging section while taking externally caused failure out of the equation. So a hard disk that fails because of a power spike that wasn't properly filtered doesn't count against the MTTF of the disk, nor does a disk that fails in its first week or two. This may be nice for the disk manufacturer's statistics, but it doesn't do much for the system administrator whose system has crashed because of a disk failure. As you can see, it's important to look at the total picture and carefully evaluate all the factors and failure points on your system. Only by looking at the whole system, including the recovery procedures and methodology, can you build a truly fault-tolerant system.

