Mean Time to Failure and Mean Time to Recover

[Previous] [Next]

Two important metrics are most commonly used to measure fault tolerance and avoidance. These are mean time to failure (MTTF), the mean time until the device will fail, and mean time to recover (MTTR), the mean time it takes to recover once a failure has occurred. Keep in mind that even if you have a finite failure rate, if your MTTR is zero or near zero, this may be indistinguishable from a system that hasn't failed. Downtime is generally measured as MTTR/MTTF, but since it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing the MTTR for your most likely and costly points of failure.

Most modern electronic components have a distinctive "bathtub" curve that represents their failure characteristics, as shown in Figure 35-1. During the early life of the component (referred to as the burn-in phase), it's more likely to fail; once this initial phase is over, a component's overall failure rate remains quite low until it reaches the end of its useful life, when the failure rate increases again.

click to view at full size.

Figure 35-1. The normal statistical failure rates for mechanical and electronic components: a characteristic "bathtub" curve.

The typical commodity hard disk of 10 years ago had an MTTF on the order of three years. Today, a typical MTTF for a commodity hard disk is more likely to be 35 to 50 years! But at least part of that difference is a direct result of counting only the portion of the curve in the normal aging section, taking externally caused failure out of the equation. So a hard disk that fails because of a power spike that wasn't properly filtered doesn't count against the MTTF of the disk. This may be nice for the disk manufacturer's statistics, but it doesn't do much for the system administrator whose system has crashed because of a disk failure. Consequently, you can understand the importance of looking at the total picture and carefully evaluating all the factors and failure points on your system. Only by looking at the whole system, including the recovery procedures and methodology, can you build a truly fault-tolerant system.



Microsoft Windows 2000 Server Administrator's Companion, Vol. 1
Microsoft Windows 2000 Server Administrators Companion (IT-Administrators Companion)
ISBN: 1572318198
EAN: 2147483647
Year: 2000
Pages: 366

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net