What Is Reliability?


What Is "Reliability"?

Use case-driven development and SRE are a natural match, both being usage-driven styles of product development. What SRE brings to the party is a discipline for focusing time, effort, and resources on use cases in proportion to their estimated frequency of use, or criticality, to maximize reliability while minimizing development costs. But what is reliability? We spent a lot of time in the last chapter talking about the operational profile as a tool to work smart to achieve reliability, but we have never actually said what reliability is. Software reliability is defined as:

The probability of failure-free operation for a specified length of time in a specified environment.

Though short, this definition has a lot packed into it that doesn't necessarily jump out at you on first read. There are two main themes of which to take notice.

Software Reliability is User-Centric and Dynamic

First, software reliability is defined from the perspective of a user using a system in operation. It is a user-centric, dynamic definition of reliability, as opposed to, say, faults per lines of code, which is a developer-centric, static measure.

Consider the phrase "specified environment." This includes the hardware, its configuration, and the user profile (e.g., whether the user is an expert user, novice, and so on). User profile is important because a system designed for use by an expert could well be unreliable in the hands of a novice. The whole "specified environment" idea is only pertinent because you are qualifying expectations of reliability in terms of a system being operated by a user.

How about the phrase "failure-free" operation; what does that imply? A failure occurs when a system in operation encounters a defect or fault, causing incorrect results or behavior. A defect or fault is a static conceptit's just there, in the codebut a failure is something that can only happen when the system is in operation. So again, this is a dynamic concept implying a system in operation.

A dynamic, user-centric definition of reliability is more than an academic issue. This part of the definition is at the heart of SRE's ability to deliver high reliability per development and test dollar spent. A use case with lots of defects or faults in its underlying code can seem reliable if the user spends so little time running it that none of the many bugs are found. Conversely, a use case that has few defects or faults in its underlying code can seem unreliable if the user spends so much time running it that they find all those few bugs in operation. This is the concept of perceived reliability; it is the reliability the user experiences as opposed to a reliability measure in terms of, say, defect density.

Software Reliability Is Quantifiable

The second key theme in the definition of reliability is that it is quantifiable; the key phrase here is "…probability of failure-free operation for a specified length of time …"

In the last chapter, we saw an example of calculating the risk exposure that two hypothetical hardware widgets posed to a manufacturing machine of which they were a part: when either failed, production was shut down until it was replaced. As part of the calculation of risk, we said that one widget was of a type expected to fail once in 5,000 hours of operation and the other once in 10,000 hours of operation. These were statements about the expected failure intensity of the widgets. Failure intensity is the number of failures per some unit time and is probably the most common method of specifying and tracking software reliability.

But technically speaking, to answer the question: "What is the probability of failure-free operation for a specified length of time?" we need the formula shown in Equation 4.1 (typically given with Greek letters, which unfortunately makes it "look" more complicated than it actually is) called the exponential failure law:

Equation 4.1 Reliability is the probability of failure-free operation for a specified length of time, which is given by this formula, called the exponential failure law.


Don't worry about committing this formula to memory: You'll see how to use it as part of a simple spreadsheet formula later in the chapter.

Equation 4.1 reads like: R(, given a constant failure intensity of


Now let's reconsider that question: Given a widget is expected to fail once in 5,000 hours (that's the failure intensity, i.e., l = 1/5000) isn't it more likely to fail later (say where is in the denominator, so the bigger is, the smaller the probability is, the bigger the probability it will be able to run that long.

So, technically speaking, Equation 4.1 is the "official" definition of reliability per se, as it is what is needed to calculate the probability of failure-free operation for a specified length of time. But the main component of Equation 4.1 is failure intensity, and failure intensity is the bit that is most commonly used in setting and tracking software reliability goals.

Reliability: Software Versus Hardware

Finally, if it's not already obvious, it's important to point out there are distinctions between hardware reliability and software reliability. When we talk about hardware widgets with expected lives of 5,000 or 10,000 hours, the source of failure is assumed to be from some part of the widget wearing out from physical use. But software does not wear out per se, leading some to question whether or not it makes sense to apply statistical models, such as the exponential failure law that originated with hardware reliability, to software (Davis 1993).

But the software reliability engineering community counters that though the source of failures is different for softwareit doesn't wear out statistical models are nevertheless valid for describing what we experience with software: the longer you run software, the higher the probability you'll eventually use it in an untested manner and find a latent defect that results in a failure.

Resolution of these (sometimes theoretical) views notwithstanding, the discipline of software reliability engineering has plenty of ideas I think you will find have practical application to use case development: a user-centric, dynamic, and quantifiable view of reliability. In the next section, you will get a closer look at what is virtually the heart of this view of reliabilityfailure intensityand learn how to apply it to your projects.



Succeeding with Use Cases. Working Smart to Deliver Quality
Succeeding with Use Cases: Working Smart to Deliver Quality
ISBN: 0321316436
EAN: 2147483647
Year: 2004
Pages: 109

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net