7.1 What is Software Reliability? | Parallel and Distributed Programming Using C++

Software reliability is the probability of failure-free operation of a computer program for a specified time in a specified environment. Ideally, that probability should be as close to 100% as necessary. When failure is not an option, the software must be designed using the techniques of fault-tolerant programming. A fault-tolerant system is one that corrects or survives software faults. A fault is a program defect that can cause a piece of software to fail. We define "software failure" as the execution of some component of software that deviates from system specifications. We rely on Musa, Iannino, and Okumoto in their work Software Reliability for a complete characterization of faults and failures:

A fault is the defect in the program that when executed under particular conditions, causes a failure. There can be different sets of conditions that cause failures, or the conditions can be repeated. Hence a fault can be the source of more than one failure. A fault is a property of the program rather than a property of its execution or behavior. It is what we are really referring to in general when we use the term "bug." A fault is created when a programmer makes an error.

The errors that a programmer or software developer makes may be from a misinterpretation of the software requirements, or from a poor, incorrect, or incomplete translation of the software requirements into code. When the programmer makes these kinds of errors, he or she introduces defects or faults into the software. When those defects or faults are executed, they can cause software failure. Software failure can only occur during the execution. The process of testing and debugging software removes faults from software, thereby preventing the possibility of software failure. Note that we use the terms "defect" and "fault" interchangeably. We use the term "error" to refer to the mistakes that the programmer makes that introduce faults (defects) into the software. Fault tolerance is a property that allows a piece of software to survive and recover from the software failures caused by faults introduced into the software as a result of human error. The most robust fault tolerance can even correct these failures.

Some failures are the result of software faults. Other failures are the result of exceptional conditions (not necessarily due to human error) that can occur in either hardware or software. For instance, a network card damaged as a result of a power surge can cause the software that depends on it to fail. A virus may corrupt a data transmission that will cause the software that depends on the data transmission to fail. A user may inadvertently remove critical components of a system, thereby causing the software to fail. These kinds of failures are not due to defects in the software, but are created by conditions that we call exceptions . An exception is an abnormal condition, exceptional circumstance, or an extraordinary occurrence that the software encounters that causes all or part of the software to fail. Although both defects and exceptions cause software failure, it is important to distinguish between them. The techniques for dealing with defects and exceptions can be and usually are different. While the end result of applying those techniques is reliable software, exception handling and error (defect) handling use different design approaches and coding constructs.