Mistakes as Causes of Defects | Design for Trustworthy Software: Tools, Techniques, and Methodology of Developing Robust Software

Hinckley and Barkan define a mistake as the execution of a prohibited action, or failure to perform a required action, or the misrepresentation of information essential for the correct execution of an action.^[10] Rare mistakes do occur, even in high-capability processes. The fact that mistakes are rare in many processes means that traditional sampling and statistical methods are not useful in estimating their frequency:

Mistakes can only be described effectively in terms of probability, which is the only universal means to describe both mistakes and variability.^[11]

How critical are mistakes in determining quality of complex products such as enterprise software? The simple answer is: to a very large extent. Harris and Chaney have concluded that 80% of the nonconformities in complex systems can be attributed to mistakes.^[12] Hinckley and Barkan have shown that mistakes can be effectively described only in terms of probability:

Mistakes either occur or they do not. Further, as every distribution describing variation can be converted to probabilities, consequently, the only universal method of describing both variation and mistake is probability.^[13]

In a software context, the causes of defects can be broadly classified as follows:

Wrong design: This is the most serious error in software. It is a consequence of the software designer's not understanding the user's functional needs and the user's not understanding the technology and what it can and cannot do. This is a "language translation" problem, because the user as a domain expert speaks one "language," and the programmer as a computer expert speaks another. One of the solutions is to have two project leaders for each application developmentone from the programming staff and one from the user staff. The system architect then serves as the conflict revolver between the two. QFD methodology can be extremely useful in capturing the voice of the users and other customers, internal as well as external. The DFTS process supports this methodology (see Chapter 11).
Wrong process/operation: In program development, this can occur when the application is "out of scale" with either the computer intended for it, the programming language chosen to write it in, the operating system used, or the experience of the development staff.
Inadequate process/operation: This is the most common error in programmed business applications. It usually occurs when the programmers do not fully understand the users' (implicit or unstated) requirements. The user expects an operation to do several things at once (which is logical to him or her as an accountant or supply chain expert) but fails to convey this to the programmer.
Skipping a process/operation: This may happen in complex software and may produce mysterious results. It is a result of the programmer's not anticipating unusual or unexpected hardware situations, such as dividing by 0. So-called "bulletproof" programs employ many lines of code to predict and handle situations that may occur very rarely but can cause a great deal of trouble when they do. Protective codes or error handlers take a lot of human design and coding time and some program storage. But they cost very little in machine cycles because they, like the situations they handle, almost never happen.
Using the wrong software algorithms or components: This happens in business applications when the wrong library routine is used to do a tax calculation, for example. In the case of NASA's Mars lander, a program used English units instead of metric units, causing the lander to crash into Mars instead of landing softly on it.
Lack of software component synchronization: This happens in software when a library routine reads a file that has not yet been written and gets an error. Or perhaps file reads are not synchronized with writes, so a file is read before it has been rewritten, and the seeking program gets old data rather than current data.
Failing to properly initialize a software algorithm or component: This is common in programming. It may occur, for example, as a result of failure to initialize a program algorithm or method so that, rather than beginning a sum with 0, it begins with some unknown leftover value. Modern programming languages try to protect the programmer from such errors, but they still occur, sometimes when the initialized variable is set to the wrong value. This is comparable to a bolt in a mechanical assembly being torqued to the wrong value, leaving it too loose or too tight and thus producing a failure.

Trustworthy software boils down to freedom from defects (nonconformance). As the process capability improves, mistakes play an increasingly crucial role in determining product quality. Furthermore, the collective impact of these mistakes, commonly prevalent in complex systems, can be potentially catastrophic. It is no wonder that avoiding mistakes rather than SQC-based variation control has been the key instrument of conformance quality in nuclear power plants and air traffic control systems, where the consequences of failure could be catastrophic. A key strategy in mistake-proofing is controlling complexity, as discussed next.