There have been many attempts made to understand the reliability of software systems through software reliability modeling. These models have become very sophisticated. They are able to predict the past with great accuracy. Unfortunately, these models are
A hardware system is an integrated and functional unit. When a jet engine, for example, is running, all of the pieces of this jet engine are involved in the process of generating fire and great mayhem. Similarly, when a new computer CPU chip is energized, the whole surface of the chip is engaged in directing the flow of electrons. Further, when either of these systems is manufactured, there will be
Software systems are very different. The control software for a cellular telephone is exactly the same for each telephone set. All telephones with the same release of software will have exactly identical software loads on them. There is exactly zero manufacturing variation in these systems. If there is a fault in one software system, then that same fault will be present in all systems.
Just as a CPU chip has thousands of component parts, so will an equally complex software system. In a CPU chip, all of these systems are working at once. In our software system, only one component is operational at any instant. This is a phenomenally important difference. A software system is not the sum of its parts; it is only its parts. Some of these
The failure of a software system as it is dependent only on what the software is currently performing: operations that a
Some of the problems that have arisen in past attempts at software reliability determination all relate to the fact that their perspective has been
Existing approaches to the understanding of software reliability patently assume that software failure events are observable. The truth is that the overwhelming majority of software failures go unnoticed when they occur. Only when these failures
Computer programs do not break. They do not fail monolithically. Programs are designed to perform a set of mutually exclusive
Another problem found in software reliability investigations centers on the failure event itself. Our current view of software reliability is colored by a philosophical approach that
Yet another problem with the hardware adaptive approach to software reliability modeling is that the failure of a computer software system is simply not time dependent. A system can
Each program functionality can be thought of as having an associated reliability estimate. We may choose to think of the reliability of a system in these functional terms. Users of the software system, however, have a very different view of the system. What is important to the user is not that a particular function is fragile or reliable, but rather whether the system will
operate
to perform those actions that the user will want the system to perform correctly. From a user's perspective, it matters not, then, that certain functions are very unreliable. It only matters that the functions associated with the user's actions or operations are reliable. The classical example of this idea was expressed by the authors of the early UNIX utility programs. In the last paragraph of the documentation for each of these utilities there was a list of known bugs for that program. In general, these
The main problem in understanding software reliability from this new perspective is getting the granularity of the observation right. Software systems are designed to implement each of their functionalities in one or more code modules. In some cases there is a direct correspondence between a particular program module and a particular functionality. That is, if the program is
We will see that the key to understanding program failure events is the direct association of these failures to execution events with a given operation. A stochastic process will be used to describe the transition of program modules from one to another as a program transitions among operations. From these observations, it will become
Computer software is very different from the hardware on which it resides. Software is abstract whereas hardware is real and
In the manufacture of embedded software systems, there is zero variance among production software systems. If there is a fault in the control software for a single cell phone, then this defect will be present in every
The manner in which software systems will be used when they are deployed, however, does vary. This means that each user of the system will, in fact, be using a different software system. They will, by their operational profiles, select certain subsets of the code to exercise in ways that do vary from one user to another. Thus, our statistical models must reflect the context of the use of the software and not the software system itself. The variance in the software system is induced by the context and is not an intrinsic property of the software.
Physical systems also change with respect to time.
It is increasingly evident that the reliability of a software system is largely determined during program design. One distinct aspect of the software design process that lends itself to measurement is the decomposition of the functionality of a system into modules and the
The reliability of a software system can be characterized in terms of the individual software modules that make up the system. From the standpoint of the logical description of the system, these functional components of the larger system are, in fact, operating in series. If any one of these components fails, the entire system will fail. The likelihood that a component will fail is directly related to the complexity of that module. If it is very complex, the fault probability is also large. The system will be as reliable as its weakest component.
For each possible design outcome in a software design effort, there will be a set of expected execution profiles, one for each of the anticipated program functionalities. We should evaluate design alternatives in terms of the variance of functional complexity. If there is high variability in functional complexity, the reliability of the product of the design is likely to be low. A reasonable design objective would be to reduce functional complexity.
At the design stage, quantitative measures correlated to module fault-proneness and product failure provide a means to begin to exert statistical process control techniques on the design of a software system as an ongoing quality control activity. Rather than merely documenting the increasing operational complexity of a software product and therefore its
It is now quite clear that the architecture of a program will be a large determinant of its reliability. Some activities that a program will be asked to perform are quite simple. Designers and programmers alike will easily understand them. The resulting code will not likely contain faults. If, on the other hand, the specified functionality of a program is very complex and as a result ambiguous, then there is a good likelihood that this functionality will be quite fragile due to the faults introduced through the very human processes of its creation. A more realistic approach to software reliability modeling will reflect the software reliability in terms of the functionality of the software.
To capture the essence of the concept of software failure, let us suppose that each software system has a virtual failure module (VFM), m n +1 , to which control is automatically passed at the instant the program departs from its specified behavior. We can easily accomplish this action with assertions built into the code. In essence, we cannot perceive a failure event if we are not watching. Some program activities such as a divide by zero will be trapped by the system automatically. The hardware and operating system, however, cannot know our intentions for the successful execution of a module on a more global scale. We must instrument the code to know whether it is operating correctly. Programming assertions into the code is a very good way to measure for failure events.
There are really three levels of granularity of program reliability. At the
Each program module, then, will or will not fail when it is executed. We can know this and we can remember these failure events. A program module will be considered reliable if we have used it frequently and it has failed infrequently. A program module will be
Any program module can transfer control to the VFM. This transfer, of course, can only happen if the current module has departed from program specifications. With our dynamic instrumentation and measurement technique, we can clearly count the number of times that control has been transferred to each module. Those data can be accumulated in a module execution frequency vector (MEFV). We can also count the number of times that control has been transferred from any module to the VFM. These data can be
Let
t
i
represent the
i
th
element of the MEFV. It will contain the number of times that module
i
has received control during the last
epochs. Similarly, let
s
i
represent the number of times that module
i
has transferred control to the VFM during the same time period. It is clear that
s
i
≤
t
i
.
That is, a module could have failed every time it was executed or it could have executed successfully at least once. Our current estimate for the failure probability of module
i
on the
m
th
. This means that of the total
epochs that module
i
has received control, it has worked successfully on
epochs. The reliability of this module is simply one (1) minus its failure probability:
. From these observations we can construct a vector
of failure probabilities for each module on the
m
th
epoch.
It is clear that if module i has executed successfully for a large number of epochs, say 100,000, and failed but once over these epochs, then the module will be fairly reliable. In contrast, consider another module j that has executed only ten times and has not failed at all. At face value, the reliability of module j is 1, whereas the reliability of module i is 0.99999. However, we really lack enough information on module j to make a real accurate assessment of its behavior.
What we are really interested in is the current point estimate for reliability and a lower confidence for that estimate. It is clear that we can derive our estimate for the failure probability
from
. It is also clear that executing module
i
will have exactly two
.
An estimate a for the (l - α ) w upper bound for this failure probability can be derived from the binomial approximation to the normal distribution where:
If, for example, we wished to find the 9 percent confidence interval
for
, we could compute this from:
where 1.65 is the
x
ordinate value from the standard normal (N(0,1)) distribution for
α
= 0.95. This estimate,
, is an
. The problem is that if we have executed a particular module only a very few times and it has not failed, then the number of failures
for this module will be zero and the estimate for the upper bound
will also be zero. This would seem to imply that an untested module is also highly reliable and we are very confident about that estimate. Nothing could be further from the truth.
A more realistic assumption for the reliability of each module is to assume that it is unreliable until it is proven reliable. To this end, we will assume that each module has, at the beginning, been executed exactly once and it has failed. Thus, the number of measured epochs for each module will be increased by one and the number of failures experienced by that module will also be increased by one. Let
and
. Thus, a more suitable estimate for the initial failure probability for a module is
with an upper bound on this estimate of:
We can construct a vector u ( m ) whose elements are the (1 - α ) upper bounds on the estimates for the failure probabilities. Thus, with the vector of failure probabilities z ( m ) and the vector of upper bounds on these estimates u ( m ) , we can characterize the failure potential and our confidence in this estimate of set of modules that comprise the complete software system.
As each functionality
f
i
is exercised, it will induce a module execution profile
on the set of
n
program models. That is, each functionality will exercise the set of program modules differently. Therefore, each functionality will have a different failure potential, depending on the failure probability of the modules that it executes. We can compute an expected value for the functionality failure probability of each functionality on the
m
th
execution epoch as
In matrix form we can represent the vector of functional failure probabilities as q = pz (m) , where p is an m × n matrix of module execution profiles for each functionality. We can also compute an upper bound on the expected value of the functional failure probabilities as q' = pu (m) .
As each operation
ω
i
in the set of
l
operations is exercised by the user, it will induce a functional profile
on the set of
k
functionalities. That is, for every operation, there will be a vector of conditional probabilities for the execution of each functionality given that operation. Let
p'
represent the
l
×
k
matrix of the conditional probabilities of the
k
functionalities for the
l
operations.
It is clear that each operation will exercise the set of program functionalities differently. Therefore, each operation will have a different failure potential, depending on the failure probability of the functionalities that it executes. We can compute an expected value for the operational failure probability of each operation on the m th execution epoch as follows:
In matrix form we can represent the vector of operational failure probabilities as x = p'q , where p' is an l × k matrix of functional profiles. Substituting for q we can define x directly in terms of the measurable module failure probabilities to wit: x = p'pz (m) . We can also establish an upper bound on the operational failure probabilities as x' = p'pu (m) .
The elements of the vector x represent the failure probabilities for each of the l operations. The reliability of operation ω i can be derived from its failure probability as follows: r i = 1 - x i . Thus, we can construct a vector of operational reliability estimates r = I - x , where I is a vector of ones. We can also construct a vector of lower bounds on these reliability estimates: r' = I - x'.
The reliability of a software system is therefore not a static attribute of the system. It is an artifact of how the system will be used. There is a different reliability estimate for each operation, depending on how that operation exercises different modules. Furthermore, there is a lower bound on that reliability estimate, depending on how much we know about the modules that comprise that operation. A user's perception of the reliability of our system will depend on how he or she uses that system. The operational profile o represents the distribution of user activity across the system. Therefore, the user's reliability perception of the system ρ is simply ρ = or and the lower bound for this estimate is ρ ' = or '.
It seems pointless to engage in the academic exercise of software reliability modeling without making at least the slightest attempt to discuss the process of measuring failure events. A failure occurs immediately at the point that the program departs from its specification, not when we first perceive this
The basic
Let us now
The software will be designed to function optimally under known and predetermined operational profiles. We will need a mechanism for tracking the actual behavior of the user of the system. To this end we will require a vector
O
in which the program will count the frequency of each of the operations. That is, an element
o
i
of this vector will be incremented every time the program initiates the
i
th
user operation as per our discussion in Chapter 10. This is, of course,
When we first start using a program we will have little or no information about how its components will interact. The initial test activity is, in essence, the start of a learning process wherein we will come to understand how program components actually work together. To help understand this learning process, let us imagine that we have built a small software system that will perform four simple operations. These operations, in turn, will be implemented by a total of six functionalities. These functionalities will be mapped during the design process to a total of eight program modules. As per the discussion in Chapter 10, we will instrument this program so that we can monitor the transition from one operation to another, from one functionality to another, and from one module to another. Exhibit 1 shows the O × F mapping for this hypothetical system. Exhibit 2 shows the F × M mapping.
Exhibit 1: Mapping of Operations to Functionalities
|
|
|
Functionality |
||||||
|---|---|---|---|---|---|---|
|
Operation |
1 |
2 |
3 |
4 |
5 |
6 |
|
1 |
1 |
1 |
1 |
|||
|
2 |
1 |
1 |
||||
|
3 |
1 |
1 |
||||
|
4 |
1 |
1 |
||||
|
|
Exhibit 2: Mapping of Functionalities to Modules
|
|
|
Module |
||||||||
|---|---|---|---|---|---|---|---|---|
|
Functionality |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
1 |
1 |
1 |
||||||
|
2 |
1 |
1 |
||||||
|
3 |
1 |
1 |
1 |
|||||
|
4 |
1 |
1 |
||||||
|
5 |
1 |
1 |
||||||
|
6 |
1 |
1 |
||||||
|
|
Let us now assume that we have
Exhibit 3: Distribution of Functionalities by Operation
|
|
|
Functionality |
||||||
|---|---|---|---|---|---|---|
|
Operation |
1 |
2 |
3 |
4 |
5 |
6 |
|
1 |
201 |
1472 |
1603 |
|||
|
2 |
506 |
1001 |
||||
|
3 |
602 |
549 |
||||
|
4 |
332 |
567 |
||||
|
|
We can now derive the conditional probabilities of each functionality by operation from the data in Exhibit 3. These conditional probabilities are shown in Exhibit 4. They represent the contents of the matrix p' from the previous section.
Exhibit 4: Conditional Probabilities of Functionality Execution by Operation
|
|
|
Functionality |
||||||
|---|---|---|---|---|---|---|
|
Operation |
1 |
2 |
3 |
4 |
5 |
6 |
|
1 |
0.06 |
|
|
|
0.45 |
0.49 |
|
2 |
|
0.34 |
|
|
|
0.66 |
|
3 |
|
0.52 |
|
|
0.48 |
|
|
4 |
|
|
|
0.37 |
0.63 |
|
|
|
Let us now assume that we have tallied the execution of each module within each functionality. That is, we will build a matrix of module execution frequencies for each of the functionalities. These data are shown in Exhibit 5.
Exhibit 5: Distribution of Module Activity by Functionality
|
|
|
Module |
||||||||
|---|---|---|---|---|---|---|---|---|
|
Functionality |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
1 |
4651 |
2444 |
||||||
|
2 |
32 |
10 |
||||||
|
3 |
9456 |
987 |
64 |
|||||
|
4 |
761 |
765 |
||||||
|
5 |
5465 |
1226 |
||||||
|
6 |
756 |
796 |
||||||
|
|
From the frequency data in Exhibit 5 we can derive the conditional probabilities for the execution of each module given a functionality. Again, these conditional probabilities are derived by dividing the row marginal sums into the individual row entries. The resulting conditional probabilities are shown in Exhibit 6. They represent the contents of the matrix p from the previous section.
Exhibit 6: Conditional Probabilities of Module Execution by Functionality
|
|
|
Module |
||||||||
|---|---|---|---|---|---|---|---|---|
|
Functionality |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
1 |
0.66 |
|
|
|
|
|
0.34 |
|
|
2 |
|
0.76 |
|
|
|
|
|
0.24 |
|
3 |
|
|
0.90 |
|
|
|
0.09 |
0.01 |
|
4 |
|
|
|
0.50 |
|
|
|
0.50 |
|
5 |
|
|
|
|
0.82 |
|
|
0.18 |
|
6 |
|
|
|
|
|
0.49 |
|
0.51 |
|
|
If we multiply the two matrices p and p' , we will get the product matrix pp' shown in Exhibit 7. These are the conditional probabilities of executing a particular module while the user is expressing each operation.
Exhibit 7: Conditional Probabilities of Module Execution by Operation
|
|
|
Module |
||||||||
|---|---|---|---|---|---|---|---|---|
|
Operation |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
1 |
0.04 |
0.00 |
0.00 |
0.00 |
0.37 |
0.24 |
0.02 |
0.33 |
|
2 |
0.00 |
0.26 |
0.00 |
0.00 |
0.00 |
0.32 |
0.00 |
0.42 |
|
3 |
0.00 |
0.00 |
0.47 |
0.00 |
0.00 |
0.23 |
0.05 |
0.25 |
|
4 |
0.00 |
0.00 |
0.00 |
0.18 |
0.52 |
0.00 |
0.00 |
0.30 |
|
|
We have instrumented our hypothetical system so that we can keep track of the number of times that each module has failed during the total time we have tested the system. These data are shown in Exhibit 8. In every case, we are going to assert that every module has failed once in epoch 0.
Exhibit 8: Hypothetical Module Executions
|
|
|
Total Epochs |
Failures |
Successes |
|---|---|---|
|
4652 |
6 |
4646 |
|
33 |
1 |
32 |
|
9457 |
5 |
9452 |
|
762 |
4 |
758 |
|
5466 |
2 |
5464 |
|
757 |
3 |
754 |
|
3432 |
1 |
3431 |
|
2862 |
4 |
2858 |
|
|
Using the formulas for failure probability derived in the previous section we can compute the failure probability for each module from the data in Exhibit 8. These failure probabilities are shown in Exhibit 9. We can also compute the 95 percent upper confidence intervals for these failure probabilities. These upper bounds are also shown in Exhibit 9. The data are sorted by failure probability. We can see that the most reliable module is module 7, and the least reliable module is module 2. Not only is module 2 the most likely to fail, the upper bound on this estimate is also very large.
Exhibit 9: Module Failure Probabilities
|
|
|
Module |
Failure Probability |
Upper Bound |
|---|---|---|
|
7 |
0.0003 |
0.0008 |
|
5 |
0.0004 |
0.0008 |
|
3 |
0.0005 |
0.0009 |
|
1 |
0.0013 |
0.0022 |
|
8 |
0.0014 |
0.0025 |
|
6 |
0.0040 |
0.0077 |
|
4 |
0.0052 |
0.0096 |
|
2 |
0.0303 |
0.0795 |
|
|
We now know something about the failure probability (reliability) of each module. We know very little, however, about the reliability of the system. This will depend entirely on how this system is used. If a user were to choose operations that expressed modules 7, 5, and 3, the system would be seen as very reliable. If, on the other hand, the user were to favor operations that heavily used modules 6, 4, and 2, then this same system would very likely fail in the near
Now let us assume that there are four different users of our hypothetical system. Each user will exercise the system in his own particular way, represented by an operational profile. Exhibit 10 shows the operational profiles for each of the four users. We can see that user 1, represented by Operational Profile 1, shows a heavy bias to Operation 3, while user 3 exploits Operation 2 and, to a small extent, Operation 4.
Exhibit 10: Four Sample Operational Profiles
|
|
|
Operational Profile |
Operation |
|||
|---|---|---|---|---|
|
1 |
2 |
3 |
4 |
|
|
1 |
0.00 |
0.04 |
0.86 |
0.10 |
|
2 |
0.05 |
0.43 |
0.05 |
0.48 |
|
3 |
0.05 |
0.69 |
0.01 |
0.26 |
|
4 |
0.49 |
0.01 |
0.01 |
0.49 |
|
|
If we multiply the contents of Exhibit 10 with the contents of Exhibit 7, PP' , we can see how each of the four users' activity will distribute across the modules that comprise the system. This module activity is shown in Exhibit 11. Clearly, there is a distinct pattern of module execution for each of these users (operational profiles).
Exhibit 11: Distribution of Module Activity under Four Operational Profiles
|
|
|
Operational Profile |
Module Activity |
|||||||
|---|---|---|---|---|---|---|---|---|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
|
|
1 |
0.00 |
0.01 |
0.41 |
0.02 |
0.05 |
0.21 |
0.04 |
0.26 |
|
2 |
0.00 |
0.11 |
0.02 |
0.09 |
0.26 |
0.16 |
0.00 |
0.35 |
|
3 |
0.00 |
0.18 |
0.00 |
0.05 |
0.15 |
0.23 |
0.00 |
0.38 |
|
4 |
0.02 |
0.00 |
0.01 |
0.09 |
0.43 |
0.12 |
0.01 |
0.32 |
|
|
Exhibit 11 represents a matrix that shows the distribution of module activity for each operational profile. Exhibit 9 is a matrix containing the failure probability of each module and an upper bound for that estimate. The product of these two matrices is shown in the second and third
Exhibit 12: Distribution of Module Activity under Four Operational Profiles
|
|
|
Operational Profile |
Failure Probability |
Upper Bound |
Reliability |
Lower Bound |
Difference |
|---|---|---|---|---|---|
|
1 |
0.0018 |
0.0037 |
0.9982 |
0.9963 |
0.0019 |
|
2 |
0.0050 |
0.0119 |
0.9950 |
0.9881 |
0.0069 |
|
3 |
0.0071 |
0.0173 |
0.9929 |
0.9827 |
0.0102 |
|
4 |
0.0017 |
0.0033 |
0.9983 |
0.9967 |
0.0016 |
|
|
By inspection of Exhibit 12, we can see that user 4, represented by Operational Profile 4, will use the system in the most reliable manner. That is because this user is exploiting modules with the highest reliability. Similarly, user 3 will have the dubious privilege of discovering problems with the software in that he is running the least reliable system.
The very nature of the milieu in which programs operate dictates that they will modify the behavior of the individuals who are using them. The result of this is that the user's initial use of the system, as characterized by the operational profile, will not
The continuing evaluation of the execution, function, and module profiles over the life of a system can provide substantial information as to the changing nature of the program's execution environment. It is clear that as we execute a particular piece of software, we will continue to collect data on the execution of the system. In all likelihood, the system will run quite well for many thousands of epochs. Thus, our perceptions of the reliability of our system will tend to be enhanced the more we work with the system. Also, our confidence in these estimates will grow over time as well.
[1] Musa, J.D., Software Reliability Engineering , McGraw-Hill, New York, 1998.