11.4 Measurement-Based Testing

Deterministically testing a large software system is virtually impossible. Trivial systems, on the order of 20 or 30 modules, often have far too many possible execution paths for complete deterministic testing. This being the case, we must revisit what we hope to accomplish by testing the system. Our goal might be to remove all of the faults within the code. If this is our goal, then we will need to know when we have found all of these faults. Given unlimited time and resources, identification and removal of all faults might be a noble goal but real-world constraints make this largely unattainable. The problem is that we must provide an adequate level of reliability in light of the fact that we cannot find and remove all of the faults. Through the use of software measurement, we hope to identify which modules contain the most faults and, based on execution profiles of the system, how these potential faults can impact software reliability. The fundamental principle is that a fault that never executes, never causes a failure. A fault that lies along the path of normal execution will cause frequent failures. The majority of the testing effort should be spent finding those faults that are most likely to cause failure. ^[1]

The first step in this testing paradigm is the identification of those modules that are likely to contain the most faults. We can know this through our static measurement techniques. In the current state, the objectives of the software test process are not clearly specified and sometimes not clearly understood. An implicit objective of a deterministic approach to testing is to design a systematic and deterministic test procedure that will guarantee sufficient test exposure for the random faults distributed throughout a program. By ensuring, for example, that all possible paths have been executed, then any potential faults on these paths will have had the opportunity to have been expressed.

We must, however, come to accept the fact that some faults will always be present in the code. We will not be able to eliminate them all, nor should we try. The objective of the testing process should be to find those faults that will have the greatest impact on the reliability/safety/survivability of the code. Using this view of the software testing process, the act of testing may be thought of as conducting an experiment on the behavior of the code under typical execution conditions. We will determine, a priori, exactly what we wish to learn about the code in the test process and conduct the experiment until this stopping condition has been reached.

11.4.1 Simple Measures for the Test Process

At the completion of each test case, we would like to be able to measure the performance of that test activity. There are many different aspects of the test activity, so there will also be a host of different measurements that we can make. The important thing is that the test activity must be measured and evaluated. We must learn to evaluate every aspect of the test activity. Our objective is not just to drive the code until it fails or fails to fail; we want to know exactly what the test did in terms of the distribution of its activity and where potential faults are likely to be.

The first measure that we will use in this regard will examine how the test activity was distributed across the modules actually exercised by the test. The distribution of the activity will be an important assessment of the ultimate testability of the program. If each of the tests distributes all of the activity on a small subset of the modules, the majority of program modules will not be executed. If, on the other hand, the test activity is distributed evenly across each module, then each of the modules will have had equal exposure during the test. A very useful measure of program dynamics is the program entropy measure discussed in Chapter 10; that is:

Each test will generate a different test execution profile. Hence, entropy is a good measure of how the test effort was distributed among the various program modules. For the test execution profile P⁽ⁱ⁾, the test entropy will be:

where n is the cardinality of the set of all program modules. A low entropy test is one that will spend all its time in a relatively small number of modules. Maximum entropy is, of course, log₂n for a test suite of n program modules.

Some tests will have a low entropy value and others will have a high entropy value. A large entropy value would indicate that the test tended to distribute its activity fairly evenly across the modules of the system. A small entropy value would indicate that only a small number of modules received the majority of the test activity.

The process of measuring the test process for evolving software systems is very complicated. Existing systems are continually being modified as a normal part of the software maintenance activity. Changes will be introduced into this system based on the need for corrections, adaptations to changing requirements, and enhancements to make the system perform faster and better. The precise effects of changes to software modules, in terms of number of latent faults, is now reasonably well understood. From a statistical testing perspective, test efforts should focus on those modules that are most likely to contain faults. Each program module that has been modified, then, should be tested in proportion to the number of anticipated faults that might have been introduced into it. Thus, the second measure of test activity will relate to the location of potential faults in the system.

In the face of the evolving nature of the software system, the impact of a single test can change from one build to the next. Each program module has a fault index (FI) value ρ_i. Again, FI is a fault surrogate. The larger value of the FI, the greater fault potential that a module has. If a given module has a large fault potential but limited exposure (small profile value), then the fault exposure of that module is also small. ^[2] One objective of the test phase is to maximize our exposure to the faults in the system. Another way to say this is that we wish to maximize fault exposure φ , given by:

where is the FI of the j^th module on the i^th system build and p⁽^k⁾ is the test execution profile of the k^th test suite. ^[3] In this case, is the expected value for φ under the k^th test case profile.

We know that the average ρ is 100 for the baseline system. This is so because we scaled the fault index to make it have a mean of 100. The maximum value for is simply the maximum value for . A unit test of this module j would yield this result. In this case, we would spend 100 percent of the test activity in this single module. By the same reasoning, the minimum is simply the least of the .

We now have two relatively simple measures of test outcomes: the test entropy measure h⁽^k⁾ will tell us about how each test case distributes its activity across modules, and the fault exposure measure φ⁽^k⁾ will tell us how well the test activity was distributed in terms of where the potential faults are likely to be. The main point is that we should learn to measure the test outcomes of individual tests and evaluate them as we perform each test case. Unfortunately, the current measurement in use in most organizations is a binary outcome: the test failed or the test succeeded.

These two measures will serve well to introduce the notion of test measurement. We must remember, however, that the code base is constantly changing. New modules are entering the code base, modules are being modified, and modules are being deleted. A single test case applied over a series of sequential builds may well produce very different values, depending on the nature of the changes that have occurred and the total code churn.

11.4.2 Cumulative Measure of Testing

Each test activity will generate a TEFV. This vector represents the frequency that each module will have been exercised in a particular test. At the conclusion of each test we can add each element of that TEFV to a cumulative test execution frequency vector (CTEFV). The CTEFV, then, will contain the frequency of execution of each module over the entire lot of test cases that have been executed to date. From this CTEFV we can compute a cumulative test execution profile p⁽^c⁾. This will show the distribution of test activity across all program modules to date. We can easily compute the cumulative test entropy to date and our cumulative fault exposure φ⁽^c⁾. The CTEFV is a vector that will contain only the module frequencies for modules in the current build.

Let us assume, for the moment, that our test objectives are strictly to find faults. The measure φ is clearly a measure of exposure to potential faults in the program. Therefore, a test that will maximize φ would be an optimal test. It turns out, however, that the maximum value of φ is obtained when the one module with the highest FI is executed to the exclusion of all others. This is also a point of minimum entropy.

A fair test, on the other hand, is one that will spread its activity across all modules, not just the ones that are most likely to contain faults. A single fault in a module with a very low FI will have exactly the same consequences as a fault in a module with a much higher FI when either of these modules is executed by a user. Each module, then, should receive test activity in proportion to the relative likelihood that it will have faults that can be exposed by testing. Remember from Chapter 8 that the proportion of faults in the module on the j^th build of the system was determined to be

A fair test of a system is one in which for all i. In other words, each module should be tested in proportion to its potential contribution to the total fault count.

We would like to measure the difference between how we have distributed our test activity and where the potential faults might be. Let represent the difference between the test activity on module i and the relative fault burden for that module. A measure of the overall test effectiveness of our test effort to date will then be:

where n_j is the cardinality of the set of modules in build j. The maximum value for Γ⁽^j⁾ will be 2.0 and its minimum will be 0.0. This minimum will be attained in the case where the cumulative test profile exactly matches the projected fault burden of the program modules. The maximum value will be attained when there is a complete mismatch between the cumulative test profile and the projected fault burden. This could happen, for example, if and .

One thing that is immediately apparent in the computation of these cumulative measures of testing is that they involve an incredible amount of data that must be accumulated over the evolving history of the system. It would be literally impossible for a single human being to begin to manage these data. This is one of the primary reasons that the test effort at most software development organizations is run strictly by gut feeling. Unless there is a suitable infrastructure in place to collect and manage the measurement data, the quantification of test outcomes will be impossible. Chapter 13 discusses a system for the management of these data.

One of our principal concerns in the test process is that the system should work when the customer uses it. It is not particularly relevant just how many faults are in the system when it is placed in the user's hands for the first time. The important thing is that the user does not execute the residual faults that are in the code. Remember that each user will distribute his or her activity on the set of operations according to an operational profile. Each operational profile will, in turn, generate a functional profile, which ultimately will create a distinct module profile for that operational profile. Our total test activity will be a success if we have tested and certified the software in a manner consistent with how it will be used. That is, p⁽^c⁾ = o. In other words, our test activity ultimately reflects the user operational profile. It would be a good idea for us to know something about the user operational profile before we ship our software.

11.4.3 Delta Testing

As the software evolution process progresses, new faults will likely be added to the system in proportion to the changes that have been made to affected modules. This means that the distribution of faults in the code will change as the software system evolves. Our measurement data will show us this shifting distribution of fault distribution. This means that we can know which modules are most likely to contain faults. The real problem will be to craft test cases that will expose these faults. Constructing these test cases will be very difficult if we do not know what the system is supposed to do or how it does what it is supposed to do. That is the reason we must insist on having good requirements traceability. In Chapter 9 we developed a system for mapping user operations to specific code modules, O × F × M. This means that we can identify either specific functionalities that exercise certain sets of modules for functional testing, or we can identify specific user operations that exercise certain modules for operational level testing.

The initial phase of the effective functional testing of changed code is to identify the functionalities that will exercise the modules that have changed. Each of these functionalities thus designated will have an associated test suite designed to exercise that functionality. With this information it is now possible to describe the efficiency of a test from a mathematical/statistical perspective. A delta test is one specifically tailored to exercise the functionalities that will cause the changed modules to be executed. A delta test will be effective if it does a good job of exercising changed code. It is worth noting, however, that a delta test that is effective on one build may be ineffective on a subsequent build. Thus, the effectiveness of a delta test between any two builds i and j is given by:

where m represents the cardinality of the modules in build as defined earlier. In this case, is simply the expected value for code churn under the profile p⁽^k⁾ between builds i and j.

This concept of test effectiveness permits the numerical evaluation of a test on the actual changes that have been made to the software system. It is simply the expected value of the fault exposure from one build to another under a particular test. If the value of τ is large for a given test, then the test will have done a good job of exercising the changed modules. If the set of τs for a given release is low, then it is reasonable to suppose that the changed modules have not been tested well in relation to the number of probable faults that were introduced during the maintenance changes.

Given the nature of the code churn from one build to another, one simple fact emerges with great clarity. That is, there is no such thing as a standard delta test suite. Delta testing must be tailored to the changes made in each build. The functionalities that will be most impacted by the change process are those that use the modules that have been changed the most.

For practical purposes, we need to know something about the upper bound on test effectiveness. That is, if we were to execute the best possible test, what then would be the value of test effectiveness? A best delta test is one that will spend the majority of its time in the functionalities that contain the modules that have changed the most from one build to the next. Let,

This is the total code churn between the i and j builds. To exercise each module in proportion to the change that has occurred in the module during its current revision, we will compute this proportion as follows:

This computation will yield a new hypothetical profile called the best profile. That is, if all modules were exercised in proportion to the amount of change they had received, we would then theoretically have maximized our exposure to software faults that may have been introduced.

Finally, we seek to develop a measure that will relate well to the difference between the actual profile that is generated by a test and the best profile. To this end, consider the term . This is the absolute value between the best profile and the actual profile for test case k. This value has a maximum value of 1 and a minimum of 0. The minimum value will be achieved when the module best and actual test profiles are identical. A measure of the efficiency of a test (task or program) is:

This coverage value has a maximum value of 100 percent when the best and the actual profiles are identical, and a value of 0 when there is a complete mismatch of profiles.

In a rapidly evolving software system, there can be no standard suite of delta test cases. These test delta cases must be developed to ensure the operability of each functionality in direct response to modules that implement that functionality.

11.4.4 Delta Test Results: A Case Study

The following discussion documents the results of the execution of 36 instrumented tasks on two sequential builds of a large embedded software system (which we will refer to as the RTJ system) written in C++ for a large mass storage disk subsystem at a major supplier of disk storage subsystems. In this specific example, a module is a task. The perspective of this discussion is strictly from the standpoint of delta testing. That is, certain program modules have changed across the two sequential builds. The degree of this change is measured by code churn. As has been clearly demonstrated on the Cassini spacecraft project, the greater the change in a program module, the greater the likelihood that faults will have been introduced into the code by the change. ^[4] Each of the delta tests, then, should attempt to exercise these changed modules in proportion to the degree of change. If a changed module were to receive little or no activity during the test process, then we must assume that the latent faults in the module will be expressed when the software is placed into service.

All of the tasks in the RTJ system were instrumented with our Clic tool. This tool will permit us to count the frequency of execution of each module in each of the instrumented tasks and thus obtain the execution profiles for these tasks for each of the tests. In the C++ environment, the term module is applied to a method or a function. These are the source code elements that are reflected in the code that actually executes. Each task typically contains from 10 to 30 program modules. In this case, the granularity of measurement has been ratcheted up from the module level to the task level.

The execution profiles show the distribution of activity in each module of the instrumented tasks. For each of the modules, the code churn measure was computed. For the purposes of this investigation, the FI distribution was set to a mean of 50 and a standard deviation of 10. The code churn values for each module reflected the degree of change of the modules during the most recent sequence of builds. The cumulative churn values for all tasks are shown in the second column of Exhibit 1. A churn value of zero indicates that the module in question received no changes during the last build sequence. A large churn value (>30) indicates that the module in question received substantial changes.

Exhibit 1: Test Summary by Task

Task	Churn	Best Profile	Actual Profile	Profile Difference	Maximum Test Effectiveness	Actual Test Effectiveness
A	2028.31	5.96E-01	1.94E-02	5.77E-01	1208.26	39.29
B	487.26	1.43E-01	7.97E-03	1.35E-01	69.73	3.88
C	154.72	4.54E-02	8.94E-04	4.45E-02	7.03	0.14
D	150.77	4.43E-02	2.71E-01	2.27E-01	6.67	40.89
E	150.71	4.43E-02	2.67E-03	4.16E-02	6.67	0.40
F	126.46	3.71E-02	3.17E-03	3.39E-02	4.69	0.40
G	121.00	3.55E-02	2.79E-03	3.27E-02	4.29	0.34
H	117.2	3.44E-02	8.15E-04	3.36E-02	4.03	0.10
I	14.38	4.23E-03	2.96E-04	3.93E-03	0.06	0.01
J	9.11	2.68E-03	4.97E-05	2.63E-03	0.02	0.00
K	6.84	2.01E-03	4.99E-01	4.97E-01	0.01	3.42
L	6.27	1.84E-03	4.42E-05	1.80E-03	0.01	0.00
M	5.64	1.66E-03	2.83E-03	1.17E-03	0.01	0.02
N	5.17	1.52E-03	7.46E-05	1.45E-03	0.01	0.00
O	3.92	1.15E-03	1.47E-04	1.00E-03	0.01	0.00
P	3.90	1.15E-03	2.27E-03	1.12E-03	0.00	0.01
Q	3.20	9.42E-04	1.72E-01	1.71E-01	0.00	0.55
R	2.28	6.70E-04	2.12E-06	6.68E-04	0.00	0.00
S	1.85	5.44E-04	7.07E-04	1.63E-04	0.00	0.00
T	1.84	5.42E-04	6.40E-05	4.78E-04	0.00	0.00
U	1.19	3.52E-04	4.48E-04	9.60E-05	0.00	0.00
V	0.84	2.49E-04	8.63E-04	6.14E-04	0.00	0.00
W	0.68	2.02E-04	1.17E-04	8.50E-05	0.00	0.00
X	0.54	1.60E-04	3.81E-03	3.65E-03	0.00	0.00
Y	0.26	7.82E-05	3.50E-03	3.42E-03	0.00	0.00
Z	0.22	6.75E-05	1.86E-04	1.19E-04	0.00	0.00
AA	0.09	2.83E-05	6.34E-05	3.51E-05	0.00	0.00
AB	0.08	2.61E-05	1.55E-05	1.06E-05	0.00	0.00
AC	0.04	1.30E-05	6.54E-07	1.23E-05	0.00	0.00
AD	0.00	0.00E+00	1.09E-05	1.09E-05	0.00	0.00
AE	0.00	0.00E+00	3.75E-06	3.75E-06	0.00	0.00
AF	0.00	0.00E+00	3.71E-03	3.71E-03	0.00	0.00
AG	0.00	0.00E+00	3.91E-07	3.91E-07	0.00	0.00
AH	0.00	0.00E+00	2.15E-04	2.15E-04	0.00	0.00
AI	0.00	0.00E+00	4.57E-07	4.57E-07	0.00	0.00
AJ	0.00	0.00E+00	2.85E-06	2.85E-06	0.00	0.00
Total	3404.77				1311.50	89.44
Test entropy			1.87
Test efficiency				9.04%
Test effectiveness						6.82%

For the subsequent analysis, two profile values for each test will be compared. The actual profile is the actual execution profile for each test. The best profile is the best hypothetical execution profile given that each module would be tested directly in proportion to its churn value. That is, a module whose churn value is zero would receive little or no activity during the regression test process.

From Exhibit 1 we can see that the A and B tasks have received the greatest change activity. The total churn values were 2028.31 and 487.26 respectively. The code churn values were used to establish the Best Profile column. The Profile Difference column represents the difference between the theoretical best profile and the actual test activity reflected by the actual profile. From the profile difference we can derive the test efficiency for this test, which is 9.04 percent. From the Actual Profile column we can see that the test entropy was 1.87. Maximum entropy for the 36 modules in a test that would exercise each of the modules equally would be 5.17.

The last two columns in Exhibit 1 contain the expected value for the code churn of the task under the best profile and also under the actual profile. These columns are labeled Maximum Test Effectiveness and Actual Test Effectiveness. The maximum test effectiveness with the code churn introduced between the builds being measured under the best profile is 1311.50. The actual test effectiveness for all tasks was measured at 89.44. In percentage terms, the actual test effectiveness of this test was 6.82 percent of a theoretical maximum exposure to software faults.

All the data point to the same problem. The tests spent a disproportionate amount of time in modules that had not substantially changed during this build interval. In essence, the bulk of the test resources are being devoted to modules that have not changed and have already been thoroughly tested. We can clearly see this if we plot the differences between the best profile and the actual profile. This difference is shown in Exhibit 2 for the tasks A through AJ. We can see that considerable test effort was invested in tasks D, K, and Q, each of which had relatively little change activity. This is shown by the fact that the difference is negative.

Exhibit 2: Actual versus Best Test Profile Activity

click to expand

Exhibit 3 summarizes the performance of the best 24 of a total suite of 115 instrumented tests. Only those tests whose test efficiencies exceeded 10 percent of a theoretical total are shown. Again, the test efficiencies shown in this exhibit were computed from the difference between the actual profile and the best profile for that test. It is clear even in the best circumstances that not all tests will exercise all modules. Therefore, the test efficiency was computed only for those modules whose functionality was included in the test. From a delta test perspective, we now know that we have a testing problem. None of these tests do a really good job in executing the code most likely to contain the newly introduced faults. Furthermore, we now have a good clear indication as to how to fix the problem.

Exhibit 3: Individual Test Summaries

Test No.	Test Efficiency	Test No.	Test Efficiency
28	20.6	177	11.7
18	19.0	31	11.6
14	18.2	3	11.5
12	16.9	167	11.5
47	14.8	59a	11.4
49	14.8	2	11.3
169	14.7	159	11.3
156	13.2	1	10.9
20	13.1	38	10.8
39	12.9	180	10.7
9	12.2	33	10.6
158	12.2	137	10.2

In a second investigation at this same organization, two major measurement initiatives were established, this time for a new system we will refer to as the QTB system. This is a real-time embedded system consisting of 120 modules of C++ code. During the test process whose activities are reported here, the QTB system was driven through five sequential test suites of the system on four successive builds of system, called build 10, 11, 12, and 13. For this sequence of builds, build 10 was chosen as the baseline build. All of the metrics for this test series were baselined on this build.

Each of the five test suites was designed to exercise a different set of functionalities. Each of the functionalities, in turn, invoked a particular subset of program modules. Thus, as each of the tests executed, only a limited number of the total program modules received execution control. For Test 1, only 49 of the 120 modules in the QTB task were executed. For Test 2, only 25 of the 120 modules were executed. Of the 120 modules, only 16 were executed in all five tests.

The specific test outcomes for Tests 1 and 2 are shown in Exhibits 4 and 5, respectively. These two tests were chosen for discussion in that their characteristics are very different. The functionalities invoked by Test 1 caused 44 modules to be executed. Test 2, on the other hand, only executed 25 program modules. The baseline system for this test series was chosen to build 10. The metrics for each of the program modules in Tests 1 and 2 were baselined on this build.

Exhibit 4: Test Results for Test 1

Module	ρ¹⁰	P⁽¹⁾	φ⁽¹⁰⁾	χ^10,11		χ^11,12		χ^12,13
1	44.23	0.023	1.01	0.00	0.00	0.00	0.00	0.00	0.00
2	48.41	0.002	0.09	0.00	0.00	0.00	0.00	0.00	0.00
3	49.05	0.004	0.19	0.00	0.00	0.00	0.00	0.00	0.00
4	56.30	0.059	3.33	0.00	0.00	0.04	0.00	0.00	0.00
7	54.93	0.004	0.21	0.00	0.00	0.00	0.00	0.00	0.00
8	54.25	0.017	0.93	0.00	0.00	0.00	0.00	0.00	0.00
9	51.48	0.002	0.10	0.00	0.00	0.00	0.00	0.00	0.00
10	50.75	0.017	0.87	0.00	0.00	0.00	0.00	0.00	0.00
12	61.25	0.004	0.23	1.99	0.01	0.00	0.00	1.18	0.00
13	53.12	0.017	0.91	0.57	0.01	0.00	0.00	0.00	0.00
14	51.10	0.013	0.68	0.00	0.00	0.00	0.00	0.00	0.00
15	49.08	0.002	0.09	0.00	0.00	0.00	0.00	0.00	0.00
16	49.92	0.055	2.76	0.00	0.00	0.00	0.00	0.00	0.00
17	55.08	0.004	0.21	0.00	0.00	0.00	0.00	0.00	0.00
18	47.32	0.048	2.26	0.00	0.00	0.00	0.00	0.00	0.00
19	54.14	0.006	0.31	0.00	0.00	0.00	0.00	0.00	0.00
21	44.47	0.044	1.95	0.00	0.00	0.00	0.00	1.13	0.05
22	49.12	0.004	0.19	0.00	0.00	0.00	0.00	0.00	0.00
23	56.82	0.004	0.22	0.00	0.00	0.00	0.00	0.00	0.00
24	57.30	0.021	1.20	0.00	0.00	0.00	0.00	0.00	0.00
25	53.98	0.017	0.93	0.00	0.00	3.13	0.05	1.22	0.02
26	36.52	0.004	0.14	0.00	0.00	0.00	0.00	0.00	0.00
31	50.90	0.017	0.87	0.00	0.00	0.00	0.00	0.00	0.00
52	46.66	0.008	0.36	0.00	0.00	0.00	0.00	0.00	0.00
53	46.11	0.010	0.44	0.00	0.00	0.00	0.00	0.00	0.00
54	46.11	0.011	0.53	0.00	0.00	0.00	0.00	0.00	0.00
56	54.87	0.006	0.31	0.00	0.00	0.00	0.00	0.63	0.00
57	46.74	0.004	0.18	2.66	0.01	0.00	0.00	0.00	0.00
60	58.28	0.004	0.22	0.00	0.00	0.07	0.00	0.07	0.00
63	50.43	0.002	0.10	0.00	0.00	0.00	0.00	0.00	0.00
64	70.82	0.053	3.78	0.00	0.00	0.00	0.00	0.00	0.00
71	55.48	0.128	7.09	0.00	0.00	0.00	0.00	0.00	0.00
75	55.29	0.044	2.43	0.00	0.00	0.00	0.00	0.00	0.00
76	50.99	0.002	0.10	0.00	0.00	0.00	0.00	0.00	0.00
77	52.55	0.002	0.10	0.62	0.00	0.00	0.00	0.00	0.00
80	56.82	0.021	1.19	0.61	0.01	0.00	0.00	1.89	0.04
95	54.93	0.004	0.21	0.00	0.00	0.00	0.00	0.00	0.00
96	58.55	0.006	0.34	0.00	0.00	0.00	0.00	0.00	0.00
97	36.38	0.006	0.21	0.00	0.00	0.00	0.00	0.00	0.00
98	56.44	0.013	0.75	0.00	0.00	0.07	0.00	0.07	0.00
100	57.46	0.002	0.11	0.00	0.00	0.00	0.00	0.00	0.00
103	47.54	0.025	1.18	0.00	0.00	0.00	0.00	0.00	0.00
106	75.43	0.002	0.14	0.00	0.00	0.00	0.00	0.34	0.00
107	50.74	0.006	0.29	0.00	0.00	0.00	0.00	0.00	0.00
110	59.45	0.002	0.11	0.00	0.00	0.00	0.00	0.00	0.00
111	68.39	0.004	0.26	0.00	0.00	0.00	0.00	0.00	0.00
114	49.94	0.193	9.63	0.00	0.00	0.02	0.00	0.02	0.00
115	46.32	0.017	0.80	1.75	0.03	0.76	0.01	0.00	0.00
118	52.86	0.040	2.12	0.00	0.00	0.00	0.00	0.00	0.00
Total	2585.10		52.68	8.20	0.07	4.10	0.07	6.54	0.12
System total	6014.15			67.24		77.52		124.32

Exhibit 5: Test Results for Test 2

Module	P¹⁰	P⁽²⁾	φ⁽²⁾	χ^10,11		χ^11,12		χ^12,13
3	49.05	0.010	0.48	0.00	0.00	0.00	0.00	0.00	0.00
4	56.30	0.039	2.19	0.00	0.00	0.04	0.00	0.00	0.00
6	48.48	0.039	1.88	0.00	0.00	0.00	0.00	0.00	0.00
14	51.10	0.019	0.99	0.00	0.00	0.00	0.00	0.00	0.00
18	47.32	0.029	1.38	0.00	0.00	0.00	0.00	0.00	0.00
20	47.68	0.019	0.93	0.00	0.00	0.00	0.00	0.00	0.00
21	44.47	0.049	2.16	0.00	0.00	0.00	0.00	1.13	0.05
23	56.82	0.058	3.31	0.00	0.00	0.00	0.00	0.00	0.00
24	57.30	0.019	1.11	0.00	0.00	0.00	0.00	0.00	0.00
26	36.52	0.010	0.35	0.00	0.00	0.00	0.00	0.00	0.00
37	53.51	0.039	2.08	0.00	0.00	0.07	0.00	0.07	0.00
64	70.82	0.136	9.63	0.00	0.00	0.00	0.00	0.00	0.00
66	61.98	0.097	6.02	0.00	0.00	0.02	0.00	0.02	0.00
69	48.46	0.019	0.94	0.00	0.00	0.13	0.00	0.13	0.00
71	55.48	0.097	5.39	0.00	0.00	0.00	0.00	0.00	0.00
77	52.55	0.029	1.53	0.62	0.02	0.00	0.00	0.00	0.00
79	57.34	0.019	1.11	0.00	0.00	0.07	0.00	0.07	0.00
80	56.82	0.058	3.31	0.61	0.04	0.00	0.00	1.89	0.11
83	73.44	0.010	0.71	0.39	0.00	0.00	0.00	0.00	0.00
95	54.93	0.019	1.07	0.00	0.00	0.00	0.00	0.00	0.00
96	58.55	0.019	1.14	0.00	0.00	0.00	0.00	0.00	0.00
103	47.54	0.019	0.92	0.00	0.00	0.00	0.00	0.00	0.00
109	49.05	0.010	0.48	0.00	0.00	0.61	0.01	0.61	0.01
114	49.94	0.117	5.82	0.00	0.00	0.02	0.00	0.02	0.00
118	52.86	0.019	1.03	0.00	0.00	0.00	0.00	0.00	0.00
Totals	1338.32		55.94	1.62	0.06	0.96	0.01	3.93	0.18

The second column of Exhibits 4 and 5 displays the FI value, relative to build 10, for each of the modules that received at least one execution in each of the tests. The total system FI for all 120 modules is 6014. The total FI for all the modules that executed in Test 1 is 2585, or 43 percent of the total system FI. Another way to look at this value is that Test 1 will have exposure to about 43 percent of the latent faults in the system. Test 2 similarly provided exposure for only about 22 percent of the latent faults.

The third column of Exhibit 4 labeled p⁽¹⁾, contains the actual profile data for each test. The fourth column of this table, φ⁽¹⁾, is the product of the FI and the profile value. This represents the failure exposure of capabilities of Test 1 on the baseline system. The total value for this column is the total test failure exposure. It is the expected value of FI under the profile, p⁽¹⁾. If all of the modules in the entire system were to have been executed with equal frequency, this value would have been 50. A value less than 50 means that the test ran less exposure to latent faults than average. In this case φ = 52.68, which is slightly better than average fault exposure for this test. On Test 2, the results of which are shown in Exhibit 5, the expected fault exposure is 55.94, an even better fault exposure.

The fifth column of Exhibit 4, χ^10,11, contains the code churn value for the changes between build 10 and build 11. For most of the modules in this exhibit, this value is zero. They did not change from build 10 to build 11. The greatest change was in a new program module not executed by this test. A good delta test for the changes that were made to this module would cause it to be executed with high probability. This module did not execute at all. A delta test suite would be effective with respect to maximizing our exposure to potentially introduced faults if the product of the profile and the churn value is large. The test effectiveness for Test 1 is shown in the column labeled . The upper bound on test effectiveness will be achieved in the case where the module with the greatest change is executed exclusively. The maximum effectiveness for this particular test will be 67.24, the code churn value for the module with the greatest churn value. We can see that the total effectiveness, then, for Test 1 on build 11 is 0.07, or less than 1 percent of the maximum. The efficiency of this test is less than 4 percent. This is neither an effective nor efficient test of the changes made between builds 10 and 11.

The next four columns of Exhibit 4 contain the code churn values and the test efficiency values for Test 1 on builds 12 and 13, respectively. We can see that the absolute effectiveness of Test 1 on builds 12 and 13 does, in fact, increase. However, when we look at the effectiveness of each of the tests as a percentage of the maximum, we can see that the test effectiveness really declines rather than increases. The test efficiency of Test 1 was never above 4 percent across the three build deltas.

The last row of Exhibit 4 contains the total system values for system FI and also for the total code churn for the three successive build events. We can see, for example, that the total churn for the build 10-11 sequence is 67.24. Test 1 only executed modules whose total churn was 8.2, or 12 percent of the total value. We can see that the greatest change occurred between builds 12 and 13 in that the total system churn was 124.32 for this build.

Similar data is reported for Test 2 in Exhibit 5. Test 2 is a very different test from that of Test 1. We can see from the fourth column in Exhibit 5 that the failure exposure φ of the modules actually invoked during Test 2 is somewhat greater than that of Test 1, 55.94 versus 52.68. In that sense, Test 2 will be more likely to expose latent faults in the base code than will Test 1. Test 2, however, is a very much weaker test suite for the set of modules that it exercises. The test effectiveness for this test is very low. What is interesting to note in Exhibit 5 is that while Test 2 is not an effective test suite for the changes that were made from build 10 to 11 and from build 11 to 12, this test suite is a more effective test suite for the nature of the changes to the system between builds 12 and 13.

We are now in a position to use the summary data to evaluate test outcomes for other test applications. In Exhibit 6 the test results for all five tests are summarized. Each of the five test activities represented by the contents of this table is summarized by two numbers. The first number is in the column labeled χ^%. It represents the percentage of the total code churn. We would have to conclude that none of the tests did a sterling job for the delta test based on this value for χ^%. These tests simply did not execute the right modules, those that had changed the most.

Exhibit 6: Test Results Summary for Tests 1 through 5

Test	Build 10-11		Build 11-12		Build 12-13
Test	χ^%10,11	τ_10,11	χ^%11,12	τ_11,12	χ^%12,13	τ_12,13
1	12	0.07	5	0.07	5	0.12
2	2	0.06	1	0.01	3	0.18
3	6	0.07	6	0.08	5	0.10
4	2	0.04	0	0	2	0.09
5	11	0.07	4	0.05	4	0.07

The second global test evaluation criterion is τ, or test effectiveness. None of the tests could have been considered to be really effective by this criterion. Even among the modules that were executed that had changed during the successive builds, these tests did not distribute their activity well.

Unless we are able to quantify test outcomes, our impressions of the testing process will likely be wrong. The five tests whose outcomes we evaluated in this study were, in fact, part of a "standard regression test" suite. They are routinely run between builds. We can see that they routinely do not do a good job of identifying and exercising modules that might well have had new faults introduced into them.

When a program is subjected to numerous test suites to exercise differing aspects of its functionality, the test risk of a system will vary greatly as a result of the execution of these different test suites. Intuitively - and empirically - a program that spends a high proportion of its time executing a module set of high FI will be more failure prone than one driven to executing program modules with complexity values. Thus, we need to identify the characteristics of test scenarios that cause our criterion measures of χ and τ to be large.

^[1]Munson, J.C., Software Faults, Software Failures, and Software Reliability Modeling, Information and Software Technology, December 1996.

^[2]Munson, J.C., Dynamic Program Complexity and Software Testing, Proceedings of the IEEE International Test Conference, Washington, D.C., October 1996.

^[3]Munson, J.C. and Elbaum, S.G., A Measurement Methodology for the Evaluation of Regression Tests, Proceedings of the 1998 IEEE International Conference on Software Maintenance, IEEE Computer Society Press, pp. 24-33.

^[4]Nikora, A.P., Software System Defect Content Prediction from Development Process and Product Characteristic, Ph.D. thesis, University of Southern California, Los Angeles, January 1998.