|
|
One of the main problems encountered in working with raw metric data is that they are data. There is very little information content in raw metric data. Take, for example, the 12 raw metrics obtained from one particular build of the PASS system. A sample of data from 20 program modules is shown in Exhibit 13. It would be very difficult to impossible to draw any useful conclusions about, for example, module 6 in relation to module 9, or any other module for that matter. If we were measuring a system of 10,000 modules on 30 metrics, we would have even more of a problem.
Exhibit 13: Raw Metric Values for 20 PASS Modules
Module | η1 | η2 | N1 | N2 | Exec | LOC | Nodes | Edges | Paths | Cycles | Maxpath | Avepath |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 45 | 282 | 1335 | 763 | 339 | 841 | 114 | 131 | 14652 | 13 | 168 | 136 |
2 | 14 | 13 | 43 | 18 | 11 | 22 | 8 | 9 | 4 | 0 | 7 | 6 |
3 | 47 | 259 | 2542 | 1154 | 559 | 1027 | 103 | 129 | 5001 | 4 | 143 | 129 |
4 | 3 | 5 | 4 | 5 | 1 | 399 | 14 | 10 | 4 | 0 | 10 | 10 |
5 | 34 | 117 | 867 | 371 | 212 | 248 | 78 | 103 | 23892 | 3 | 87 | 69 |
6 | 35 | 157 | 1040 | 484 | 226 | 475 | 129 | 168 | 13512 | 4 | 112 | 97 |
7 | 12 | 28 | 47 | 34 | 13 | 114 | 14 | 10 | 4 | 0 | 10 | 10 |
8 | 45 | 331 | 3493 | 1760 | 733 | 1451 | 405 | 531 | 3129 | 14 | 441 | 429 |
9 | 42 | 221 | 1365 | 667 | 377 | 740 | 235 | 310 | 50004 | 11 | 134 | 118 |
10 | 26 | 62 | 274 | 109 | 46 | 69 | 24 | 31 | 46 | 0 | 22 | 17 |
11 | 11 | 22 | 26 | 22 | 9 | 283 | 14 | 10 | 4 | 0 | 10 | 10 |
12 | 38 | 154 | 836 | 427 | 203 | 286 | 117 | 156 | 50001 | 7 | 109 | 87 |
13 | 24 | 48 | 289 | 145 | 86 | 92 | 51 | 68 | 16472 | 5 | 64 | 47 |
14 | 37 | 82 | 321 | 177 | 70 | 197 | 34 | 35 | 40 | 1 | 32 | 25 |
15 | 23 | 69 | 361 | 167 | 67 | 81 | 25 | 31 | 377 | 3 | 43 | 32 |
16 | 23 | 56 | 212 | 111 | 64 | 81 | 26 | 33 | 637 | 2 | 45 | 34 |
17 | 20 | 29 | 85 | 47 | 20 | 24 | 16 | 19 | 13 | 1 | 20 | 14 |
18 | 35 | 191 | 1189 | 569 | 311 | 431 | 174 | 244 | 2231 | 6 | 150 | 137 |
19 | 33 | 144 | 774 | 362 | 194 | 597 | 122 | 158 | 210 | 1 | 40 | 27 |
20 | 25 | 51 | 205 | 101 | 60 | 79 | 31 | 39 | 444 | 1 | 36 | 30 |
x | 16 | 50 | 300 | 154 | 70 | 138 | 34 | 44 | 7301 | 1 | 30 | 25 |
s | 13 | 70 | 594 | 290 | 131 | 229 | 55 | 74 | 16668 | 3 | 47 | 43 |
We would like to convert the data shown in Exhibit 13 to information. The first step in this process is to understand each module in the context of the larger system. The last two rows of this table contain the means and standard deviations, respectively, for each metric for the entire software system. We can now look at module 6 and see that there are more than the average number of paths. For example, module 6 has 23,892 paths whereas the average module in this system has 7301 paths.
We can increase our understanding of the data represented in Exhibit 13 by converting the raw metric values to z-scores. The corresponding z-scores for each of the program modules are shown in Exhibit 14. Now our resolution on the data is beginning to improve. Module 6 from this new perspective is rather different from the majority of other program modules. With the possible exception of path complexity, module 6 is at least one standard deviation greater than the mean of each of 11 metrics. Module 9 in another module that is strikingly greater than average on most attributes. Module 7, on the other hand, has lots of negative values. It is typically less than average on all program attributes.
Exhibit 14: z-Scores for Raw Metric Values
Module | η1 | η2 | N1 | N2 | Exec | LOC | Nodes | Edges | Paths | Cycles | Maxpath | Avepath |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2.21 | 3.31 | 1.74 | 2.10 | 2.06 | 3.06 | 1.46 | 1.17 | 0.44 | 3.94 | 2.94 | 2.61 |
2 | -0.12 | -0.53 | -0.43 | -0.47 | -0.45 | -0.51 | -0.48 | -0.47 | -0.44 | -0.40 | -0.50 | -0.45 |
3 | 2.36 | 2.98 | 3.78 | 3.45 | 3.74 | 3.87 | 1.26 | 1.14 | -0.14 | 0.94 | 2.40 | 2.45 |
4 | -0.94 | -0.65 | -0.50 | -0.52 | -0.53 | 1.14 | -0.37 | -0.46 | -0.44 | -0.40 | -0.43 | -0.36 |
5 | 1.38 | 0.95 | 0.96 | 0.75 | 1.09 | 0.48 | 0.80 | 0.79 | 1.00 | 0.61 | 1.21 | 1.02 |
6 | 1.46 | 1.52 | 1.25 | 1.14 | 1.20 | 1.47 | 1.73 | 1.67 | 0.37 | 0.94 | 1.74 | 1.68 |
7 | -0.27 | -0.32 | -0.43 | -0.42 | -0.43 | -0.11 | -0.37 | -0.46 | -0.44 | -0.40 | -0.43 | -0.36 |
8 | 2.21 | 4.01 | 5.38 | 5.54 | 5.07 | 5.72 | 6.79 | 6.55 | -0.25 | 4.28 | 8.76 | 9.50 |
9 | 1.98 | 2.44 | 1.79 | 1.77 | 2.35 | 2.62 | 3.67 | 3.58 | 2.56 | 3.28 | 2.21 | 2.19 |
10 | 0.78 | 0.17 | -0.04 | -0.16 | -0.18 | -0.30 | -0.19 | -0.17 | -0.44 | -0.40 | -0.18 | -0.18 |
11 | -0.34 | -0.41 | -0.46 | -0.46 | -0.46 | 0.63 | -0.37 | -0.46 | -0.44 | -0.40 | -0.43 | -0.36 |
12 | 1.68 | 1.48 | 0.90 | 0.94 | 1.02 | 0.64 | 1.51 | 1.51 | 2.56 | 1.94 | 1.68 | 1.46 |
13 | 0.63 | -0.03 | -0.02 | -0.03 | 0.12 | -0.20 | 0.30 | 0.32 | 0.55 | 1.27 | 0.72 | 0.51 |
14 | 1.61 | 0.45 | 0.04 | 0.08 | 0.00 | 0.26 | -0.01 | -0.12 | -0.44 | -0.06 | 0.04 | -0.01 |
15 | 0.56 | 0.27 | 0.10 | 0.04 | -0.02 | -0.25 | -0.17 | -0.17 | -0.42 | 0.61 | 0.27 | 0.17 |
16 | 0.56 | 0.08 | -0.15 | -0.15 | -0.04 | -0.25 | -0.15 | -0.15 | -0.40 | 0.27 | 0.31 | 0.19 |
17 | 0.33 | -0.31 | -0.36 | -0.37 | -0.38 | -0.50 | -0.34 | -0.33 | -0.44 | -0.06 | -0.22 | -0.26 |
18 | 1.46 | 2.01 | 1.50 | 1.43 | 1.85 | 1.28 | 2.56 | 2.69 | -0.30 | 1.61 | 2.55 | 2.62 |
19 | 1.31 | 1.34 | 0.80 | 0.72 | 0.95 | 2.00 | 1.60 | 1.53 | -0.43 | -0.06 | 0.21 | 0.04 |
20 | 0.71 | 0.01 | -0.16 | -0.18 | -0.07 | -0.26 | -0.06 | -0.07 | -0.41 | -0.06 | 0.12 | 0.12 |
We discovered with the PCA of the 12 metrics listed in Exhibits 13 and 14 that there are only two distinct sources of variation. We would like to transform the 12 raw metric values to their corresponding equivalents in the two new metric domains. Fortunately, the PCA technique produces a set of coefficients that will send the 12 metric z-scores shown in Exhibit 14 into the two new metric domains. The transformation matrix for 12 metrics on the PASS data is shown in Exhibit 15.
Exhibit 15: Transformation Matrix for z-Scores
Metric | Size | Control |
---|---|---|
η1 | 0.14 | -0.03 |
η2 | 0.22 | -0.08 |
N1 | 0.26 | -0.13 |
N2 | 0.26 | -0.13 |
Exec | 0.24 | -0.10 |
LOC | 0.20 | -0.06 |
Nodes | -0.03 | 0.19 |
Edges | -0.04 | 0.20 |
Paths | -0.04 | 0.17 |
Cycles | -0.18 | 0.31 |
Maxpath | -0.10 | 0.26 |
Avepath | -0.11 | 0.27 |
The z-scores for the PASS sample data are shown in Exhibit 14. This is a 20 × 12 matrix. When it is post-multiplied by the 12 × 2 matrix of coefficients shown in Exhibit 15, the result is a 20 × 2 matrix of factor scores, which we will call domain scores for each of the 20 program modules. This product matrix of domain scores is shown in Exhibit 16, along with the observation that each domain score has a mean of 0 and a standard deviation of 1, just the same as the raw z-scores. We have now reduced the data of Exhibit 13 to information. The program module that exhibits the largest size attribute is module 3. The most complex module from a control perspective is module 8.
Exhibit 16: Domain Scores for the PASS Data
Metric | Size | Control |
---|---|---|
1 | 1.74 | 2.06 |
2 | -0.36 | -0.40 |
3 | 3.78 | 0.17 |
4 | -0.24 | -0.37 |
5 | 0.76 | 0.78 |
6 | 1.08 | 1.24 |
7 | -0.25 | -0.38 |
8 | 3.25 | 6.07 |
9 | 1.42 | 2.92 |
10 | 0.13 | -0.34 |
11 | -0.16 | -0.40 |
12 | 0.53 | 1.91 |
13 | -0.34 | 0.92 |
14 | 0.44 | -0.23 |
15 | -0.01 | 0.12 |
16 | -0.12 | 0.13 |
17 | -0.30 | -0.17 |
18 | 1.10 | 2.00 |
19 | 1.39 | 0.03 |
20 | -0.05 | -0.01 |
The size and control domain scores shown in Exhibit 16; both have a mean of 0 and a standard deviation of 1. We can therefore add them to create a new composite metric. This new metric sum is shown in the fourth column (Sum) of Exhibit 17. It is essentially a composite score of the program modules on size and control complexity. Exhibit 17 has also been sorted by this new sum. Now a new picture of the distribution of module complexity clearly emerges. Module 8 is, by far, the most complex module of the 20 sample modules. If we have been very careful in our selection of metrics to include only those that are distinctly related to software faults, then the new domain scores represented by Exhibit 17 are particularly relevant. A large domain score on the control domain, such as is the case with module 8, indicates a real proclivity on the part of those writing module 8 to include control faults in that module.
Exhibit 17: Sorted Domain Scores for the PASS Data
Metric | Size | Control | Sum |
---|---|---|---|
8 | 3.25 | 6.07 | 9.31 |
9 | 1.42 | 2.92 | 4.34 |
3 | 3.78 | 0.17 | 3.95 |
1 | 1.74 | 2.06 | 3.80 |
18 | 1.10 | 2.00 | 3.10 |
12 | 0.53 | 1.91 | 2.44 |
6 | 1.08 | 1.24 | 2.32 |
5 | 0.76 | 0.78 | 1.55 |
19 | 1.39 | 0.03 | 1.42 |
13 | -0.34 | 0.92 | 0.58 |
14 | 0.44 | -0.23 | 0.21 |
15 | -0.01 | 0.12 | 0.12 |
16 | -0.12 | 0.13 | 0.01 |
20 | -0.05 | -0.01 | -0.06 |
10 | 0.13 | -0.34 | -0.21 |
17 | -0.30 | -0.17 | -0.47 |
11 | -0.16 | -0.40 | -0.56 |
4 | -0.24 | -0.37 | -0.61 |
7 | -0.25 | -0.38 | —0.63 |
2 | -0.36 | -0.40 | -0.75 |
The right-most of Exhibit 17 is very revealing. Imagine that the system of program modules represented by this table constitutes the entire system. Further imagine that we are going to have to ship this system sometime in the very near future. We would like to invest our test and inspection time wisely so that we can maximize our exposure to latent faults in the system. We would be wise to invest our time in proportion to the likelihood of encountering faults in the code. The distribution of these faults in the code is not even. Control faults are more likely to be found in modules whose control domain scores are high. The right-most column of the table, (Sum) is our first cut at a fault surrogate. That is, it is a measure that varies in the same manner as software faults. There are other ways of creating surrogate fault measures, as we will now see.
|
|