8.2 Measuring Evolving Software

8.2 Measuring Evolving Software

Software systems change dramatically as they go through their various stages of development. From the first build of each such system to the last build, the differences may be so great as to obscure the fact that it is still the same system. Developers commonly make this mistake when they talk about the system they are developing. It might be referred to as the "file management system" or whatever name seems to describe the software. This seems to imply that there is but one file management system. The fact that is obscured when we talk about the file management system is that today's build of the file management system is probably vastly different in composition and functionality from the original first-born file management system of the first system build. We would like to be able to quantify the differences in the system from its first build, through all builds, to the current one. Then and only then will it be possible to know how these systems have changed.

8.2.1 Baselining the System

The measurement of an evolving software system through the shifting sands of time is not an easy task. Perhaps one of the most difficult issues relates to the establishment of a baseline against which the evolving systems can be compared. This problem is very similar to that encountered by the surveying profession. If we were to buy a piece of property, there are certain physical attributes that we would like to know about that property. First, we might wish to know the total area of the property. Next, we might want to establish the physical shape, the physical elevation, and the physical topology of the property. We can establish the area and the shape of the property with a transit and a measuring tape at the site. To answer the questions as to the location or the elevation of the property, we cannot make these determinations from the site alone. We will have to seek out a benchmark. The benchmark is a survey marker that represents a point in a larger standard grid wherein each point is clearly related to every other point in the grid, both in terms of distance and elevation. This benchmark may be some distance from the property. To measure the topology of the property, we must first establish a fixed point or baseline on the property. The distance and the elevation of every other point on the property can then be established in relation to the fixed baseline. Interestingly enough, we can pick any other point on the property, establish a new baseline, and get exactly the same topology for the property. The property does not change; only our perspective changes.

The software measurement process is very much the same as the survey process. We wish to understand the individual elements of the whole system in relation to each other. We also wish to understand just how a system has evolved over time. It is very difficult to use raw complexity metrics for either of these purposes. The dilemma confronted by those who wish to use measurement of evolving software systems can be seen in Exhibit 1. In this exhibit there are two program modules: A and B. We have two measurements on each of these two modules; lines of code (LOC) and unique operator count, η1. Measurements have been taken at build 1 and build 2. First, let us look at the two modules A and B at build 1. It is not clear whether module A is more complex than B. Now look at how the system containing modules A and B has changed from build 1 to build 2. It is very difficult to establish whether or not the system is more complex at build 2 than at build 1. Clearly, the total number of lines of code has dropped by ten from build 1 to build 2. However, the unique operator count has risen from 35 to 37.

Exhibit 1: Build Comparisons

start example

Module

Build 1

Build 2

 

A

B

A

B

LOC

200

250

210

230

η1

20

15

19

18

end example

The whole notion of establishing a baseline system will allow us to begin to answer the questions raised in the dilemma created by the data represented in Exhibit 1. The first thing we must do is identify common sources of variation among the metrics. We will use principal components analysis (PCA) to create a set of orthogonal measures for the software modules, all of which will be defined on the same scale. From these common domain metrics, we then will reduce the measurement problem to a single fault surrogate measure for each of the program modules. This will reduce the dimensionality of the complexity problem to one single measure for each program.

When a number of successive system builds are to be measured, we will choose one of the systems as a baseline system. All others will be measured in relation to the chosen system. This is exactly analogous to the selection of an arbitrary point or a piece of property to begin a topological survey. Sometimes it will be useful to select the initial system build for this baseline. If we select this system, then the measurements on all other systems will be taken in relation to the initial system configuration.

8.2.2 System Evolution

A complete software system generally consists of a large number of program modules. Each of these modules is a potential candidate for modification as the system evolves during development and maintenance. As each program module is changed, the total system must be reconfigured to incorporate the changed module. We refer to this reconfiguration as a build. For the effect of any change to be felt, it must physically be incorporated into a build.

As program modules change from one build to another, the attributes of the changed program modules change. This means that there are measurable changes in modules from one build to the next. Each build is numerically and measurably different from its predecessor with respect to a particular set of metrics. Thus, there is no such thing as measuring a software system but once. Many software developers who profess to be deeply committed to measurement are still tempted to represent a system by a set of measurements taken at one point in a system's evolution. The truth is that measurement is a process. Whenever changes are made to a system, those system elements that have changed must be remeasured.

To describe the complexity of a system at each build, it will be necessary to know what the version of each of the modules was in the program that failed. Each of the program modules is a separate entity. Each will evolve at its own rate. Consider a software system composed of n modules as follows: m1,m2,m3,...,mn. Each build of the system will unify a set of these modules. Not all the builds will contain precisely the same modules. Clearly, there will be different versions of some of the modules in successive system builds.

We can represent the build configuration in a nomenclature that will permit us to describe the measurement process more precisely by recording module version numbers as vector elements in the following manner: . This build index vector will allow us to preserve the precise structure of each for posterity. Thus, in the vector vn would represent the version number of the ith module that went to the nth build of the system. The cardinality of the set of elements in the vector vn is determined by the number of program modules that have been created up to and including the nth build. In this case, the cardinality of the complete set of modules is represented by the index value m. This is also the number of modules in the set of all modules that have ever entered any build.

Program modules are similar to stars in a galaxy. Some of these stars (modules) have a relatively short life span. Other stars burn for a very long time. Thus, there is a constant flux of the stars in the galaxy. In a typical software build environment, there is a constant flux of modules going in and out of the build.

Exhibit 2 shows the evolution of a hypothetical software system consisting of a set of ten program modules. In this example, we can see that build 1 has six modules in it and the build index vector would look like this: v1 = < 1,2,1,2,1,1 >. This build index vector has six elements because that is the number of modules that have been sent to the build to date. On build 2 we can see that the first four modules have gone through a number of revisions and have probably changed quite a bit. Modules 5 and 6 on this build have not changed at all. The entire code churn has happened in the first four modules on this build. On build 3, a new module (module 7) enters the build. Module 6 is pulled from this build. This event is represented by a zero for the version number of module 6 on build 3. The build index vector for this build looks like this: v3 = < 6, 4, 6, 4, 1, 0, 1 >. It now has seven elements in it. In fact, as this system evolves, it will grow in total modules. The build index vector will also grow in size.

Exhibit 2: Hypothetical Build Example

start example

Module

Build

 

1

2

3

4

5

1

1

5

6

8

7

2

2

3

4

0

0

3

1

3

6

7

8

4

2

3

4

5

5

5

1

1

1

1

1

6

1

1

0

2

2

7

  

1

2

4

8

   

2

3

9

   

2

3

10

    

1

end example

On build 4, module 6 returns to the build as a new version. Module 2 has vanished from the build and will remain gone, its services no longer required. Finally, on build 5, the system has reached its maximum size of nine program modules. The build index vector for build 5 looks like this: v5 = < 7, 0, 8, 5, 1, 2, 4, 3, 3, 1 >. It has ten elements. Although module 2 has vanished from this and subsequent builds, it has a historical presence in the current build. That is, if we wish to compare build 5 with build 3, the module 2 will be present on build 3.

A natural way to capture the intermediate versions of the software is to have the system development occur under a configuration management system. For a system running under configuration management, all versions of all modules can be reconstructed from the time the program was placed under configuration control.

Management of the configuration of each of the program modules is one aspect of the software management process. Another vital piece is the build index vector; it is the only record of the module version that went to each build. This build index vector must be maintained in some type of build management database. There are many sad stories in the software maintenance community about software systems that have been delivered to a customer without such a record. It is almost impossible to interpret trouble reports from customers if the structure of the build that the customer is using is not known.

A natural way to capture the intermediate measurements for each build would be to incorporate the measurement tools within the configuration management system. Just as code deltas are maintained for each program module, so should deltas for the code attributes also be kept by the configuration management system.

The prime objective of this discussion is to demonstrate the measurement process for measuring successive stages of an evolving software system. Thus, we will be able to assess the precise effect of the change from the build represented by vi to vi+1 or even vi to vi+k or vi-k. These data will serve to structure the regression test activity between builds. Those modules that have the greatest change in complexity from one build to the next should receive the majority of test effort in the regression test activity.

The actual evolution of a large software system can be very complicated. Exhibit 3 presents a chart of the evolution of the many builds of the Space Shuttle PASS from Operational Increment 8 through Operational Increment 22. Each box, or lozenge, on this chart represents a different build of the system. This evolutionary chart is known internally as a spider chart. The leaf nodes of this graph are all rectangular except for the most recent build. These are the software systems that actually flew on space shuttle missions.

Exhibit 3: Spider Chart for PASS

start example

click to expand

end example

One of the most important observations that can be drawn from Exhibit 3 is that not all of the builds are direct antecedents of the most current build. In fact, only a rather small subset of the total set of system builds can have this property. When we wish to measure a build in relation to its antecedent builds, we can only use those builds that are on the antecedent path to the build in question. The same principle is true for the collection of fault data. The total number of faults across all builds would be a misleading quality metric for understanding the most recent build. Some of these fault data clearly relate to code segments that are not included in the module set on the antecedent path for this most recent build. The spider chart, then, is a most valuable tool in that it permits us to identify the antecedent path for any program module.

In the close inspection of the spider chart for PASS we can observe that there is a main sequence of builds that goes from the top left of the chart to the bottom right (with the two discontinuities at points A and B). There are exactly 102 builds on this main sequence. At various intervals along this main sequence, a branch was pulled and a short evolutionary tree was built. There are 252 builds in these sub-trees off the main sequence, for a total of 354 total builds in the evolutionary sequence represented in this spider chart.

This type of build evolution represents yet another incremental problem in software measurement. Many different modules can be created and added on each of the sub-trees. If these new modules are never integrated into any of the builds on the main sequence, then they have no measurement consequence for the current system. The current system, of course, is always the last node on the main sequence. This is a particular problem for fault reporting systems. In a typical fault reporting system, all fault and failure reports are kept in a common reporting system. If we are interested in the total faults that have been reported for our current system, we must carefully sort all of those faults recorded against program modules that are currently on the main sequence. If a fault has been reported against a module that was on the main sequence but is no longer part of the build, then this fault must not be included in the total count. Similarly, any faults reported against modules that were developed on branches must also be removed from the current count. In essence, we are interested only in the direct antecedents of the modules in the current build.

When evaluating the precise nature of any changes that occur to the system between any two builds i and j, we are interested in three sets of modules. The first set, , is the set of modules present in both builds of the system. These modules may have changed since the earlier version but were not removed. The second set, , is the set of modules that were in the early build, i, and were removed prior to the later build, j. The final set, , is the set of modules that has been added to the system since the earlier build.

As an example, let build i consist of the following set of modules:

Mi ={m1, m2, m3, m4, m5}

Between build i and j module m3 was removed, giving:

Then between builds j and k two new modules, m7 and m8, are added and module m2 is deleted, giving:

8.2.3 Establishing a Software Measurement Baseline

For measurement purposes, it will be necessary to standardize all original or raw metrics so that they are on the same relative scale. For the ith module on the ith build of the system, there will be a data vector raw complexity metrics for that module. We can standardize each of the raw metrics by subtracting the mean of the metric #1 over all modules in the ith build and dividing by its standard deviation such that

represents the standardized value of the first raw metric for the ith module on the ith build. The problem with this method of standardizing is that it will erase the effect of trends in the data. For example, let us assume that we were taking measurements on LOC and that the system we were measuring grew in this measure over successive builds. If we were to standardize each build of the system by its own mean LOC and its own standard deviation, the mean of this system would always be zero. Thus, we will standardize the raw metrics using a baseline system such that the standardized metric vector for the ith module on the ith build would be:

where is a vector containing the means of the raw metrics for the baseline system and is a vector of standard deviations of these raw metrics. Thus, for each system, we can build an m × k data matrix, Wj, that contains the standardized metric values relative to the baseline system on build B.

A simple example will serve to identify some of the problems surrounding the measurement of an evolving software system. The purpose of this exercise is to demonstrate the steps that we have employed to solve the measurement of software evolution. It is a contrived example whose purpose is strictly pedagogical. It is much easier to show all of the data and all of the resulting analysis for a problem of this scale than it is for a real problem. For the real problem, data will be drawn from our investigations of the evolution of the Space Shuttle PASS. The data from PASS would be almost unintelligible without this example in the small.

Let us assume that our hypothetical software system has ten modules and we are collecting measurements on seven program attributes. If we apply our measurement tool to this system, we will get data represented by Exhibit 4. Each row in this exhibit represents the measurements derived from one program module.

Exhibit 4: Hypothetical Software System

start example

Module

Exec

N1

η1

N2

η2

Nodes

Edges

1

100

303

34

92

20

15

12

2

158

352

42

104

18

23

18

3

99

208

32

88

27

20

15

4

30

35

21

26

8

5

4

5

64

45

14

14

5

24

20

6

85

157

19

20

12

19

14

7

198

268

32

105

15

32

20

8

154

360

15

105

30

15

13

9

96

185

22

35

11

25

15

10

74

56

9

44

13

13

10

105.8

196.9

24

63.3

15.9

19.1

14.1

δ

50.2

123.8

10.5

38.6

8.0

7.5

4.8

end example

The means and standard deviations for each of the metrics are shown in the last two rows of this exhibit. These data are part of the baseline transformation data. We will use them to transform the raw data for this build and other builds from the raw data to z-scores. Observe that each metric has a different mean and standard deviation. Thus, each metric represents a value drawn from a different population. Very simply, what this means is that we cannot add values of any two of these metrics. It is meaningless to add N1 and N2 and learn anything meaningful from the sum. Maurice Halstead, as you will remember, suggested that we could create a derived metric, program vocabulary, in the following manner: η = η1 + η2. The mean of the distribution of η1 is 24 and its standard deviation is 10.5. The mean of η2 is 15.9 and its standard deviation is 7.5. The sum of any of the two metrics for any program module is disinformation. We can solve the problem of dissimilar distributions of these metrics by transforming them into z-scores. The transformation of these metrics is shown in Exhibit 5.

Exhibit 5: z-Scores for the Hypothetical System

start example

Module

Exec

N1

η1

N2

η2

Nodes

Edges

1

-0.12

0.86

0.95

0.74

0.51

-0.54

-0.43

2

1.04

1.25

1.71

1.05

0.26

0.52

0.81

3

-0.14

0.09

0.76

0.64

1.39

0.12

0.19

4

-1.51

-1.31

-0.29

-0.97

-0.99

-1.87

-2.09

5

-0.83

-1.23

-0.95

-1.28

-1.37

0.65

1.22

6

-0.41

-0.32

-0.48

-1.12

-0.49

-0.01

-0.02

7

1.84

0.57

0.76

1.08

-0.11

1.71

1.22

8

0.96

1.32

-0.86

1.08

1.77

-0.54

-0.23

9

-0.20

-0.10

-0.19

-0.73

-0.61

0.78

0.19

10

-0.63

-1.14

-1.43

-0.50

-0.36

-0.81

-0.85

end example

Whereas Exhibit 4 containing the raw metric values has data in it, Exhibit 5 has information in it. Let us look, in particular, at modules 2 and 3. For module 2, the values of η1 and η2 are 42 and 18, respectively. For module 3, the corresponding values are 32 and 27. From these two comparisons, it is clear that module 2 has more unique operators than does module 3. Similarly, module 2 has quite a few more unique operands than does Module 3. We really do not know what this means. We have data that we really cannot use. It is no wonder that so many people who have "tried metrics and they didn't work" have come to this conclusion.

Now let us look at the same data from modules 2 and 3 from a different perspective. Exhibit 5 contains the z-scores for the same data. All of the data in Exhibit 5 have the same mean (0) and the same standard deviation (1). Now, if we look at η1 and η2 from this table for module 2, we see that the number of unique operators η1 for this module is 1.71. This is almost two standard deviations above the mean for all program modules. There must be some serious computation complexity in this module. When we look at η2 for this module, we see that the number of unique operands is 0.26, which is very close to the average of all modules. If we look at module 3 from the same perspective, we see that the values for η1 and η2 are 0.76 and 1.39, almost the reverse of module 2. Module 2 is clearly richer in its potential data complexity, as represented by the number of distinct operands, than is module 3. Module 3 is clearly more computationally complex than module 2 in that the number of distinct operators is much greater.

The raw data have been converted into information that we can analyze and begin to draw some preliminary conclusions about what is going on in each of the modules. Let us delve further into this process of extracting information from the data. To do this we will perform a principal components analysis (PCA, with a varimax rotation) on the raw data matrix. The factor pattern for this PCA is shown in Exhibit 6.

Exhibit 6: The Principal Components of the Hypothetical System

start example

Metric

Domain 1 Size

Domain 2 Control

Exec

0.73

0.59

N1

0.92

0.25

η1

0.62

0.32

N2

0.96

0.17

η2

0.89

-0.17

Nodes

0.10

0.98

Edges

0.13

0.95

Eigenvalues

3.5

2.5

Percent variation

60

25

end example

The PCA technique has revealed two distinct sources of variation in the original data. We can clearly see that the metrics Exec, N1, η1, N2, and η2, are associated with the first domain (principal component). The metrics Nodes and Edges are clearly and unambiguously associated with Domain 2. The metrics associated with Domain 1 are all measuring essentially the same thing, program size. The metrics associated with Domain 2, the Control domain, are both attributes of the control flowgraph structure of a program module. They are clearly identifying a different source of variance from the size metrics. This is not a matter of speculation: PCA has revealed this fact to us.

The eigenvalues for each of the size and control domains are 3.5 and 2.5, respectively. Together, these two principal components then account for a total of 85 percent of the variation in the original problem space.

Now let us return briefly to Halstead's metric η = η1 + η2 of program vocabulary. As pointed out earlier, the raw metric values η1 and η2 simply cannot be added together because we feel like creating a new metric called vocabulary. The values for each metric on each module are drawn from a different distribution. We will learn nothing from their combination. Again, this is not a matter of speculation. Let us augment the original data matrix containing the raw metric values with a column representing the new vocabulary metric. When we perform the PCA, we get the result shown in Exhibit 7.

Exhibit 7: Revised PCA

start example

Metric

Domain 1 Size

Domain 2 Control

Exec

0.687

0.608

N1

0.904

0.267

η1

0.692

0.302

N2

0.945

0.181

η2

0.872

-0.151

Nodes

0.092

0.981

Edges

0.117

0.955

η

0.944

0.131

end example

Voila! We have transformed two measures of size complexity to yet another measure of size complexity. We have learned nothing new with this new metric. The metric primitives contain all of the juice. We just do not get three oranges if we add two oranges together.

Returning to the problem of measuring software evolution, we have a real problem in the volume of data that we are trying to manage with just ten modules and seven metric values. A much more realistic problem would be a system with 5000 modules and 20 metric primitives. We desperately want some mechanism to reduce the size of the problem with which we are working.

A by-product of the original PCA of the ten program modules and the seven metric primitives is a transformation matrix T that will map the ten z-scores of the raw metrics into the reduced space represented by the two principal components of size and control. Let Z represent the matrix of z-scores shown in the exhibit above for the original data shown in Exhibit 5. We can obtain new domain metrics, D, using the transformation matrix T as follows: D = ZT, where Z is a 10 × 7 matrix of z-scores, T is a 7 × 2 matrix of transformation coefficients, and D is a 10 × 2 matrix of domain scores. The matrix T for this solution is shown in Exhibit 8.

Exhibit 8: Transformation Matrix

start example

Size

Control

0.155

0.171

0.268

-0.017

0.159

0.062

0.293

-0.064

0.32

-0.215

-0.111

0.453

-0.098

0.435

end example

The domain metrics (transformed z-scores or factor scores) formed by the product are shown in Exhibit 9.

Exhibit 9: Factor Scores

start example

Module

Size

Control

1

0.85

-0.57

2

1.03

0.72

3

0.72

-0.18

4

-0.82

-1.73

5

-1.61

1.02

6

-0.71

0.07

7

0.53

1.61

8

1.33

-0.71

9

-0.60

0.57

10

-0.72

-0.80

end example

For each module, there are now two metrics: size and control. These new metrics represent the underlying domains or sources of variation uncovered by principal component analysis. They also have the interesting property that they are uncorrelated. Each of the new metrics represents a distinct source of variation. There is no overlap. We have reduced the dimensionality of the problem from seven metrics to two new metrics that account for approximately 86 percent of the variation seen in the original seven metrics.

By inspection of Exhibit 9 we can see that module 8 has the highest size complexity of all modules. It is clearly not the largest in terms of Exec but it is larger in terms of the cumulative size attributes. We can also see that the greatest control complexity occurs in module 7. We are now in a very good position to understand the differences among the program modules based on the attributes we are measuring. When we have a clear indication of what we are measuring, as revealed in the PCA, we also have a good clear view of what we are not measuring. For example, we know that program modules differ with regard to their coupling complexity (their relationships with other program modules) or their data structures complexity. However, we are not measuring these attributes.

At this stage we might wish to simplify the problem further. We can further reduce the complexity dimensionality by forming a linear combination of the domain metrics. At the outset we could simply add them together for each module: yi = d1,i + d2,i. We can do this because dl and d2 have the same distribution. These values are both drawn from a population with a mean of 0 and a standard deviation of 1. The new variable, y, created through this synthesis should meet the criterion that it be related to software faults, as we have seen in Chapters 4 and 7. If we were to model this behavior, we could discover that y would do a reasonable job. We could, of course, regress our two domain metrics on actual fault data from our historical fault database and this would yield a new metric, yi = ad1,i+bd2,i. We have found that a fault index (FI) derived from the eigenvalues does a reasonably good job as a fault surrogate. This FI, represented by y, is defined by yi = λ1d1,i + λ2d2,i, where λ1 and λ2 are the eigenvalues associated with each of the two domains or principal components.

The main problem with the FI metric is that its mean value will be 0. This means that half of the metric values will be negative. We would like to adjust FI so that has a distribution that is more socially acceptable. To do this we will choose to center the distribution about 100 with a standard deviation of 10 as follows: ρi = 10 × (λ1d1,i + λ2d2,i) + 100. There is no magic in this transformation. It is the same distribution as IQ scores. People are comfortable with it. The FI values for the system of ten modules currently under investigation is shown in Exhibit 10.

Exhibit 10: The Fault Index

start example

Module

Fault Index ρ

1

115

2

154

3

121

4

28

5

69

6

77

7

159

8

129

9

93

10

54

Total

1000

end example

The values in this Exhibit 10 represent the fault potential of each module. This is not an act of faith. We have found this to be valid in many different studies. If we are very careful in selecting a working set of metrics (see Chapter 7), we can make FI account for more than 95 percent of the total variation observed in historical fault data. In the specific case of the Space Shuttle PASS, the correlation between the FI metric and the DR Count from the historical database was 0.92. Very simply stated, the FI metric accounted for more than 80 percent of the variation in the DR, or fault data. With the two metric domains of size and control of this contrived example, it is not likely that the FI represented in Exhibit 10 would account for more than 65 percent of the variation in corresponding fault data. From a scientific standpoint, 65 percent is a lot better than no information or simple speculation.

The sum of the FIs for all for all modules is, of course, 1000. This is not surprising. The mean was set to 100 and there are ten modules. We would look for problems in our calculations had this sum not been 1000.

From the standpoint of the data in Exhibit 10, if we were looking for problems in the code, we would certainly focus our efforts on module 7, because it has the highest FI value. If we have a fixed amount of inspection time (and we always do), then it would behoove us to invest this time wisely. Modules 1, 2, 3, 7, and 8 would certainly command our attention. If and only if we had surplus resources would we spend time with modules 4 and 10. This is a case of the classical trade-off between fairness and optimality. The optimal solution would focus all review efforts on the most complex systems. These are the ones where the majority of faults will probably be found. It takes just one fault in the right place to bring any system to its knees, however. The fair solution requires that we invest inspection resources in proportion to the FI. The fair solution for apportioning our inspection (test) resources is shown in Exhibit 11 in terms of percentage of effort.

Exhibit 11: Optimal Resource Allocation

start example

Module

Percentage Effort

7

16

2

15

8

13

3

12

1

12

9

9

6

8

5

7

10

5

4

3

end example

The whole purpose of the reduction in dimensionality of the measurement problem is to convert the measurement data to usable information. Exhibit 11 is a good example of the utility achieved in the simplification of the measurement problem. It would have been very difficult to arrive at any meaningful conclusion about our software system based on the data shown in the original raw metric data in Exhibit 4.

In summary, then, the measurement baseline will consist of three arrays. First is the array of means for the baseline metric data. Next is the vector of standard deviations s for each of the metrics on the baseline system. Finally, there is the transformation matrix T that will map the z-scores of the metrics to orthogonal domain scores.

8.2.4 Measuring Changes to the System

As has been noted, a significant problem in the measurement of evolving software systems is that software modules come and go. This is not a problem when it comes to the computation of the individual module domain metrics and the computation of the module fault index. It is, however, a problem when we are looking at the average system metrics. For example, if the initial build of a system contained m program modules and the next system contains m+1 modules, there is some ambiguity in calculating the average FI of the new system. We can understand this problem a little better if we consider a program module that was simply split into two modules from the first to the second build. This being the case, the FI of each of the two new modules will be less than the FI of the parent module. Thus, if we were to compute the average FI of the new system with the value of m+1 as the normalizing value, then the apparent complexity of the new system will have been reduced. However, because of the coupling complexity introduced between the two new modules, the net system complexity will have increased. To this end, the normalizing value for the computation of all averages will be the cardinality of the set of modules in the baseline system.

By definition, the average FI of the baseline system at build B will be:

where N1 is the cardinality of the set of program modules on the first build of the system. As the system progresses through a series of builds, system complexity will tend to rise. Thus, the system FI of the kth version of a system can be represented by a function of module FI as follows:

where represents an element from the configuration vector vk described earlier.

Let us now assume that the example system of ten program modules has been modified somewhat and new functionality has been added to the system. There are now 11 modules in the system as represented in Exhibit 12.

Exhibit 12: Build 2 Raw Metric Data

start example

Module

Exec

N1

η1

N2

η2

Nodes

Edges

1

100

303

34

92

20

15

12

2

158

352

41

104

18

23

18

3

99

208

32

88

25

20

15

4

45

36

22

24

10

6

5

5

64

45

14

14

5

24

20

6

85

157

19

20

12

19

14

7

179

205

32

95

15

32

20

8

154

360

15

105

30

15

13

9

96

185

22

35

11

25

15

10

74

56

9

44

13

13

10

11

11

50

32

16

26

15

10

end example

In addition to the changes represented by adding a new module to this system, there have been changes to modules 4 and 7 as well. It is not clear exactly how the existing modules 4 and 7 have changed with regard to the whole system. We now want answers to two questions. First, what is the nature of the changes to modules 4 and 7? Second, what is the effect of adding the new module 11 to the system? To take the first step in answering these questions, we will convert the metrics in Exhibit 12 to z-scores using the baseline means and standard deviations. This will yield the results displayed in Exhibit 13.

Exhibit 13: z-Scores for Build 2 Scaled by Build 1 Baseline

start example

Module

Exec

N1

η1

N2

η2

Nodes

Edges

1

-0.12

0.86

0.95

0.74

0.51

-0.54

-0.43

2

1.04

1.25

1.62

1.05

0.26

0.52

0.81

3

-0.14

0.09

0.76

0.64

1.14

0.12

0.19

4

-1.21

-1.30

-0.19

-1.02

-0.74

-1.74

-1.88

5

-0.83

-1.23

-0.95

-1.28

-1.37

0.65

1.22

6

-0.41

-0.32

-0.48

-1.12

-0.49

-0.01

-0.02

7

1.46

0.07

0.76

0.82

-0.11

1.71

1.22

8

0.96

1.32

-0.86

1.08

1.77

-0.54

-0.23

9

-0.20

-0.10

-0.19

-0.73

-0.61

0.78

0.19

10

-0.63

-1.14

-1.43

-0.50

-0.36

-0.81

-0.85

11

-1.11

-1.33

-0.76

-0.97

-0.11

-1.21

-1.26

Mean

-0.11

-0.17

-0.07

-0.12

-0.01

-0.10

-0.10

end example

The general characteristics of the new module 11 can be seen from the last row in this Exhibit 13. In relation to the modules that were present on the first build, this module is a very simple one. The most complex characteristic of this module is that the number of unique operands, η2, seems inordinately high in terms of the overall magnitude of the other metrics. It is interesting to note that the mean of these z-scores is no longer zero. This is because there has been a new module added to the system that is lower overall in the attributes being measured. In addition, it is clear that modules 4 and 7 have changed as well.

Once the z-scores for build 2 of the hypothetical system have been established, the transformation matrix T, which was also part of the baseline for build 1, can now be applied to these new z-scores. The result of this matrix multiplication is shown in Exhibit 14 for both builds 1 and 2. The two columns for build 2 are the new metric domain scores. As with the baselined z-scores, the mean of the domain scores is no longer zero because of the departures of the new module from the mean and also because of the changes that have occurred in modules 4 and 7.

Exhibit 14: A Comparison of Build 2 to Build 1

start example

 

Build 1

Build 2

Build 2-Build 1

Module

Size

Control

Size

Control

Size

Control

1

0.85

-0.57

0.85

-0.57

0

0

2

1.03

0.72

1.03

0.72

0

0

3

0.72

-0.18

0.72

-0.15

0

0

4

-0.82

-1.73

-0.72

-1.58

0.09

0.16

5

-1.61

1.02

-1.61

1.02

0

0

6

-0.71

0.07

-0.71

0.07

0

0

7

0.53

1.61

0.26

1.57

-0.27

-0.04

8

1.33

-0.71

1.33

-0.71

0

0

9

-0.60

0.57

-0.60

0.57

0

0

10

-0.72

-0.80

-0.72

-0.80

0

0

11

-

-

-0.71

-1.22

-0.71

-1.22

Mean

0

0

-0.09

-0.10

  

end example

The final two columns of Exhibit 14 are perhaps the most informative with regard to the precise manner in which the software has changed from build 1 to build 2. Modules 4 and 7 have changed. There has been an increase in both the size and control complexity on module 4 although this is not immediately apparent by looking at the signs of the differences. There has been a decrease in both the size and control complexity on module 7. Further, there is the addition of a new module altogether, Module 11.

Ultimately, we are interested in what has happened to the system as a whole between builds 1 and 2. This can best be seen in terms of the FI metric. Exhibit 15 displays the new FI values for build 2. The total system FI has now increased to 1039 from the beginning value of 1000. Modules 4 and 7 have changed in complexity and the overall net system complexity has increased. The last two columns in Exhibit 15 represent the difference between the FI values on the two incremental builds. The column labeled Build Difference shows the increase (or decrease) in FI for each module build 1 to build 2. Module 4 has increased in net FI by 7, and module 7 has decreased in net FI by 10. The net system increase is 41.

Exhibit 15: FI Build Deltas

start example

Module

Fault Index ρ

Build Difference

Absolute Build Difference

1

115

0

0

2

154

0

0

3

121

0

0

4

35

7

7

5

69

0

0

6

77

0

0

7

148

-10

10

8

129

0

0

9

93

0

0

10

55

0

0

11

44

44

44

Total

1039

41

61

end example

The last column in Exhibit 15 represents the difference in absolute value between build 1 and 2 for each module. This represents the net change in FI (up or down) between the two builds. From this perspective the net change in the system between the two builds has increased by 61.

FI is a fault surrogate. The FI on the first build represents the fault potential or the fault burden of each module on this initial build. Changes to the system will possibly introduce new faults to the system. As we will see, faults introduced into the code over time will vary directly with the changes to FI.

8.2.5 Evaluating Changes across Builds: An Example from PASS

We will now examine the application of this methodology to the evolution of the Space Shuttle Primary Avionics Software System (PASS). For the PASS data, we can transform the 20 raw attribute measures for each of the 765 program modules into four orthogonal domain metrics. This transformation from correlated metric z-scores to uncorrelated metric domain scores (factor scores) is achieved through the multiplication of the standardized metrics by a transformation matrix produced by the principal components analysis. The transformation matrix for this specific example is shown in Exhibit 16. This is a 20 × 4 matrix that will transform the standardized metrics on each of the 765 modules to form uncorrelated domain metrics for each of the modules. While this reduction in the number of metrics has simplified the problem somewhat, we really would like to represent each program with a single metric that would serve as a measure of module complexity, simultaneously representing all four orthogonal domains of complexity.

Exhibit 16: The PASS Baseline Transformation Matrix

start example

 

D1 Control

D2 Size

D3 Semaphore

D4 Temporal

η1

-0.019

0.103

0.032

-0.055

η2

-0.064

0.183

-0.014

-0.029

N1

-0.101

0.228

-0.051

-0.036

N2

-0.104

0.223

-0.042

-0.009

Stmt

-0.077

0.217

-0.057

-0.040

LOC

-0.050

0.141

0.013

0.062

Comm

-0.071

0.173

-0.019

0.013

Nodes

0.188

-0.027

-0.004

-0.003

Edges

0.198

-0.031

-0.015

-0.009

Paths

0.167

-0.012

-0.054

-0.032

Cycle

0.296

-0.145

-0.033

0.023

MaxP

0.263

-0.095

-0.004

0.020

AveP

0.268

-0.104

-0.000

0.026

DataStr

-0.009

0.142

-0.060

-0.069

Sets

-0.031

-0.041

0.303

0.000

Reset

-0.006

-0.090

0.346

-0.012

Can

0.007

-0.055

-0.063

0.546

SetA

-0.030

-0.034

0.320

-0.093

ResA

-0.044

-0.055

0.364

-0.108

CanA

0.008

-0.056

-0.074

0.554

end example

The change in the overall FI of the PASS system over time is represented pictorially in Exhibit 17. This is an example of the FI of the most recent 20 software builds for the Space Shuttle PASS. For this presentation, the baseline system is represented by the system 0 on the x-axis of this graph. All other systems are measured relative to this one.

Exhibit 17: Average FI across Sequential Builds

start example

click to expand

end example

One pattern that becomes obvious from Exhibit 17 is that the complexity of a system continues to rise over the life of the software system. This is particularly interesting in a mature system such as PASS. This system has evolved through hundreds of builds (see also Exhibit 3). It is still increasing in complexity. If we were to move the baseline system back another ten builds in time, the general upward trend of the complexity of the system would be sustained. The particular baseline for this exhibit was selected because of the change activity that we had observed before and after this baseline build. There were substantial changes made in PASS prior to the build labeled as build 0 on the x-axis of this figure. This activity is evidenced by the substantial variation in ρ around build 0 as seen in Exhibit 18.

Exhibit 18: Deltas in FI across Sequential Builds

start example

click to expand

end example

The general upward trends in system complexity as shown in Exhibit 17 can be eliminated by computing the differences, or deltas, in complexity from one build to the next. These deltas, then, show the relative magnitude of the changes that have occurred at each build. They provide an excellent view of the impact on the total system of the incremental changes between builds. The deltas for FI are shown in Exhibit 18 for the same builds represented in Exhibit 17.

The average FI gives a real good indication of the global nature of the evolution of a system. In Exhibit 17, we can see that the baseline system has been labeled as build 0 on the x-axis. By definition, it has an average FI of 50. What is interesting are the fluctuations in the curve about this point. We can see the builds -4, -3, and -2 are relatively stable and have an average FI value of about 49.25, somewhat less than build 0. A very interesting event occurs at build -1. A unilateral decision was made to overhaul the system and simplify the code after build -2. This resulted in the new build at build -1. Indeed, we can see that the goal was achieved in that the average FI on build -1 is 48.5. However, between build -1 and build 0 it was discovered that the new changes created heroic problems. These problems were resolved on build 0. Now the system complexity has risen much higher than it was before the great leap forward. Eventually, the process oscillation damps out and the curve resumes essentially where it would have been if Nature had been allowed to take its course.

The system evolution event represented by Exhibit 17 is very typical of software development efforts in the absence of a mechanism to measure evolution. Heroic changes are made to happen on software systems typically by program managers. This effort is always made to "clean up the code." By our definition, the code will be "cleaned up" if there has been a sustainable net decline in the average FI for the system. What very often happens, however, is that the system is no better (and sometimes much worse) off after the simplification effort. The average FI values shown in the Exhibit 17 clearly show this phenomenon.

Exhibit 18 gives a clear indication of exactly what happened during the great leap forward. Between builds -2 and -1, a substantial amount of code was dumped and the average FI delta went down. Immediately after the great leap forward, the consequences of the change were felt and problems were fixed. New code is added between builds -1 and build 0 to rectify the problems found. This oscillation continues for awhile and eventually the system reaches its steady-state evolution process again. If we are watching and measuring, we can discover that there was extreme code churn in this process and probably no net benefit in the system overhaul.

As changes are made to individual software modules in an evolving software system, the complexity of the system will tend to grow. This will lead to problems in the maintainability of the code. As the system becomes more complex, it will be more difficult to maintain. The increase in complexity will also result in the introduction of new faults into the system in direct proportion to the increase in the complexity of the code. The measurement methodology introduced in this study will permit direct measurement of the complexity measurements that are likely to be related to software faults. From a maintenance perspective, an increase in program complexity from one build to another will create a concomitant increase in the cost of maintaining the program. From a software test perspective, the rate of fault injection by changes from one build to the next will be directly proportional to the change in the system complexity. This, in turn, will create the need for increased test efforts directly proportional to the net change in program complexity.

8.2.6 Examining the Specific Changes in the Evolution of PASS

The Space Shuttle PASS is of great interest in the study of software evolution for a number of reasons. Primary among these reasons is that the system was developed by a systems group that was for many years the only viable candidate for a Level 5 development organization on the Software Engineering Institute Capability Maturity Model. A tremendous historical database exists for this software system across its many distinct builds. Furthermore, the development group has maintained accurate measurement data over the life of this system. It represents an optimal opportunity for the study of software evolution.

Now we would like to present an example of specific attribute measurements across software builds. To this end, we will examine the last 20 system builds baselined on the fifth one of this series. There is a very good reason for examining these particular systems. There were some major changes occurring at the fifth build, as can well be seen in Exhibit 19. For each of the 20 system builds represented in this exhibit, the four domain metrics were computed for each program module. System totals were then computed for the domain metrics for all modules of each of the systems. These four total domain values were then plotted for each system.

Exhibit 19: Baselined Domain Metrics Across Builds

start example

click to expand

end example

The domain metrics of all of the systems were computed with build 0 as a baseline system. In that the domain metrics have a mean of zero and a standard deviation of one for the baseline system, we can see from Exhibit 18 that all four domain metric lines cross at the x-axis for this system. In that all other systems are baselined relative to build 0, the domain metrics show changes in the complexity on each of the domains relative to build 0.

Some very interesting patterns emerge from Exhibit 19. We can see that in the series of builds from build 0, there is a general upward trend in the source of complexity represented by Domain 3, semaphore complexity. Alternatively, there has been a substantial decline in the source of complexity represented by Domain 1. Domain 1, it will be remembered from the earlier discussion, is a Control complexity domain. We can see that the control complexity of recent systems has been in the direction of simplification of the program modules.

One clear way to reduce control complexity is to reduce the size of each program module. The best way to reduce the size of program modules is to break big modules into two or more smaller modules. This fragmentation, however, simply shifts complexity from one domain, Control, to another, Semaphore, to cope with the increasing coupling complexity among the modules.

Recent builds of the software have resulted in some substantial functional changes in the real-time control complexity of the modules. The activity in and around build 0 shows material changes in the software in the systems immediately before and immediately after this build. There were substantial changes being made to PASS during this time. This system change activity may also be seen from a more global perspective in Exhibit 3.

Just as was the case for the analysis of the system FI, it is possible to compute the incremental changes, or deltas, in each of the domain metrics for each of the builds. This will effectively eliminate the trends in the metrics and permit a distinct focus on the precise nature of each of the changes to each of the builds. The deltas for the domain metrics are shown in Exhibit 20. From this new focus, some interesting observations can be made. We can see, for example, a relatively large fluctuation in the Control complexity domain on builds -2, -1, 0, and 1. This oscillation in the control domain metrics is typical of attempts to introduce major changes in a system. First, there is the simplification attempt, build -2; then there is the rebound, build -1, indicating that some desirable features have been removed, followed by yet another reduction in control complexity, and finally a gradual dampening in the oscillation. A very similar pattern is revealed in Domain 3 over builds -2 through 2.

Exhibit 20: Deltas in Domain Metrics across Builds

start example

click to expand

end example



Software Engineering Measurement
Software Engineering Measurement
ISBN: 0849315034
EAN: 2147483647
Year: 2003
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net