Chapter 5: Static Software Measurement | Software Engineering Measurement

5.1 Introduction

We can measure a software system in exactly two different ways. We can measure static software attributes of the requirements, the design, and the code. We can also measure the system as it is running. We will learn very different things from each of these types of measurements. Static code attributes will let us learn much about the structure of a system, how it is built, and where the problem areas in the code base might be. We will use our static software observations to learn about the quality of the system. Dynamic source code measurement, on the other hand, will allow us to observe a system when it is running and draw useful conclusions about the efficacy of the test process, the reliability of a system, its availability, and its robustness. We can also predict how a typical user might employ the system and validate this prediction in practice.

The key to software measurement is in understanding just precisely what each of our measurements is telling us about the quality, reliability, security, or availability of the software we are measuring. It is possible to identify a vast number of properties of a program that can be measured. ^[1], ^[2] The real science is to determine which attributes are meaningful in terms of the software development. We could easily measure the height of all entering freshmen at a university. We could also weigh them. We could count the number of bumps on their skulls. And we could store all these data in a database. But we find that there is very little or no information in these measures. What we are trying to ascertain from the vitals on the entering freshmen is who will succeed academically and who will choose to tank. The height measure on these students will not be a good predictor for academic success, nor will the number of skull bumps probably tell us much about a person's academic future.

We have learned, in the software development arena, to expect miracles. If we were just to find the right CASE tools, the path to complete harmony and fine software would instantly emerge. If we were but to institute the new XYZ process, conveniently packaged and distributed by the XYZ Corporation, we would then produce fine software. In short, we have been told that we can buy into the right tools and implement a new (unproven) methodology and all will be well. Oddly enough, no one has ever been reinforced by having such a miracle happen, but our faith in the process persists. There seems to be a persistent expectation that some miracles will happen if we buy a few metric tools. But, again, no miracles are forthcoming.

Doing science is very hard work. At the foundation of science is a core of measurement and observation. We uncover new truths through a rigorous process of experimentation and investigation. We invest heavily and happily in the measurement process in the name of science in the fields of medicine and in mechanical engineering, among others. There are no miracles in true science. There is only a continuing process of measurement and careful observation. If we are to further our understanding of the software development and evolution processes, then we, too, must stop looking for miracles and start doing some hard measurement and experimentation. We will begin that process by trying to find out what it is that we should be measuring.

Within the vast panoply of metrics available to the software developer, most lack sufficient content validity to provide utility in the measurement process. They are of limited utility at best. They propose to disclose certain properties of the software but lack the fundamental experimental research behind them to validate the claims made about them. A very good example of this is the cyclomatic complexity number V(g) of a program module. The metric is calculated from the relationship V(g) = Edges - Nodes + 2, where Nodes and Edges are derived from a flowgraph representation of the program module. ^[3] Cyclomatic complexity is supposed to be a measure of the control complexity of a program module. As will be demonstrated later, when this metric is studied in conjunction with measures of lines of code and statement count, it is usually highly correlated with these measures of module size. In essence, V(g) is a measure of size. If, on the other hand, we look at the metric primitives Nodes and Edges, we find that they measure something else altogether. ^[4] These two metric primitives do, in fact, measure control flow complexity.

So, our first criterion in selecting those metrics is content validity. The metric must measure the attributes that it is supposed to measure. A major objective in the formulation of a software measurement program is to identify a working set of metrics with experimentally (scientifically) derived content validity. ^[5], ^[6] A metric will not be useful just because a software engineering expert says it is. It will be useful because it reveals information about the very attributes we wish to understand. Expert opinion is not the same as scientific validity.

The second criterion in the selection of a metric is that it is related to a quality criterion we wish to understand. Again, if we are interested in estimating the fault-proneness of a module, we will carefully select a working set of static metrics that have demonstrated potential in this regard. ^[7], ^[8], ^[9], ^[10]

The third measurement criterion is that there be a standard for the metric. There are, for example, many published studies in software maintenance that use two apparently simple metrics: line of code and statement count. For a programming language like C or C++, these metrics can be defined in an astonishing number of different ways. The National Institute of Standards and Technology does not maintain measurement standards for computer software. Thus, when we read a study about a C program consisting of 1500 statements, we really do not have a good idea of just how the value 1500 was obtained. If we are unable to identify a standard for a metric we wish to use, then we must publish our own. The essence of this standard is reproducibility. Another scientist can read our standard, apply it in the measurement of code, and get exactly the same results that we would have gotten had we measured the same code.

The fourth measurement criterion is that all measurements be at the same level of granularity. The top-level granularity of measurement would by the system level. An example of such a metric is total program module count. At the lowest level of granularity, we might wish to enumerate the number of characters in C statements. Because of our interest in measuring both the static and dynamic properties of software, the most relevant level of granularity to this enterprise is the module level, where a module is a C function or a C++ method.

It is not an objective of this chapter to enumerate all possible static source code metrics. There have been masterful attempts by others to do that. ^[11] Our main goal is to identify a working set of software metrics that will typically explain more than 90 percent of the variation in software faults of a software system. Each of these metrics has been chosen because it adds unambiguously to our understanding of software faults. The main purpose of this chapter is to show the process whereby each of the software attributes was specified and then show how it can be unambiguously measured. Once we clearly understand how each attribute is defined and how measures for that attribute can be developed and validated, there are literally no bounds to the extent that we can increase our knowledge about the source code, the processes that led to the creation of the source code, and the quality of the source code when it is placed into service.

^[1]Zuse, H., Software Complexity: Measures and Methods, Walter de Gruyter & Co., New York, 1990.

^[2]Zuse, H., A Framework of Software Measurement, Walter de Gruyter & Co., Berlin, 1998.

^[3]McCabe, T.J., A Complexity Measure, IEEE Transactions on Software Engineering, SE-2, pp. 308-320, 1976.

^[4]Munson, J.C. Software Measurement: Problems and Practice, Annals of Software Engineering, 1(1), 255-285, 1995.

^[5]Munson, J.C., and Khoshgoftaar, T.M., The Dimensionality of Program Complexity, Proceedings of the 11th Annual International Conference on Software Engineering, IEEE Computer Society Press, Los Alamitos, CA, 1989, pp. 245-253.

^[6]Munson, J.C. and Khoshgoftaar, T.M., Regression Modeling of Software Quality: An Empirical Investigation, Journal of Information and Software Technology, 32, 105-114, 1999.

^[7]Khoshgoftaar, T.M., Munson, J.C., Bhattacharya, B.B., and Richardson, G.D., Predictive Modeling Techniques of Software Quality from Software Complexity Measures, IEEE Transactions on Software Engineering, SE-18(11), 979-987, November 1992.

^[8]Munson, J.C. and Khoshgoftaar, T.M., The Detection of Fault-Prone Programs, IEEE Transactions on Software Engineering, SE-18(5), 423-433, 1992.

^[9]Munson, J.C., Software Faults, Software Failures, and Software Reliability Modeling, Information and Software Technology, 687-699, December 1996.

^[10]Munson, J.C. and Khoshgoftaar, T.M., The Detection of Fault-Prone Programs, IEEE Transactions on Software Engineering, SE-18(5), 423-433, 1992.

^[11]Zuse, H., Software Complexity: Measures and Methods, Walter de Gruyter & Co., New York, 1990.