Chapter 1: The Goals of Software Engineering Measurement

Everybody talks about the weather but nobody does anything about it.
Everybody talks about bad software but nobody does anything about it.

1.1 Software Engineering Measurement

In the engineering of any hardware system, the term "best engineering practice" is repeatedly applied to all aspects of the development of the system, whether it be a bridge, an automobile, or a large building. Best engineering practice embodies a long tradition of experimentation, analysis, and measurement. It is the application of scientific principles to the solution of a complex hardware design and development project. At the core of the best engineering practice is measurement and empirical validation of design outcomes. This will be the goal in this book. We seek to develop a measurement methodology for the abstract systems that make up the world of computer science and engineering, and use these measurements to build a measurement based on best software engineering practice.

1.1.1 The Measurement Process

The field of software engineering has evolved quite recently from the very young discipline of computer science. To distinguish between these two disciplines it will be useful to observe the relationship between the disciplines of physics and mechanical engineering. Physics seeks to develop a theoretical foundation for the laws of Nature as we are capable of observing them. Mechanical engineering is the discipline of applied physics. The basic principles of physics are applied to solve real problems in the physical world. Computer science embodies the theoretical foundations for computers and computation. The field of software engineering is, then, the applied discipline of computer science.

Empirical validation is the basis for all science. All theory is validated through the conduct of experiments. Numerical results are derived from experiments through a rigorous measurement process. For an experiment to be meaningful, it must be reproducible. That is the difference between science and religion. Miracles are not reproducible. You are either privy to the execution of one or you are not. It is kind of a one-shot thing. The standard for science is much more demanding. The numerical outcomes of a scientific experiment are reported in terms of numerical values in units of measures shared by a common scientific community. This is possible because there are standards for the measurement units. So valuable are these standards that they are husbanded by national and international governments. If you need to know, for example, exactly what one meter is to ten decimal places, the National Institute for Standards and Technology (NIST) will tell you. NIST will also share this value with other researchers. In this way, the measurements taken by one group of scientists can be very accurately reported to any other. This is the essence of science.

It is interesting to note that NIST does not maintain any standards for software measurement; nor, for that matter, does any other national standards organization. This makes it difficult to impossible to share experimental results among the community of scholars in computer science that wish to engage in the practice of science. We could, for example, measure the kernel of the Linux operating system. It is written in the C programming language. We might choose to enumerate the number of statements in the source code that comprises this system. In the absence of a standard for counting these statements, this turns out to be a very difficult and ambiguous task. Even if we were successful in this counting exercise, there is no way that our results can be shared with our colleagues throughout the world. Each of these colleagues could enumerate the statements in the same Linux kernel and arrive at a very different number. The principal thrust of this book, then, is to lay the basis for measurement standards so that we can begin to share our experimental results in a meaningful way.

The next thing that will be of interest in the measurement process is that the attributes that we are measuring will have some validity in terms of a set of real-world software attributes. In the early days of psychology, there emerged a branch of inquiry called phrenology. In this discipline, the structures of the human subjects' skulls were carefully measured. Certain values of particular structures were thought to be related to underlying psychological constructs such as intelligence. That is, one could deduce the intelligence of a subject by measuring the structure of his head. These measurements were standardized and were quite reproducible. One experimenter could measure the same subject and get the same attribute values as another experimenter with some degree of accuracy. The problem with the science of phrenology is that the measures obtained from the skull did not correlate well with the attributes such as intelligence that they were thought to be related to. They lacked validity.

As we learn to measure software for scientific purposes, we will thus be concerned with two fundamental measurement principles:

The measurements must be reproducible. That is, there must be a standard shared by the software community.
The attributes being measured must be valid. They must yield true insights into the fundamental structure of the object of measurement.

1.1.2 We Tried Measurement and It Did Not Work

In our many years of working in software measurement, we have frequently interviewed software development organizations on their various attempts at instituting software metrics programs. Most will say that they have tried measurement and it did not work. Now, this is all very interesting. If you were to meet someone who was holding a ruler and this person said, "I've tried this ruler and it didn't work," you probably would not think to criticize the ruler. Rulers are passive, inanimate objects. If a ruler is not working correctly, it will most certainly not be the fault of the ruler. We would immediately turn our attention to the person holding the ruler who is unable to make it work. The ruler did not work properly because it was not being used properly. The same is true for software measurement techniques. If they are not working (yielding results) in a software development organization, it is probably not the fault of the metrics themselves.

It is possible to measure all aspects of software development. It just takes training. That is what this book is about: to show what there is to measure and how to go about doing this measurement. However, we will not be able to use measurement tools with which we are familiar. A ruler, for example, will not work for most software applications. Because software and software processes are abstract, we will require new tools and new ways of thinking to measure these abstractions. Fortunately, much of this ground has been plowed for us in the social sciences. Practitioners and scientists who function in the social science world also deal with abstractions. Consider the notion of a human intelligence quotient, for example. This is clearly an abstract attribute of all functioning human beings. For a large part of our efforts in measurement, we will draw from the experience of psychologists and sociologists. After all, programmers (as psychological entities) are working in teams (as sociological entities) to produce abstract objects called programs.

1.1.3 Problems in Software Measurement

Whenever we attempt to measure something, we really get more than we bargained for. In addition to the true value of the attribute that is being measured, Nature provides additional information in the form of noise. Quite simply, we wish to exercise control over the amount of noise that Nature is providing. In a practical sense we cannot realistically expect to eliminate all noise from this process. It would be too expensive. Instead, we will set tolerances on this noise and operate within the practical limitations of these tolerances. If we wish to measure the size of a book to see whether it will fit in our bookcase, we would probably be satisfied with a ruler that would let us measure the size of the book to the nearest half-centimeter. A very cheap ruler would be most satisfactory for this purpose. If, on the other hand, we wanted to measure the diameter of an engine crankshaft to fit a new crankshaft bearing, we would probably want an instrument that could measure to the nearest 1/100 of a millimeter. Again, we must accept the proposition that we can never know the true value of the attribute that we wish to measure. Our focus should be on how close to the true value we really need to come in our measurement. We will pay a price for accuracy. We could buy a scale that would be sufficiently accurate to measure our weight changes at a molecular scale. The cost of this knowledge would be astonishing. The scale would be very expensive. Realistically, if we merely want to establish whether we are losing or gaining weight on our new diet, a scale that is accurate to ±10 grams would be more than adequate for our purposes. We could also buy this scale with out-of-pocket cash instead of having to mortgage our house and our future to know our weight at the molecular level. Understanding the economics of the contribution of noise will be the first thing we will have to solve in our software measurement processes.

We can only control noise in the measurement process if we understand the possible sources of this information. In some cases it will be relatively simple to establish what the possible sources of variation in measurement might be. If we buy a cheap ruler for measuring distances, the printed marks on the ruler will be fat and fuzzy. The width of the printed marks, the variation in the thickness of the marks, and the distance between the marks will all contribute to measurement noise. If our ruler has marks at 1-millimeter intervals, then the best we can hope for is an accuracy of ±0.5 millimeter. If the ruler is made of metal, its length will vary as a function of temperature. If the ruler is made of cloth, then the length of the ruler will vary directly with the amount of tension applied to both ends during the measurement process.

One of the greatest problems of measurement in software engineering is the abysmal lack of standards for anything we wish to measure. There are no standards for enumerating statements of C code. There are no standards for measuring programmer productivity. We have no clear concept of how to measure operating system overhead. NIST is not motivated to establish standards for software measurement. We are on our own. This is perhaps the greatest obstacle that we will have to overcome in software engineering measurement. We have no basis for communicating the results of our scientific investigations in that everyone is measuring the outcomes of these investigations with different standards and measurement tools.

1.1.4 The Logistics of Software Measurement

We have recently witnessed the efforts of a number of large software development organizations to begin measurement programs for their software development. They have all come up against the real problem of measurement. The most unexpected problem in measurement is dealing with the volume of data generated from a typical measurement application. The focus within these organizations is on the measurement tools, as if the tools in and of themselves will provide the solution to measurement efforts. This is rather akin to the idea that the tools make the craftsman. That is, one can become a good carpenter if one simply has the tools of the trade. The truth is that a good carpenter can do even marvelous carpentry with the most primitive tools. Training to become a carpenter is a long process. Training to become a mechanical engineer is also a long process. One cannot hope to buy some software measurement tools and become an expert in software measurement. The logistical and statistical issues that surround the measurement process should be the central focus - and not the tools.

Consider a system of 100 KLOC (thousand lines of code) with approximately 200 lines per program module. This system will have approximately 1000 program modules. If we measure the system on 20 distinct attributes, then we will have a 1000 × 20 data matrix containing 20K data points for one measurement effort. Now, if we consider that most systems in development will change materially each week, they will require new measurement every week. The data bandwidth from the measurement process for one software development project is substantial. The substantial flow of new data from the measurement process is one of the major problems encountered by software metricians. This is generally the reason that most measurement efforts fail. They simply produce too much data and not enough information.

Just having taken measurements on a system does not constitute a measurement process. It merely generates data. These data must somehow be converted into information. This is the role of statistics. The use of statistics permits us to understand just what these data are telling us about our processes. There is just no way that we can make rational interpretations of hundreds of thousands of measurements taken on all aspects of the software development process and reasonably hope to understand or make sense out of the mass of data.