2.4 Measurement Standards

When setting out to measure something, we must have some tool with which to perform the measurement. If we want to measure the length of an object, then we will seek out a ruler or a tape measure. The tool we use for the measurement of length is a copy of a standard measure of length, an instrument maintained by a bureau of standards somewhere. The whole purpose of the standard unit of measure is that all of us who have copies of this standard can measure the length attribute of items in our environment. We can then communicate with each other about our measurements and have these communications be meaningful. If a colleague were to send us the measurements of his desk in centimeters, we would have an excellent idea of the size of this desk. Further, if he were to measure all of the components of the desk and transmit those measurements to us, then we could build a replica of his desk. If, on the other hand, we maintained our own ideas of what a centimeter is according to our own standard, then our replica desk would be an inaccurate representation of the original. It would certainly be correctly proportioned, but not a faithful reproduction. Science depends on this ability to measure and report our observations according to a standard.

Sometimes, an apparent standard must be reevaluated. We may think that we have a standard defined, only to find that it is fraught with ambiguity. Consider the rule for mapping people into categories of sex. It seems so obvious. However, in recent Olympic Games, it became apparent that there were some very masculine-looking women competing in the women's athletic events. Visual appearances are not enough to separate individuals into two mutually exclusive sets of sexuality. We find, by examining the chromosomes of the broad spectrum of individuals in our society, that there are males, females, super-females, and super-males. Even more confusing is the fact that we can find that Nature has assigned male genitalia to chromosomal females. Although it seems too obvious at the outset, developing a standard for measuring the sexuality of humans is a very complex issue.

Another interesting measurement problem is that of eye color. The color of my eyes may map into any one of several categories, depending on the observer. They are reported on my driver's license as hazel color. There is, however, no referent standard for hazel eye color. Hazel, it turns out, is a transition color between blue and green. We are left with the inevitable questions as to just when are not quite blue eyes to be classified as hazel eyes. If it were indeed imperative that the color of my eyes be reported exactly, we could report with some degree of accuracy the average wavelength in angstroms of a reflected white light from the iris of my eyes.

In the world of the physical sciences, if it becomes necessary to obtain an accurate measurement of the length of a meter, a standard meter is maintained by the National Institute of Standards and Technology. There is no such measurement body for the measurement of software. Anyone who wishes can construct a metric analyzer and go into the measurement business. The problem with this unstructured approach is that there are just about as many ways to measure a single program as there are tools. Each measurement derived by each tool can be slightly different. Within a fixed application, this circumstance may be not be a problem. This environment, however, is not conducive to the practice of science. When research results are reported in the literature based on these multiple tools, seldom, if ever, is it possible to replicate precisely the results observed by one experimenter.

A ready example of this standards issue can be found in the enumeration of operators and operands in a language such as C. Consider the case in which a call is made to a function foobar that will return a value of float. On the one hand, the call to foobar looks like an operand. We could count it as an operand. On the other hand, foobar represents an action on its own operands (arguments). In this case we could count foobar as an operator. The truth of the matter is that it is both. When we encounter foobar, we should increase the operand count by one and the operator count by one. In that we do not have standard definitions of just how these and other operands and operators should be enumerated in C, the measurements produced by any C metric analyzer must be regarded as imprecise.

Another problem in the enumeration of Halstead's unique operator count, η₁ in the analysis of C language programs, for example, can be found in the enumeration of the "+" operator. ^[1] Generally, when this token is found for the first time, the unique operator count is increased by one and subsequent occurrences of "+" are ignored. The C language, however, permits operator overloading. Thus, the single operator "+" can be used in several different semantic contexts. It can have integer operands, in which case it represents integer addition. It can have real operands, in which case it is a floating point addition operation. It can have two operands as a binary operator and it can have one operand as a unary operator. Each semantically different context for the "+" operator must be enumerated separately. The failure to account for operator overloading will obscure dramatic differences between program modules. From a statistical perspective, the variance in η₁ among program modules will be artificially small. Modules may appear to be very similar when, in fact, they are very different.

Yet another measurement problem is created in different styles of programmers. Just as operators can be overloaded, operands can also be overloaded. In the good old days, FORTRAN programmers were very concerned about the space that their programs would occupy. There was great incentive to define a minimum of identifiers and use them for everything. A single integer, I, might be used as a loop counter, then as a variable in which a summation is computed, and then again as a loop counter. This single integer has been overloaded. It has served many semantically different functions. On the other hand, we can conceive of a programmer trained in a COBOL environment. In this environment, all program variables tend to be tightly controlled as to function. A single variable will always be used in the same context. In this case, there will be little or no operand overloading. If these two programmers set about to write the identical program in a common language such as C, it is possible that their programs would differ only in the operand counts for the programs. Thus, the net effect of operand overloading is that an external source of variation is introduced into the measurement process. This is the variation due to differences in the programmers, not in differences intrinsic in the problem to be solved. If we do not control for the effect of these differences among programmers in our efforts to model with metrics, then this operand overloading problem will be present in the form of pure noise in the model.

2.4.1 Sources of Noise in the Measurement Data

When we wish to quantify a particular attribute for a member of a population, we, for the most part, never really know the true value of this attribute. Only Nature knows the true value of the attribute. The very best that a metrician can provide is an estimate of this value. Nature will have taken the true value and added in additional information in the form of noise. Suppose, for example, that we wish to know the height of Mabel, a person on our software development staff. The more we study this simple problem, the more improbable the measurement task seems. First of all, Mabel is a living organism. The physical composition of her body is changing continually. This means that her height is really changing continually as well. If and when we actually get a measurement on this attribute, it will represent the height attribute in Mabel only at an instant in time. Unfortunately, Nature is also acting to disguise Mabel's height. Mabel's height will vary as a function of the curvature of her spine, among other things. The curvature of Mabel's spine may well have to do with her attitude. Today, Mabel may be depressed. Her shoulders sag. She stands slumped over. This may cause us to measure her to be at least one centimeter shorter than if she were feeling chipper. Nature knows that Mabel is depressed. Nature does not share this information with us. We can never really know Mabel's real height because of this lack of candor on the part of Nature. The very best we can do is to get a viable estimate of this height. We should be smart enough to ask the right question, which is, "What kind of accuracy do we need for our estimate of Mabel's height?"

The information that Nature withholds or adds to our measurement is called noise. Sometimes, the noise component will be sufficiently large as to obscure the attribute that we really wish to measure. It is not difficult to produce an example of this degree of noise contribution. Suppose we want to measure the time it takes for a developer to implement a given design element in a predetermined programming language. To get this measurement, most development organizations simply start the clock running when the developer is tasked with the assignment and stop the clock when he or she has completed the task. The difference between the two clock values is taken to be the development time. The actual development time will be a small fraction of the time reported by the developer. Let us partition the time that actually elapsed into realistic components.

40 percent surfing the Web
10 percent reading and responding to personal e-mail
20 percent communicating with peers in chat rooms on the Web
5 percent restroom breaks
2 percent discussion of the latest Dilbert comic strip with colleagues
5 percent informal inter- and intracube conversation
5 percent design review meetings
2 percent staff meetings
11 percent coding

In this case, the signal component of the measurement of development time with respect to this one person consists of 82 percent nonproductive time or noise. What is even worse is that the distribution of activity for this person may well vary from this base level. Given a personal crisis in his or her life, the actual coding time may well fall below 11 percent.

In this not-so-extreme example, we think we are measuring programmer productivity. What we are actually measuring is the recreational use of computer and office facilities by the programmer. If we really want a good estimate of programmer productivity, we will first have to devise a good way of measuring it. It should be quite clear that simply looking at the programmer's reported hours is a very unsatisfactory measure of productivity. These data are almost pure noise. Conclusions based on those noisy data will be equally worthless. We must first devise a means of extracting the signal component from the noise component. Alternatively, we might elect to constrain the noise component.

First, consider extracting the actual number of hours that a programmer is on task. The objective here is to extract the signal component from the noise through the use of some type of filter. Perhaps the simplest filter would be to install a closed-circuit television camera and measure — unobtrusively — exactly what the programmer was doing during the time he was actually occupying his office. We could then monitor the activity of this person and, over time, develop a measure of central tendency (such as the average) and an assessment of the variability of this estimate of the time spent in programming activity.

The second alternative is to move to constrain nonprogramming activity on the part of the programmer. The first step would be to ban all Internet Web access and severely restrict the use of the company intranet. The second step would be to ban all chat room activity. The third step would be to restrict e-mail to the local intranet and then only for company business. The fourth step would be to provide close supervision to ensure that conversations with colleagues were restricted to programming activity. These are very draconian measures. They would, however, dramatically increase the signal component in the reporting of employee programming effort.

Many companies collect enormous amounts of "metrics" data. These data are very similar in content to the programmer productivity example above. They are essentially of no value because the metrics are an end in and of themselves. There is astonishingly little interest in how the data are collected, what the possible source of noise in the data might be, or how reliable the data collection process might be. The numbers appear to speak for themselves. Unless some real provisions are made prior to the collection of the data for the control of extraneous sources of variation, it will be reasonable to assume that the data are essentially worthless. Management decisions based on these data will be misguided and equally worthless.

2.4.2 Reproducibility of Measurements

The essence of scientific reporting is the principle of reproducibility of experimental results. Results of experiments must be reported precisely and accurately. The defining standard for this level of reporting is that another investigator should be able to repeat the study and get absolutely the same result that was reported. There is a real paucity of such reporting standards in the computer literature. For example, the lines of code (LOC) metric is commonly used in the literature as a measure of program size. We might read an article about an experimental system consisting of 100 KLOC of C code. When we begin to analyze this result, some questions come to mind immediately. First, the prefix "K" could mean 1000 LOC. We very frequently use this same prefix "K" to mean 1024 or 2¹⁰ units. Did the author of the study count 100,000 LOC or did he count 102,400 LOC? Now we turn our attention to exactly how the LOC was enumerated. In the UNIX environment, a logical record (line) is a string delimited by a <CR>. To enumerate LOC, did the author simply count <CR>s for a file containing all of the C code? What about blank lines or comment lines? We could easily opt to report only the actual lines of C code, not including blank lines or comment lines. The main point is that our reporting that we examined a C software system consisting of 100 KLOC gives the reader little reliable information on the size of the system to which we are referring.

Not to belabor the point, but the term "bug" is commonly used in reporting problems encountered in the development of a system. For example, an author might report that 1000 bugs were found and removed in the development of a system. The problem with this statement is that we do not know what a bug means to him. One person's bug is another's lunch. To an Australian aborigine, a grub (bug) is a very tasty treat and an important source of protein. In an abstract sense, bugs found in a program are neither delicious nor nutritious. There is no formal or standard definition of a bug. A bug could be a problem in a requirements specification, a problem in the code, or a very elusive code failure at runtime. We do not know what a bug meant to the author who reported finding 1000 of them. Therefore, we have learned nothing useful from a scientific standpoint in reading about a system that contains 1000 undefined "bugs."

Equally loose standards apply in the personnel area as well. We might read in the literature that a particular system took 10,000 staff hours to develop. We learn even less from this statement. Some basic questions come to mind immediately, to wit:

Did the 10,000 hours include only programming staff hours?
Was the requirements specification factored into the 10,000 hours?
Was the administrative burden factored into the hours, or are we looking at just those staff hours of people actually doing the work?
What level of granularity of measurement was involved? Did an employee's time get aggregated at a 40-hour workweek, or did the employee actually track the exact time spent working on the project?

The main point here is that learning that a project took 10,000 staff hours to complete provides very little or no useful data about any similar work that we might be performing. We simply do not know what a staff hour means to the author of the study, nor are we typically informed of this in the literature.

Much of the basic work in software reliability is of no scientific value whatsoever. There are two fundamental reasons that this literature is flawed. First, the basic unit of interest to the software reliability community is the failure of a system. Unfortunately, we are never told just what a failure is. There is no standard for reporting failure events. In most cases, a failure is simply the catastrophic collapse of a functioning software system. This is not necessarily a good basis for the concept of a failure. A system could easily return an erroneous result and keep executing, quite happily, and producing consistently bad data. There is simply no standard definition for the software failure event. Therefore, when someone reports a failure in the literature, we do not know what this means to him. Second, much of the literature is based on a measure of time between failure events. There are two problems here. First, if we can not agree on what a failure is in the first place, then it will be somewhat difficult to recognize when one occurs. Second, in very complex modern software systems, a failure can occur many days before it has destroyed enough of the user's files or transactions to be made manifest and counted. Thus, the measure of "time between failure" really has little or no scientific validity.

^[1]Halstead, M.H., Elements of Software Science, Elsevier, New York, 1977.