Chapter 4: Validation of Software Measures | Software Engineering Measurement

4.1 Understanding What is Being Measured

It is very easy to think of attributes of computer software products or processes that can be measured. It is also very easy to identify properties of people and programming environments that can easily be measured. Measurement in and of itself is not the real problem. The real problem is identifying meaningful attributes to measure and then finding measurement processes to produce reliable and reproducible assessments of these attributes. One attribute, for example, that we could measure for each of our software developers is that of height. Clearly, every one of our developers occupies some physical space in this universe, so we should be able to measure the height of these people handily. Indeed, there are plenty of tools that we have at hand that will provide satisfactory measurement data for us once we have decided what level of accuracy is required. If we only require accuracy to ±1 centimeter, we could easily acquire a suitable measurement tool, a tape measure, obtained from the local hardware store to do the job for us.

We have now identified an attribute of our developers, height, that we can measure. We have also identified a tool that will provide suitable measurements of that attribute. It is now possible to collect data on all of our developers and save these data for posterity. Unfortunately, nothing will have been gained by this measurement process. A developer's height is probably not related to his programming skills in any way. Knowledge of a developer's height will not give us insight into any of his skills and abilities with regard to programming. Height is not a valid measure of developer ability, productivity, or skill. We will insist that all measurements we collect about people, processes, products, and environments have validity with regard to one or more criterion measures that we wish to understand.

Yet another dimension to our problem relates to the mechanism of collecting the data. It is possible to identify an attribute that has face validity yet no reasonable means of quantifying the attribute values. An example of such an attribute would be programmer aptitude. Some programmers write really good code while others cannot code their way out of a bushel basket. Anyone who has ever worked with programmers can attest to this fact. This, then, is a reasonable attribute for us to know. We have only to be able to measure it. Different people have different notions of what good code is so we will attempt to control for these differences by having multiple judges rate the ability of each programmer. By doing this we hope to avoid individual biases introduced by each of the judges. In essence, we want a reliable assessment of the programming ability of each programmer. It is clear that we can ask each judge to assign a number (e.g., 1 through 5) to each programmer. A good programmer will receive a value of 5. A bad programmer will receive a value of 1. If all judges assign the same value to each programmer, then the rating scheme will be reliable. If, on the other hand, there is substantial variation among the judges as to programmer ability, then the ratings that we receive will be unreliable. ^[1]

There are two criteria, then, that we must have for our measurements. First, the attributes being measured must contribute to our understanding of the criterion attributes. The attributes being measured must be valid. Second, the measurement data must be reproducible. Different judges looking at the same attribute will assess it in the same manner. The measurements must be reliable. To assist our understanding of the validity and reliability of measurements, we will borrow heavily from the psychological testing discipline.

^[1]Ebel, R.L., Estimation of the Reliability of Ratings, Psychometrica, 16(4), 407-424, December 1951.