4.6 Reliability

Validity is a necessary but not sufficient condition for a good metric. The numbers that we produce when we measure a given attribute should not vary as a function of time, of the observer, or of the context. It is clear that LOC is a good measure of program size. It will be possible to specify how the measurement of LOC should occur in such a manner that any observer of the same program would count it in exactly the same way. We could, for example, specify that our LOC metric, when applied to a file containing C code, would simply enumerate the number of carriage return characters in that file. Thus, LOC has good validity and it can be reliably measured.

Many metrics that are collected about people and processes are highly subjective. Each rater who evaluates a particular attribute will arrive at a different numerical value. The function point metric is a very good example of this. The function point metric is used to assess, among other things, the complexity of a set of software specifications. In theory, they provide an estimate of the relative size of a software system based on the functionalities in the specification of that system. The notion of function points has spawned an entire mystical society of software developers who have built a pseudo-science around these improbable measurements. For our purposes, function points have one very serious drawback: they are intrinsically unreliable. Different people evaluating the same system will arrive at different numbers for the same specification.

Just because a rater is able to attach a number to a specific specification attribute does not mean he will be likely to reproduce exactly that number at some future time, or that another observer of the same system will likely produce the same number or one close to it. The missing piece of research on function points relates to the reliability of the technique. We can derive an estimate for the reliability of function point ratings by different judges from a classic paper by Ebel ^[5] in the psychometric literature. His definition of the reliability of ratings is based on the analysis of variance among raters evaluating the same system. Consider a hypothetical experiment involving three raters who will all evaluate the same four software specifications. Exhibit 1 reveals the outcome of this hypothetical experiment. We can see, for example, that Rater 1 found a function point count of 31 for System 1 and Rater 2 found 55 function points in the same system.

Exhibit 1: Hypothetical Function Point Scores

	Rater 1	Rater 2	Rater 3	Sum	Sum ^[a]
System 1	31	55	35	121	14641
System 2	42	36	37	115	13225
System 3	14	18	17	49	2401
System 4	22	21	30	73	5329
Sum	109	130	119	358	35596
Sum2	11881	16900	14161	42942
^[a]Anger, F.D., Munson, J.C., and Rodriguez, R.V., Temporal Complexity and Software Faults, Proceedings of the IEEE International Symposium on Software Reliability Engineering 1994, IEEE Computer Society Press, Los Alamitos, CA, 1994.

To begin our analysis of variance, we need to compute the sums and the sum squared for each rater across all systems and for each system across all raters. From the work by Ebel, we can compute the reliability of the ratings as:

where k is the number of raters, is the system mean square, and M is the error mean square. The computation of the reliability for our hypothetical function point experiment is shown in Exhibit 2. For this experiment, the reliability of the ratings was 0.655.

Exhibit 2: Computation of the Reliability of Ratings

Total sum of squares	12274.0
Average sum squared	10680.3
Sum of squares
For raters	55.2
For systems	1185.0
For total	1593.7
For error	353.5
Mean square
For systems	395.0
For error	58.9
Reliability	0.655

If the reliability of the rating scheme is very low, as it is in our example, then this is indicative that the judges themselves are a significant source of variation in the ratings. In a highly reliable rating system, the reliability value will also be very high. In essence, the error mean square term is very small in relation to the mean square error of the systems being measured. The source of this error variation is the difference in the judges' rating for the same systems.

Given that there are no standards for essentially any measurement that will be performed by human raters, the reliability of such data will always be low. We have no real good notion of what constitutes a good developer, for example. If we ask three different managers to evaluate the performance of a single developer, we will likely get three very different perceptions of the same individual. The reason for this is that we simply do not have viable evaluation templates or standards to use in this process. Each of the raters will be using a different construct to evaluate the developer. One rater might simply look at productivity as measured by LOC as a means of rating the developer. Another might look at productivity in terms of lines of clean and tested code, as opposed to raw productivity. Yet another manager might factor elements of collegiality into his evaluation process.

It is really pretty easy to identify metrics that are unreliable. They will be derived from unconstrained judgments of the raters. If the evaluation process is controlled by a very well-defined standard, then there will be no opportunity for variation among the raters. The standard is a good one if it converts the human observer into an automaton. The same observer on different occasions would evaluate the same event and get exactly the same value for each observation. Different observers applying the same standard would get exactly the same value for the same event.

^[5]Ebel, R.L., Estimation of the Reliability of Ratings, Psychometrica, 16(4), 407-424, December 1951.