Chapter 3: Measuring Software Development | Software Engineering Measurement

3.1 Measurement Domains

The ultimate focus of software engineering measurement will be on the software product, the source code program itself. The actual program product is, however, very strongly influenced by the software engineering process used to created it; the people involved in the specification, design, coding, and testing of the product; and the environment in which those people and processes were operating. It will be our task to develop measurement methodologies for each of these measurement domains.

The first and most important realization that we must come to is that we are working with a multivariate problem. There is no single best measure of a program. There is no single best measure of a requirements analysis. There is no single best measure of a software process.

Measurement of software systems is a multivariate problem. To characterize any of the products, people, processes, or environments, we will have to measure a number of attributes simultaneously. For example, consider the case of two programs. Program A has 100 lines of code; Program B has 500 lines of code. On the surface, it would appear that Program B is the more complex of the two because it is exactly five times longer than Program A. Now let us measure the if statement complexity of these two programs. We discover that Program A has an if statement complexity of 20, whereas Program B has an if statement complexity of 8. This means that there are many more distinct paths in Program A than in Program B. In this circumstance, Program A would be considered by most software developers to be far more complex than Program B.

Now suppose that we measure the number of cycles in the control flow graph representation of both programs. This count will correspond directly to the number of loops in the programming language of both programs. Under this count let us suppose that there are eight such loops in Program B and none in Program A. The numerical relationship between the two programs is now not so clear. It is generally such circumstances that cause software developers to seek to simplify their world. They now seek to choose one of the three measures of the program as being the best metric. This is somewhat like trying to find the best metric for a person. People have many attributes. We must understand them all in order to represent the person from whom these metrics are drawn. People are short and tall, smart and dull, thin and heavy, light and dark; but people are none of these one at a time. People have each of these attributes at once. There are multiple simultaneous attributes for people and there are multiple simultaneous attributes for programs. We would not think to characterize a person by a single attribute, nor should we try to characterize a program by a single attribute.

Perhaps one of the most destructive notions ever created was that the software development process was an art. We see continual reference to the "art of programming" or the "art of software testing." For there to emerge a real discipline of software engineering, there must be a science of programming, a science of design, and a science of testing. One of the basic principles of science is that a theory can be validated empirically through the scientific method. A hypothesis is formulated based on theoretical considerations. An experiment can be designed to test the hypothesis. Measurements are taken on the phenomenon whose existence (nonexistence) is being postulated. The measurements are analyzed. The hypothesis is accepted or rejected. The theory is thus validated or invalidated. The operant term in this process is the concept of measurement. For there to be a science, there must be measurement.

There are several tangible products of the software development process. The software development process begins with the software requirements specification (SRS). This SRS is the leading indicator of the product that is to be built. There are many measurable attributes of the SRS that can be identified. ^[1] The next physical product in the evolving software system is the software design. This design may take many different forms. It can be a set of design specifications on printed paper, a set of files as output from a CASE tool, or a high-level source document in a program design language. After the design details of the program have been fleshed out, the source code representing this design is then constructed. Perhaps it is worth noting here that there is not just one set of source code. A source code program will continually evolve over its lifetime. At the point that it stops evolving, it will be retired. Last, but certainly not least among the potentially measurable software products, is the set of software system documentation. These products are tangible objects. Their measurement is fairly well defined.

From the standpoint of the software development process we would like to begin our science by identifying those attributes for which valid and reliable measures can be defined. We will learn to treat with great suspicion those concepts that do not lend themselves to measurement. If we carefully examine the circumstances of software development, we can construct a taxonomy of measurable software attributes that fall into four basic categories as follows: people attributes, process attributes, product attributes, and environment attributes. We can define metrics on specific properties within each of these attribute domains.

In the initial stages of the development of a software measurement program, the emphasis should not be on gathering data from a large number of metrics from each of the four measurement domains. Instead, the focus should be on the measurement processes.

3.1.1 People Metrics

We very often forget or most certainly suppress the fact that computer software is designed and built by people. We seldom, if ever, mention this fact in our undergraduate and graduate computer science curricula. Computer programs have algorithms and data structures. Somehow these algorithms and data structures are scrunched together and computer programs happen. Anyone who has ever taught programming courses realizes that there are an astonishing number of different ways to solve the same problem using different algorithms and different data structures. Some students can find the essence of the problem and produce very elegant, simple programs. Other students seem determined to make a mountain out of a molehill. They produce hundreds of lines of labyrinthine logic whose real functionality may never be known but seems to produce the required result. Very simply put, people design and develop programs. If you want to understand the software development process, it is vital to understand the people performing the work.

Measuring people is one of the most important things we can learn to do in understanding the software development process, and it is also the most difficult. With the exception of a brief excursion into this arena in Chapter 4, we will not devote further attention to measuring people in this book. Measuring people is a tar pit. It will trap the unwary, the novices, and suck them down into certain doom. As much as we would like to know about how our developers function, we must learn to walk before we can run. People metrics require very special knowledge about how people work and what makes them tick. The typical software practitioner simply does not have the educational background in psychology or sociology to attempt this very difficult feat.

A little story is in order here. Sometime in the past we were invited into a software development organization at the XYZ Company to work with its development staff to build a model for mapping from software code complexity measures to software faults. In general, this is a fairly easy task. A very small set of valid software metrics can reasonably account for more than 70 percent of the observed variation in recorded software faults. When we attempted to replicate this result at the XYZ Company, we could only account for about 20 percent of the variation in the fault data with the canonical software metrics, a most unusual result. A careful inspection of the fault data appeared to suggest that there were substantial differences in how faults were reported by different developers. Some seemed to have far fewer faults than others. Also, the fault rates did not seem to correlate with the tenure of the developers. Sometimes, novices have a much greater fault rate than do more experienced developers.

It became clear that there were real differences among the developers on the rate at which they were reporting faults. In this case, the developers themselves were an uncontrolled source of variation in the modeling problem. The noise they created simply obscured the relationship between the fault data and the software metric data. To control for the constant differences among the development staff in their reporting rates, simple dummy variates were introduced into the model for each developer, save one. (It is not important to understand just what a dummy variate is at this point.) A model was then constructed using the dummy variates and, lo and behold, we were then able to account for the traditional 70 percent variation in software fault data. The coefficients of the dummy variates, in essence, represented the constant difference in reporting rate of each developer.

The results of the research were presented to a committee of software development managers for the project that was being studied. The stated objective of this meeting was to work to achieve a better reporting method for software faults so that we could get better measurement data for software faults. During the course of the meeting, several of the managers wanted to know why the model was so complex and what the individual elements of the model were. When the coefficients of the dummy variates were presented, we suggested that these coefficients, in fact, represent the differences in reporting rates among the developers. The managers then understood that there were vast differences in the rates that the developers were reporting. Some, it was clear, seldom reported problems in their code.

The developers then wanted to know the name of the most egregious violator, the person with the highest coefficient. We made the mistake of sharing this information with the group. It was Sally M. They then sent someone to get Sally M so that they could discuss this with her. When Sally came into the room, a kangaroo court of managers was convened to discuss Sally's failure to report software faults. A conviction was soon forthcoming from this impromptu court. Sally flew into a towering rage and allowed she knew very well that the managers were factoring the fault data into their annual evaluations. Those programmers who reported faults honestly were penalized for their honesty. After some further discussions among the managers, Sally then allowed that she no longer wished to work for the XYZ Company. Sally then left the room to clean out her desk, and the managers continued their discussion. Sally was one of their best software engineers. She had been with the company for years and had an enormous number of implicit software requirements stored in her head for projects on which she had worked. Her loss to the XYZ Company was going to have serious repercussions. The managers now looked for the cause of the problem. It was decided then and there that everything had been just fine until we created the problem. We were asked to surrender our security badges and never darken the halls of the XYZ Company again.

Here was a very classic case in which the managers learned the wrong thing from the data. The wide variation in the reporting rates was not a people metric; it was, in fact, a measure of the failure of a software process. There was a software fault reporting process in place but there was no clear standard for the reporting process. Further, there was no audit function in place to ensure that all faults were reported and that they were reported in the same manner. Many of the faults were repetitive. Divide by zero faults were very common. There was no process in place to ensure that the range of all denominators be checked prior to divide operations. There was no learning from previous mistakes. Developers in that organization are probably still introducing divide faults into their code, only to spend agonizing hours finding them again.

Some people will do a very good job at software development activities and some will not. This people dimension is perhaps the least understood of all the relevant components of the software development process. In general, the people involved can be assembled into two major groups. First, there are the analysts who are responsible for the formulation of the requirements specifications. Then come the people responsible for software design. The design products are used by programmers to create source code. The resulting code is tested by a software test group. Finally, the software is placed into service and becomes the responsibility of a software maintenance staff. Each of the individuals comprising the different groups will have very different characteristics. That is, a person who really likes software maintenance work and is very good at it is probably a very different person from someone who excels in software specification or design.

Learning to measure people is a very difficult and complex task. If we are not careful and conscientious in the measurement process, we can learn exactly the wrong thing from our measurements. This is not new ground that needs to be plowed. There is a veritable wealth of information about the measurement of people that can be drawn from the fields of psychology, sociology, and human factors engineering. Subsequent to this chapter, there will be very little or no mention of the measurement of the people domain. This is not because this domain is not important. It is very important, and perhaps the most important domain. To do measurement on this domain will require very much more expertise in psychology and sociology than the typical software engineer will be exposed to.

Much of the data that we will collect about people will, at first, seem to be counterintuitive. For example, if we track the rate of fault insertion by years of experience with the company, we will invariably find a strong positive relationship between these two attributes. At face value, it would appear that the best way to control faults and lower the overall rate of fault insertion is to sack the older and more experienced developers. If we look further into this problem, we will discover that the overall complexity of the code modules created by the experienced developers is far greater than the complexity of code modules being worked on by the novices. Their tasks are not equal. If the novices were assigned to the complex module design and coding tasks, their rate of fault insertion would be very much greater than the experienced developers. Similarly, if the experienced developers were working on the same code base that the novices were working on, their rate of fault insertion would be very much lower than that of the novices.

W. Edwards Deming maintained that only managers who had a sufficient background in statistics should be allowed access to raw data. ^[2] Without sufficient preparation in statistical analysis, it is particularly easy to extract the wrong information from raw data. The above example is but one of a myriad of similar circumstances that arise on a daily basis. Many times, the mere perception that they are being monitored will change people's behavior. In some organizations, for example, there are efforts afoot to measure programmer productivity. One such measure is to simply count the lines of code (LOC) that each programmer produces each week. Once the programmers know that they are going to be evaluated by their LOC productivity, the number of comments in the code begins to rise dramatically. Also, the number of language tokens per line can also be expected to drop. In a sense it is a bit like the measurement problems of quantum mechanics. The very act of measurement alters the attribute being measured.

Like computer programs, people have attribute domains. Some of these domains are relevant to the task of software development and others are not. It is possible to list some relevant people attributes that can be measured. These are as follows:

Age
Gender
Number of computer science courses
Number of courses in the application domain
Date of employment at the company
Experience in the programming languages
Technical training

This is not intended to be an exhaustive list of people attributes but rather to indicate that there are attributes that may be measured for each of the people in the software process. They are also typical of data that can be kept on the software development staff by the human resources organization. These measures can potentially be used in models of software quality or productivity. What is important is that we must be very careful to establish standards for any measurements that we do make on people variables. This is the most difficult part of measuring people attributes.

From the above list of attributes, we will start with the first attribute, age. It is always bad to define a variable named age. These data will be accurate at the time of employment and become increasingly inaccurate as more time elapses. It would be far better to know a person's birth date than it would be to know their age at employment. From the birth date it will always be possible to work out their age at any time in the future. Now to the birth date field itself. The normal form that such dates are kept in is MMDDYYYY. This definition will allow us to compute how old a developer is to the nearest day. For all practical purposes, this resolution is too great. The birth year probably will give us the resolution that we need for any analysis of our developers. It is highly unlikely that we will need to know age differences to greater than ±6 months. Now that we know our needs, we will only record birth year. We simply do not need to know more than that.

Next on the list of people attributes is gender. This attribute will have the value of male or female. The real question is why we need to know the gender of a developer. It is clear that the human resources organization will have to know this for EEOC reporting but it is not at all clear that these data will be meaningful in the context of the software development environment. Such data would be useful if (1) there were known differences between males and females, (2) these differences were quantified by task, and (3) the organization was structured to exploit these differences. Quite simply, there is not sufficient experimental data at our disposal to suggest that we could begin to make effective use of the gender attribute. We will delete it from our list of people attributes we will measure.

The third item on the list is the number of computer science courses that a developer has completed. There are a couple of major problems with this attribute measure. As we will learn in the next chapter, there may or may not be predictive validity to this attribute. It might well be that exposure to varying levels of the computer science curriculum is not related to software development in any way. Without the science to back up the fact that good developers also have lots of computer science (software engineering) courses, we simply cannot consider this to be a valid attribute. The second problem with this attribute is that not all computer science courses are equal. Graduate computer science curriculum content is very different from upper-division university curriculum content. Also, lower-division computer science curriculum content is materially different from both graduate and upper-division undergraduate curriculum content. Quite a number of people would suggest that the curriculum content of computer science courses at the Massachusetts Institute of Technology is very different from that of the computer science courses at Jadavpur University of Calcutta. Simply counting computer science courses will probably do nothing more than introduce an uncontrolled source of variation in any attempt to use this attribute in modeling.

For application programming, it might well make a difference if the application developers are founded in the discipline. For example, if the application under development is an accounting package, it might be useful for the developers to have a background in the accounting discipline. The fourth people attribute will enumerate the number of courses that a developer has had in the discipline of the application. The course content issue raised above is just as relevant here as it was for the discipline of computer science. Although there is probably less variability in the course content of accounting courses from one university to another than there is in computer science, it is still a substantial factor. There are probably no experimental data available to support the conclusion that a person with an accounting background will do better at writing accounting applications than a person without such a background. Without such evidence at hand, there would be little reason to support maintaining data on this attribute. Finally, it is quite possible that developers will be shifted from one project to another during the course of their careers. A person who started work writing COBOL accounting applications may well be writing Java applications for Web infrastructure applications today. Thus, the fourth people attribute is probably not usable or relevant for any modeling purpose.

Another common practice is to record the date of employment at the company. The same issues of granularity of measurement obtain here as in the case of birth date. In this case we will probably compute the length of service in months. We can make a clear case for being able to resolve this attribute at the month level of granularity. This means that the date of employment will be defined as MMYYYY. Unfortunately, this information is not particularly relevant. If the developer first joined the company as an unskilled employee on the production floor, graduated from that to foreman, and from that to tester and then to programmer, this employee may have been with the company for 20 years or more but will have done development for no more than 1 year. Hence, there is probably little value to the date of employment attribute.

Programmer experience is an interesting attribute. It seems so relevant; however, there are a couple of problems with this attribute. First, there are hosts of programming languages in use today. Probably more have been used in the past. This is not a single attribute. We will need to record experience in Java, FORTRAN, COBOL, JOVIAL, NELIAC, LISP, SLIP, HAL/S, ALGOL, Basic, etc. Perhaps the greatest difficulty with programmer experience is that we will probably learn nothing from it. We have all met developers who are hopeless, regardless of the number of years of experience in the language. A real geek with 2 weeks of Java is probably to be preferred to a programming hack with 20 years of experience.

The final attribute goes to the heart of company training programs. Many software development organizations have available to employees some type of technical vitality program. It is the stated purpose of these programs to provide the employees with a broader educational exposure, to update their skills with new technology, and a host of other very laudable and lofty goals. Kudos are given to employees for participating in these programs. The basic assumption underlying the technical vitality attribute is that the more technical training a person has had, the more valuable he or she will be to the company. Again, there is little evidence to support this conclusion. There is probably an inverse relationship between developer productivity and this attribute. We have taught any number of such training courses. For the typical older developer, they are a very pleasant and viable alternative to real work.

We have now worked our way through the list of people attributes. Except for the first attribute, there is not much in the way of functional information in any of these attributes. Many of them will bring more noise with them than signal. Although the list of people attributes chosen for this discussion is rather limited, the same considerations apply to just about any other list that we might wish to construct. The conclusion is obvious. We know very little about which human attributes are really relevant to the products that they make, to the environments that they are capable of working in, or to the processes that will make them productive. We are going to have to learn to measure people. This will be the biggest challenge in the development of an effective software engineering measurement program. It is important that we learn to do this measurement. People clearly do the design and development, yet we know so little about how they do it. The very best we can do at the present time is to build models that control for the effects of people. We will not be able to build good models that incorporate people attributes because our knowledge of these attributes is so weak.

3.1.2 Process Metrics

Software is developed through a sequence of steps that constitute the underlying software process model. We can measure certain aspects of that process. We might, for example, wish to measure the rate at which software faults are introduced and removed by the development process. We might wish to characterize the process in terms of the number of unresolved problems that are extant in the software in development. Some notable process attributes include:

Rate of programmer errors
Rate of software fault introduction
Rate of software failure events
Number of software change requests
Number of pending trouble reports
Measures of developer productivity
Cost information
Software process improvement (SPI) costs
Return on investment (ROI)

The above list is not intended to be exhaustive, merely indicative. The important thing to note is that the process variables tend to deal primarily with cost and rate data.

3.1.3 Product Metrics

As was stated earlier, there are several tangible products of the software development process. These are, in part

The software requirements specification
The high level design
The low level design
The source code
The test cases
The documentation

A software system is a rapidly evolving and dynamic structure. All of the products of this development process are in a constant state of flux. In fact, when a system finally reaches a point where it no longer changes, it will probably be replaced. The world that we live in is a rapidly changing place. Software that does not adapt becomes obsolete.

There are clearly many more products than the short list given above. Again, it was not our intention that this list be exhaustive, just illlustrative. Much of our measurement methodology will be devoted to understanding this domain more completely. We will return to the subject in rather more detail in Chapter 5.

3.1.4 Environment Metrics

Environment metrics will quantify the attributes of the physical surroundings of the software development organization. This will include all pertinent aspects of the environment that relate to development. The operating system environment for the design and development of the systems has great bearing on project outcomes. The rate of personnel turnover in programmers and managers will clearly be measurable and worthwhile knowing. If a programmer is assigned to a complicated task and is repeatedly interrupted with meetings, by colleagues, or other distractions, he or she will probably make substantial errors in the programming task. Some pertinent environmental attributes that can be measured are as follows:

Operating system
Development environment
Operating environment
Administrative stability (staff turnover)
Machine (software) stability
Office interruptability
Office privacy
Library facilities
Rendezvous facilities

Again, this is not intended to be an exhaustive list. It merely indicates the types of attributes that might potentially impact the software development process.

^[1]Munson, J.C. and Coulter, N.S., "An Investigation of Program Specification Complexity," Proceedings of the ACM Southeastern Regional Conference, April 1988, pp. 590-595.

^[2]Deming, W.E., Out of Crisis, Massachusetts Institute of Technology Center for Advanced Engineering Study, Cambridge, MA, 1993.