2.5 Principles of Experimentation

Much of what we know about the software development process is based on invalid or incomplete information. It is not surprising that we build unreliable and insecure software systems. We know very little about what we will need to know to improve our software development process. Our entire focus at this stage in the evolution of software development should be on research and experimentation. We are going to have to build a scientific basis for software development technology. The main driver for this discovery process will be the measurement methodology. Our experiments will be very simple but our focus on data collection methodology will be intense.

To understand how quality software might be developed, we must first learn to identify the possible sources of variation in this development. Operational definitions of processes and products must be carefully constructed in a manner that will permit accurate and reliable measurements. Levels of quantification and the quality of measurement processes must be carefully understood. In the development of a new field of software measurement, we must learn how to emulate the conduct of inquiry of those physical, natural, and social sciences that have been around for many years. Software development processes very closely resemble the field of alchemy, the precursor to the science of chemistry. One of the guiding objectives of the typical alchemist was the painless conversion of lead to gold. If they could but find the magic formula for this transformation, they could be wealthy people. Current software development practices are hilariously close to this same pursuit. If we can but pass our source code and high-level designs through a magic filter, then we can finally realize the "gold" of maintainable and fault-free code. The problem is one of finding the right magic. We should know by now that there is no magic. There is no real substitute for hard work. It is very hard work to lose weight. It requires substantial discipline and restraint. It is far easier to search for a pill or magic elixir that will substitute for discipline and willpower.

We would like to pull back from the wild enthusiasm of unstructured search for gold and begin to learn from the history of the origins of other sciences. The conduct of our inquiry will thus be constrained to two fairly common approaches to scientific investigation: ex post facto and experimental. What little structured inquiry that does exist in the fields of computer science and software engineering is generally governed by the ex post facto approach. In this method of inquiry, it is customary to compare two known software development techniques with respect to some characteristic. For example, we might compare an object-oriented (O-O) programming approach with a traditional non-object-oriented approach to a software development scenario. If we were to find that the O-O method yielded a significant difference between these two approaches with regard to the maintainability of the resulting software, this evidence would then be used to support our conclusion that the O-O method of software development is in some sense superior to traditional software development practices.

Much of the research literature in support of the cleanroom software development method is ex post facto research. The salient feature of ex post facto research is that we are obliged to analyze data that already existed before we ask the research question. Little or no effort will have been made to control extraneous and possibly quite relevant sources of variation in the data. The one thing that can certainly be said about the cleanroom procedure is that it is quite labor intensive in the early phases of the life cycle. It is quite possible, for example, that software generated by the cleanroom technique has fewer faults than other code only because of the intensity of the focus on the problem at a very early stage. Perhaps any other arbitrary ad hoc software development technique would do equally well if a similar investment of time were made in the initial life cycle using this new technique. We just do not know. We do not know because the necessary experiments with scientific controls have not been used to test this hypothesis.

The paradigm for experimental research is very different. We will perform an experiment to collect data that is central to the questions we wish to answer. In formulating an experiment, we will attempt to identify all of the sources of variation in a set of variables that we will manipulate during the experiment and a set of variables that we will control during the experiment. If we have designed a good experiment, the variations that occur in the manipulated variables will be attributable to the effect of the experiment. Extraneous influences on these variables are carefully controlled. The data we collect for the experiment will be generated only after we have asked the question. Those data are like chewing gum; they can be used once and only once. They have been created for the purposes of a single experiment and have no real value outside the context of that experiment. Once we have completed the experiment and analyzed the data, they should be discarded. We have gotten the good out of it.

As a simple example of experimental conduct of inquiry, consider the possibility that there are measurable differences between classes of programmers. Specifically, we wish to know whether there is a gender difference between male and female programmers. An ex post facto approach to this question would allow us to look at our existing database to see what we could find there that might answer this question. We would find, for example, data on the productivity of each of our programmers as measured by the number of lines of code they contributed to each project. We could then split these observations into a set of values for the male programmers and the set for female programmers. The literature is replete with such studies. It is quite possible that our staff of female programmers has been recently hired due to a push from the Equal Employment Opportunity Commission. This being the case, when we examine the differences between female and male programmers, we will certainly learn that the male programmers are more productive. What we are really measuring by analyzing the existing productivity data is differences in programmer tenure. Most of the females are recent hires; they are novices. The males, as a group, have been around much longer; we would expect them to be more productive.

2.5.1 Hypothesis Formulation

The conduct of scientific inquiry is an iterative process. At the heart of this process is a theory that we think will help us explain our universe. To test the validity of our theory, we will conduct a series of experiments to test various aspects of the theory. Each experiment will test a discrete component of the theory. The experiment begins with the formulation of a hypothesis. Nature has the knowledge to do real good software development. She will disclose these secrets to us through a guessing game. The way this guessing game works is as follows:

Watch Nature at work.
Identify a discrete process in Nature that we would like to explain.
Make an educated guess as to what Nature is up to.
Formulate a hypothesis about how and why this process acts the way it does.
Conduct an experiment to test the hypothesis.
Accept or reject the hypothesis.
If we accept the hypothesis, it then becomes a fact or a law.
Revise our notion of how Nature works.
Go to Step 1.

Research, then, is an iterative process. There is no end to the process. The absolutely most important aspect to the research process is that we learn to listen to Nature. Hypotheses are not sacred cows. The very worst thing that we could do is to try proving our hypotheses correct. Hypotheses are trial balloons. They are conjectures. They may be wrong.

2.5.2 Hypothesis Testing

Each hypothesis that we formulate must be tested for validity. That is what an experiment is designed to do. There are two possible outcomes for each experiment. First, we can find that the data suggest that our hypothesis is correct. In this case, the hypothesis is now no longer conjectural; it is a fact or a law. Subsequent hypotheses will have as their foundation this new law.

There is a tradition among statisticians that we couch our hypotheses in negative terms. Instead of stating that there will be an observable treatment effect between an experimental and a control group, a null hypothesis will be used instead. In this sense, the null hypothesis, represented as H₀, will state that there is no observable difference between a treatment group and a control group for the experiment. If we conduct the experiment and find this to be the case, we will accept the null hypothesis, H₀. If, on the other hand, we do find a significant difference between the control and the treatment group in our experiment, we will reject H₀ in favor of the alternate hypothesis H₁, which says that there is a significant difference between the experimental and treatment groups.

Hypothesis formulation is usually not a problem in the conduct of scientific inquiry. Once the research pump is primed by the first experiment, new hypotheses flow quite naturally from the experiment. It is appropriate to think of an experiment not in terms of closure but rather in terms of opening a research area. Generally, a successful experiment will yield far more questions than answers. Essentially, the more we learn, the more we will understand how little we really know. The real problem in the experimental process is the actual test of the hypothesis. Many times, the effects we are trying to assess are tenuous. The signal component will be well disguised in the noise. Statistics will be a tool that we will use to make decisions on outcomes based on the uncertainty of signal recognition.

2.5.3 Type I and Type II Errors

When conducting an experiment, we are really playing a two-person, zero sum game with Nature. Nature knows a priori what the outcome of our experiment will be. We do not. From Nature's point of view there are two states for our experiment: either H₀ is true or it is false. We, on the other hand, can either accept H₀ as a result of our experimental observations or we can reject this hypothesis. There are two ways that we can get the right result. If H₀ is false and we choose to reject this hypothesis, then we will have a correct result. Also, if H₀ is true and we choose to accept this hypothesis, then we will again have a correct experimental outcome. There are two different ways that we can arrive at exactly the wrong conclusion. If H₀ is known by Nature to be true and we reject H₀, then we will have committed a Type I error. If, on the other hand, H₀ is false and we accept H_0, then we will have made a Type II experimental error. These experimental outcomes are summarized in Exhibit 1.

Exhibit 1: The Experimental Paradigm

	H₀ True	H₀ False
Reject H₀	Type I error	Correct decision
Accept H₀	Correct decision	Type II error

Type I and Type II errors are not equivalent. In most circumstances, we would much rather make a Type II error than a Type I error. In making a Type II error, we will simply assume that there is no experimental effect when Nature knows that there really is. The implications of the Type I error are much greater. Nature knows that there is no experimental effect of our treatment. We will learn, from our experiment, quite incorrectly that there is an experimental effect. We will base our future actions on this conclusion and cause harm as a result. The consequences of such a decision in a safety-critical application can be great. If, for example, a new drug is being tested to treat a particularly virulent form of cancer, H₀ will state that there is no significant drug effect while H₁ represents the alternate hypothesis that there is. If a Type I error is made, H₀ will be rejected in favor of the alternate hypothesis. Nature knows that the drug is ineffective but the experiment has shown that it will have a significant effect as a cancer treatment. Based on this experimental evidence, drug companies will market the drug as a successful cancer-treating agent, which it is not, and many cancer patients will be treated with a drug that has no effect. Had a Type II error been made in the same experiment, a drug that had potential as a cancer treatment would be eliminated from consideration by the experiment. This is a far more conservative strategy.

2.5.4 Design of Experiments

The design of experiments is a very complicated and involved process. It can only be performed by statisticians who have had substantial training in the area. It would never occur to us individually to try to remove our own appendix. It takes surgeons a very long time to learn how to do surgery. One cannot simply read a book and go to work with a knife to extract appendixes, or kidneys, or whatever. It should be equally clear that the design of experiments should be left to someone who has been trained to do this work. What we need to know most about the design of experiments is that only professionals should perform this work. If you select a good surgeon for your surgery, your chances of survival are enhanced. If you find a good statistician, your chances for doing good science will improve as well.

We live in a very complex, multivariate world. There is no way in the world that a couple of paragraphs in this or any other text could begin to explain the complexities of the design of software experiments. It is very important to understand our own limitations. It is equally important that we can recognize the need for professional help in our conduct of inquiry. The consequences of making bad decisions in software development are becoming dire. As more safety-critical systems using embedded software systems are being developed, the more important it is that we understand exactly what it is we are about when developing the software for these systems.

The most important thing to know about the design of experiments it that this activity must be performed before the data are collected - not afterward. Once we have formulated our hypothesis, then and only then will we seek the aid of a statistician to help design our experiment.

2.5.5 Data Browsing: Ex post facto Research

The most prevalent form of investigation in software engineering and in computer science is ex post facto analysis. In this case, data are analyzed after they have been collected, the collection generally being done by someone else. We do not have precise knowledge about the circumstances of data collection, either in regard to validity or reliability. This form of research is tantamount to beating the data with a statistical stick. If you beat it hard enough, you are almost certain to find that for which you are looking. This, of course, was a guiding principle behind the Spanish Inquisition. The unfortunate subjects of the inquisition were beaten and tortured until they were willing to say whatever was necessary to end their misery. Exactly the same principle drives ex post facto investigations. Modern statistical software packages in the hands of inadequately trained researchers become instruments for data torture. These statistics packages should be treated like a surgeon's instruments. They should only be used by people with the necessary training in statistics. It is far too easy to torture the data with these tools until the data scream to validate our favorite hypothesis.

There is, however, a role for ex post facto investigation. It is not a scientific one. It is an intuitive one. While it would be very bad form for ex post facto data to be analyzed and reported as scientific evidence, these same data will disclose their intrinsic sources of variation. These data can be of real value in the formulation of a solid hypothesis for subsequent experimentation. So very much of science hinges on the emergent event and intuition. Ex post facto data can be very useful for data exploration. There are a host of statistical techniques, such as factor analysis, that can disclose some very interesting interactions among the sources of variation in the data. We will use the insights provided by these techniques to guide our scientific intuition.

We must learn to treat experimental data as if they were very fragile. They can be used once and only once and must then be discarded. In the normal course of conducting an experimental inquiry, a hypothesis will be formulated, an experiment will be designed, the experiment will be conducted, and the data will be collected. Finally, the data will be analyzed only in the context of the specified experimental design. This experimental design will be completely specified before the data are collected. The analysis of the data specified by the experimental design will suck all of the juice out of the data. There will be nothing of value left in the data once they are analyzed.

Finally, we would never consider using data for our science that we did not personally collect. We just do not know where these data have been or with whom they have been. It would be very unsafe science to consort with data of uncertain origin.