7.2 Simple Linear Regression

Linear statistical modeling begins with a linear hypothesis. In the most simple case, there will have two variables. The most simple linear hypothesis will be that we wish to create a linear relationship between an independent variable (e.g., LOC) and a dependent variable such as faults. To formulate our hypothesis, we will postulate a linear relationship between faults, represented by the variable y, and lines of code, represented by the variable x. The model that we wish to create will be that of a straight line: y = ax + b. To build the model we will next need to collect some data. Note the order. First we will postulate the model and then we will get the data.

Modeling is a form of experimentation. We begin the experiment with a hypothesis. Next we identify and constrain external sources of variation that might contaminate our experiment. Now, a specific linear model will be chosen such as y = ax+b. Then we measure the system to get value pairs (x_i, y_i) for our variables. Finally, we attempt to fit the model to the data.

Once we have built the model (we will discuss that subject momentarily), we will then evaluate the quality of the model that we have developed. The data that we have used for this experiment are consumable. You may imagine that they will be totally destroyed by the model building process. They cannot be reused in any other experiment or model. They have validity only for the experiment for which they were collected. Linear modeling is not a tool for beating data into submission. It is part of the process of scientific inquiry.

All too frequently in the computer software literature we see models that have been developed to fit the data, and not the other way around as science would dictate. The problem with this approach is that we wind up building a science around measurement artifacts and other spurious aberrations in the data. If our model or our conjecture is not a good one, the disparity between the model and the actual data will contain information on where we went wrong. If we choose to ignore this very important data source, we will certainly build a science of software on a very weak or unstable foundation.

If we have the fortune of not being able to fit our predetermined model to the data, we will then analyze the outcome to understand what went wrong. This, in turn, will immediately lead to a new hypothesis, a new model, a new experiment, and a new set of problems. That is the process of scientific inquiry. The engine or motive force behind the scientific discovery process is the model that does not work too well. If we learn to listen to the data, they will certainly steer us in the right direction.

7.2.1 Examination of the Data

Above all, common sense should prevail in the modeling process. An understanding of the data that we are working with will reveal a great deal about what we will be able to learn from them. For the purposes of this discussion, we will use two variables: one dependent and one independent. This will simplify the discussion and permit us to demonstrate the modeling techniques in two-dimensional figures. We will call this technique "listening to the data." The data contain a message; they are trying to reveal something about the nature of the problem that we are studying. We must be attentive to what the data are trying to say. Altogether too often, data are used to confirm a pet hypothesis. We will have failed utterly if we use it in this manner. Sometimes, the messages in these data are very subtle. Some degree of statistical sophistication will be required to get the data to reveal their message. At other times, the message in the data will be quite obvious. We must learn to listen.

Let us begin our investigation with some hypothetical data that we have collected on our new Mnamana metric and software faults. We built a measurement tool to obtain metric data for a software system consisting of 24 program modules for the Mnamana metric. From our software quality assurance staff we also obtained fault data for these modules. The results of this measurement exercise are shown in Exhibit 1.

Exhibit 1: The Relationship between Mnamana and Faults

click to expand

When we look at those data, we are filled with despair. There is no apparent relationship between these variables. When we examine the correlation coefficient between the data, we find that it is very low as well (at -0.004). Further, if we try to find a linear relationship, y_i = β₀ + β₁x_i, between the data, we find that there are a very large number of lines that we could generate through this data, all of which would represent the data equally poorly. Our insistence on finding a model to fit this data will cause us to miss the obvious.

Let us find the centroid, , for this data. Now define a circle whose radius is ε around the centroid. This we will call the epsilon neighborhood of the centroid. It turns out that for relative small values, ε < 2, we can completely encapsulate all the data. This implies that there is a very strong relationship between the Mnamana metric and faults. It is a point relationship and not a linear one. There is not much variation in the software faults and there is not much variation in the Mnamana metric.

Now let us look at another example in which there is an apparent linear relationship between our independent measure of LOC and our dependent measure of faults. We have now measured another hypothetical software system with ten modules and obtained values for our independent and dependent variables as shown in Exhibit 2. Those data are also shown in an x-y plot in Exhibit 4. There is an apparent linear relationship between our two variables. The correlation coefficient for them is 0.88, which confirms our suspicion of a linear relationship. There are many straight lines that can be put through these data. The question at hand is what criterion measure to use to establish the best line.

Exhibit 2: Measurements on Hypothetical Systems

Module	Faults	LOC
1	3	82
2	3	87
3	4	94
4	4	98
5	2	105
6	6	108
7	5	120
8	6	125
9	7	130
10	9	153
Mean	4.9	110.2

First, let us address the criterion measure for quality of fit. Let y_i represent the observed value of the dependent variable and the predicted value of the dependent variable from the model, y_i = β₀ + β₁x_i. The difference between the predicted value and the observed value is called the residual value, or error. Let ε_i = y_i - ŷ_i. Our first attempt at defining a criterion measure of model quality will be to let

(1)

Exhibit 3: Epsilon Neighborhood about the Centroid

click to expand

A good line will be one that minimizes q. This is a linear loss function; that is, q is linear in relation to y.

To return to the problem of fitting a line through the fault data shown in Exhibit 2, one possibility is to set the intercept to zero and pass the line through the centroid of the data. This line is shown superimposed on the data in Exhibit 4. Clearly, the two points that define this line are (0,0) and . Thus, the line will look like this:

(2)

Exhibit 4: Plot of LOC and Faults for Hypothetical Data

click to expand

where b₁ is our estimate for the parameter β₁. The residuals for this linear model are shown in Exhibit 5. Note that the sum of the residuals is zero. The sum of the residuals will always be zero by definition. Therefore, this sum is not a good indicator of the quality of the line that we have fit to the data. If, on the other hand, we compute the absolute value of the residuals, deviations, and add these deviations together, then we have a quality index that is different from zero. We can systematically investigate other models:

(3)

Exhibit 5: Measurements on Hypothetical Systems

Module	*y_i*
1	3	3.65	-0.65	0.65
2	3	3.87	-0.87	0.87
3	4	4.18	-0.18	0.18
4	4	4.36	-0.36	0.36
5	2	4.67	-2.67	2.67
6	6	4.80	1.20	1.20
7	5	5.34	-0.34	0.34
8	6	5.56	0.44	0.44
9	7	5.78	1.22	1.22
10	9	6.80	2.20	2.20
Mean	4.9	4.9	0.00	1.01

where b₀ is not zero. In every one of these cases we will find that our quality index

(4)

is in every case greater than the quality index for the model where b₀ = 0.

Instead of defining a line with the two points (0,0) and , we could also pick any other two points in the data and define a new line with those points. Using this fitting strategy, we could compute our quality index for each such pair of lines and pick the line that has the smallest value of q for all such lines. This will be a very time-consuming process, however, and we run a great risk of choosing a model that is heavily biased due to random artifacts in the data.

In a classical statistic sense, a technique called regression analysis is used to fit a straight line through the data. This technique is based, very classically, on a quadratic loss function, to wit:

(5)

In this circumstance, we select estimates b₀ and b₁ for our parameters β₀ and β₁ that will minimize the squared error term q. To do this we will first differentiate q with respect to β₀ and then with respect to β₁.

(6)

(7)

Now, if we substitute our estimates b₀, b₁ for our parameters β₀, β₁ and set the differentials to zero, we have two equations in two unknowns, to wit:

(8)

(9)

Solving first for b₁ , we find that

(10)

and that

(11)

There are still other alternatives for our loss function. We could also consider a host of loss functions in polynomials of ε. The choice of the quadratic loss function is simply one of inertia. There is a wealth of historical effort and energy that has been invested in this area of least squares linear fit of data. In addition, this is not a text in statistics. Suffice it to say that we will use a quadratic loss function, not because it is necessarily the best for our modeling concerns in software engineering. We will do so strictly out of convenience at this point.

We will now revisit the data shown in Exhibit 2 and recompute a new line through this data using the least squares fit technique. The requisite calculations for our estimator of the slope, b₁, of the regression line are shown in Exhibit 6. We will compute b₁ as follows:

(12)

Exhibit 6: Slope Calculation Data

Module	y_i	x_i
1	3	82	795.2	53.6
2	3	87	538.2	44.1
3	4	94	262.4	14.6
4	4	98	148.8	11.0
5	2	105	27.0	15.1
6	6	108	4.8	-2.4
7	5	120	96.0	1.0
8	6	125	219.0	16.3
9	7	130	392.0	41.6
10	9	153	1831.8	175.5
Sum	49	1102	4315.6	370.2

Next we will find the intercept for the line:

(13)

Thus, our final least squares model for predicting faults from lines of code is:

(14)

To compare the new least squares fit with the first model we developed, we will need to compute the residuals for each model. This comparison is shown in Exhibit 7. Using the least squares criterion for model evaluation, Model 2 is clearly better than Model 1 in that the sum of the squared residuals for Model 2 is 9.14 as opposed to 16.51 for Model 1. Both models would have been considerably better off without module 5 in them. This residual value is very large for both modules. Module 5 is an outlier. There is a great temptation, at this point, to toss out module 5 altogether. It appears to be corrupting our model. Actually, the contrary is true. It is perhaps our most important data point. There is real information in this datum. We do not know what it is but we should certainly take the time to find out.

Exhibit 7: Residual Analysis of the Two Models

		Model 1	Model 1	Model 2	Model 2
Module	y_i	y_i - ŷ_i	*(y_i* - ŷ_i)²**	y_i - ŷ_i	*(y_i* - ŷ_i)²**
1	3	-0.65	0.42	0.52	0.27
2	3	-0.87	0.75	0.09	0.01
3	4	-0.18	0.03	0.49	0.24
4	4	-0.36	0.13	0.15	0.02
5	2	-2.67	7.12	-2.45	6.02
6	6	1.20	1.43	1.29	1.66
7	5	-0.34	0.11	-0.74	0.55
8	6	0.44	0.20	-0.17	0.03
9	7	1.22	1.49	0.40	0.16
10	9	2.20	4.83	0.43	0.18
Sum	49	0.0	16.51	0.0	9.14

We must investigate four very important issues before we take our new model seriously and use it for the evaluation of all future code development projects. First, we need to know how well the line we have chosen fits the data. Second, we need to know whether the line we have found is really better than any other line obtained by chance. Third, and most important, we will must develop an estimate for the predictive validity of this model. It is clear that we can plug a value for lines of code into this model and get out a value for the anticipated number of faults in the model, but we really do not know at this point that what we have done is meaningful. Finally, we need to establish that we have found the right model for our data. It is quite possible that we could fit a linear model to data that are intrinsically nonlinear. What is worse is that we could have fit a model to data that were severely corrupted with measurement error. Just because we developed a model does not mean that our modeling task is over. Quite the contrary, it has just now begun.

In Chapter 2 we discussed the conduct of scientific inquiry. It is an iterative process. We will formulate a hypothesis about the nature of the world. We will conduct an experiment to gather data to explore our hypothesis. From this data we will build a model. We will apply the model to the data to see what went wrong. In the case of regression modeling, the residuals will disclose our inadequacies. We must learn to listen to what these residual values are telling us. They will reveal to us our next hypothesis in the conduct of our inquiry. The bottom line is that the model is not the end of the process; it is the beginning of the next cycle of the process of scientific inquiry. No model in the history of science has ever been capable of perfect prediction. That is not our goal. What is really of interest to us is why the model really did not work. All of this information is in the residuals. We must learn how to tease it out. The model is not our objective; the residuals are. That is where the gold is.

7.2.2 The Regression ANOVA

The least squares regression model that we just developed

(15)

is of value to us if it has credibility. We would like to be able to make some assertions about the quality of prediction of this model. That is, is the model that we developed better than one that we would have developed by chance? To answer this question we will use a statistical procedure called analysis of variance. Observe that:

(16)

We now square both sides of this equation such that:

(17)

Now, if we add the residual squares we obtain:

(18)

Now we expand the right-hand side. The cross product terms will drop out because they always do (see Appendix 1), and we obtain:

(19)

which can be rewritten as:

(20)

The left-hand side of this equation is the sum of squares about the mean of the observed data. This is the variance term for the variance in the ys. The first term on the right-hand side is the sum of squares error. It is the residual or error variance. The second term on the right-hand side of the equation is the sum of squares due to regression. This is the variance term for the predicted data.

Each of the sums of squares has an associated number called degrees of freedom. Let us observe that the term is a constant. We know its value. If we look at the deviation scores , we know that their sum must be zero, by definition. Now if we examine these terms one at a time from y₁ to y_n_-1, we simply cannot predict what any of these values might be. However, when we know all of the values for y_l to y_n-_l, then we will automatically know the final one, y_n. There will be no new information in the value of y_n. While the first 1 through n - 1 values of y are free to vary, the last one, y_n, will be determined precisely by the other n - 1 values. Therefore, the sum of the deviation scores squared has n - 1 degrees of freedom.

In that

(21)

Exhibit 8: The Regression ANOVA

Source	Degrees of Freedom	Mean Square	F-Ratio
Due to regression	1	MS_R = SS_R / 1	F = MS_R / MS_E
Residual	n-2	MS_E = SS_E / (n-2)
Total	n-1

it is a single function of y and therefore has but one degree of freedom. Hence, by subtraction we determine that the sum of squares about regression (error) has n - 2 degrees of freedom.

We can now construct the Analysis of Variance (ANOVA) table for the regression model. This is shown in Exhibit 8. The Mean Square column of this table is derived by dividing the sums of squares residual and regression by their respective degrees of freedom. The F statistic in the final column is the ratio of the mean squares due to regression to the mean square error term. This F statistic has one degree of freedom in the numerator and n - 2 degrees of freedom in the denominator.

Let us now take our sample data from Exhibit 2 and perform the regression analysis of variance for this data. The results of this analysis are shown in Exhibit 9. We would now like to ascertain whether our new model is different from one that Nature would have derived by chance alone. As per Chapter 2, we will have determined long before we collected data for this experiment what risk we were willing to take about this conjecture. Let us assume that it was one chance in 20. Then we would like to know at the 95 percent confidence level whether the observed F statistic is greater than F(0.95;1,8) = 5.32. Our calculated F at 27.80 is certainly much greater than this value. In that we decided a prior to set our Type I error to 5 percent it would be at once gauche and naive to report that our F statistic was also significant at the 0.5 percent level as well. We are only concerned that our initial experimental criterion level of 5 percent was met.

Exhibit 9: Regression ANOVA for Sample Data

Source	Sum of Squares	Degrees of Freedom	Mean Square	F-Ratio
Due to regression	31.76	1	31.76	F = 27.80
Residual	9.14	8	1.14	p < 0.05
Total	40.90	9

Exhibit 10: Sample Data from Four Experiments

Case 1		Case 2		Case 3		Case 4
x	y	x	y	x	y	x	y
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.1	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.1	4	5.39	19	12.5
12	10.84	12	9.13	12	8.15	8	5.56
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

Exhibit 11: Regression ANOVA for Four Experiments

Case	Source	Sum of Squares	Degrees of Freedom	Mean Square	F-Ratio
1	Regression	27.5	1	27.5	F =27.5
	Residual	13.8	9	1.5
	Total	41.3	10
2	Regression	27.5	1	27.5	F = 27.5
	Residual	13.8	9	1.5
	Total	41.3	10
3	Regression	27.5	1	27.5	F =27.5
	Residual	13.8	9	1.5
	Total	41.3	10
4	Regression	27.5	1	27.5	F = 27.5
	Residual	13.8	9	1.5
	Total	41.3	10

We now know that our regression line was satisfactory. The next question is how well did the model perform. We can compute the coefficient of determination as R² = SS_R / SS_TOT. This will give us the proportion of variation of y explained by the regression equation. Clearly, if the residual sum of squares, or error sum of squares, is zero, then R² = 1.0. In the example above, R² = 31.76 / 40.90 = 0.78. This says that we are able to explain about 78 percent of the variation in y with a concomitant variation in our independent variable x.

We must be very careful about what we think that we have learned from the regression ANOVA. Consider the data presented in Exhibit 10 from Tufte. ^[1] Here are data from four successive hypothetical experiments. When we compute the least squares regression line for each of these cases, we find that y = 0.5x + 3. The models are identical. Furthermore, when we perform the analysis of variance for each of these cases, we find that the results of these analyses are all identical. They all account for exactly the same amount of variation in the dependent variable. Yet, the data are not identical (see Exhibit 10). They are very different from case to case.

There must be something that we have missed in our model analysis. Indeed there is. Perhaps the most important step in modeling is a detailed examination of the residuals. As amazing as it might seem, we should be more interested in the deviations from our predictions than in our successful predictions.

7.2.3 Residual Analysis

In the great scheme of things, if we were to create a model where all the values of the dependent variable had fallen on the regression line, the residuals would all have been zero. Given that there is some error introduced by the observation process alone, there will be no functional relationship between the errors and our dependent variable. That is, ε and x will be unrelated. Now look at the residuals for the data on the four hypothetical experiments shown in Exhibit 10.

The residuals for the four cases are shown in Exhibit 12. If the εs are unrelated to the xs, then there should be no particular pattern in the residuals. We can explore this conjecture visually by plotting the residuals against the independent variable x. These plots are shown in Exhibits 13 through 16.

Exhibit 12: Residual Values and Independent Variable

Case 1		Case 2		Case 3		Case 4
x	**y_i - ŷ_i**	x	**y_i - ŷ_i**	x	**y_i - ŷ_i**	x	**y_i - ŷ_i**
4.00	-0.74	4.00	-1.90	4.00	0.39	19.00	0.00
5.00	0.18	5.00	-0.76	5.00	0.23	8.00	-0.11
6.00	1.24	6.00	0.13	6.00	0.08	8.00	-1.75
7.00	-1.68	7.00	0.76	7.00	-0.08	8.00	0.91
8.00	-0.05	8.00	1.14	8.00	-0.23	8.00	-1.24
9.00	1.31	9.00	1.27	9.00	-0.39	8.00	1.84
10.00	0.04	10.00	1.14	10.00	-0.54	8.00	-0.42
11.00	-0.17	11.00	0.76	11.00	-0.69	8.00	1.47
12.00	1.84	12.00	0.13	12.00	-0.85	8.00	-1.44
13.00	-1.92	13.00	-0.76	13.00	3.24	8.00	0.71
14.00	-0.04	14.00	-1.90	14.00	-1.16	8.00	0.04

The first observation that we can make about the residual data shown in Exhibits 13 through 16 is that they are fantastically different. This is particularly astonishing in that everything that we have done in model construction and ANOVA seems to suggest that these data are very similar. Obviously, there is much more to the modeling process than just building the model and evaluating it with the ANOVA. That is only the first step. The residual analysis is where the model building fun really begins. Let us begin this process with Exhibit 13 for Case 1.

Exhibit 13: Residual Values for Case 1

click to expand

If the residuals were truly randomly distributed with respect to the independent variable, then there would be a random distribution of this data. There would be no clear pattern in these data. That is certainly not the case for Exhibits 14 through 16. The data in these figures are anything but random. In Exhibit 13, however, there is no clear relationship between the residuals and the independent variable. The correlation coefficient between them is essentially zero. There is an apparent tendency for the residuals to grow larger, as does the independent variable. However, if we compute the correlation coefficient of |ε| with x, it is 0.08, which is not significantly greater than zero. That is, the 95 percent confidence intervals for this correlation include the value zero. This will lead us to discard the conjecture that the residuals are becoming larger as x becomes larger.

If, on the other hand, we compute the correlation coefficient between ε and y, we find that there is a significant correlation between these variables at 0.58, indicating a linear relationship between the two. We are deeply concerned about these kinds of relationships because they are leading indicators of a flawed measurement process. If the residuals grow with respect to either the dependent variable or an independent variable, then it is highly likely that errors are creeping in because of our measurement tool. Let us suppose, for example, that the algorithm that we use to measure the executable statement count underreports these values by 5 to 10 percent. Then the measurements taken on larger program modules will have greater error.

There are a host of different techniques that we can apply to determine whether there are trends in the residuals, either with respect to the dependent variable or the independent variable. There are nonparametric procedures of sign tests and run tests. Autocorrelation is also a very useful technique for exploring trends in residuals. In every case, we are trying to eliminate the possibility that we have introduced systematic trends in these residuals, either by the formulation of the wrong hypothesis or by measurement error.

The residuals shown in Exhibit 14 for Case 2 show a very clear trend. It is clear that these residuals are nonlinear with respect to x, and that a second-order polynomial model would probably fit these data much better. We may have had the wrong hypothesis in mind when we developed the experiment. What we cannot do is rerun this data through another model. We used these data once and now the juice is out of them. We really need to do some soul searching about the data themselves. There are three points in these data that are particularly relevant: the maximum value and the two minima. We need to extract these points and study very carefully the circumstance that created them. If, after close scrutiny, we can see a distinct reason for why there is a quadratic relationship between the dependent and independent variable, then we will reformulate a new quadratic hypothesis, collect new data from a new experiment, and then fit a new model. The worst thing that we could do is to revisit this data with a new model and then invent an hypothesis to go with the new model. That methodology will fall well outside the realm of valid scientific investigation.

Exhibit 14: Residual Values for Case 2

click to expand

The real danger of fitting a new model to used data is that the quadratic trend that we observed in the data may be a measurement artifact. We could imagine, for example, that we were using a thermocouple in this experiment that took a while to warm up; hence the rising slope of the first residual data points. Then, after our thermocouple got warm, it underwent some physical change that caused its resistance to drop over time. This is an artifact of the measurement process. The underlying problem that we are investigating might, in fact, be a simple linear relationship. If we fit a nonlinear model to this data, we will be learning about our measurement tools and not about the very thing that we wish to model. We will only learn about the true cause of this trend in the residuals if we do the necessary forensic investigation on the data themselves.

The residuals for Case 3 are shown in Exhibit 15. Here is a very disturbing trend. This pattern was created by the single point that falls way off the line of the other points, the outlier. If it were not for this outlier, we would have a pretty good model. We find that if we remove this point (13,12.74) from the original data set and recompute the regression line, we will have a perfect line. The residuals all become zero. It is very tempting to remove this offending point. After all, what is one point? Then we would have a perfect experimental outcome.

Exhibit 15: Residual Values for Case 3

click to expand

We simply cannot remove data points because they disturb the aesthetics of our model and our experiment. Sometimes, these outliers are very poignant. We might well be observing a very important but also very tenuous phenomenon in our data. Because of our limited sampling, we only saw one instance of this phenomenon that may contain more information than all of the other points put together. To cast this point out summarily would be a crime against science. Our outlier is perhaps the most important point of the lot. Something either went heroically wrong in our observation/measurement process or something very important happened once and only once. Again, we need to do a thorough forensic analysis on the point (13,12.74) to try to discover why it was an outlier. If in the process of this investigation we were to discover that we had simply made a big mistake in our observation process and had evidence to support that conclusion, then and only then could we remove this point from the set of legitimate observations.

Finally, there is Case 4 shown in Exhibit 16. Except in the case of our one outlier, there is a perfect nonrelationship between x and y. With the exception of the outlier, y varies while x is constant. These residual data suggest that there is almost no predictive capability in the model. It is time to formulate a new hypothesis. Again, it will be well worth our while to see just exactly why x has the value of 19 once when all of the other observations were 8. We can only analyze this if we have preserved the circumstances that surrounded the original measurement during the experiment.

Exhibit 16: Residual Values for Case 4

click to expand

Let us remember that the regression ANOVA and the linear regression models for each of these cases were identical. Only on the examination of the residuals did we really begin to understand what each of these experimental outcomes really was. Essentially, the residuals were the experimental outcome.

7.2.4 Multiple Linear Regression

Unfortunately, the world we live in is very complicated. Software faults, for example, are not evenly distributed in source code. If they were, we would only need simple univariate linear regression between our independent variable of, say, lines of code and our dependent variable of software faults. People inject these faults into code for a variety of different reasons. Sometimes, developers make errors on predicate clauses. Sometimes, they make errors on the declaration or use of data structures. Errors are made for a variety of reasons. Therefore, we will need to measure a variety of different program attributes to get a reasonable handle on the faults themselves. We will require much more sophisticated models to cope with the simultaneous variation in each of these attributes. To this end we will employ a more general first-order linear model in multiple independent variables as follows:

(22)

The least squares estimates for the parameters of this equation are obtained from:

(23)

where X is an n × (p + 1) matrix of n observations on each of the p independent variables and X^T is its transpose. The first column of X is a vector of n ones and Y is a vector of n observations on the dependent variable. The vector, b, has (p + 1) elements and will contain the estimates for the parameters β₀, β₁, β₂, ...,β_p.

The sums of squares for the multiple linear regression model are:

(24)

(25)

(26)

where

(27)

and U is an n × 1 element vector containing all 1s. As was the case for simple linear regression, the SS_TOT has n-1 degrees of freedom. There are now p parameters to be estimated in this new model. That means that the sum of squares due to regression, SS_R, will have p of freedom. By subtraction, this means that the sum of squares error, SS_E, will have n - p - 1 degrees of freedom. The ANOVA for the multiple linear regression is shown in Exhibit 17.

Exhibit 17: ANOVA for Multiple Regression

Source	Sum of Squares	Degrees of Freedom	F-Ratio
Due to regression	SS_R	p	F_C = MS_R / MS_E
Residual	SS_E	n - p - 1
Total	SS_TOT	n - 1

In the simplest case of hypothesis testing, either the regression equation explains a significant amount of variation in the dependent variable or it does not. Hence, we will choose between the two alternate hypotheses as follows:

(28)

depending on the computed value of the F statistics, F_c. Given that we have determined a to be 0.05 as before, then we will accept:

(29)

Otherwise,

(30)

Again, just because we have rejected the null hypothesis does not mean that we have produced a useful regression model. Its utility must be measured in terms of its ability to predict the future. However, we can evaluate the quality of the fit of the model with the coefficient of multiple determination that we discussed earlier as R² = SS_R / SS_TOT.

Now that we know how, we can easily build very complex regression models with multiple independent variables. For small data sets, we can very rapidly reach a point of diminishing returns. Observe that F_c has p degrees of freedom in the numerator and n - p - 1 degrees of freedom in the denominator. As we add terms to the regression equation, the degrees of freedom numerator rises and the degrees of freedom denominator falls. The critical value of the F statistic begins to fall as the numerator for F_c rises. More is not necessarily better.

We will now begin to explore our new modeling knowledge with two random samples of data from program modules drawn from the Space Shuttle PASS. The first sample will consist of data from 100 program modules. The second sample, the validation sample, will consist of data from 50 program modules. This second data set will be used to validate the models that we will create using the first data set. It is called a hold-back or confirmatory data set. The problem with predictive models is that they predict the past. They have been developed from data we collected in the past. We do not know how well they will predict new data that is not part of the original data set used to develop the model.

Regression models were developed for the original 100 observations. New models were created that systematically added each of the metrics. Thirteen such models were developed. These are shown in Exhibit 18. The first column of this table is the constant term. The first model in this table is the simple linear regression of η₁ on the dependent variable that was DR-Count or the count of the number of Discrepancy Reports that were filed for each program module. This table shows 13 of 8191 of all of the possible combinations of models built from these 13 metrics. It is quite possible that the best model is not shown here. In the next section we will look at techniques to help us find the best model.

Exhibit 18: Regression Models for the 13 Dependent Variables

Cons	η₁	η₂	N₁	N₂	Exec	LOC	Nodes	Edges	Paths	Cycles	Max Paths	Ave Paths	DS
-0.841	0.156
-0.346	0.062	0.018
0.450	0.018	-0.020	0.007
0.451	0.017	-0.017	0.008	-0.002
0.342	0.045	-0.043	-0.004	0.014	0.029
-0.178	0.028	-0.040	-0.005	0.014	0.015	0.012
-0.152	0.025	-0.040	-0.004	0.014	0.013	0.012	0.003
-0.020	0.040	-0.038	-0.001	0.011	0.001	0.015	-0.150	0.107
-0.004	0.040	-0.039	-0.001	0.011	0.001	0.015	-0.154	0.110	0.000
0.067	0.078	-0.059	-0.009	0.021	0.021	0.016	-0.346	0.264	0.000	-0.399
0.143	0.076	-0.057	-0.010	0.022	0.023	0.015	-0.287	0.229	0.000	-0.169	-0.029
0.132	0.078	-0.058	-0.009	0.022	0.022	0.015	-0.284	0.226	0.000	-0.171	-0.041	0.014
0.100	0.074	-0.054	-0.008	0.019	0.018	0.014	-0.298	0.239	0.000	-0.194	-0.046	0.017	0.009

The regression ANOVA for each of the 13 regression models is shown in Exhibit 19. Here we can see that the regression degrees of freedom are increasing from 1 to 13. At the same time that the degrees of freedom due to regression (numerator) are increasing, the degrees of freedom residual (denominator) are declining. The calculated F statistic, F_c, rises fairly steadily to the sixth model and then declines. The criterion F statistic for α < 0.05 is shown in the last column. It falls steadily as the degree of freedom numerator increases and degrees of freedom denominator declines.

Exhibit 19: Regression ANOVAs for 13 Models

Model	Source	Sum of Squares	d.f.	Mean Square	F_c	F_(0.95;n,d)
1	Regression	314.23	1	314.23	18.74	3.96
	Residual	1643.48	98	16.77
2	Regression	517.74	2	258.87	17.44	3.11
	Residual	1439.97	97	14.85
3	Regression	1015.92	3	338.64	34.52	2.72
	Residual	941.79	96	9.81
4	Regression	1016.39	4	254.10	25.64	2.49
	Residual	941.32	95	9.91
5	Regression	1060.01	5	212.00	22.10	2.33
	Residual	897.70	94	9.55
6	Regression	1492.06	6	248.68	49.67	2.22
	Residual	465.65	93	5.01
7	Regression	1493.32	7	213.33	42.26	2.13
	Residual	464.39	92	5.05
8	Regression	1507.92	8	188.49	38.14	2.06
	Residual	449.79	91	4.94
9	Regression	1508.14	9	167.57	33.55	2.00
	Residual	449.56	90	4.10
10	Regression	1585.34	10	158.53	37.89	1.95
	Residual	372.37	89	4.18
11	Regression	1594.12	11	144.92	35.08	1.91
	Residual	363.59	88	4.13
12	Regression	1594.25	12	132.85	31.80	1.87
	Residual	363.46	87	4.18
13	Regression	1604.87	13	123.451	30.09	1.83
	Residual	352.85	86	4.10

Now let us turn our attention to the performance of the subset of regression models that we have evaluated. First, we will assess the quality of fit of each of the models. This is given to us by the R² term in Exhibit 20. It would appear that R² is a monotonically increasing function of the number of terms in the model. It improves to the point that with all 13 metrics in the model we appear to be able to account for about 82 percent of the variation in the program module DR-Count.

Exhibit 20: Regression Model Performance

Model	R²	MSE_Fit	MSE_Pred
1	0.16	16.28	39.58
2	0.26	14.28	34.85
3	0.52	9.47	44.66
4	0.52	9.36	44.38
5	0.54	8.95	38.40
6	0.76	4.71	40.51
7	0.76	4.77	41.72
8	0.77	4.55	50.80
9	0.77	4.57	51.23
10	0.81	3.74	47.19
11	0.81	3.63	47.76
12	0.81	3.72	49.48
13	0.82	3.54	52.29

Another measure of the fit of the regression model is the mean square error term calculated as:

(31)

where n in this case is 100. We can clearly see that there is a point of diminishing returns at model 6. That is, MSE_Fit declines rapidly and then stabilizes at about model 6. Even so, these are relatively small values for MSE_Fit. The models that we have developed fit the data fairly well.

7.2.5 Model Predictive Validity

The main problem with our predictive models is that they predict past events; that is, they are derived from events that have already taken place. If our objective is to understand functional relationships among independent variables and a dependent variable as they have existed, these models will do very well. The reason that most scientists develop predictive models is that they want to be able to predict future events. Predicting the past and predicting the future are two very different things. If we wish to use a model for future prediction, then we must attempt to validate its ability to do so.

Now let us turn our attention to the last column in Exhibit 20. This last column also has in it the mean square error terms but this time they are derived from the hold-back data set. That is, the model developed with the first 100 observations was then applied to the 50 observations. The means square error predictive term is calculated as:

(32)

where m = 50, is the DR-Count value from the hold-back data set, and is the predicted from each of the models. Both the MSE_Fit and the MSE_Pred values for the 13 models are shown in Exhibit 21. We can see that there is a real disparity in the residual values for these two terms. The predicted values are much larger than those obtained by the model.

Exhibit 21: The Original and Predicted MSE Values

click to expand

The moral of this story is that a good model fit does not mean that the model is of value for future prediction. With the kind of MSE_Pred values that we have seen from these 13 models, the derived models were of very limited value. They fit the data well but they failed utterly to give us a quality future predictive ability. It is clear that if the model were to yield the best possible future prediction, then MSE_Fit = MSE_Pred. To the extent that MSE_Pred > MSE_Fit, then the predictive validity of the model is not good.

In Exhibit 21, we had the fortune to have enough data that we could generate a hold-back data set to validate each of the models. Sometimes, the data are much too costly and too sparse. We are obliged to use every last datum in constructing the model. This does not preclude our assessment of the predictive validity of the model. In this case we can use the PRESS statistic to give an assessment of the future prediction potential for a model. To compute the PRESS statistic, we will select and remove one data point from an initial set of n data points and construct our predictive model from the remaining n - 1 observations. We then compute the residual value of the one point that was held back from the model. Let y₍₁₎ represent the y value of the data point held back from the first model. Then, y₍₁₎-ŷ₍₁₎will represent the residual value of this dependent variable value when the first model is applied to the first hold-back data point.

The PRESS statistic, then, is computed as follows:

(33)

A good model with good future predictive potential will have a small value of the PRESS statistic.

7.2.6 Selecting the Best Regression Model

As mentioned in passing in the previous section, with 13 dependent variables it is possible to construct 8191 models of all possible combinations of variables in the models. Some of these models will be better than other models in terms of their quality of fit. We could evaluate each of these models with any one of several criteria. For example, one such criterion would be to select the model with the best R². Our experience with the Space Shuttle data example should make us a little shy about this sole criterion. Of the 13 models that we built for that example, the one with the highest R² had the worst MSE_Pred. Another alternative would be to select the model with the highest calculated F_c statistic.

There are several methodologies available that will short-circuit this grueling process for us. They are iterative processes that build successively better models for us. They are the step-wise model building techniques. There are procedures for the forward step-wise process wherein we will start with a single independent variable and then add new variables until a maximum value of a criterion measure is met. There are also backward elimination step-wise procedures that start with all dependent variables in the model and systematically remove those independent variables that do not contribute significantly to the model. These procedures are well beyond the scope of this text. They are commonly available in most statistical software packages.

7.2.7 Regression with Dummy Variates

Consider the data in Exhibit 2 with a special focus on the sources of variation in the dependent variable, faults. There may very well be sources of variation in these data that are not related to the independent variable, LOC. It turns out that the program modules in this example were written by two different people. The data are reproduced in Exhibit 22, this time with the developers that wrote each module. We can see in this exhibit, for example, that Mary wrote modules 1, 2, 3, 4, and 6. The remainder were written by Betty. Just looking at the data, it would appear that Betty had more faults reported than Mary. This might be due to the fact that Betty was responsible for more complex code. It might be due to the fact that Betty was more diligent in reporting her faults. The main point here is that we do not have enough information to know just why they may be different in regard to the number of faults that have been reported in their code. We would, however, like to control for the constant differences between them, if any.

Exhibit 22: Sample Data by Developer

Module	Faults	LOC	Mary	Betty
1	3	82	1	0
2	3	87	1	0
3	4	94	1	0
4	4	98	1	0
5	2	105	0	1
6	6	108	1	0
7	5	120	0	1
8	6	125	0	1
9	7	130	0	1
10	9	153	0	1

To initiate this control for the constant effect of the developer, we will create a new independent variable that has discrete values. This new variable will have the value of 1 if Mary wrote the code and a value of 0 if Betty wrote the code. This new variable is called a dummy variable. It has the discrete values of 1 or 0. When we build a new regression model with this dummy variate in it, the model coefficient for this independent variable will, in fact, represent the constant difference in faults in the modules that Mary has written.

The regression model for this data is shown in Exhibit 23. The regression coefficient for Mary is 2.69. This is the constant difference between the faults in Mary's code and Betty's code. On average, the modules written by Mary have roughly two and one half more faults in them than Betty's code. That is an interesting fact in and of itself. Looking further into the table we see a notable increase in R² as well. The second model fits the data much better.

Exhibit 23: Regression Model with Dummy Variate

Constant	LOC	Mary	R²
-4.55	0.09		0.78
-11.53	0.14	2.69	0.94

We now turn our attention to the regression ANOVAs for the two models (Exhibit 24). The calculated F statistic (F_c) for the model with the dummy variate is much higher than the F statistic for the initial model. Clearly, we have identified and controlled an extraneous source of variation in the dependent variable that is affecting our model. By introducing the dummy variate to control for the constant difference between the fault reporting observed between Mary and Betty, we have greatly improved our understanding of the relationship between LOC and faults.

Exhibit 24: Regression ANOVA for Two Models

Source	Sum of Squares	d.f.	Mean Square	F_c	F_(0.95;n,d)
Regression	31.75	1	31.76	27.79	5.32
Residual	9.14	8	1.14
Regression	38.57	2	19.29	58.04	4.74
Residual	2.33	7	0.33

We cannot incorporate a dummy variate for both Betty and Mary in the same model because they are linearly dependent. The sum of the variable values for Betty and Mary will always be 1. A basic assumption in linear modeling is that the data vectors representing the independent variables will have no linear dependencies.

In a more general sense, if we wish to represent the constant differences among a group of m members, we will use m-1 dummy variates to do this. That is, if we had a team of six programmers and we wanted to control for constant differences for members of this team, we would create a model with five dummy variates in it. If we suspected that there is a constant difference between male and female developers, for example, we could create a male (or female) dummy variate. The coefficient of this variable in the regression equation would represent the constant difference between males and females. In this manner we are able to identify and control for a source of variance that would otherwise contribute noise to the regression model. The downside of the use of dummy variates is that they tend to represent a large contribution to the numerator degrees of freedom in the regression ANOVA.

^[1]Tufte, E.R., The Visual Display of Quantitative Information, Graphics Press, Cheshire, CT, 1983.

Module	Faults	LOC	Mary	Betty
1	3	82	1	0
2	3	87	1	0
3	4	94	1	0
4	4	98	1	0
5	2	105	0	1
6	6	108	1	0
7	5	120	0	1
8	6	125	0	1
9	7	130	0	1
10	9	153	0	1

Module	Faults	LOC	Mary	Betty
1	3	82	1	0
2	3	87	1	0
3	4	94	1	0
4	4	98	1	0
5	2	105	0	1
6	6	108	1	0
7	5	120	0	1
8	6	125	0	1
9	7	130	0	1
10	9	153	0	1

Module	Faults	LOC	Mary	Betty
1	3	82	1	0
2	3	87	1	0
3	4	94	1	0
4	4	98	1	0
5	2	105	0	1
6	6	108	1	0
7	5	120	0	1
8	6	125	0	1
9	7	130	0	1
10	9	153	0	1