7.4 Problems Associated with Multicollinearity

The basic regression models are based on the assumption that the independent variables of the analysis are neither linear compounds of each other nor share an element of common variance. Two variables sharing a common element of variance are said to be collinear. Unfortunately, we know this is not the case with software metrics as independent variables. As we have seen in Chapter 6, the correlation coefficients among our independent variables are very large, indicating a high degree of collinearity. In fact, we saw with our 13 metrics on the PASS data that there were only three distinct sources of variation, not 13.

There are several problems that a high degree of multicollinearity can cause in the modeling process. The main problem is that the regression coefficients formed by the model are really not representative of the relative contribution of a particular independent variable to the variation in the dependent variable. That is, the effect of a particular independent variable, say LOC, has on our dependent variable, Faults, is dependent on whether Exec is already in the model. We know that LOC and Exec are highly correlated. They are measuring essentially the same phenomenon of program size. Thus, if we were to build a model with LOC as a predictor of faults and then add Exec to this model, the regression coefficient would reflect only a marginal or partial effect that Exec contributes to the model.

Quite simply, multicollinearity among the independent variables undermines our ability to use the regression model coefficients in any meaningful way. With a given set of highly correlated independent variables, a model built from them will be capricious. We will benefit little in understanding the particular contribution of each independent variable to the observed variation in the independent variable. However, this is the very reason that we are modeling in the first place.

As we saw in Chapter 6, principal component analysis (PCA) can be used to detect and eliminate this collinearity in the software complexity metrics. When confronted with a large number of variables known to be highly correlated, it may be desirable to represent the set by some small number of variables that convey all or most of the information in the original set. The principal components are constructed so that they represent transformed domain scores on dimensions that are orthogonal.

We will now apply PCA to our sample of 100 data modules from the PASS system so that we can eliminate the problem of multicollinearity in this data and build a better model. When we do this we get three factors or principal components, as per our analysis of the complete PASS data set in Chapter 6. In this reduced subset of program modules, the new factors or principal components account for approximately 90 percent of the variance in the original set of 13 raw measures. We will use the factors scores or domain scores to build our regression model, with the DR-Count as the dependent variable. The resulting model is shown in Exhibit 25. In this exhibit, the factors have been labeled as Size, Control, and DS, just as they were in Chapter 6 for the complete set of data. We can see that these new domain metrics account for approximately 55 percent of the variation in the dependent variable DR-Count.

Exhibit 25: Regression Equation for Domain Scores

Constant	Size	Control	DS	R²
2.23	2.32	1.81	1.47	0.55

The regression model presented in Exhibit 25 is very different from the models derived from these data and shown in Exhibit 18. This time, we can interpret the model coefficients directly. Remember, first of all, that the domain scores from the Size, Control, and Data Structures domains are z-scores. That is, they all have a mean of zero and a standard deviation of one. Thus, the coefficients shown in Exhibit 25 are on the same scale. They show the relative contribution of each of the new orthogonal independent variables to the variation observed in the DR-Count. The relative contribution of Size at 2.32 exceeds that of Control at 1.81, which in turn exceeds that of Data Structures at 1.47. The relative contribution of Data Structures to this model is somewhat surprising in that there is but one variable in this principal component from the original set of 13.

The regression ANOVA for the orthogonal domain scores model is shown in Exhibit 26. Remember that, in effect, all 13 variables of the original data set are present in this model in a transformed form. Instead of 13 degrees of freedom numerator, we now have 3. The calculated F statistic is greater than all but two of the models from the original analysis shown in Exhibit 18.

Exhibit 26: Regression ANOVA for Domain Scores

Source	Sum of Squares	d.f.	Mean Square	F_c	F(0.95;n,d)
Regression	1071.91	3	357.31	38.72	2.72
Residual	885.80	96	9.23

Now it really gets interesting. The 13 metrics from the hold-back data set of 50 observations that we used earlier will now be converted to z-scores by dividing by each of the raw metric values by the means and standard deviations of the metrics from the original data set of 100 observations. We will now multiply this new 50 × 13 matrix of z-scores by the transformation matrix generated by the original PCA to produce the original 100 domain scores to create a new 50 × 3 matrix of domain scores for the hold-back data set. We will now plug these data back into the model developed earlier and shown in Exhibit 25. This will give us a vector of 50 observations of predicted DR-Counts for the hold-back data set. We can then compute the associated residuals and then the MSE_Pred for the 50 hold-back residuals. This value is shown in last row of Exhibit 26.

The MSE_Pred value for our new orthogonal model is better than any of the 13 models developed using the original raw data. The first five models are repeated in Exhibit 27 for comparison because these were the models that had the best predictive value based on the MSE_Pred criterion.

Exhibit 27: Comparison of Domain Score Model with Raw Score Models

Model	R²	MSE_Fit	MSE_Pred
1	0.16	16.28	39.58
2	0.26	14.28	34.85
3	0.52	9.47	44.66
4	0.52	9.36	44.38
5	0.54	8.95	38.40
Domain	0.55	8.85	34.03

When we model with the orthogonal metric set, we achieve several objectives. First, we eliminate the problem of multicollinearity. This means that we are able to make a direct interpretation of the effect of each of the orthogonal domains on the dependent variable. But most important of all, we have eliminated sources of noise due to multicollinearity so that our model has better predictive validity. All of this, despite the fact that the new set of three orthogonal variables represents only 90 percent of variation in the original model.