GOODNESS OF FIT OF THE MODEL | Six Sigma and Beyond: Statistics and Probability, Volume III

We already have discussed the importance of assessing how well the regression model actually fits the data. The REGRESSION command prints several statistics that describe the "goodness of fit." Here are some of the most common ones:

MULTIPLE R is just the absolute value of the correlation coefficient between the dependent variable and the single independent variable. It is also the correlation coefficient between the values predicted by the regression model and the actual observed values. If the value is close to one, the regression model fits the data well. If the value is close to zero, the regression model does not fit well.
Another way of looking at how well the regression model fits is to see what proportion of the total variability (or variance) in the dependent variable can be "explained" by the independent variable. The variability in the dependent variable is divided into two components: variability explained by the regression, and variability not explained by the regression. Because of the way they are calculated, these two components are termed sums of squares . Indeed, they are conceptually very similar to the sums of squares we mentioned at the end of Chapter 7. The sums of squares explained by the regression equation are labeled REGRESSION, while the unexplained variability is labeled RESIDUAL . You can obtain the total variability in the dependent variable by adding up these two sums of squares. To calculate what proportion of the total variability is explained by the regression, all you have to do is divide the regression sum of squares by the total sum of squares.
You can calculate this proportion in an easier way. All you have to do is square the correlation coefficient. This value is called R SQUARE. From R ² we see that in the sample we can explain X% of the variability in one variable by knowing something about the other variable.
Yet another way to test the null hypothesis that no linear relationship exists between the two variables is analysis of variance. The actual test is the F test. F is the ratio of the mean square for regression to the mean square for the residual, and the mean squares are the sums of squares divided by their respective degrees of freedom. You can find these mean squares by looking at the output of the regression output, identified as Mean Square . If no linear relationship exists between the two variables, then each of these mean squares provides an estimate of the variance, or variability, of the dependent variable. If a linear relationship exists, then the variability estimate based on the regression mean square will be much larger than the estimate of variability based on the residuals. Large F values suggest that a linear relationship exists between the two variables.

Is this F statistic related to the test that the slope is zero? It seems as though we are testing the same hypothesis in both situations. Yes, the two tests are evaluating exactly the same hypothesis when only one independent variable exists. In fact, a relationship exists between the two statistics. If you square the t value for the test that the slope is zero, you will come up with the F value in the analysis of variance table. (Try it.) For a simple equation such as this, you do not learn anything from the analysis of variance table that you did not already know from the test of the slope.

The STANDARD ERROR is yet another statistic that is available when you run a regression analysis. It is an estimate of the standard deviation of the distributions of the dependent variable. Remember, we assumed that for each value of the independent variable there is a distribution of values of the dependent variable. All of these distributions are normal and have the same standard deviation. Our estimate of it is the standard error.

Another powerful option is the ADJUSTED R SQUARE, which is most useful when you have a model with several independent variables. This statistic adjusts the value of R ² to take into account the fact that a regression model always fits the particular data from which it was developed better than it will fit the population. When only one independent variable is present and the number of cases is reasonably large, the adjusted R ² will be very close to the unadjusted value. (This statistic is very useful when we check for measurement error.)