CHECKING ASSUMPTIONS WITH RESIDUALS | Six Sigma and Beyond: Statistics and Probability, Volume III

Residuals are the primary tools for checking whether the assumptions necessary for linear regression appear to be violated. We can draw histograms of the residuals, plot them against the observed and predicted values, recompute them excluding certain cases, and manipulate them in other ways. By examining the resulting plots and statistics, we can learn much about how appropriate the regression model is for a particular data set. In the next few sections, we will consider how to check each of the assumptions in turn .

NORMALITY

If the relationship is linear and the dependent variable is normally distributed for each value of the independent variable (in the population), then the distribution of the residuals should also be approximately normal. A simple histogram can demonstrate this.

On the other hand, when the distribution of residuals does not appear to be normal, you can sometimes transform the data to make it appear more normal. When you "transform" a variable, you change its values by taking square roots, or logarithms, or some other mathematical function of the data. If the distribution of residuals is not symmetric but has a tail in the positive direction, it is sometimes helpful to take logs of the dependent variable. If the tail is in the negative direction and all data values are positive, taking the square root of the data may be helpful.

The distribution of your residuals may appear not to be normal for several reasons besides a population in which the distributions are not normal. If you have a variance that is not constant for different values of the independent variable, or if you simply have a small number of residuals, your histogram may also appear not to be normal. So it is possible that after you have remedied some of these problems, the distribution of residuals may look more normal. To check whether the variance appears to be constant, you can plot the residuals against the predicted values and also against the values of the independent variable.

You may find some common transformations useful when the variance does not appear to be constant. If the variance increases linearly with the values of the independent variable and all values of the dependent variable are positive, take the square root of the dependent variable. If the standard deviation increases linearly with values of the independent variable, try taking logs of the data.

LINEARITY

To see whether it is appropriate to assume a linear relationship, you should always plot the dependent variable against the independent variable. If the points do not seem to cluster around a straight line, you should not fit a linear regression model. Another way to see whether a relationship is linear is to look at the plots of the residuals against the predicted values and the residuals against the values of the independent variable. If you see any type of pattern to the residuals ” that is, if they do not fall in a horizontal band ” you have reason to suspect that the relationship is not linear.

Sometimes when the relationship between two variables does not appear to be linear, it is possible to transform the variables and make it linear. Then you can study the relationship between the transformed variables using linear regression.

It may seem that when you transform the data, you are cheating or at least distorting the picture. But this is not the case. All that transforming a variable does is change the scale on which it is measured. Instead of saying that a linear relationship exists between work experience and salary, you say that a linear relationship exists between work experience and the log of salary. It is much easier to build models for relationships that are linear than those that are not. That is why transforming variables is often a convenient tactic.

How do you decide what transformation to use? Sometimes you might know what the mathematical formula is that relates two variables. In that case, you can use mathematics to figure out what transformation you need. This situation happens more often in engineering or the physical or biological sciences than in the social sciences. If the true model is not known, you choose a transformation by looking at the plot of the data. Often, a relationship appears to be nearly linear for part of the data but is curved for the rest. The log transformation is useful for "straightening out" such a relationship. Sometimes taking the square root of the dependent variable may also straighten a curved relationship. These are two of the most common transformations, but others can be used.

When you try to make a relationship linear, you can transform the independent variable, the dependent variable, or both. If you transform only the independent variable you are not changing the distribution of the dependent variable. If it was normally distributed with a constant variance for each value of the independent variable, that remains unchanged. However, if you transform the dependent variable, you change its distribution. For example, if you take logs of the dependent variable, then the log of the dependent variable ” not the original dependent variable ” must be normally distributed with a constant variance. In other words, the regression assumptions must hold for the variables you actually use in the regression equation.

INDEPENDENCE

Another assumption that we made was that all observations are independent. (The same person is not included in the data twice on separate occasions. One person's values do not influence the others'.) When data are collected in sequence, it is possible to check this assumption. You should plot the residuals against the sequence variable. If you see any kind of pattern, you should be concerned .

Finally, it is important to examine the data for violation of the assumptions since significance levels, confidence intervals, and other regression tests are sensitive to certain types of violations and cannot be interpreted in the usual fashion if serious departures exist. If you carefully examine the residuals, you will have an idea of what sorts of problems might exist in your data. Transformations provide you with an opportunity to try to remedy some of the problems. You can then be more confident that the regression model is appropriate for your data.

How can you tell whether the assumptions necessary for a regression analysis appear to be violated? Here are the key points to remember:

A residual is the difference between the observed value of the dependent variable and the value predicted by the regression model.
To check the assumption of normality, make a histogram of the residuals. It should look approximately normal.
To check the assumption of constant variance, plot the residuals against the predicted values and against the values of the independent variable. There should be no relationship between the residuals and either of these two variables. If you note a pattern in the plots, you have reason to suspect that the assumption of constant variance is violated.
To check whether the relationship between the two variables is linear, plot the two variables. If the points do not cluster about a straight line, you have reason to believe that the relationship is not linear.
If any of the assumptions appear to be violated, transforming the data may help. The choice of the transformation depends on which assumption is violated and in what way.