Simple linear regression is a powerful tool for measuring something you cannot see or for predicting the outcome of events that have not happened yet. With some help from our special friend statistics, you can make a precise guess of how someone will score on one variable by looking at performance on another.
Many professionals, both in and outside of the social sciences, often need to predict how a person will perform on some task or score on some variable, but they cannot measure the critical variable directly. This is a common need when making admission decisions into college, for example. Admissions officers want to predict college performance (perhaps grade point average or years until completion). However, because the prospective student has not actually gone to college yet, admissions officers must use whatever information they can get now to guess what the future holds.
Schools often use scores on standardized college admissions tests as an indicator of future performance. Let's imagine that a small college decides to use scores on the American College Test (ACT) as a predictor of college grade point average (GPA) at the end of students' first years. The admissions office goes back through a few years of records and gathers the ACT scores and freshman GPAs for a couple hundred students. They discover, to their delight, that there is a moderate relationship between these two variables: a correlation coefficient of .55.
Correlation coefficients are a measure of the strength of linear relationships between two variables [Hack #11], and .55 indicates a fairly large relationship. This is good news because the existence of a relationship between the two makes ACT scores a good candidate as a predictor to guess GPA.
Simple linear regression is the procedure that produces all the values we need to cook up the magic formula that will predict the future. This procedure produces a regression line that we can graph to determine what the future holds [Hack #12], but once we have the formula, we don't actually need to do any graphing to make our guesses.
Cooking Up the Equation
First, examine the recipe for creating the formula (see the "Regression Formula Recipe" sidebar), and then we'll see how to use it with real data. You can clip this recipe out and keep it in the kitchen drawer.
The regression recipe calls for two other ingredients, means and standard deviations for both variables. Here are those statistics for our example:
The admissions office built a regression equation from this information. Consequently, as each applicant's letter came into the admissions office, an officer could enter the student's ACT score into the regression formula and predict his GPA. Let's figure out the parts of the regression equation in this example:
By placing all this information into the regression equation format, we get this formula for predicting freshman GPA using ACT scores:
In our college admissions example, imagine two letters arrive. One applicant, Melissa, has an ACT score of 26. The other applicantlet's call him Brucehas an ACT score of 14.
Using the regression equation we have built, there would be two different predictions for these folks' eventual grade point averages:
I hope, for Bruce's sake, there is more than one spot available.
Why It Works
When two variables correlate with each other, there is overlap in the information they provide. It is as if they share information. Statisticians sometimes use correlational information to talk about variables sharing variance.
If some of the variance in one variable is accounted for by the variance in another variable, it makes sense that smart mathematicians can use one correlated variable to estimate the amount of variance from the mean (or distance from the mean) on another variable. They would have to use numbers that represent the variables' means and variability, and a number that represents the amount of overlap in information. Our regression equation uses all that information by including means, standard deviations, and the correlation coefficient.
Where Else It Works
Regression is helpful in answering research questions beyond making predictions. Sometimes, scientists just want to understand a variable and how it operates or how it is distributed in a population. They can do this by looking at how that variable is related to another variable that they know more about.
Where It Doesn't Work
There will be error in predictions under three circumstances. First, if the correlation is less than perfect between two variables, the prediction will not be perfectly accurate. Since there are almost never really large relationships between predictors and criteria, let alone perfect 1.0 correlations, real-world applications of regression make lots of mistakes. In the presence of any correlation at all, though, the prediction is more accurate than blind guessing. You can determine the size of your errors with the standard error of estimate [Hack #18].
Second, linear regression assumes that the relationship is linear. This is discussed in "Graph Relationships" [Hack #12] in greater detail, but if the strength of the relationship varies at different points along the range of scores, the regression prediction will make large errors in some cases.
Finally, if the data collected to first establish the values used in the regression equation are not representative of future data, results will be in error. For example, in our college admissions example, if an applicant presents with an ACT score of 36, the predicted GPA is 5.52. This is an impossible value that does not even fit on the GPA scale, which maxes out at 4.0. Because the past data that was used to establish the prediction formula included few or no ACT scores of 36, the equation was not equipped to deal with such a high score.