If you don't like the score you just got on that important high-stakes test, maybe you should take the test again. Or should you?
We've already discussed how to measure anything precisely by applying concepts of reliability [Hack #6]. Reliability is the consistency with which a test assesses some outcome. In other words, a reliable test produces a stable score, and an unreliable test does not. Because tests that are less than perfectly reliable produce scores at least partly due to random chance, their scores can move around in ways that statisticians can predict. Because your test score when you retake a test will tend to move toward the average score on that test, this effect is called regression toward the mean.
When you take a high-stakes test such as the SAT, ACT, GRE, LSAT, or MCAT, you often have the option of retaking it to try to improve your score. Your decision on whether it is worth the time, hard work, and money to try to improve your test score should be made with an understanding of the test's reliability and how much change is possible simply through regression to the mean.
Regressing to the Mean
First, let's make regression to the mean occur, so you'll believe that scores can change in a predictable direction for no reason other than the characteristics of the normal curve [Hack #23]. Seeing is believing, and I hope to make this invisible magical phenomenon happen before your eyes.
Give the true/false quiz shown in Table 3-8 to 100 of your closest friends. Well, OK, maybe 10 people, counting you. 1,000 would be even better, but I just need enough to prove to you that this regression thing happens. As we proceed, keep in mind that if we had 100 or 1,000 takers of this very difficult (or very easy) test, the results would be even more convincing.
Oh, and for this test, you don't have to see the actual questions themselves. Scores will change on this test without any change in the construct that is being measured [Hack #32]. So, all you can do on this quiz is guess. Because they are true/false questions, you will have a 50 percent chance of getting any question correct, and the average performance for your group of 10 test takers (or 100 if you are really serious about this...can you do at least 30 maybe?...anyone?) should be a score of 5 out of 10.
Administer the Advanced Quantum Physics Quiz to all the people you were able to get. And when you and the others take this quiz, don't cheat by looking at the answer key, even though it is only inches away from your eyes right now (in Table 3-9)!
Collect the completed tests (make sure they put their names on them) and score them up, using the answer key in Table 3-9.
Now, pick your highest scorer (this represents someone like you, perhaps, who scores higher than average on standardized tests such as the SAT) and the lowest scorer (this represents someone not like you, perhaps, who scores lower than average). Give these two people the quiz again (without them seeing the correct answers) and score them again.
Here's where regression to the mean kicks in. I am pretty surewithout knowing you or your friends or what their answers areof two things:
If it worked, then aha! I told you so. If it didn't work, I told you I was only "pretty sure." With a larger sample, it is much more likely to work.
Why It Works
What we expect to happen with the two scores is that all the test scores that are below 5 (or whatever your test mean was) would move up toward the mean, and those scores above 5 would move down toward the mean. This may or may not have happened with your two scores, but it is the most probable outcome.
Remember this was a test in which knowledge had no effect on scores. Scores were due entirely to chance both times. This effect occurs with real tests, though, even when knowledge does influence your score. That's because no real test is perfectly reliable, and chance plays some role in performance on every test. This demonstration just exaggerated the effect by presenting a test in which chance accounts for 100 percent of the test taker's score.
So, why are scores likely to change and move closer to the mean on second occasions? In the long run, with 100 or 1,000 sets of test scores, we would expect the outcomes to be something like the normal distribution. Just like flipping a coin (which can come up heads or tails, with a 50 percent chance of either), probabilities are associated with particular outcomes on a true/false test (or any test, for that matter). Table 3-10 shows the possible scores and the likelihood of a test taker receiving them for the Advanced Quantum Physics Quiz.
Why would more extreme scores become less extreme with repeated testing? Look at the likelihood of getting two extreme scores (such as a score of 2 and then another score of 2) versus getting a score of 2 (probability = .044), and then a score of 4 (probability = .205). It's almost five times as likely that a person with a 2 the first time will score a 4 on a second administration. It is almost 95 percent certain that he will score higher than 2 (1 - .044 - .010 - .001 = .945).
With this test, in which scores were entirely due to chance, there is a 65.6 percent chance of scoring at or very near the mean (combining probabilities of scores 4, 5, and 6). With most tests, which have a greater number of items and produce normal distributions, you have a 68 percent chance of scoring at or near the mean [Hack #23].
Predicting the Likelihood of a Higher Score
This is all very interesting, but how will it help you decide whether it is worth it to take a test a second time? Back to our original dilemma. Taking these important tests (such as college admissions tests) a second time takes more money, time, stress, and, perhaps, preparation, so one needs to be strategic in deciding when to try again.
The likelihood that you will do better on a test by just taking it a second time depends on two things: your score the first time and the reliability of the test.
Statisticians have developed a formula that you can apply to give you a good idea of how much wiggle room you have around your score. If there is plenty of room to grow, you might consider a second shot at it. A useful tool to use here is the standard error of measurement. Here's the formula for the standard error of measurement [Hack #6]:
Most standardized tests publish their levels of reliability and the expected standard deviation for the many hundreds of thousands of scores produced by the test during each administration. By plugging values for these tests into the standard error of measurement equation, one can get a general sense of the variation of scores from test to retest that might be possible without any real change in the person being measured.
However, even the standard error is misleading for extreme scores. Very low scores and very high scores are likely to move a greater distance by chance alone than the standard error would suggest. The further you are from normal, the harder it is to resist the gravitational forces of normal. Extreme scores cannot resist that pull, unless they are perfectly reliable.
In sum, here's some sound advice on how to decide whether to retake a test: