Hack 31. Establish Reliability | Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds

People who use, make, and take high-stakes tests have a vested interest in establishing the precision of a test score. Fortunately, the field of educational and psychological measurement offers several methods for both verifying that a test score is consistent and precise and indicating just how trustworthy it is.

Anyone who uses tests to make high-stakes decisions needs to be confident that the scores that are produced are precise and that they're not influenced much by random forces, such as whether the job applicant had breakfast that morning or the student was overly anxious during the test. Test designers need to establish reliability to convince their customers that they can rely on the results produced.

Most importantly, perhaps, when you take a test that will affect your admission to a school or determine whether you get that promotion to head beverage chef, you need to know that the score reflects your typical level of performance. This hack presents several procedures for measuring the reliability of measures.

Why Reliability Matters

Some basics, first, about test reliability and why you should seek out reliability evidence for important tests you take. Tests and other measurement instruments are expected to behave consistently, both internally (measuring the same construct behaving in similar ways) and externally (providing similar results if they are administered again and again over time). These are issues of reliability.

Reliability is measured statistically, and a specific number can be calculated to represent a test's level of consistency. Most indices of reliability are based on correlations [Hack #11] between responses to items within a test or between two sets of scores on a test given or scored twice.

Four commonly reported types of reliability are used to establish whether a test produces scores that do not include much random variance:

Internal reliability: Is performance for each test taker consistent across different items within a single test?
Test-retest reliability: Is performance for each test taker consistent across two administrations of the same test?
Inter-rater reliability: Is performance for each test taker consistent if two different people score the test?
Parallel forms reliability: Is performance for each test taker consistent across different forms of the same test?

Calculating Reliability

If you have produced a test you want to usewhether you are a teacher, a personnel officer, or a therapistyou will want to verify that you are measuring reliably. The methods you use to compute your level of precision depend on the reliability type you are interested in.

Internal reliability

The most commonly reported measure of reliability is a measure of internal consistency referred to as coefficient (or Cronbach's) alpha. Coefficient alpha is a number that almost always ranges from .00 to 1.00. The higher the number, the more internally consistent a test's items behave.

If you took a test and split it in halfthe odd items in one half and the even items in the other, for exampleyou could calculate the correlation between the two halves. The formula for split-half correlations is the correlation coefficient formula [Hack #11] and is a traditional method for estimating reliability, though it is considered a bit old-fashioned these days.

Mathematically, the formula for coefficient alpha produces an average of correlations between all possible halves of a test and has come to replace a split-half correlation as the preferred estimate of internal reliability. Computers are typically used to calculate this value because of the complexity of the equation:

where n = the number of items on the test, SD = standard deviation of the test, S means to sum up, and SD_i = standard deviation of each item.

Test-retest reliability

Internal consistency is usually considered appropriate evidence for the reliability of a test, but in some cases, it is also necessary to demonstrate consistency over time.

If whatever is being measured is something that should not change over time, or if it should change very slowly, then responses from the same group should be pretty much the same if they were administered the same test on two different occasions. A correlation between these two sets of scores would reflect a test's consistency over time.

Inter-rater reliability

We can also calculate reliability when more than one person scores a test or makes an observation. When different raters are used to produce a score, it is appropriate to demonstrate consistency between them. Even if only one scorer is used (as with a teacher in a classroom), if the scoring is subjective at all, as with most essay questions and performance assessments, this type of reliability has great theoretical importance.

To demonstrate that an individual's score represents typical performance in these cases, it must be shown that it makes no difference which judge, scorer, or rater was used. The level for inter-rater reliability is usually established with correlations between raters' scores for a series of people or with a percentage that indicates how often they agreed.

Parallel forms reliability

Finally, we can demonstrate reliability by arguing that it doesn't matter which form of a test a person takes; she will score about the same. Demonstrating parallel forms reliability is necessary only when the test is constructed from a larger pool of items.

For example, with most standardized college admission tests, such as the SAT and the ACT, different test takers are given different versions of the test, made up of different questions covering the same subjects. The companies behind these tests have developed many hundreds of questions and produce different versions of the same test by using different samples of these questions. This way, when you take the test in Maine on a Saturday morning, you can't call your cousin in California and tell him specific questions to prepare for before he takes the test next week, because your cousin will likely have a different set of questions on his test.

When companies produce different forms of the same test, they must demonstrate that the tests are equally difficult and have other similar statistical properties. Most importantly, they must show that you would score the same on your Maine version as you would if you took the California version.

Interpreting Reliability Evidence

There are a variety of approaches to establishing test reliability, and tests for different purposes should have different types of reliability evidence associated with them. You can rely on the size of the reliability coefficients to decide whether a test you have made needs to be improved. If you are only taking the test or relying on the information it provides, you can use the reliability value to decide whether you trust the test results.

Internal reliability

A test designed to be used alone to make an important decision should have extremely high internal reliability, so the score one receives should be very precise. A coefficient alpha of .70 or higher is most often considered necessary for a claim that a test is internally reliable, though this is just a rule of thumb. You decide what is acceptable for the tests you make or take.

Test-retest reliability

A test used to measure change over time, as in various social science research designs, should display good test-retest reliability, which means any changes between tests are not due to random fluctuations in scores. An appropriate size for a correlation of stability depends on how theoretically stable a construct should be over time. Depending on its characteristics, then, a test should produce scores over time that correlate in the range of .60 to 1.00.

Inter-rater reliability

Inter-rater reliability is interesting only if the scoring is subjective, such as with an essay test. Objective, computer-scored multiple-choice tests should produce perfect inter-rater reliability, so that sort of evidence is typically not produced for objective tests. If an inter-rater correlation is used as the estimate of inter-rater reliability, .80 is a good rule of thumb for minimum reliability.

Sometimes, reliability across raters is estimated by reporting the percentage of time the two scorers agreed. With a percentage agreement reliability estimate, 85 percent is typically considered good enough.

Parallel forms reliability

Only tests with different forms can be described as having parallel forms reliability. Your college professor probably doesn't need to establish parallel forms reliability when there is only one version of the final, but large-scale test companies probably do.

Parallel forms reliability should be very high, so people can treat scores on any form of the test as equally meaningful. Typically, correlations between two forms of a test should be higher than .90. Test companies conduct studies in which one group of people takes both forms of a test in order to determine this reliability coefficient.

Before you take a high-stakes test that could determine which roads are open to you, make sure that the test has accepted levels of reliability. The type of reliability you'd like to see evidence of depends on the purpose of the test.

Improving Test Reliability

The easiest way to ensure a high coefficient alpha or any other reliability coefficient is to increase the length of your test. The more items asking about the same concept and the more opportunities respondents have to clarify their attitudes or display knowledge, the more reliable a total score on that test would be. This makes sense theoretically, but also increases reliability mathematically because of the formula used to calculate reliability.

Look back at the equation for coefficient alpha. As the length of a test increases, the variability for the total test score increases at a greater rate than the total variability across items. In the formula, this means that the value in the parentheses gets larger as a test gets longer. The n/n-1 portion also increases as the number of items increases. Consequently, longer tests tend to produce higher reliability estimates.

Why It Works

Correlations compare two sets of scores matched up so that each pair of scores describes one individual. If most people perform consistentlyeach of their two scores is high, low, or about average when compared to other individuals, or a high score on one test matches consistently with a low score on anotherthe correlation will be close to 1.00 or -1.00.

An inconsistent relationship between scores produces a correlation close to 0. Consistency of scores, or the correlation of a test with itself, is believed to indicate that a score is reliable under the criteria established within Classical Test Theory [Hack #6]. Classical Test Theory suggests, among other things, that random error is the only reason that scores for a single person will vary if the same test is taken many times.