People who use, make, and take high-stakes tests have a vested interest in establishing the precision of a test score. Fortunately, the field of educational and psychological measurement offers several methods for both verifying that a test score is consistent and precise and indicating just how trustworthy it is. Anyone who uses tests to make high-stakes decisions needs to be confident that the scores that are produced are precise and that they're not influenced much by random forces, such as whether the job applicant had breakfast that morning or the student was overly anxious during the test. Test designers need to establish reliability to convince their customers that they can rely on the results produced. Most importantly, perhaps, when you take a test that will affect your admission to a school or determine whether you get that promotion to head beverage chef, you need to know that the score reflects your typical level of performance. This hack presents several procedures for measuring the reliability of measures. Why Reliability MattersSome basics, first, about test reliability and why you should seek out reliability evidence for important tests you take. Tests and other measurement instruments are expected to behave consistently, both internally (measuring the same construct behaving in similar ways) and externally (providing similar results if they are administered again and again over time). These are issues of reliability. Reliability is measured statistically, and a specific number can be calculated to represent a test's level of consistency. Most indices of reliability are based on correlations [Hack #11] between responses to items within a test or between two sets of scores on a test given or scored twice. Four commonly reported types of reliability are used to establish whether a test produces scores that do not include much random variance:
Calculating ReliabilityIf you have produced a test you want to usewhether you are a teacher, a personnel officer, or a therapistyou will want to verify that you are measuring reliably. The methods you use to compute your level of precision depend on the reliability type you are interested in. Internal reliabilityThe most commonly reported measure of reliability is a measure of internal consistency referred to as coefficient (or Cronbach's) alpha. Coefficient alpha is a number that almost always ranges from .00 to 1.00. The higher the number, the more internally consistent a test's items behave. If you took a test and split it in halfthe odd items in one half and the even items in the other, for exampleyou could calculate the correlation between the two halves. The formula for split-half correlations is the correlation coefficient formula [Hack #11] and is a traditional method for estimating reliability, though it is considered a bit old-fashioned these days. Mathematically, the formula for coefficient alpha produces an average of correlations between all possible halves of a test and has come to replace a split-half correlation as the preferred estimate of internal reliability. Computers are typically used to calculate this value because of the complexity of the equation: where n = the number of items on the test, SD = standard deviation of the test, S means to sum up, and SDi = standard deviation of each item. Test-retest reliabilityInternal consistency is usually considered appropriate evidence for the reliability of a test, but in some cases, it is also necessary to demonstrate consistency over time. If whatever is being measured is something that should not change over time, or if it should change very slowly, then responses from the same group should be pretty much the same if they were administered the same test on two different occasions. A correlation between these two sets of scores would reflect a test's consistency over time. Inter-rater reliabilityWe can also calculate reliability when more than one person scores a test or makes an observation. When different raters are used to produce a score, it is appropriate to demonstrate consistency between them. Even if only one scorer is used (as with a teacher in a classroom), if the scoring is subjective at all, as with most essay questions and performance assessments, this type of reliability has great theoretical importance. To demonstrate that an individual's score represents typical performance in these cases, it must be shown that it makes no difference which judge, scorer, or rater was used. The level for inter-rater reliability is usually established with correlations between raters' scores for a series of people or with a percentage that indicates how often they agreed. Parallel forms reliabilityFinally, we can demonstrate reliability by arguing that it doesn't matter which form of a test a person takes; she will score about the same. Demonstrating parallel forms reliability is necessary only when the test is constructed from a larger pool of items. For example, with most standardized college admission tests, such as the SAT and the ACT, different test takers are given different versions of the test, made up of different questions covering the same subjects. The companies behind these tests have developed many hundreds of questions and produce different versions of the same test by using different samples of these questions. This way, when you take the test in Maine on a Saturday morning, you can't call your cousin in California and tell him specific questions to prepare for before he takes the test next week, because your cousin will likely have a different set of questions on his test. When companies produce different forms of the same test, they must demonstrate that the tests are equally difficult and have other similar statistical properties. Most importantly, they must show that you would score the same on your Maine version as you would if you took the California version. Interpreting Reliability EvidenceThere are a variety of approaches to establishing test reliability, and tests for different purposes should have different types of reliability evidence associated with them. You can rely on the size of the reliability coefficients to decide whether a test you have made needs to be improved. If you are only taking the test or relying on the information it provides, you can use the reliability value to decide whether you trust the test results.
Before you take a high-stakes test that could determine which roads are open to you, make sure that the test has accepted levels of reliability. The type of reliability you'd like to see evidence of depends on the purpose of the test. Improving Test ReliabilityThe easiest way to ensure a high coefficient alpha or any other reliability coefficient is to increase the length of your test. The more items asking about the same concept and the more opportunities respondents have to clarify their attitudes or display knowledge, the more reliable a total score on that test would be. This makes sense theoretically, but also increases reliability mathematically because of the formula used to calculate reliability. Look back at the equation for coefficient alpha. As the length of a test increases, the variability for the total test score increases at a greater rate than the total variability across items. In the formula, this means that the value in the parentheses gets larger as a test gets longer. The n/n-1 portion also increases as the number of items increases. Consequently, longer tests tend to produce higher reliability estimates. Why It WorksCorrelations compare two sets of scores matched up so that each pair of scores describes one individual. If most people perform consistentlyeach of their two scores is high, low, or about average when compared to other individuals, or a high score on one test matches consistently with a low score on anotherthe correlation will be close to 1.00 or -1.00. An inconsistent relationship between scores produces a correlation close to 0. Consistency of scores, or the correlation of a test with itself, is believed to indicate that a score is reliable under the criteria established within Classical Test Theory [Hack #6]. Classical Test Theory suggests, among other things, that random error is the only reason that scores for a single person will vary if the same test is taken many times. |