Hack 32. Establish Validity | Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds

The single most important characteristic of a test is that it is useful for its intended purpose. Establishing validity is important if anyone is to trust that a test score means what it is supposed to mean. You can convince yourself and others that your test is valid if you provide certain types of evidence.

A good test measures what it is intended to measure. For example, a survey that is supposed to find out how often high school students wear seatbelts should, obviously, contain questions about seatbelt use. A survey without these items could reasonably be criticized as not having validity. Validity is the extent to which something measures whatever it is expected to measure. Surveys, tests, and experiments all require validity to be acceptable. If you are building a test for psychological or educational measurement, or just want to be sure your test is useful, you should be concerned about establishing validity.

Validity is not something that a test score either has or does not have. Validity is an argument that is made by the test designer, those relying on the test's results, or anyone else who has a stake in the acceptance of the test and its results.

Consider a spelling test that consists of math problems. Clearly, a test with math problems is not a valid spelling test. While it is not a valid spelling test, though, it might well be a valid math test. The validity of a test or survey is not in the instrument itself, but in the interpretation of the results.

A test might be valid for one purpose, but not another. It would not be appropriate to interpret a child's score on a spelling test as an indication of her math ability; the score might be valid as a measure of verbal ability, but not as a measure of numerical fluidity. The score itself is neither valid nor invalid; it is the meaning attached to the score that is arguably valid or not valid.

To illustrate how to solve the problem of establishing validity, imagine you have designed a new way of measuring spelling ability. You want to sell the test forms to school districts across the country, but first you must produce visible evidence that your test measures spelling ability and not something else, such as vocabulary, test anxiety, reading ability, or (in terms of other factors that might affect scores) gender or race.

Strategies for Winning the Validity Argument

Validity might seem like an argument that can never be won, because as an invisible indicator of quality, it can never be completely established. As a test developer, though, you want to be able to convince your test-takers and anyone who will be using the results of your test that you are measuring substantially whatever it is you are supposed to measure. Fortunately, there are a number of accepted ways in which evidence for the validity of a test can be provided.

The most commonly accepted type of validity evidence is also, interestingly, theoretically the weakest argument one can make for validity. This argument is one of face validity, and it runs as follows: this test is valid because it looks (on its face) like it measures what it is supposed to measure. Those presenting or accepting an argument for face validity believe that the test in question has the sort of items that one would expect to find on such a test. For example, the seatbelt use survey mentioned earlier would be accepted as valid if it has items asking about seatbelt use.

The face validity argument is weak because it relies on human judgment alone, but it can be compelling. Common sense is a strong argument, perhaps even the strongest, for convincing someone to accept any aspect of an assessment. Though face validity seems less scientific than other types of validity evidence (and in a real sense, it is less scientific), few test instruments would be acceptable to those who make and use them if face validity evidence is lacking. If you, as a test developer or user, cannot supply the types of validity evidence discussed in the rest of this hack, you are expected to provide a test that at least has face validity.

For your spelling test, if test takers are asked to spell, you have established face validity.

Four somewhat more scientific types of validity evidence are generally accepted by those who rely on assessments. They are all part of the range of arguments that can be made for validity.

Content-based arguments: Do the items on the test fairly represent the items that could be on the test? If a test is meant to cover some well-defined domain of knowledge, do the questions fairly sample from that domain?
Criterion-based arguments: Do scores on the test estimate performance on some other test?
Construct-based arguments: Does the score on the test represent the trait or characteristic you wish to measure?
Consequences-based arguments: Do the people who take the test benefit from the experience? Is the test biased against certain groups? Does taking the test cause so much stress that, no matter how you score, it isn't worth it?

Content-Based Arguments

If you decide to measure a concept, there are many aspects of that concept and many different questions that can be asked on a test. Some demonstration that the items you choose for your test represent all possible items would be a content-based argument for validity.

This sounds like a daunting requirement. Traditionally, this sort of evidence has been considered more important for tests of achievement. In areas of achievementmedicine, law, English, mathematicsthere are fairly well-defined domains and content areas from which a valid test should sample. A classroom teacher also, presumably, has defined a set of objectives or content areas that a test should measure. Such concisely defined aspects of a subject are rarely available, however, when testing a range of behaviors, knowledge, or attitudes. Consequently, making a reasonable argument that you have selected questions that are representative of some imaginary pool of all possible questions is difficult.

So, what is necessary for content evidence of validity in test construction? It seems that, at a minimum, test construction calls for some organized method of question selection or construction. When measuring self-esteem, for example, questions might cover how the test taker feels about himself in different environments (e.g., work, home, or school), while performing different tasks (e.g., sports, academics, or job duties), or how he feels about different aspects of himself (e.g., his appearance, intelligence, or social skills).

For a classroom teacher measuring how much students have learned during the last few weeks, a table of specifications (an organized list of topics covered and weights indicating their importance) is a good method.

The choice of how to organize a concept or how to break it down into components belongs to the test developer. The developer might have been inspired by research or other tests, or she might just be following a common-sense scheme. The key is to convince yourself, so that you can convince others that you are covering the vital aspects of whatever area you are measuring.

For your spelling test, if you can establish that the words students are asked to spell represent a larger pool of words that students should be able to spell, you are providing content-based validity evidence.

Criterion-Based Arguments

Criterion evidence of validity demonstrates that responses on a test predict performance in some other situation. "Performance" can mean success in a job, a test score, ratings by others, and so on.

If responses on the test are related to performance on criteria that can be measured immediately, the validity evidence is referred to as concurrent validity. If responses on the test are related to performance on criteria that cannot be measured until some future time (e.g., eventual college graduation, treatment success, or eventual drug abuse), the validity evidence is called predictive validity.

It might go without saying that the measures you choose to support criterion validity should be relevant; the criteria should be measures of concepts that are somehow theoretically related. This form of validity evidence is most persuasive and important when the express purpose of a test is to estimate or predict performance on some other measure.

Criterion-based evidence is less persuasive, and perhaps irrelevant, for tests that do not claim to predict the future or estimate performance on some other measure. For example, such evidence might not be useful for your spelling test. On the other hand, it is possible that you can demonstrate that high scorers on your test do well in the National Spelling Bee.

Construct-Based Arguments

The third category of validity evidence is construct evidence. A construct (pronounced with an emphasis on the first syllable: con-struct) is the theoretical concept or trait that a test is designed to measure. We know that we can never measure constructs such as intelligence or self-esteem directly. The methods of psychological measurement are indirect. We ask a series of questions we hope will require the respondent to use the part of her mind we are measuring or reference the portion of her memory that contains information on past behaviors or knowledge, or, at the very least, direct the respondent to examine her attitudes and feelings on a particular topic.

We further hope that the test takers accurately and honestly respond to test items. In practice, test results are often treated as a direct measure of a construct, but we shouldn't forget that they are educated guesses only. The success of this whole process depends on another set of assumptions: that we have correctly defined the construct we are trying to measure and that our test mirrors that definition.

Construct evidence, then, often includes both a defense of the defined construct itself and a claim that the instrument used reflects that definition. Evidence presented for construct validity can include a demonstration that responses behave as theory would expect responses to behave. Construct validity evidence continues to accumulate whenever a survey or test is used, and, like all validity arguments, it can never be fully convincing. In a sense, construct validity arguments include both content and criterion validity arguments, because all validity evidence seeks to establish a link between a concept and the activity that claims to measure it.

For your spelling test, there might be research on the nature of spelling ability as a cognitive activity or personality trait or some other well-defined entity. If you can define what you mean by spelling ability and demonstrate that your test's scores behave as your definition would expect, then you can claim construct-based validity evidence. Does theory suggest that better readers are better spellers? Show that relationship, perhaps with a correlation coefficient [Hack #11], and you have presented validity evidence that might convince others.

Consequences-Based Arguments

Until the last decade or two, measurement folks interested in establishing validity were concerned only with demonstrating that the test score reflected the construct. Because of increasing concerns that certain tests might unfairly penalize whole groups of people, plus other concerns about the social consequences of the common use of tests, policy makers and measurement philosophers now look at the consequences experienced by the test taker because of taking a test.

The idea is that we have gotten so used to testing and making high-stakes decisions based on those test scores that we should take a step back occasionally and ask whether society is really better off if we rely on tests to make these decisions. This represents a broadening of the definition of validity from a score representing the construct to a test fulfilling its intended purpose. Presumably, tests are here to help the world, not hurt it, and consequences-based validity evidence helps to demonstrate the societal value of testing.

Like people from the government in all those old jokes, tests are "here to help us."

For your spelling test, the key negative consequences you want to rule out involve test bias. If your theory of spelling ability expects no differences across gender, race, or socio-economic status, then spelling scores should be equal between those groups. Produce evidence of similar scores between groups, perhaps with a t test [Hack #17], and you will be well on your way to establishing that your test is fair and valid.

Choosing from the Menu of Validity Options

The variety of categories of validity evidence described here represents a strategic menu of options. If you want to demonstrate validity, you can choose from across the range of validity evidence types.

Clearly, not all tests need to provide all types of validity evidence. A small teacher-made history test meant for a group of 25 students might require only some content-based validity evidence to convince the teacher to trust the results. Criterion-based validity evidence is unnecessary, because estimating performance on another test is not an intended purpose of this sort of test.

On the other hand, higher-stakes tests, such as college admissions tests (e.g., the ACT, SAT, and GRE) and intelligence tests used to identify students as eligible for special education funding, should be supported with evidence from all four validity areas. For your spelling test, you can decide which type of evidence, and which type of argument, is most convincing.