Hack 29. Test Fairly

Classroom teachers frequently create their own tests to measure their students' learning. They often worry whether their tests are too hard or too easy and whether they measure what they are supposed to measure. Item analysis tools provide the solutions to teachers' concerns.

Classroom assessment is perhaps the single most common activity in the modern schoolroom. Teachers are always making and grading tests, students are always studying for and taking tests, and the whole process is meant to support student learning. Tests must not be too hard (or too easy), and they must measure what the teacher wants them to measure. Test scores and grades are the way that teachers communicate with parents, students, and administrators, so the score at the top of the test needs to be fair. It must accurately reflect student learning, and it should be the result of a quality assessment.

Concerned teachers constantly work to improve their tests, but they are often working in the dark without solid data to guide them. What can a smart, caring teacher do to improve his tests or improve the validity of his grading? A family of statistical methods called item analysis can provide direction to teachers as they seek to develop fair assessments and grading.

Item Analysis

Item analysis is the process of examining classroom performance on individual test items. A classroom teacher might want to examine performance on parts of a test she has written, to see what areas are being mastered by her students and what areas need more review. A commercial test developer producing exams for nursing certification might want to know which items on his test are the most valid and which seem to measure something else and should therefore be removed.

In both cases, the developer of the test is interested in item difficulty and item validity. Though one example involves a high school teacher making tests for her own students, and the other example involves a large for-profit corporation, both developers are interested in the same types of data, and both can apply the same tools of item analysis.

Three Types of Classroom Assessment Problems

If you are a classroom teacher worried about your own assessments, there are three different types of questions that you probably need to answer. Fortunately, there are three item-analysis tools that will provide you with the three different types of information you need.

Are my test questions too hard?

The difficulty of any specific test question can be calculated fairly easily using the formula for the difficulty index. You can produce a difficulty index for a test item by calculating the proportion of students taking the test that got that item correct. The larger the proportion, the more test takers who know the information measured by the item.

The term difficulty index is counterintuitive, because it actually provides a measure of how easy the item is, not the difficulty of the item. An item with a high difficulty index is an easy item, not a tough one.

How hard is too hard? You get to decide that yourself. Some teachers treat difficulty indices at .50 or below as too hard because most people missed the item. You might have higher standards. If you believe that most students should have learned the material and your difficulty index for an item suggests that a substantial portion of your class missed it, it might be too hard.

Is each test question measuring what it is supposed to?

Measurement experts say that if a test item measures what it is supposed to, then it is valid [Hack #32]. The discrimination index is a basic measure of the validity of an item, in addition to its reliability. It measures an item's ability to discriminate between those who scored high on the total test and those who scored low.

Though there are several steps in its calculation, once computed, this index can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skills is related to the response on an item.

A discrimination index is not so named because it suggests test bias. Discrimination is the ability to identify whether one who got an item correct is in a high-scoring group or a low-scoring group.

Why did my students miss a question?

In addition to examining the performance of an entire test item, teachers are often interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items through analysis of answer options. By calculating the proportion of students who choose each answer option, teachers can see what sorts of errors students are making. Have they mislearned certain concepts? Do they have common confusions about the material?

To improve how well the item works from a measurement perspective, teachers also can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and are not being chosen by many students.

To eliminate educated guesses that result in correct answers purely by chance, teachers and test developers want as many plausible distractors as is feasible. Analyses of response options allow teachers to fine-tune and improve items they might want to use again with future classes.

Conducting Item Analyses and Interpreting Results

Here are the procedures for the calculations involved in item analysis, using data for an example item. For this example, imagine a classroom of 25 students who took a test that included the item in Table 3-6 (keep in mind, though, that even large-scale standardized test developers use the same procedures for tests taken by hundreds of thousands of people).

The asterisk for the answer options in Table 3-6 indicates that B is the correct answer.

Table Sample item for item analysis
Answer to question: "Who wrote The Great Gatsby?"	Number of students who chose each answer
A. Faulkner	4
B. Fitzgerald*	16
C. Hemingway	5
D. Steinbeck	0

To calculate the difficulty index:

Count the number of people who got the correct answer.
Divide by the total number of people who took the test.

On the item shown in Table 3-6, 16 out of 25 people got the item right:

16 / 25 = .64

Difficulty indices range from .00 to 1.0. In our example, the item had a difficulty index of .64. This means that 64 percent of students knew the answer.

If a teacher believes that .64 is too low, there are a couple of actions she can take. She could decide to change the way she teaches to better meet the objective represented by the item. Another interpretation might be that the item was too difficult or confusing or invalid, in which case the teacher can replace or modify the item, perhaps using information from the item's discrimination index or analysis of response options.

To calculate the discrimination index:

Sort your tests by total score, and create two groupings of tests: the high scores, made up of the top half of tests, and the low scores, made up of the bottom half of tests.
For each group, calculate a difficulty index for the item.
Subtract the difficulty index for the low scores group from the difficulty index for the high scores group.

Imagine that in our example 10 out of 13 students (or tests) in the high group and 6 out of 12 students in the low group got the item correct. The high group difficulty index is .77 (10/13) and the low group difficulty index is .50 (6/12), so we can calculate the discrimination index like so:

.77 - .50 = .27

The discrimination index for the item is .27. Discrimination indices range from -1.0 to 1.0. The greater the positive value (the closer it is to 1.0), the stronger the relationship is between overall test performance and performance on that item.

If the discrimination index is negative, that means that, for some reason, students who scored low on the test were more likely to get the answer correct. This is a strange situation, and it suggests poor validity for an item or that the answer key was incorrect. Teachers usually want each item on the test to tap into the same knowledge or skill as the rest of the test.

The formula for the discrimination index is such that if more students in the high-scoring group chose the correct answer than did students in the low-scoring group, the number is positive. At a minimum, then, a teacher would hope for a positive value, because that would indicate that knowledge resulted in the correct answer.

We can use the information provided in Table 3-6 to look at the popularity of different answer options, as shown in Table 3-7.

Table Item analysis of "Who wrote The Great Gatsby?"
Answer	Popularity of options	Difficulty index
A. Faulkner	4/25	.16
B. Fitzgerald*	16/25	.64
C. Hemingway	5/25	.20
D. Steinbeck	0/25	.00

The analysis of response options shows that students who missed the item were about equally likely to choose answer A and answer C. No students chose answer D, so answer option D does not act as a distractor. Students are not choosing between four answer options on this item; they are really choosing between only three options, since they are not even considering answer D.

This makes guessing correctly more likely, which hurts the validity of an item. A teacher might interpret this data as evidence that most students make the connection between The Great Gatsby and Fitzgerald, and that the students who don't make this connection can't differentiate between Faulkner and Hemingway very well.

Suggestions for Item Analysis and Test Fairness

To improve the quality of tests, item analysis can identify items that are too difficult (or too easy, if a teacher has that concern), don't differentiate between those who have learned the content and those who have not, or have distractors that are not plausible.

If you as a teacher have concerns about test fairness, you can change the way you teach, change the way you test, or change the way you grade the tests:

Change the way you teach: If some items are too hard, you can adjust the way you teach. Emphasize unlearned material or use a different instructional strategy. You might specifically modify instruction to correct a confusing misunderstanding about the content.
Change the way you test: If items have low or negative discrimination values, they can be removed from the current test, and you can remove them from the pool of items for future tests. You can also examine the item, try to identify what was tricky about it, and change the item. When distracters are identified as being nonfunctional (no one picks them), teachers can tinker with the item and create a new distracter. One goal for a valid and reliable test is to decrease the chance that random guessing could result in credit for a correct answer. The greater the number of plausible distracters, the more accurate, valid, and reliable the test typically becomes.
Change the way you grade: You might use item analysis information to decide that the material was not taught and, for the sake of fairness, remove the item from the current test and recalculate scores. The simplest way for real classroom teachers to do this is to simply count the number of bad items on a test and add that number to everyone's score. This is not technically the same as rescoring the test as if the item never existed, but this way students still get credit if they got a hard or tricky item correct, which seems fairer to most teachers.

These concerns that teachers have about the quality of their tests are not much different than the research questions that scientists ask. Just like scientists, teachers can collect data in their classroom, analyze the data, and interpret results. They can then decide, based on their own personal philosophies, how to act on those results.