Hack 29. Test
|
|
How hard is too hard? You get to decide that yourself. Some teachers treat difficulty indices at .50 or below as too hard because most people missed the item. You might have higher standards. If you believe that most students should have learned the material and your difficulty index for an item suggests that a substantial portion of your class missed it, it might be too hard.
Measurement experts say that if a test item measures what it is supposed to, then it is valid [Hack #32]. The discrimination index is a basic measure of the validity of an item, in addition to its reliability. It measures an item's ability to discriminate between those who scored high on the total test and those who scored low.
Though there are several steps in its calculation, once computed, this index can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skills is
|
In addition to examining the performance of an entire test item, teachers are often interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items through analysis of answer options . By calculating the proportion of students who choose each answer option, teachers can see what sorts of errors students are making. Have they mislearned certain concepts? Do they have common confusions about the material?
To improve how well the item works from a measurement perspective, teachers also can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and are not being
To eliminate
Here are the procedures for the calculations involved in item analysis, using data for an example item. For this example, imagine a classroom of 25 students who took a test that included the item in Table 3-6 (keep in mind, though, that even large-scale standardized test developers use the same procedures for tests taken by hundreds of thousands of people).
|
| Answer to question: "Who wrote The Great Gatsby ?" | Number of students who chose each answer |
|---|---|
| A. Faulkner | 4 |
| B. Fitzgerald* | 16 |
| C. Hemingway | 5 |
| D. Steinbeck |
To calculate the difficulty index:
Count the number of people who got the correct answer.
Divide by the total number of people who took the test.
On the item shown in Table 3-6, 16 out of 25 people got the item right:
Difficulty indices range from .00 to 1.0. In our example, the item had a difficulty index of .64. This means that 64 percent of students knew the answer.
If a teacher believes that .64 is too low, there are a couple of actions she can take. She could decide to change the way she teaches to better meet the objective represented by the item. Another interpretation might be that the item was too difficult or confusing or invalid, in which case the teacher can replace or modify the item, perhaps using information from the item's discrimination index or analysis of response options.
To calculate the discrimination index:
For each group, calculate a difficulty index for the item.
Subtract the difficulty index for the low scores group from the difficulty index for the high scores group.
Imagine that in our example 10 out of 13 students (or tests) in the high group and 6 out of 12 students in the low group got the item correct. The high group difficulty index is .77 (10/13) and the low group difficulty index is .50 (6/12), so we can calculate the discrimination index like so:
The discrimination index for the item is .27. Discrimination indices range from -1.0 to 1.0. The greater the positive value (the closer it is to 1.0), the stronger the relationship is between overall test performance and performance on that item.
If the discrimination index is negative, that means that, for some reason, students who scored low on the test were more likely to get the answer correct. This is a
|
We can use the information provided in Table 3-6 to look at the popularity of different answer options, as shown in Table 3-7.
| Answer | Popularity of options | Difficulty index |
|---|---|---|
| A. Faulkner | 4/25 | .16 |
| B. Fitzgerald* | 16/25 | .64 |
| C. Hemingway | 5/25 | .20 |
| D. Steinbeck | 0/25 | .00 |
The analysis of response options shows that students who missed the item were about equally likely to choose answer A and answer C. No students chose answer D, so answer option D does not act as a distractor. Students are not choosing between four answer options on this item; they are really choosing between only three options, since they are not even considering answer D.
This makes guessing correctly more likely, which hurts the validity of an item. A teacher might interpret this data as evidence that most students make the connection between The Great Gatsby and Fitzgerald, and that the students who don't make this connection can't differentiate between Faulkner and Hemingway very well.
To improve the quality of tests, item analysis can identify items that are too difficult (or too easy, if a teacher has that concern), don't differentiate between those who have learned the content and those who have not, or have distractors that are not plausible.
If you as a teacher have concerns about test fairness, you can change the way you teach, change the way you test, or change the way you grade the tests:
If some items are too hard, you can adjust the way you teach. Emphasize unlearned material or use a different instructional strategy. You might
If items have low or negative discrimination values, they can be removed from the current test, and you can remove them from the pool of items for future tests. You can also examine the item, try to identify what was tricky about it, and change the item. When distracters are identified as being nonfunctional (no one picks them), teachers can tinker with the item and create a new distracter. One goal for a valid and reliable test is to decrease the chance that random guessing could result in credit for a correct answer. The greater the number of plausible distracters, the more accurate, valid, and reliable the test typically becomes.
You might use item analysis information to decide that the material was not taught and, for the sake of fairness, remove the item from the current test and recalculate scores. The simplest way for real classroom teachers to do this is to simply count the number of bad items on a test and add that number to everyone's score. This is not technically the same as rescoring the test as if the item never existed, but this way students still get credit if they got a hard or tricky item correct, which seems fairer to most teachers.
These concerns that teachers have about the quality of their tests are not much different than the research questions that scientists ask. Just like scientists, teachers can collect data in their classroom, analyze the data, and interpret results. They can then decide, based on their own personal philosophies, how to act on those results.