Classroom teachers frequently create their own tests to measure their students' learning. They often worry whether their tests are too hard or too easy and whether they measure what they are supposed to measure. Item analysis tools provide the solutions to teachers' concerns.
Classroom assessment is perhaps the single most common activity in the modern schoolroom. Teachers are always making and grading tests, students are always studying for and taking tests, and the whole process is meant to support student learning. Tests must not be too hard (or too easy), and they must measure what the teacher wants them to measure. Test scores and grades are the way that teachers communicate with parents, students, and administrators, so the score at the top of the test needs to be fair. It must accurately reflect student learning, and it should be the result of a quality assessment.
Concerned teachers constantly work to improve their tests, but they are often working in the dark without solid data to guide them. What can a smart, caring teacher do to improve his tests or improve the validity of his grading? A family of statistical methods called item analysis can provide direction to teachers as they seek to develop fair assessments and grading.
Item analysis is the process of examining classroom performance on individual test items. A classroom teacher might want to examine performance on parts of a test she has written, to see what areas are being mastered by her students and what areas need more review. A commercial test developer producing exams for nursing certification might want to know which items on his test are the most valid and which seem to measure something else and should therefore be removed.
In both cases, the developer of the test is interested in item difficulty and item validity. Though one example involves a high school teacher making tests for her own students, and the other example involves a large for-profit corporation, both developers are interested in the same types of data, and both can apply the same tools of item analysis.
Three Types of Classroom Assessment Problems
If you are a classroom teacher worried about your own assessments, there are three different types of questions that you probably need to answer. Fortunately, there are three item-analysis tools that will provide you with the three different types of information you need.
Are my test questions too hard?
The difficulty of any specific test question can be calculated fairly easily using the formula for the difficulty index. You can produce a difficulty index for a test item by calculating the proportion of students taking the test that got that item correct. The larger the proportion, the more test takers who know the information measured by the item.
How hard is too hard? You get to decide that yourself. Some teachers treat difficulty indices at .50 or below as too hard because most people missed the item. You might have higher standards. If you believe that most students should have learned the material and your difficulty index for an item suggests that a substantial portion of your class missed it, it might be too hard.
Is each test question measuring what it is supposed to?
Measurement experts say that if a test item measures what it is supposed to, then it is valid [Hack #32]. The discrimination index is a basic measure of the validity of an item, in addition to its reliability. It measures an item's ability to discriminate between those who scored high on the total test and those who scored low.
Though there are several steps in its calculation, once computed, this index can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skills is related to the response on an item.
Why did my students miss a question?
In addition to examining the performance of an entire test item, teachers are often interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items through analysis of answer options. By calculating the proportion of students who choose each answer option, teachers can see what sorts of errors students are making. Have they mislearned certain concepts? Do they have common confusions about the material?
To improve how well the item works from a measurement perspective, teachers also can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and are not being chosen by many students.
To eliminate educated guesses that result in correct answers purely by chance, teachers and test developers want as many plausible distractors as is feasible. Analyses of response options allow teachers to fine-tune and improve items they might want to use again with future classes.
Conducting Item Analyses and Interpreting Results
Here are the procedures for the calculations involved in item analysis, using data for an example item. For this example, imagine a classroom of 25 students who took a test that included the item in Table 3-6 (keep in mind, though, that even large-scale standardized test developers use the same procedures for tests taken by hundreds of thousands of people).
To calculate the difficulty index:
On the item shown in Table 3-6, 16 out of 25 people got the item right:
Difficulty indices range from .00 to 1.0. In our example, the item had a difficulty index of .64. This means that 64 percent of students knew the answer.
If a teacher believes that .64 is too low, there are a couple of actions she can take. She could decide to change the way she teaches to better meet the objective represented by the item. Another interpretation might be that the item was too difficult or confusing or invalid, in which case the teacher can replace or modify the item, perhaps using information from the item's discrimination index or analysis of response options.
To calculate the discrimination index:
Imagine that in our example 10 out of 13 students (or tests) in the high group and 6 out of 12 students in the low group got the item correct. The high group difficulty index is .77 (10/13) and the low group difficulty index is .50 (6/12), so we can calculate the discrimination index like so:
The discrimination index for the item is .27. Discrimination indices range from -1.0 to 1.0. The greater the positive value (the closer it is to 1.0), the stronger the relationship is between overall test performance and performance on that item.
If the discrimination index is negative, that means that, for some reason, students who scored low on the test were more likely to get the answer correct. This is a strange situation, and it suggests poor validity for an item or that the answer key was incorrect. Teachers usually want each item on the test to tap into the same knowledge or skill as the rest of the test.
We can use the information provided in Table 3-6 to look at the popularity of different answer options, as shown in Table 3-7.
The analysis of response options shows that students who missed the item were about equally likely to choose answer A and answer C. No students chose answer D, so answer option D does not act as a distractor. Students are not choosing between four answer options on this item; they are really choosing between only three options, since they are not even considering answer D.
This makes guessing correctly more likely, which hurts the validity of an item. A teacher might interpret this data as evidence that most students make the connection between The Great Gatsby and Fitzgerald, and that the students who don't make this connection can't differentiate between Faulkner and Hemingway very well.
Suggestions for Item Analysis and Test Fairness
To improve the quality of tests, item analysis can identify items that are too difficult (or too easy, if a teacher has that concern), don't differentiate between those who have learned the content and those who have not, or have distractors that are not plausible.
If you as a teacher have concerns about test fairness, you can change the way you teach, change the way you test, or change the way you grade the tests:
These concerns that teachers have about the quality of their tests are not much different than the research questions that scientists ask. Just like scientists, teachers can collect data in their classroom, analyze the data, and interpret results. They can then decide, based on their own personal philosophies, how to act on those results.