revising multiple choice test items: now what?€¦ · revising multiple choice test items: now...
TRANSCRIPT
Revising Multiple Choice Test Items: Now What?
College of Pharmacy December 15, 2014
Objectives
• Consider the first three (of five) keys to quality assessment • Differentiate criterion-reference tests from norm-referenced tests • Revisit and discuss the use of Type K questions • Put item discrimination and p-scores in perspective when analyzing
item effectiveness • Revise test items using guidelines for criterion-referenced and norm-
referenced exams • Consider using an assessment blueprint • Consider student involvement in revising items and analyzing
instructional goals
Anatomy of a Multiple-choice Question
Patients with congenital adrenal hyperplasia present with excessive circulating levels of ________________
a) ACTH b) Aldosterone c) BAM22 d) Cortisol e) CXCR7
STEM
OPTIONS Distracters
Key (correct answer)
Five Keys to put things into perspective
Key 2: Clear Targets
Key 3: Sound Design
Key 4: Effective
Communication
Key 5: Student
Involvement
• Who will use the information?
• How will they use it? • What information
(and in what detail) is required?
Key 1: Clear Purpose
What type of test is it?
Criterion-Referenced – Shows how individuals perform on a given
task; not how he/she compares to other test takers
Norm-Referenced – Provides an estimate of the position of a
tested individual in a predefined population
Criterion Reference
• ALL examinees score as high as possible. • Not interested in maximizing test score
variance. • Index of discrimination is not useful and
other measures, such as sensitivity to instruction, are used to judge item quality.
Criterion-reference Tests
• Criterion is subject matter that the test is designed to assess
• Often involves a cutscore – Criterion: "add two single-digit numbers
correctly to a maximum sum of 9." – Cutscore: minimum of 80% to pass.
Learning Target Types
• Knowledge Targets • Reasoning Targets • Skill Targets • Product Targets • Disposition Targets
Assessment Blueprint Based on Criterion
Target Criterion Item #’s Points
K R S P D Write an addition problem in a horizontal line using the correct operators
3,4,5 3
K R S P D Add two single-digit numbers correctly to a maximum sum of 9
1,2,6,9,10 5
K R S P D Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9
7,8 2
K R S P D
K R S P D
K R S P D
K R S P D
K R S P D
K R S P D
Student Involvement Question Criterion/Concept/Skill Right Wrong Simple
Mistake Don’t Get It
1 Add two single-digit numbers correctly to a maximum sum of 9 x
2 Add two single-digit numbers correctly to a maximum sum of 9 x
3 Write an addition problem in a horizontal line using the correct operators x
4 Write an addition problem in a horizontal line using the correct operators x
5 Write an addition problem in a horizontal line using the correct operators x
6 Add two single-digit numbers correctly to a maximum sum of 9 x
7 Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9 x
8 Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9 x
9 Add two single-digit numbers correctly to a maximum sum of 9 X
10 Add two single-digit numbers correctly to a maximum sum of 9 X
Instructional Sensitivity
“The degree to which students’ performances on a test accurately reflect the quality of instruction specifically provided to promote students’ mastery of what is being assessed” Popham, W. J. (2010). Instructional sensitivity. In W. J. Popham (Ed.), Everything school leaders need to know about assessment. Thousand Oaks, CA: Sage.
Norm-Referenced Tests
• Scholastic Aptitude Test (SAT) and Graduate Record Exam (GRE)
• IQ tests • Auditions and job interviews
How does student X compare to student Y (and everybody else)?
Are we getting too many statistics for the
• Sample size? • Intent of the test? • How individual item were tracked (or not)
over time?
Two measures of item effectiveness Difficulty and Discrimination
• Difficulty (p-value) – The number of examinees who answer an
item correctly
• Item Discrimination (iD) – A comparison of top scorers with low scorers
Item Difficulty p-value
# Who Got the Item Correct # of Students who Answered the Item
8 got it correct
42 students answered the item
8 42 .19
Item Difficulty p-value range
The higher the value, the easier the item.
– Above 0.90 -- too easy; review for question’s purpose (Warm up? Fundamental concept?)
– Below 0.20 -- too difficult; review for confusing language, remove from subsequent exams, and/or identify as area for re-instruction.
Item Discrimination
Image Sources: http://www.allarounddrivingschool.com/bigstockphoto_Happy_Group_Of_Friends_2134478.jpg http://gosupermarche.com/deardiary/wp-content/uploads/2009/06/sad_group2.jpg
(# Upper Group Correct) – (# Lower Group Correct)
Top 27% Bottom 27%
Number of Students in the Upper Group 5 - 2
6 .50
Item Discrimination
Negative ID
0% - 24%
25% - 39%
40% - 100% Excellent item
Usually unacceptable
Unacceptable – check for item error
Good item
Adapted from University of Wisconsin Oshkosh: http://www.uwosh.edu/testing/facultyinfo/itemdiscrimone.php
One last comment on … T-values and Statistical Significance
• The score obtained when you perform a T-test to look at statistical significance of an item.
• Statistical significance is important in large samples but difficult to achieve in cohort populations the size of most OHSU programs UNLESS items are used over time AND not modified between uses.
•
Item Analysis Guidelines Rewriting Items For All Assessments
• Establish Criterion References Common error: Teach for analysis of data, ability to discover trends, ability to infer meaning, etc. and then construct a test measuring recognition of facts.
• Make Distracters distracting! If only one distracter in a five-option MCQ is effective, the item is a two-option item.
• No Negative IDs Items with negative item discrimination must be revised or discarded.
Item Analysis Guidelines For Assessments Designed to Rank Students
• Difficulty between 20 and 80 with target of 40 to 70 Very hard or very easy items usually contribute little to the discriminating power of a test.
• Item discrimination above .25 Items should discriminate between upper and lower groups.
The Dreaded Type K Complex multiple-choice
Which of the following behaviors suggests that you’re losing it?
A. You light a match to check a gas leak. B. You pick apart your relationship with your significant
other. C. You advise your teenage son to use his own best
judgment. D. A and B E. B and C F. All of the above Berk, R. (1996). A consumer’s guide to multiple choice item formats that measure complex cognitive outcomes. Pearson Publishing.
Type K Questions What Research Shows
• Fewer can be answered in a given time period • Likely more dependent on test-taking skills and
reading than subject knowledge • Often have lower item discrimination scores Haladyna, T. M. (1992). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5, 73-88.
Credo
Unless your testing goal is to assess student’s ability
to perform well on Type K questions,
avoid them.
MCQ Test Item Revision
• Identify targets and criterion for each item • Check for:
– focus on a single important concept – The stem posing a clear question
– Distracters being homogenous and plausible • Modify Type K questions • Review distracter performance for spread from lower
performing students • Throw or rewrite Items with negative IDs
Objectives
• Consider the first three (of five) keys to quality assessment • Differentiate criterion-reference tests from norm-referenced tests • Revisit and discuss the use of Type K questions • Put item discrimination and p-scores in perspective when analyzing
item effectiveness • Revise test items using guidelines for criterion-referenced and norm-
referenced exams • Consider using an assessment blueprint • Consider student involvement in revising items and analyzing
instructional goals
The mid-term, the perfect test question and the tearful prof
In assessing Mr. Delgado, which behavior is the most reassuring sign that he has been following his treatment plan for his hypertension and diabetes? A. He has a list of glucose readings for the past 10 days B. He has a list of medications along with newly refilled
meds. C. He has kept a nutritional log for a 3-day period D. He can verbalize the side effects of all his medications
He has a list of medications along with newly refilled meds.
The consultation …
Goal: – Learn all the important content – Learn how to think critically about
the subject
Teaching Activities? – Lecture - experts conduct hour-long lectures
Feedback/Assessment: Mid-term exam Result: Students could not reason through to the right answer Discussion: Should you assess what you haven’t taught?
Reliability Kuder-Richardson Formula 20 (KR-20)
• The measure obtained by administering the same test twice over a period of time to the same individuals.
• Scores from time 1 and time 2 are correlated to evaluate the test for stability over time.
• Acceptable reliability coefficients? – 0.60 is an acceptable lower value
Finding Good Dogs and Bad Dogs
• Which items had the best – difficulty scores? – discrimination scores?
• Which items were good foundational questions? • Comparing difficulty AND discrimination, which
items had the best balance of the two? • What is your overall “take” about this exam?
Item Difficulty: Trivia When guessing is taken into account
True/False • 2 items (g=.5) • Optimal p = .75
• 4 items (g=.25) • Optimal p = .63
1.0 + g 2
• 5 items (g=.20) • Optimal p = .60
Multi-item MCQ
g = 100
# distractors Optimal p-value guessing/chance
The Cognitive Domain Bloom’s Taxonomy
Evaluation
Synthesis
Analysis
Application
Comprehension
Knowledge
Creating
Evaluating
Analyzing
Applying
Understanding
Remembering
Bloom, B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc.
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian, P.W., Cruikshank, K.A., Mayer, R.E., Pintrich, P.R., Raths, J., & Wittrock, M.C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives (Complete edition). New York: Longman.
Verb use to guide question depth
Taxonomy Level Verbs to trigger thinking at this level
Creating: can the student create new product or point of view?
assemble, construct, create, design, develop, formulate, write.
Evaluating: can the student justify a stand or decision?
appraise, argue, defend, judge, select, support, value, evaluate
Analyzing: can the student distinguish between the different parts?
appraise, compare, contrast, criticize, differentiate, discriminate, distinguish, examine, experiment, question, test.
Applying: can the student use the information in a new way?
choose, demonstrate, dramatize, employ, illustrate, interpret, operate, schedule, sketch, solve, use, write.
Understanding: can the student explain ideas or concepts?
classify, describe, discuss, explain, identify, locate, recognize, report, select, translate, paraphrase
Remembering: can the student recall or remember the information?
define, duplicate, list, memorize, recall, repeat, reproduce state
LOTS
HOTS