revising multiple choice test items: now what?€¦ · revising multiple choice test items: now...

44
Revising Multiple Choice Test Items: Now What? College of Pharmacy December 15, 2014

Upload: others

Post on 26-Apr-2020

35 views

Category:

Documents


0 download

TRANSCRIPT

Revising Multiple Choice Test Items: Now What?

College of Pharmacy December 15, 2014

Objectives

• Consider the first three (of five) keys to quality assessment • Differentiate criterion-reference tests from norm-referenced tests • Revisit and discuss the use of Type K questions • Put item discrimination and p-scores in perspective when analyzing

item effectiveness • Revise test items using guidelines for criterion-referenced and norm-

referenced exams • Consider using an assessment blueprint • Consider student involvement in revising items and analyzing

instructional goals

Anatomy of a Multiple-choice Question

Patients with congenital adrenal hyperplasia present with excessive circulating levels of ________________

a) ACTH b) Aldosterone c) BAM22 d) Cortisol e) CXCR7

STEM

OPTIONS Distracters

Key (correct answer)

What are you trying to measure?

Five Keys to put things into perspective

Key 2: Clear Targets

Key 3: Sound Design

Key 4: Effective

Communication

Key 5: Student

Involvement

• Who will use the information?

• How will they use it? • What information

(and in what detail) is required?

Key 1: Clear Purpose

What type of test is it?

Criterion-Referenced – Shows how individuals perform on a given

task; not how he/she compares to other test takers

Norm-Referenced – Provides an estimate of the position of a

tested individual in a predefined population

Criterion Reference

• ALL examinees score as high as possible. • Not interested in maximizing test score

variance. • Index of discrimination is not useful and

other measures, such as sensitivity to instruction, are used to judge item quality.

Criterion-reference Tests

• Criterion is subject matter that the test is designed to assess

• Often involves a cutscore – Criterion: "add two single-digit numbers

correctly to a maximum sum of 9." – Cutscore: minimum of 80% to pass.

Learning Target Types

• Knowledge Targets • Reasoning Targets • Skill Targets • Product Targets • Disposition Targets

Assessment Blueprint Based on Criterion

Target Criterion Item #’s Points

K R S P D Write an addition problem in a horizontal line using the correct operators

3,4,5 3

K R S P D Add two single-digit numbers correctly to a maximum sum of 9

1,2,6,9,10 5

K R S P D Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9

7,8 2

K R S P D

K R S P D

K R S P D

K R S P D

K R S P D

K R S P D

Student Involvement Question Criterion/Concept/Skill Right Wrong Simple

Mistake Don’t Get It

1 Add two single-digit numbers correctly to a maximum sum of 9 x

2 Add two single-digit numbers correctly to a maximum sum of 9 x

3 Write an addition problem in a horizontal line using the correct operators x

4 Write an addition problem in a horizontal line using the correct operators x

5 Write an addition problem in a horizontal line using the correct operators x

6 Add two single-digit numbers correctly to a maximum sum of 9 x

7 Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9 x

8 Select the correct operator to complete an addition or subtraction problem with a maximum sum of 9 x

9 Add two single-digit numbers correctly to a maximum sum of 9 X

10 Add two single-digit numbers correctly to a maximum sum of 9 X

Instructional Sensitivity

“The degree to which students’ performances on a test accurately reflect the quality of instruction specifically provided to promote students’ mastery of what is being assessed” Popham, W. J. (2010). Instructional sensitivity. In W. J. Popham (Ed.), Everything school leaders need to know about assessment. Thousand Oaks, CA: Sage.

Norm-Referenced Tests

• Scholastic Aptitude Test (SAT) and Graduate Record Exam (GRE)

• IQ tests • Auditions and job interviews

How does student X compare to student Y (and everybody else)?

From 30,000 Feet

Statistical Terms Used by Norm-referenced Tests

Are we getting too many statistics for the

• Sample size? • Intent of the test? • How individual item were tracked (or not)

over time?

Scantron Analysis

Back to Psychometrics …

Two measures of item effectiveness Difficulty and Discrimination

• Difficulty (p-value) – The number of examinees who answer an

item correctly

• Item Discrimination (iD) – A comparison of top scorers with low scorers

Item Difficulty p-value

# Who Got the Item Correct # of Students who Answered the Item

8 got it correct

42 students answered the item

8 42 .19

Item Difficulty p-value range

The higher the value, the easier the item.

– Above 0.90 -- too easy; review for question’s purpose (Warm up? Fundamental concept?)

– Below 0.20 -- too difficult; review for confusing language, remove from subsequent exams, and/or identify as area for re-instruction.

Item Discrimination

Image Sources: http://www.allarounddrivingschool.com/bigstockphoto_Happy_Group_Of_Friends_2134478.jpg http://gosupermarche.com/deardiary/wp-content/uploads/2009/06/sad_group2.jpg

(# Upper Group Correct) – (# Lower Group Correct)

Top 27% Bottom 27%

Number of Students in the Upper Group 5 - 2

6 .50

Item Discrimination

Negative ID

0% - 24%

25% - 39%

40% - 100% Excellent item

Usually unacceptable

Unacceptable – check for item error

Good item

Adapted from University of Wisconsin Oshkosh: http://www.uwosh.edu/testing/facultyinfo/itemdiscrimone.php

One last comment on … T-values and Statistical Significance

• The score obtained when you perform a T-test to look at statistical significance of an item.

• Statistical significance is important in large samples but difficult to achieve in cohort populations the size of most OHSU programs UNLESS items are used over time AND not modified between uses.

Item Analysis Guidelines Rewriting Items For All Assessments

• Establish Criterion References Common error: Teach for analysis of data, ability to discover trends, ability to infer meaning, etc. and then construct a test measuring recognition of facts.

• Make Distracters distracting! If only one distracter in a five-option MCQ is effective, the item is a two-option item.

• No Negative IDs Items with negative item discrimination must be revised or discarded.

Item Analysis Guidelines For Assessments Designed to Rank Students

• Difficulty between 20 and 80 with target of 40 to 70 Very hard or very easy items usually contribute little to the discriminating power of a test.

• Item discrimination above .25 Items should discriminate between upper and lower groups.

The Dreaded Type K Complex multiple-choice

Which of the following behaviors suggests that you’re losing it?

A. You light a match to check a gas leak. B. You pick apart your relationship with your significant

other. C. You advise your teenage son to use his own best

judgment. D. A and B E. B and C F. All of the above Berk, R. (1996). A consumer’s guide to multiple choice item formats that measure complex cognitive outcomes. Pearson Publishing.

Type K Questions Argument For / Argument Against

Type K Questions What Research Shows

• Fewer can be answered in a given time period • Likely more dependent on test-taking skills and

reading than subject knowledge • Often have lower item discrimination scores Haladyna, T. M. (1992). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5, 73-88.

Credo

Unless your testing goal is to assess student’s ability

to perform well on Type K questions,

avoid them.

MCQ Test Item Revision

• Identify targets and criterion for each item • Check for:

– focus on a single important concept – The stem posing a clear question

– Distracters being homogenous and plausible • Modify Type K questions • Review distracter performance for spread from lower

performing students • Throw or rewrite Items with negative IDs

Objectives

• Consider the first three (of five) keys to quality assessment • Differentiate criterion-reference tests from norm-referenced tests • Revisit and discuss the use of Type K questions • Put item discrimination and p-scores in perspective when analyzing

item effectiveness • Revise test items using guidelines for criterion-referenced and norm-

referenced exams • Consider using an assessment blueprint • Consider student involvement in revising items and analyzing

instructional goals

The mid-term, the perfect test question and the tearful prof

In assessing Mr. Delgado, which behavior is the most reassuring sign that he has been following his treatment plan for his hypertension and diabetes? A. He has a list of glucose readings for the past 10 days B. He has a list of medications along with newly refilled

meds. C. He has kept a nutritional log for a 3-day period D. He can verbalize the side effects of all his medications

He has a list of medications along with newly refilled meds.

The consultation …

Goal: – Learn all the important content – Learn how to think critically about

the subject

Teaching Activities? – Lecture - experts conduct hour-long lectures

Feedback/Assessment: Mid-term exam Result: Students could not reason through to the right answer Discussion: Should you assess what you haven’t taught?

A Kinder, Gentler Scantron Report

Reliability Kuder-Richardson Formula 20 (KR-20)

• The measure obtained by administering the same test twice over a period of time to the same individuals.

• Scores from time 1 and time 2 are correlated to evaluate the test for stability over time.

• Acceptable reliability coefficients? – 0.60 is an acceptable lower value

Finding Good Dogs and Bad Dogs

• Which items had the best – difficulty scores? – discrimination scores?

• Which items were good foundational questions? • Comparing difficulty AND discrimination, which

items had the best balance of the two? • What is your overall “take” about this exam?

Item Difficulty: Trivia When guessing is taken into account

True/False • 2 items (g=.5) • Optimal p = .75

• 4 items (g=.25) • Optimal p = .63

1.0 + g 2

• 5 items (g=.20) • Optimal p = .60

Multi-item MCQ

g = 100

# distractors Optimal p-value guessing/chance

The Cognitive Domain Bloom’s Taxonomy

Evaluation

Synthesis

Analysis

Application

Comprehension

Knowledge

Creating

Evaluating

Analyzing

Applying

Understanding

Remembering

Bloom, B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc.

Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian, P.W., Cruikshank, K.A., Mayer, R.E., Pintrich, P.R., Raths, J., & Wittrock, M.C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives (Complete edition). New York: Longman.

Verb use to guide question depth

Taxonomy Level Verbs to trigger thinking at this level

Creating: can the student create new product or point of view?

assemble, construct, create, design, develop, formulate, write.

Evaluating: can the student justify a stand or decision?

appraise, argue, defend, judge, select, support, value, evaluate

Analyzing: can the student distinguish between the different parts?

appraise, compare, contrast, criticize, differentiate, discriminate, distinguish, examine, experiment, question, test.

Applying: can the student use the information in a new way?

choose, demonstrate, dramatize, employ, illustrate, interpret, operate, schedule, sketch, solve, use, write.

Understanding: can the student explain ideas or concepts?

classify, describe, discuss, explain, identify, locate, recognize, report, select, translate, paraphrase

Remembering: can the student recall or remember the information?

define, duplicate, list, memorize, recall, repeat, reproduce state

LOTS

HOTS

What’s the Bloomin’ Level?