de champlain agm_sunday_2012

The Top 10 Myths on Standard Setting

André De Champlain, PhDConsulting Chief Research

Psychometrician &Interim Director of R&D, MCC

The Need to Make Decisions• The need to make classifications

permeates many aspects of daily life• Classifications required by law

• E.g.: Passing an examination to obtain a driver’s license requires meeting a certain level of proficiency with regard to knowledge of traffic laws and performance (passing, parallel parking, etc.)

• Keeps unsafe motorists from behind the wheel!

The Need to Make Decisions

• The need to make classifications permeates many aspects of daily life• Classifications required by law • E.g.: Jury rendering an (impartial) verdict in a

criminal trial classifies a defendant as “guilty” or “not guilty” after weighing the evidence of a case, i.e., analyzing the facts

• Sentence meted out for incapacitation (“protection of the public”), deterrence, denunciation, rehabilitation, etc.


• The need to make classifications permeates many aspects of professional life• Classifications required within a

profession • Medical licensing/registration examination programs

• LMCC®, USMLE® , PLAB®, AMC ®, etc.

• Medical specialty board examination programs• ABMS®

• Medical membership organizations• RCP(UK), RCPSC, CFPC, RACP, RACGP, etc.

The Need to Make DecisionsJury Standard Setting

Panel

Composition

Impartial panel that represents the population

Impartial panel that represents a profession

Size Randomly selected group of citizens (to satisfy social decision rules)

Randomly selected panel that is sufficiently large and representative to define standard

Task Renders a verdict “Renders” a decision

Purpose Incapacitation, rehabilitation

Protection of the public, remediation


• Number (“cut-score”) can be used to differentiate between several “states or degrees of performance”• Pass / Fail• Grant / Withhold a Credential• Award / Deny a license• Grant / Deny membership• Basic / Proficient / Advanced (Honors)

The Need to Make Decisions• To make “sensible” decisions, information is

needed• Decision makers need relevant and accurate information • Standard setting is the process by which these informed

decisions are arrived at

• Standard setting can be defined as the proper following of a prescribed, rational system of rules or procedures resulting in the assignment of a number to differentiate between two or more states or degrees of performance (Cizek, 1993) • Addresses “procedural due process” (legal framework)


• “Procedural due process” • Was the standard setting exercise

well documented?• Description of the standard setting exercise• Selection of judges• Overview of training• Definition of the borderline candidate• Judges’ assessment of each phase of the

exercise and overall cut-score• What did the judges think of the exercise?


• Consequential impact of standard setting?

• What are the outcomes of the process?

• Substantive aspect of standard setting• Did the process lead to a “fair” decision?• Consequential aspect of test score use (Messick,

1989)• What are the intended and unintended

consequences of implementing a standard?

• Several sources of (empirical) evidence can be presented to support the fairness and appropriateness of the decision


• How popular is standard setting?• MEDLINE/PUBMED database

• Nearly 600 articles published in this topic area

• ERIC (Educational Resources Information Center)• Over 750 articles published in this domain

• Despite the immense popularity of standard setting, basic misconceptions still persist

• What are common “myths” surrounding standard setting?

Myth #1: Standard = Cut-score

• Performance standard (Kane,2001)• Qualitative description of an acceptable level of

performance and knowledge required in practice• “Conceptual” definition of competence• Performance standard is a construct

• Example – MCCQE Part I• The candidate who passes the Medical Council

of Canada’s Qualifying Examination Part I (MCCQE Part I) has demonstrated knowledge, clinical skills, and attitudes necessary for entry into supervised clinical practice, as outlined by the Medical Council of Canada’s Objectives

Myth #1: Standard = Cut-score

• Passing score or cut-score (Kane, 2001)• Selected point on the score scale that

corresponds to this performance standard • “Operational” definition of competence • Cut-score is a number

• Example – MCCQE Part I• A candidate who scores at or above 390 has

met the performance standard defined for the Medical Council of Canada’s Qualifying Examination Part I (MCCQE Part I)

Myth #2: There is a “Gold Standard”• Standard setting entails eliciting

judgments on what cut-score best represents “competence”• All cut-scores are intrinsically subjective in

nature

• Cut-scores can, and will vary as a function of several factors including, but not limited to, the method selected to set the standard and the panel of participating judges

Myth #2: There is a “Gold Standard”• Cut-scores do not exist externally • The aim of standard setting is not to “discover”

some true or preexisting cut-score that separates candidates into mutually exclusive categories (e.g.: competent vs. incompetent)

• Standard setting is a process that synthesizes human judgment in a rational and defensible way to facilitate the partitioning of a score scale into 2 or more categories

Myth #2: There is a “Gold Standard”• Cut-scores do not exist externally • Standards do not externally exist, i.e. outside

of the realm of human opinion

• “a right answer [in standard setting] does not exist, except, perhaps, in the minds of those providing judgment” (Jaeger, 1989)

• Empirical evidence can help standard setting panels translate (policy-based) judgment onto a score scale in a defensible manner

Myth #3: Standard Setting is a Psychometric Exercise

• Standard setting lies at the “intersection” of science and art • While we can facilitate the standard setting

process using psychometric models, a cut-score is ultimately based on human judgment

• Our statistical models can help us to systematize the process, i.e., to translate a policy decision into a cut-score using defensible, well-defined procedures; however, they cannot be used to estimate some “true” cut-score that separates masters from non-masters

Myth #3: Standard Setting is a Psychometric Exercise

• Standards for Educational and Psychological Testing (1999; p.54) • “Cut-scores embody value judgments as well as

technical and empirical considerations”

• Given that human judgment and opinion play such a significant role in this process, a cut-score can be regarded as a composite that incorporates considerations that originate from a number of arenas including medical, statistical, educational, social, political and economic

Myth #4: The Cut-score is Set by a Standard Setting Panel

• A standard setting panel does not “set a cut-score” but rather recommends a cut-score value or standard

• The actual cut-score is set by the governing body that legitimizes the process and the use of the cut-score to make pass/fail decisions• e.g.: Legislative body, academy, certification

specialty board, a college, etc.

Myth #4: The Cut-score is Set by a Standard Setting Panel

• The role of the standard setting panel is to provide guidance & information to those bodies that actually are responsible for implementing a given cut-score value or standard

• The goal of periodic standard setting exercises is to revisit the appropriateness of a cut-score (not necessarily change it) based on replicated exercises and informed expert judgment

Myth #5: Some Standard Setting Methods Are Better Than Others

• Standards for Educational and Psychological Testing (1999; p.53)• “There can be no single method for determining

cut-scores for all tests or for all purposes, nor can there be any single set of procedures for establishing their defensibility”.

• Angoff (1988; p.219)• [Regarding] the problem of setting cut-scores, we

have observed that the several judgmental methods not only fail to yield results that agree with one another, they even fail to yield the same results on repeated application” .

Myth #5: Some Standard Setting Methods Are Better Than Others• No standard setting method yields an

“optimal” cut-score (standards don’t exist outside of the minds of judges)

• Extent to which a standard setting process is properly followed has the most impact on the cut-score • Was the purpose of the exam and the standard

setting exercise clearly defined?• Were the judges qualified to perform the task?• Was adequate training offered to panelists? Etc.

Myth #5: Some Standard Setting Methods Are Better Than Others• Factors to consider when selecting a

standard setting method• A. What is the purpose of examination?

• With professional exams, norm-referenced approaches are appropriate in instances where a limited number of candidates can meet the cut-score• Placement, promotion, awards, etc.

• In most instances, criterion-referenced approaches are more suitable• Medical licensure/certification decisions, passing a clerkship/

internship, etc.


standard setting method• B. How complex is the examination?

• For knowledge-based exams (e.g.: dichotomously-scored MCQs), test-centered methods (Angoff, Ebel, Bookmark, etc.) are appropriate given the task required to complete

• For performance assessments (OSCEs, workplace-based assessments, etc.), examinee-centered approaches (borderline groups, contrasting-groups, body of work methods) are better suited given the complex, multidimensional nature of the performance


standard setting method• C. What is the test format?• Certain standard setting methods were developed solely for

use with MCQs (e.g.: Nedelsky).

• While other methods can be used with different formats (e.g. Angoff methods), certain assumptions are made that may or may not meet expectations (Angoff assumes a compensatory model)

• Other methods (Hofstee, contrasting-groups) were developed as test format invariant


standard setting method• D. What resources are available?

• In very high-stakes settings (e.g. medical licensing exam), a complex standard setting exercise which includes several panels of judges, extensive training, multiple rounds of judgments, etc., might be preferable

• In lower-stakes settings (elective clerkship examination), less intensive models might be appropriate

• What makes the most sense given the intended use of the information?

Myth #5: Some Standard Setting Methods Are Better Than Others• Why not combine several standard

setting procedures?• Standard setting and the selection of

a cut-score are a policy decision• There’s little empirical evidence to suggest

that combining multiple methods will lead to a “better” standard

• There is no “correct” cut-score, so how can policy makers synthesize results from multiple approaches?

• Also requires significantly more resources

Myth #5: Some Standard Setting Methods Are Better Than Others• Always better to systematically

implement 1 standard setting method rather than provide results from several (poorly) implemented approaches• Properly document all phases of standard

setting• Objective, selection of participants, training, etc.

• Provide empirical evidence to support use of cut-score • Impact of sources of variability (judges, panels, etc.)• Consequences of implementing a cut-score• Surveys, etc.

Myth #6: Expert Clinicians are de facto Expert Standard Setting Judges• Selection and training of judges most

critical to the success of any standard setting exercise • However, being a content expert is not

synonymous with expert standard setting judge

• Participating standard setting judges need to be carefully trained to ensure that they understand the task and to minimize biases

Myth #6: Expert Clinicians are de facto Standard Setting Judges

• Standards for Educational and Psychological Testing (1999; p.54)• Care must be taken to assure that judges

understand what they are to do. The process must be such that well-qualified judges can apply their knowledge and experience to reach meaningful and relevant judgments that accurately reflect their understandings and intentions”.

Myth #6: Expert Clinicians are de facto Standard Setting Judges• Training usually includes the following

steps:1. Provision of sample materials (test specifications,

blueprint, sample items/stations, etc.)2. Clear presentation of the purpose of standard setting

and what we are asking of participants3. Discussion and definition of what constitutes a

borderline candidate4. Judgments on a set of exemplars5. Discussion and clarification of any misconceptions

amongst participants6. Survey participants on all aspects of training

Myth #7: We “Know” Who the Truly Competent Candidates Are

• Classification errors are always present in standard setting• High-quality exams and well implemented

standard setting exercises can significantly minimize the proportion of misclassifications

• False positive misclassification• Candidate who “truly” lacks the knowledge, skill and/or

ability necessary to pass the examination, but actually passes

• False negative misclassification• Candidate who “truly” possesses the knowledge, skill

and/or ability necessary to pass the examination, but actually fails

Myth #7: We “Know” Who the Truly Competent Candidates Are • Why do classification errors occur?• Cut-scores represent inferences about the “real”

or “true” level of knowledge, skill possessed by candidates

• The quality of those inferences is related to a number of factors:• The number of items/cases sampled for the standard

setting exercise• The number of judges selected and their degree of

representativeness, etc.• Consequently, pass/fail classifications of

candidates will always be somewhat imperfect

Myth #7: We “Know” Who the Truly Competent Candidates Are • We can’t actually identify false positive

and negative misclassifications• If we knew a candidate was a false negative, we’d

do something about it!• We can estimate misclassification errors using

a host of statistical indices (Brennan, 2004)• In medicine, protection of the public is a

prime concern of examinations• Minimizing false positive misclassifications is

generally of greater interest

Myth #8: All Decisions Are Created Equally

• For fairness reasons, failing candidates are generally allowed to retake an examination (sometimes repeatedly)

• Millman (1989) showed that the greater the number of (repeat) attempts to pass an exam, the greater the likelihood that a candidate who does not possess the level of knowledge or skill needed to pass, will indeed pass (false positive)

Myth #8: All Decisions Are Created Equally • This phenomenon can be attributed to a

number of reasons including:• Possible re-exposure of material (security issue)• (Compounded) measurement errors

associated with each test score• The more times a candidate repeats, the more likely

their score will be sufficiently high (overestimated) to result in a false positive decision• This could significantly impact safe and effective

patient care given the link between medical licensing exam scores and future egregious acts in practice (Tamblyn et al. research)

Myth #8: All Decisions Are Created Equally

• How serious of a problem is the issue of repeat attempts on false positive rates?

• Millman example (1989)• Let’s assume that a cut-score is 70% on an exam• A candidate with a true ability of 65% (should fail)

has a greater than 50/50 chance of passing the exam due to measurement error after 5 attempts (with MCQ exam, i.e., high reliability)

Myth #8: All Decisions Are Created Equally • How might we control for this effect?• Increase the size of the item/station bank to

reduce the likelihood that previously seen material will appear on repeat test attempts

• Incorporate item/station exposure as a constraint when assembling test forms

• Adjust the cut-score to minimize misclassifications• A panel sets the standard at 65%• We can adjust the cut-score so that a candidate

with a true ability level of 65% (true master) has a near zero probability of being misclassified

Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• A cut-score reflects the (informed)

judgments of a small sample of experts, based on sample of items/stations, at a specific point in time, using one or only a few methods• Cut-scores can and will vary as a function of

these factors that need to be evaluated• Evidence to support both the “internal” and

“external” validity of your cut-score should be collected and presented to support its intended use

Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• Evaluating your standard

• Internal validation • How reproducible is the cut-score across

facets?• Judges (inter-rater consistency)?• Sample of stations?• Panels of judges? Etc.

• Generalizability analysis and rater models (IRT) are useful to help us assess how variable the cut-score is across these facets

Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• Evaluating your standard

• External validation• How do the decisions relate to other

measures?• If scores on two exams are highly related, but

decision consistency is low, perhaps the cut-score on one assessment is not appropriate?

• Impact• How comparable are P/F rates to historical

trends?• Does the cut-score lead to “acceptable”

results?

Myth #10: The Angoff Method Was Developed by Angoff

• Angoff did not formally develop the (Angoff) standard setting method• Origin can be traced back to a footnote in

a chapter on scales, norms and equivalent scores that Angoff wrote in 1971

• Angoff ascribed the procedure to Tucker • Method was a “systematic procedure for

deciding on the minimum raw scores for passing and honors”

Myth #10: The Angoff Method Was Developed by Angoff“a slight variation of this procedure is to ask each judge to state the probability that the “minimally acceptable person” would answer each item correctly. In effect, the judges would think of a number of minimally acceptable persons, instead of only one such person, and would estimate the proportion of minimally acceptable persons who would answer each item correctly. The sum of the probabilities, or proportions, would then represent the minimally acceptable score (p. 515)”.

de champlain agm_sunday_2012

Documents

gold standard standard

standard setting andr

incompetent standard

aim of standard setting

gold standard cutscores

decisions number cutscore

overall cutscore

score kane