de champlain agm_sunday_2012
TRANSCRIPT
The Top 10 Myths on Standard Setting
André De Champlain, PhDConsulting Chief Research
Psychometrician &Interim Director of R&D, MCC
The Need to Make Decisions• The need to make classifications
permeates many aspects of daily life• Classifications required by law
• E.g.: Passing an examination to obtain a driver’s license requires meeting a certain level of proficiency with regard to knowledge of traffic laws and performance (passing, parallel parking, etc.)
• Keeps unsafe motorists from behind the wheel!
The Need to Make Decisions
• The need to make classifications permeates many aspects of daily life• Classifications required by law • E.g.: Jury rendering an (impartial) verdict in a
criminal trial classifies a defendant as “guilty” or “not guilty” after weighing the evidence of a case, i.e., analyzing the facts
• Sentence meted out for incapacitation (“protection of the public”), deterrence, denunciation, rehabilitation, etc.
The Need to Make Decisions
• The need to make classifications permeates many aspects of professional life• Classifications required within a
profession • Medical licensing/registration examination programs
• LMCC®, USMLE® , PLAB®, AMC ®, etc.
• Medical specialty board examination programs• ABMS®
• Medical membership organizations• RCP(UK), RCPSC, CFPC, RACP, RACGP, etc.
The Need to Make DecisionsJury Standard Setting
Panel
Composition
Impartial panel that represents the population
Impartial panel that represents a profession
Size Randomly selected group of citizens (to satisfy social decision rules)
Randomly selected panel that is sufficiently large and representative to define standard
Task Renders a verdict “Renders” a decision
Purpose Incapacitation, rehabilitation
Protection of the public, remediation
The Need to Make Decisions
• Number (“cut-score”) can be used to differentiate between several “states or degrees of performance”• Pass / Fail• Grant / Withhold a Credential• Award / Deny a license• Grant / Deny membership• Basic / Proficient / Advanced (Honors)
The Need to Make Decisions• To make “sensible” decisions, information is
needed• Decision makers need relevant and accurate information • Standard setting is the process by which these informed
decisions are arrived at
• Standard setting can be defined as the proper following of a prescribed, rational system of rules or procedures resulting in the assignment of a number to differentiate between two or more states or degrees of performance (Cizek, 1993) • Addresses “procedural due process” (legal framework)
The Need to Make Decisions
• “Procedural due process” • Was the standard setting exercise
well documented?• Description of the standard setting exercise• Selection of judges• Overview of training• Definition of the borderline candidate• Judges’ assessment of each phase of the
exercise and overall cut-score• What did the judges think of the exercise?
The Need to Make Decisions
• Consequential impact of standard setting?
• What are the outcomes of the process?
• Substantive aspect of standard setting• Did the process lead to a “fair” decision?• Consequential aspect of test score use (Messick,
1989)• What are the intended and unintended
consequences of implementing a standard?
• Several sources of (empirical) evidence can be presented to support the fairness and appropriateness of the decision
The Need to Make Decisions
• How popular is standard setting?• MEDLINE/PUBMED database
• Nearly 600 articles published in this topic area
• ERIC (Educational Resources Information Center)• Over 750 articles published in this domain
• Despite the immense popularity of standard setting, basic misconceptions still persist
• What are common “myths” surrounding standard setting?
Myth #1: Standard = Cut-score
• Performance standard (Kane,2001)• Qualitative description of an acceptable level of
performance and knowledge required in practice• “Conceptual” definition of competence• Performance standard is a construct
• Example – MCCQE Part I• The candidate who passes the Medical Council
of Canada’s Qualifying Examination Part I (MCCQE Part I) has demonstrated knowledge, clinical skills, and attitudes necessary for entry into supervised clinical practice, as outlined by the Medical Council of Canada’s Objectives
Myth #1: Standard = Cut-score
• Passing score or cut-score (Kane, 2001)• Selected point on the score scale that
corresponds to this performance standard • “Operational” definition of competence • Cut-score is a number
• Example – MCCQE Part I• A candidate who scores at or above 390 has
met the performance standard defined for the Medical Council of Canada’s Qualifying Examination Part I (MCCQE Part I)
Myth #2: There is a “Gold Standard”• Standard setting entails eliciting
judgments on what cut-score best represents “competence”• All cut-scores are intrinsically subjective in
nature
• Cut-scores can, and will vary as a function of several factors including, but not limited to, the method selected to set the standard and the panel of participating judges
Myth #2: There is a “Gold Standard”• Cut-scores do not exist externally • The aim of standard setting is not to “discover”
some true or preexisting cut-score that separates candidates into mutually exclusive categories (e.g.: competent vs. incompetent)
• Standard setting is a process that synthesizes human judgment in a rational and defensible way to facilitate the partitioning of a score scale into 2 or more categories
Myth #2: There is a “Gold Standard”• Cut-scores do not exist externally • Standards do not externally exist, i.e. outside
of the realm of human opinion
• “a right answer [in standard setting] does not exist, except, perhaps, in the minds of those providing judgment” (Jaeger, 1989)
• Empirical evidence can help standard setting panels translate (policy-based) judgment onto a score scale in a defensible manner
Myth #3: Standard Setting is a Psychometric Exercise
• Standard setting lies at the “intersection” of science and art • While we can facilitate the standard setting
process using psychometric models, a cut-score is ultimately based on human judgment
• Our statistical models can help us to systematize the process, i.e., to translate a policy decision into a cut-score using defensible, well-defined procedures; however, they cannot be used to estimate some “true” cut-score that separates masters from non-masters
Myth #3: Standard Setting is a Psychometric Exercise
• Standards for Educational and Psychological Testing (1999; p.54) • “Cut-scores embody value judgments as well as
technical and empirical considerations”
• Given that human judgment and opinion play such a significant role in this process, a cut-score can be regarded as a composite that incorporates considerations that originate from a number of arenas including medical, statistical, educational, social, political and economic
Myth #4: The Cut-score is Set by a Standard Setting Panel
• A standard setting panel does not “set a cut-score” but rather recommends a cut-score value or standard
• The actual cut-score is set by the governing body that legitimizes the process and the use of the cut-score to make pass/fail decisions• e.g.: Legislative body, academy, certification
specialty board, a college, etc.
Myth #4: The Cut-score is Set by a Standard Setting Panel
• The role of the standard setting panel is to provide guidance & information to those bodies that actually are responsible for implementing a given cut-score value or standard
• The goal of periodic standard setting exercises is to revisit the appropriateness of a cut-score (not necessarily change it) based on replicated exercises and informed expert judgment
Myth #5: Some Standard Setting Methods Are Better Than Others
• Standards for Educational and Psychological Testing (1999; p.53)• “There can be no single method for determining
cut-scores for all tests or for all purposes, nor can there be any single set of procedures for establishing their defensibility”.
• Angoff (1988; p.219)• [Regarding] the problem of setting cut-scores, we
have observed that the several judgmental methods not only fail to yield results that agree with one another, they even fail to yield the same results on repeated application” .
Myth #5: Some Standard Setting Methods Are Better Than Others• No standard setting method yields an
“optimal” cut-score (standards don’t exist outside of the minds of judges)
• Extent to which a standard setting process is properly followed has the most impact on the cut-score • Was the purpose of the exam and the standard
setting exercise clearly defined?• Were the judges qualified to perform the task?• Was adequate training offered to panelists? Etc.
Myth #5: Some Standard Setting Methods Are Better Than Others• Factors to consider when selecting a
standard setting method• A. What is the purpose of examination?
• With professional exams, norm-referenced approaches are appropriate in instances where a limited number of candidates can meet the cut-score• Placement, promotion, awards, etc.
• In most instances, criterion-referenced approaches are more suitable• Medical licensure/certification decisions, passing a clerkship/
internship, etc.
Myth #5: Some Standard Setting Methods Are Better Than Others• Factors to consider when selecting a
standard setting method• B. How complex is the examination?
• For knowledge-based exams (e.g.: dichotomously-scored MCQs), test-centered methods (Angoff, Ebel, Bookmark, etc.) are appropriate given the task required to complete
• For performance assessments (OSCEs, workplace-based assessments, etc.), examinee-centered approaches (borderline groups, contrasting-groups, body of work methods) are better suited given the complex, multidimensional nature of the performance
Myth #5: Some Standard Setting Methods Are Better Than Others• Factors to consider when selecting a
standard setting method• C. What is the test format?• Certain standard setting methods were developed solely for
use with MCQs (e.g.: Nedelsky).
• While other methods can be used with different formats (e.g. Angoff methods), certain assumptions are made that may or may not meet expectations (Angoff assumes a compensatory model)
• Other methods (Hofstee, contrasting-groups) were developed as test format invariant
Myth #5: Some Standard Setting Methods Are Better Than Others• Factors to consider when selecting a
standard setting method• D. What resources are available?
• In very high-stakes settings (e.g. medical licensing exam), a complex standard setting exercise which includes several panels of judges, extensive training, multiple rounds of judgments, etc., might be preferable
• In lower-stakes settings (elective clerkship examination), less intensive models might be appropriate
• What makes the most sense given the intended use of the information?
Myth #5: Some Standard Setting Methods Are Better Than Others• Why not combine several standard
setting procedures?• Standard setting and the selection of
a cut-score are a policy decision• There’s little empirical evidence to suggest
that combining multiple methods will lead to a “better” standard
• There is no “correct” cut-score, so how can policy makers synthesize results from multiple approaches?
• Also requires significantly more resources
Myth #5: Some Standard Setting Methods Are Better Than Others• Always better to systematically
implement 1 standard setting method rather than provide results from several (poorly) implemented approaches• Properly document all phases of standard
setting• Objective, selection of participants, training, etc.
• Provide empirical evidence to support use of cut-score • Impact of sources of variability (judges, panels, etc.)• Consequences of implementing a cut-score• Surveys, etc.
Myth #6: Expert Clinicians are de facto Expert Standard Setting Judges• Selection and training of judges most
critical to the success of any standard setting exercise • However, being a content expert is not
synonymous with expert standard setting judge
• Participating standard setting judges need to be carefully trained to ensure that they understand the task and to minimize biases
Myth #6: Expert Clinicians are de facto Standard Setting Judges
• Standards for Educational and Psychological Testing (1999; p.54)• Care must be taken to assure that judges
understand what they are to do. The process must be such that well-qualified judges can apply their knowledge and experience to reach meaningful and relevant judgments that accurately reflect their understandings and intentions”.
Myth #6: Expert Clinicians are de facto Standard Setting Judges• Training usually includes the following
steps:1. Provision of sample materials (test specifications,
blueprint, sample items/stations, etc.)2. Clear presentation of the purpose of standard setting
and what we are asking of participants3. Discussion and definition of what constitutes a
borderline candidate4. Judgments on a set of exemplars5. Discussion and clarification of any misconceptions
amongst participants6. Survey participants on all aspects of training
Myth #7: We “Know” Who the Truly Competent Candidates Are
• Classification errors are always present in standard setting• High-quality exams and well implemented
standard setting exercises can significantly minimize the proportion of misclassifications
• False positive misclassification• Candidate who “truly” lacks the knowledge, skill and/or
ability necessary to pass the examination, but actually passes
• False negative misclassification• Candidate who “truly” possesses the knowledge, skill
and/or ability necessary to pass the examination, but actually fails
Myth #7: We “Know” Who the Truly Competent Candidates Are • Why do classification errors occur?• Cut-scores represent inferences about the “real”
or “true” level of knowledge, skill possessed by candidates
• The quality of those inferences is related to a number of factors:• The number of items/cases sampled for the standard
setting exercise• The number of judges selected and their degree of
representativeness, etc.• Consequently, pass/fail classifications of
candidates will always be somewhat imperfect
Myth #7: We “Know” Who the Truly Competent Candidates Are • We can’t actually identify false positive
and negative misclassifications• If we knew a candidate was a false negative, we’d
do something about it!• We can estimate misclassification errors using
a host of statistical indices (Brennan, 2004)• In medicine, protection of the public is a
prime concern of examinations• Minimizing false positive misclassifications is
generally of greater interest
Myth #8: All Decisions Are Created Equally
• For fairness reasons, failing candidates are generally allowed to retake an examination (sometimes repeatedly)
• Millman (1989) showed that the greater the number of (repeat) attempts to pass an exam, the greater the likelihood that a candidate who does not possess the level of knowledge or skill needed to pass, will indeed pass (false positive)
Myth #8: All Decisions Are Created Equally • This phenomenon can be attributed to a
number of reasons including:• Possible re-exposure of material (security issue)• (Compounded) measurement errors
associated with each test score• The more times a candidate repeats, the more likely
their score will be sufficiently high (overestimated) to result in a false positive decision• This could significantly impact safe and effective
patient care given the link between medical licensing exam scores and future egregious acts in practice (Tamblyn et al. research)
Myth #8: All Decisions Are Created Equally
• How serious of a problem is the issue of repeat attempts on false positive rates?
• Millman example (1989)• Let’s assume that a cut-score is 70% on an exam• A candidate with a true ability of 65% (should fail)
has a greater than 50/50 chance of passing the exam due to measurement error after 5 attempts (with MCQ exam, i.e., high reliability)
Myth #8: All Decisions Are Created Equally • How might we control for this effect?• Increase the size of the item/station bank to
reduce the likelihood that previously seen material will appear on repeat test attempts
• Incorporate item/station exposure as a constraint when assembling test forms
• Adjust the cut-score to minimize misclassifications• A panel sets the standard at 65%• We can adjust the cut-score so that a candidate
with a true ability level of 65% (true master) has a near zero probability of being misclassified
Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• A cut-score reflects the (informed)
judgments of a small sample of experts, based on sample of items/stations, at a specific point in time, using one or only a few methods• Cut-scores can and will vary as a function of
these factors that need to be evaluated• Evidence to support both the “internal” and
“external” validity of your cut-score should be collected and presented to support its intended use
Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• Evaluating your standard
• Internal validation • How reproducible is the cut-score across
facets?• Judges (inter-rater consistency)?• Sample of stations?• Panels of judges? Etc.
• Generalizability analysis and rater models (IRT) are useful to help us assess how variable the cut-score is across these facets
Myth #9: A Cut-score/Standard Does Not Need to Be Evaluated• Evaluating your standard
• External validation• How do the decisions relate to other
measures?• If scores on two exams are highly related, but
decision consistency is low, perhaps the cut-score on one assessment is not appropriate?
• Impact• How comparable are P/F rates to historical
trends?• Does the cut-score lead to “acceptable”
results?
Myth #10: The Angoff Method Was Developed by Angoff
• Angoff did not formally develop the (Angoff) standard setting method• Origin can be traced back to a footnote in
a chapter on scales, norms and equivalent scores that Angoff wrote in 1971
• Angoff ascribed the procedure to Tucker • Method was a “systematic procedure for
deciding on the minimum raw scores for passing and honors”
Myth #10: The Angoff Method Was Developed by Angoff“a slight variation of this procedure is to ask each judge to state the probability that the “minimally acceptable person” would answer each item correctly. In effect, the judges would think of a number of minimally acceptable persons, instead of only one such person, and would estimate the proportion of minimally acceptable persons who would answer each item correctly. The sum of the probabilities, or proportions, would then represent the minimally acceptable score (p. 515)”.