packing and unpacking sources of validity evidence: history repeats itself again stephen g. sireci...
TRANSCRIPT
Packing and Unpacking Sources of Validity Evidence: History
Repeats Itself Again
Stephen G. Sireci
University of Massachusetts Amherst
Presentation for the conference
“The Concept of Validity: Revisions, New Directions, & Applications”
October 9, 2008
University of Maryland, College Park
Validity
A concept that has evolved and is still evolving
The most important consideration in educational and psychological testing
Simple, but complex– Can be misunderstood– Disagreements regarding what it is,
and what is important
Purposes of this presentation
Provide some historical context on the concept of validity in testing
Present current, consensus definitions of validity
Describe the validation framework implied in the Standards
Discuss limitations of current framework
Suggest new directions for validity research and practice
Packing and unpacking: A prelude
Packing UnpackingDoes the test measure what it purports to measure?
Predictive, status, content, congruent validity
A test is valid for anything with which it correlates.
Clarity, coherence, plausibility of assumptions (validity argument)
Validity is a unitary concept.
5 sources of validity evidence
What does valid mean?Truth?
According to Websters, Valid:
1. having legal force; properly executed and
binding under the law.
2. sound; well grounded on principles or
evidence; able to withstand criticism or rejection.
3. effective, effectual, cogent
4. robust, strong, healthy (rare)
What is validity?According to Websters:
Validity:
1. the state or quality of being valid;
specifically, (a) strength or force from
being supported by fact; justness;
soundness; (b) legal strength or force.
2. strength or power in general
3. value (rare)
In the beginning
Modern measurement started at the turn of the 20th century
1905: Binet-Simon scale– 30-item scale designed to ensure that
no child could be denied instruction in the Paris school system without formal examination
• Binet died in 1911 at age 54
What else was happening around the turn of the century?1896: Karl Pearson, Galton
Professor of Eugenics at University College, published the formula for the correlation coefficient
Given the predictive purpose of Binet’s test,
interest in heredity and individual differences,
and a new statistical formula relating variables to one another
validity was initially defined in terms of correlation
Earliest definitions of Validity
“Valid scale” (Thorndike, 1913)
A test is valid for anything with which it correlates– Kelley, 1927; Thurstone, 1932; – Bingham, 1937; Guilford (1946); others
Validity coefficients– correlations of test scores with grades,
supervisor ratings, etc.
Validation started with group tests1917: Army Alpha and Army Beta
– (Yerkes)– Classification of 1.5 million recruits
Borrowed items and ideas from Otis Tests– Otis was one of Terman’s graduate
students
Military Testing
Tests were added or subtracted to batteries based solely on correlational evidence (e.g, increase in R2).
How well does test predict pass/fail criterion several weeks later?
Jenkins (1946) and others emerged in response to problems with notion that validity=correlation– See also Pressey (1920)
Problems with notion that validity = correlationFinding criterion dataEstablishing reliability of criterionEstablishing validity of criterion
If valid, measurable, criteria exist, why do we need the test?
What did critics of correlational evidence of validity suggest for validating tests?
Professional judgment
“...it is proper for the test developer to use his individual judgment in this matter though he should hardly accept it as being on par with, or as worthy of credence as, experimentally established facts showing validity.” – (Kelley, 1927, pp. 30-31)
What did critics of correlational evidence of validity suggest for validating tests?Appraisal of test content with respect
to the purpose of testing (Rulon, 1946)– rational relationship
Sound familiar?Early notions of content validity
– (Kelley, Mosier, Rulon, Thorndike, others)
– but notice Kelley’s hesitation in endorsing this evidence, or going against the popular notion
Other precursors to content validity
Guilford (1946): validity by inspection?
Gulliksen (1950): “Intrinsic Validity”– pre/post instruction test score change– consensus of expert judgment regarding
test content– examine relationship of test to other tests
measuring same objectivesHerring (1918): 6 experts evaluated the
“fitness of items”
Development of Validity Theory
By the 1950s, there was consensus that correlational evidence was not enough
and that judgmental data of the adequacy of test content should be gathered
Growing idea of multiple lines of “validity evidence”
Emergence of Professional StandardsCureton (1951): First “Validity”
chapter in first edition of “Educational Measurement” (edited by Lindquist).
Two aspects of validity– Relevance (what we would call
criterion-related)– Reliability
Cureton (1951)
Validity defined as “the correlation between actual test scores and true criterion scores”
but: “curricular relevance or content validity” may be appropriate in some situations.
Emergence of Professional Standards1952: APA Committee on Test
Standards– Technical Recommendations for
Psychological Tests and Diagnostic Techniques: A Preliminary Proposal
Four “categories of validity”– predictive, status, content, congruent
Emergence of Professional Standards
1954: APA, AERA, & NCMUE produced– Technical Recommendations for
Psychological Tests and Diagnostic Techniques
Four “types” or “attributes” of validity:– construct validity (instead of congruent)– concurrent validity (instead of status)– predictive– content
1954 Standards
Chair was Cronbach and guess who else was on the Committee?– Hint: A philosopher
Promoted idea of:– different types of validity– multiple types of evidence preferred– some types preferable in some
situations
Subsequent Developments
1955: Cronbach and Meehl – Formally defined and elaborated the
concept of construct validity.– Introduced term “criterion-related
validity”
1956: Lennon– Formally defined and elaborated the
concept of content validity.
Subsequent Developments
Loevinger (1957): big promoter of construct validity idea.
Ebel (1961…): big antagonist of unified validity theory– Preferred “meaningfulness”
Evolution of Professional Standards
1966: AERA, APA, NCMEStandards for Educational and
Psychological Tests and Manuals
Three “aspects” of validity:– Criterion-related (concurrent +
predictive)– Construct– Content
1966: Standards
Introduced notion that test users are also responsible for test validity
Specific testing purposes called for specific types of validity evidence.– Three “aims of testing”
• present performance• future performance• standing on trait of interest
Important developments in content validation
Evolution of Professional Standards1974: AERA, APA, NCME
Standards for Educational and Psychological Tests
Validity descriptions borrowed heavily from Cronbach (1971)– Validity chapter in 2nd edition of
“Educational Measurement” (edited by R.L. Thorndike)
1974: Standards
Defined content validity in operational, rather than theoretical, terms.
Beginning of notion that construct validity is much cooler than content or criterion-related.
Early consensus of “unitary” conceptualization of validity
Evolution of Professional Standards1985: AERA, APA, NCMEStandards for Educational and
Psychological Testing
note “ing”Described validity as unitary
conceptNotion of validating score-based
inferencesVery Messick-influenced
1985 Standards
More responsibility on test usersMore standards on applications and
equity issuesSeparate chapters for
– Validity– Reliability– Test development– Scaling, norming, equating– Technical manuals
1985 Standards
New chapters on specific testing situations– Clinical– Educational– Counseling– Employment– Licensure & Certification– Program Evaluation– Linguistic Minorities– “People who have handicapping conditions”
1985 Standards
New chapters on – Administration, scoring, reporting– Protecting the rights of test takers– General principles of test use
Listed standards as – primary,– secondary, or– conditional.
1999 Standards
New “Fairness in Testing” section No more “primary,” “secondary,”
“conditional.” 3-part organizational structure
1. Test construction, evaluation, & documentation
2. Fairness in testing
3. Testing applications
1999 Standards (2) Incorporated the “argument-based
approach to validity”Five “Sources of Validity Evidence”1. Test content2. Response processes3. Internal structure4. Relations to other variables5. Testing consequencesWe’ll return to these sources later.
Comparing the Standards: Packing & Unpacking Validity Evidence
Edition Validity
1954 Construct, concurrent, predictive, content
1966 Criterion-related, construct, content
1974 Criterion-related, construct, content
1985 Unitary (but, content-related evidence, etc.)
1999 Unitary: 5 sources of evidence
What are the current and influential definitions of validity?
Cronbach: Influential, but not current (1971…)
Messick (1989…)Shepard (1993)Standards (1999)Kane (1992, 2006)
Messick (1989): 1st sentence
“Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.” (p. 13)
This “integrated” judgment led Messick, and others, to conclude
All validity is construct validity.
It outside my purpose today to debate the unitary conceptualization of validity, but like all theories, it has strengths and limitations.
But two quick points…
Unitary conceptualization of validityFocuses on inferences derived from
test scores– Assumes measurement of a construct
motivates test development and purposeThe focus on analysis of scores may
undermine attention to content validity
Removal of term “content validity” may have had negative effect on validation practices.
Consider Ebel (1956)
“The degree of construct validity of a test is the extent to which a system of hypothetical relationships can be verified on the basis of measures of the construct…but this system of relationships always involves measures of observed behaviors which must be defended on the basis of their content validity” (p. 274).
Consider Ebel (1956)
“Statistical validation is not ann alternative to subjective evaluation, but an extension of it. All statistical procedures for validating tests are based ultimately upon common sense agreement concerning what is being measured by a particular measurement process” (p. 274).
The 1999 Standards accepted the unitary conceptualization, but also took a practical stance.
The practical stance stems from the use of an argument-based approach to validity.– Cronbach (1971, 1988)– Kane (1992, 2006)
The Standards (1999) succinctly defined validity
“Validity refers to the degree to which
evidence and theory support the
interpretations of test scores entailed
by proposed uses of tests.” (p. 9)
Why do I say the Standards incorporated the argument-based approach to validation?
“Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.”
(AERA et al., 1999, p. 9)
Kane (1992)
“it is not possible to verify the interpretive argument in any absolute sense. The best that can be done is to show that the interpretive argument is highly plausible, given all available evidence” (p. 527).
Kane: Argument-based approach
a) Decide on the statements and decisions to be based on the test scores.
b) Specify inferences/assumptions leading from test scores to statements and decisions.
c) Identify competing interpretations.d) Seek evidence supporting
inferences and assumptions and refuting counterarguments.
Philosophy of Validity
Messick (1989)
“if construct validity is considered to be dependent on a singular philosophical base such as logical positivism and that basis is seen to be deficient or faulty, then construct validity might be dismissed out of hand as being fundamentally flawed” (p. 22).
Messick (1989)
“nomological networks are viewed as an illuminating way of speaking systematically about the role of constructs in psychological theory and measurement, but not as the only way” (p. 23).
3 perspectives on rel. b/w test and other indicators of construct1. Test and nontest consistencies are
manifestations of real traits.2. Test and nontest consistencies are
defined by rel. among constructs in a theoretical framework.
3. Test and nontest consistencies are attributable to real entities but are understood in terms of constructs.
See Messick (1989) Figures 2.1-2.3
Messick on test validation
“test validation is a process of inquiry” (p. 31)
5 systems of inquiry– Leibnizian– Lockean– Kantian– Hegelian– Singerian
Systems of inquiry
Main points– Validation can seek to confirm– Validation can seek consensus
(Leibniz, Lock)
– Validation can seek alternative hypotheses
– Validation can seek to disconfirm
(Kant, Hegel, Singer)
Two other important points by Messick (1989)
“the major limitation is shortsightedness with respect to other possibilities” (p. 33)
“The very variety of methodological approaches in the validational armamentarium, in the absence of specific criteria for choosing among them, makes it possible to select evidence opportunistically and to ignore negative findings” (p. 33)
If you look at the seminal papers and textbooks, and the various editions of the Standards, there are several fundamental and consensus tenets about validity theory and test validation.
Fundamental Validity Tenets
Validity is NOT a property of a test.A test cannot be valid or invalid.What we seek to validate are
(inferences) uses of test scores.Validity is not all or none.Test validity must be evaluated with
respect to a specific testing purpose. Thus, a test may be appropriate for one purpose, but not for another.
Fundamental Validity Tenets (cont.)Evaluating the validity of inferences
derived from test scores requires multiple lines of evidence (i.e., different types of evidence for validity).
Test validation never ends—it is an ongoing process.
I believe these tenets can be considered “consensus” due to their incorporation in the standards and predominance in the literature.But of course, not everyone need
agree with consensus, and we will here important points from detractors over the next two days.
Criticisms of this perspective
Tests are never truly validated (we are never done).
No prescription or guidance regarding specific types of evidence to gather and how to gather it.
Ideal goals with no guidance leads to inaction.
The argument-based approach is a compromise between sophisticated validity theory and the reality that at some point, we must make a judgment about the defensibility and suitability of use of a test for a particular purpose.
What guidance does the Standards give us? Five “sources of evidence that
might be used in evaluating a proposed interpretation of test scores for particular purposes”
(Messick, 1989, p. 13).
“Validation is a matter of making the most reasonable case to guide both current use of the test and current research to advance understanding of what the test scores mean…
To validate an interpretive inference is to ascertain the degree to which multiple lines of evidence are consonant with the inference, while establishing that alternative inferences are less well supported.”
The current Standards
Provide a useful framework for evaluating the use of a test for a particular purpose.– And for documenting validity evidence
Allow us to use multiple lines of evidence to support use of a test for a particular purpose
But, are not prescriptive and do not provide examples or references to “adequate” validity arguments.
Standards’ Validation FrameworkValidity evidence based on
1. Test content
2. Response processes
3. Internal structure
4. Relations to other variables
5. Testing consequences
What is helpful in the Standards framework?It provides a system for
categorizing validity evidence so that a coherent set of evidence can be put forward.
It provides a way of standardizing the reporting of validity evidence.
It focuses on both test construction and test score validation activities.
Emphasizes the importance of evaluating consequences
What are the limitations in the Standards framework?Not all types of evidence of validity
fit into the 5 sources categories.No examples of good validation
studies or of when sufficient evidence is put forth
No statistical guidanceNo referencesVagueness in some areas
Suggestions for revising the Standards (1)Need to refine sources of validity
evidence to accommodate– Analysis of group differences– Alignment research– Differential item functioning– Statistical analysis of test bias
Need more clarity on validity evidence for accountability testing (groups, rather than individuals)
Suggestions for revising the Standards (2)
Need to define “score comparability”– Across subgroups of examinees
taking a single assessment– Across accommodations to
standardized assessments– Across different language versions of
an assessment– Across different tests in CAT/MST– Across different modes of assessment
Suggestions for revising the Standards (3)
Include specific examples of laudable test validation analyses and references to studies that exemplify sound validity arguments.
Closing remarks
There are different perspectives on validity theory.
Whether a test is valid for a particular purpose will always be a question of judgment.
A sound validity argument makes the judgment an easy one to make.
Closing remarks (2)
For educational tests, validity evidence based on test content, is fundamental. Without confirming the content tested is consistent with curricular goals, the test adequately represents the intended domain, and the test is free of construct-irrelevant material, the utility of the test for making educational decisions will be undermined.
Why are there different perspectives on validity?It’s philosophy.It’s okay to disagree, but we need
consensus with respect to nomenclature, and that is where differences can hurt us as a profession.
For over 50 years, the Standards have provided consensus definitions.
Remember, thhreats to validity boil down toConstruct underrepresentationConstruct-irrelevant variance
“Tests are imperfect measures of constructs because they either leave out something that should be included…or else include something that should be left out, or both” (Messick, 1989, p. 34)
Adhering to and Improving the StandardsI don’t agree with everything in the
Standards.But I find it easier to work within the
framework, than against it.
Advice for the remainder of the conference, and for your future validity endeavorsIf you criticize, have specific
improvements to contribute.– (e.g., evidence-centered design)
Consider different perspectives on validity when evaluating use of a test for a particular purpose– If one statistical analysis is offered as
“validation,” be suspicious– Look for evidence in test construction
Thank you for your attention
And thanks to Bob Lissitz and UMD for the invitation and for holding this conference.
I look forward to continuing the conversation.
There is certainly a lot more to hear, and to say.