1 an introduction to validity arguments for alternate assessments scott marion center for assessment...

11

An Introduction to Validity An Introduction to Validity Arguments for Alternate Arguments for Alternate

AssessmentsAssessments

Scott MarionScott MarionCenter for AssessmentCenter for Assessment

Eighth Annual MARCES ConferenceEighth Annual MARCES ConferenceUniversity of MarylandUniversity of MarylandOctober 11-12, 2007October 11-12, 2007

Marion. Center for Assessment. MARCES 2007Marion. Center for Assessment. MARCES 2007 22

OverviewOverviewA little validity backgroundA little validity background

Creating and evaluating a validity Creating and evaluating a validity argument…or translating Kane (and argument…or translating Kane (and others) to AA-AASothers) to AA-AAS– Can we make it practical? Can we make it practical?

A focus on validity in technical A focus on validity in technical documentationdocumentation


Validation is “a lengthy, even Validation is “a lengthy, even endlessendless process” process”

(Cronbach, 1989, p.151)(Cronbach, 1989, p.151)

Good for consultants, but not so great Good for consultants, but not so great for state folks and contractorsfor state folks and contractors

Are you nervous yet….Are you nervous yet….


Validity Should be CentralValidity Should be CentralWe argue that the purpose of the We argue that the purpose of the technical documentation is to provide technical documentation is to provide data to support or refute the validity of data to support or refute the validity of the inferences from the alternate the inferences from the alternate assessments at both the student and assessments at both the student and program levelprogram level


Unified Conception of ValidityUnified Conception of Validity

Drawing on the work of Cronbach, Drawing on the work of Cronbach, Messick, Shepard, and Kane the Messick, Shepard, and Kane the proposed evaluation of technical proposed evaluation of technical quality is built around a unified quality is built around a unified conception of validityconception of validity– centered on the inferences related to centered on the inferences related to

the construct including significant the construct including significant attention to the social consequences of attention to the social consequences of the assessmentthe assessment


But what is a validity argument and how But what is a validity argument and how do we evaluate the validity of our do we evaluate the validity of our inferences?inferences?


A little historyA little historyKane traces the history of validity theory Kane traces the history of validity theory from the criterion through the content from the criterion through the content model to the construct model. model to the construct model.

It is worth stopping briefly to discuss the It is worth stopping briefly to discuss the content model, because that appears to be content model, because that appears to be where many still appear to operate.where many still appear to operate.


““The content model interprets test scores The content model interprets test scores based on a sample of performances in based on a sample of performances in some area of activity as an estimate of some area of activity as an estimate of overall level of skill in that activity.” The overall level of skill in that activity.” The sample of items/tasks and observed sample of items/tasks and observed performances must be:performances must be:– representative of the domain,representative of the domain,– evaluated appropriately and fairly, and evaluated appropriately and fairly, and – part of a large enough samplepart of a large enough sample

So, this sounds good, right?So, this sounds good, right?


Concerns with the content modelConcerns with the content model““Messick (1989) argued that content-based Messick (1989) argued that content-based validity evidence does not involve test scores or validity evidence does not involve test scores or the performances on which the scores are based the performances on which the scores are based and therefore cannot be used to justify and therefore cannot be used to justify conclusions about the interpretation of test conclusions about the interpretation of test scores.” (p. 17)scores.” (p. 17)– Huh? More simply…content evidence is a matching Huh? More simply…content evidence is a matching

exercise and doesn’t really help us get at the exercise and doesn’t really help us get at the interpretations we make from scoresinterpretations we make from scores

Is it useful? Sure, but with the intense focus on Is it useful? Sure, but with the intense focus on alignment these days, content evidence appears alignment these days, content evidence appears to be privileged compared with trying to create to be privileged compared with trying to create arguments for the meaning of test scoresarguments for the meaning of test scores


The Construct ModelThe Construct ModelWe can trace this evolution from Cronbach and We can trace this evolution from Cronbach and Meehl (1955) through Loevinger (1957) to Meehl (1955) through Loevinger (1957) to Cronbach (1971) and culminating in Messick Cronbach (1971) and culminating in Messick 1989)1989)– Focused attention on the many factors associated with Focused attention on the many factors associated with

the interpretations and uses of test scores (and not the interpretations and uses of test scores (and not simply with correlations)simply with correlations)

– Emphasized the important effect of Emphasized the important effect of assumptionsassumptions in in score interpretations and the need to check these score interpretations and the need to check these assumptionsassumptions

– Allowed for the possibility of Allowed for the possibility of alternative explanationsalternative explanations for test scores—in fact, this model even encouraged for test scores—in fact, this model even encouraged falsificationfalsification


Limitations of the Construct ModelLimitations of the Construct Model

Does not provide clear guidance for the Does not provide clear guidance for the validation of a test score interpretation validation of a test score interpretation and/or useand/or useDid not help evaluators prioritize validity Did not help evaluators prioritize validity studiesstudies– If, as Anastasi (1986) noted, “almost any If, as Anastasi (1986) noted, “almost any

information gathered in the process of information gathered in the process of developing or using a test is relevant to its developing or using a test is relevant to its validity (p. 3),” where should one start and validity (p. 3),” where should one start and how do you know when you’re done or are how do you know when you’re done or are you ever done?you ever done?


Transitioning to argument…Transitioning to argument…

The call for careful examination of The call for careful examination of alternative explanations within the alternative explanations within the construct model is helpful for directing a construct model is helpful for directing a program of validity researchprogram of validity research


Kane’s argument-based frameworkKane’s argument-based framework

“…“…assumes that the proposed interpretations and assumes that the proposed interpretations and uses will be explicitly stated as an argument, or uses will be explicitly stated as an argument, or network of inferences and supporting network of inferences and supporting assumptions, leading from observations to the assumptions, leading from observations to the conclusions and decisions. Validation involves an conclusions and decisions. Validation involves an appraisal of the coherence of this argument and appraisal of the coherence of this argument and of the plausibility of its inferences and of the plausibility of its inferences and assumptions (Kane, 2006, p. 17).” assumptions (Kane, 2006, p. 17).”

Sounds easy, right…Sounds easy, right…


Two Types of ArgumentsTwo Types of ArgumentsAn An interpretative argumentinterpretative argument specifies the specifies the proposed interpretations and uses of test proposed interpretations and uses of test results by laying out the network of results by laying out the network of inferences and assumptions leading to the inferences and assumptions leading to the observed performances to the conclusions observed performances to the conclusions and decisions based on the performancesand decisions based on the performances

The The validity argumentvalidity argument provides an provides an evaluation of the interpretative argument evaluation of the interpretative argument (Kane, 2006)(Kane, 2006)


Kane’s approach provides a more pragmatic Kane’s approach provides a more pragmatic approach to validation, “…involving the approach to validation, “…involving the specification of proposed interpretations and specification of proposed interpretations and uses, uses, the development of a measurement the development of a measurement procedure that is consistent with this proposalprocedure that is consistent with this proposal, , and a critical evaluation of the coherence of the and a critical evaluation of the coherence of the proposal and the plausibility of its inferences and proposal and the plausibility of its inferences and assumptions.”assumptions.”

The challenge is that most assessments do not The challenge is that most assessments do not start from an explicit attention to validity in the start from an explicit attention to validity in the design phasedesign phase


The Interpretative ArgumentThe Interpretative Argument

Essentially a mini-theory—the interpretative Essentially a mini-theory—the interpretative argument provides a framework for argument provides a framework for interpretation and use of test scoresinterpretation and use of test scores

Like theory, the interpretative argument Like theory, the interpretative argument guides the data collection and methods and guides the data collection and methods and most importantly, theories are falsifiable as most importantly, theories are falsifiable as we critically evaluate the evidence and we critically evaluate the evidence and argumentsarguments


Two stages of the interpretative argumentTwo stages of the interpretative argumentDevelopment stageDevelopment stage—focus on development of —focus on development of measurement tools and procedures as well as measurement tools and procedures as well as the corresponding interpretative argumentthe corresponding interpretative argument– An appropriate confirmationist bias in this stage since An appropriate confirmationist bias in this stage since

the developers (state and contractors) are trying to the developers (state and contractors) are trying to make the program the best it can bemake the program the best it can be

Appraisal stageAppraisal stage—focus on critical evaluation of —focus on critical evaluation of the interpretative argumentthe interpretative argument– Should be more neutral and “arms-length” to provide Should be more neutral and “arms-length” to provide

a more convincing evaluation of the proposed a more convincing evaluation of the proposed interpretations and usesinterpretations and uses

““Falsification, obviously, is something we prefer to Falsification, obviously, is something we prefer to do unto the constructions of others”do unto the constructions of others”

(Cronbach, 1989, p. 153)(Cronbach, 1989, p. 153)


Interpretative argumentInterpretative argument

““Difficulty in specifying an interpretative Difficulty in specifying an interpretative argument…may indicate a fundamental problem. argument…may indicate a fundamental problem. If it is not possible to come up with a test plan and If it is not possible to come up with a test plan and plausible rational for a proposed interpretation and plausible rational for a proposed interpretation and use, it is not likely that this interpretation and use use, it is not likely that this interpretation and use will be considered valid” (Kane, 2006, p. 26).will be considered valid” (Kane, 2006, p. 26).

Think of the interpretative argument as a series of Think of the interpretative argument as a series of “if-then” statements…“if-then” statements…– E.g., if the student performs the task in a certain way, E.g., if the student performs the task in a certain way,

then the observed score should have a certain value then the observed score should have a certain value


Criteria for Evaluating Interpretative ArgumentsCriteria for Evaluating Interpretative Arguments

ClarityClarity—should be clearly stated as a —should be clearly stated as a framework for validation. Inferences and framework for validation. Inferences and warrants specified in enough detail to warrants specified in enough detail to make proposed claims explicit.make proposed claims explicit.CoherenceCoherence—assuming the individual —assuming the individual inferences are plausible, the network of inferences are plausible, the network of inferences leading from the observations inferences leading from the observations to conclusions and decisions make senseto conclusions and decisions make sensePlausibilityPlausibility—particularly of assumptions, —particularly of assumptions, are judged in terms of all the evidence for are judged in terms of all the evidence for and against themand against them


One of the most effective challenges to One of the most effective challenges to interpretative arguments (or scientific interpretative arguments (or scientific theories) is to propose and substantiate an theories) is to propose and substantiate an alternative argument that is more plausiblealternative argument that is more plausible– With AA-AAS we have to seriously consider With AA-AAS we have to seriously consider

and challenge ourselves with competing and challenge ourselves with competing alternative explanations for test scores, for alternative explanations for test scores, for example…example…

““higher scores on our state’s AA-AAS reflects higher scores on our state’s AA-AAS reflects greater learning of the content frameworks” OR greater learning of the content frameworks” OR

““higher scores on our state’s AA-AAS reflects higher scores on our state’s AA-AAS reflects higher levels of student functioning”higher levels of student functioning”


Categories of interpretative Categories of interpretative arguments (Kane, 2006)arguments (Kane, 2006)

Trait interpretationsTrait interpretationsTheory-based interpretationsTheory-based interpretationsQualitative interpretationsQualitative interpretationsDecision proceduresDecision procedures

Like scientific theories, the specific type of Like scientific theories, the specific type of interpretative argument for test-based interpretative argument for test-based inferences guides models, data collection, inferences guides models, data collection, assumptions, analyses, and claimsassumptions, analyses, and claims


Decision ProceduresDecision ProceduresEvaluating a decision procedure requires Evaluating a decision procedure requires an evaluation of an evaluation of valuesvalues and and consequencesconsequences““To evaluate a testing program as an To evaluate a testing program as an instrument of policy [e.g., AA-AAS under instrument of policy [e.g., AA-AAS under NCLB], it is necessary to evaluate its NCLB], it is necessary to evaluate its consequences” (Kane, 2006, p.53)consequences” (Kane, 2006, p.53)Therefore, values inherent in the testing Therefore, values inherent in the testing program must be made explicit and the program must be made explicit and the consequences of the decisions as a result consequences of the decisions as a result of test scores must be evaluated!of test scores must be evaluated!


Prioritizing and FocusingPrioritizing and FocusingShepard (1993) advocated a straightforward Shepard (1993) advocated a straightforward means to prioritize validity questions. Using an means to prioritize validity questions. Using an evaluation framework, she proposed that validity evaluation framework, she proposed that validity studies be organized in response to the studies be organized in response to the questions: questions: – What does the testing practice claim to do; What does the testing practice claim to do; – What are the arguments for and against the intended What are the arguments for and against the intended

aims of the test; and aims of the test; and – What does the test do in the system other than what it What does the test do in the system other than what it

claims, for good or bad? (Shepard, 1993, p. 429). claims, for good or bad? (Shepard, 1993, p. 429).

The questions are directed to concerns about the The questions are directed to concerns about the construct, relevance, interpretation, and social construct, relevance, interpretation, and social consequences, respectively. consequences, respectively.

OBSERVATION

INTERPRETATION

COGNITION Student PopulationAcademic ContentTheory of Learning

Assessment SystemTest DevelopmentAdministration Scoring

ReportingAlignmentItem Analysis/DIF/BiasMeasurement ErrorScaling and Equating Standard Setting

VALIDITY EVALUATIONEmpirical EvidenceTheory and Logic (argument)Consequential Features

A heuristic to help organize and focus the validity evaluation (Marion, Quenemoen, & Kearns, 2006)


Synthesizing and IntegratingSynthesizing and IntegratingHaertel (1999) reminded us that the Haertel (1999) reminded us that the individual pieces of evidence (typically individual pieces of evidence (typically presented in separate chapters of presented in separate chapters of technical documents) do not make the technical documents) do not make the assessment system valid or not, it is only assessment system valid or not, it is only by synthesizing this evidence in order to by synthesizing this evidence in order to evaluate the interpretative argument can evaluate the interpretative argument can we judge the validity of the assessment we judge the validity of the assessment program.program.


NHEAI/NAAC Technical NHEAI/NAAC Technical DocumentationDocumentation

The “Nuts and Bolts”The “Nuts and Bolts”

The Validity EvaluationThe Validity EvaluationThe Stakeholder SummaryThe Stakeholder Summary

The Transition Document The Transition Document


The Validity EvaluationThe Validity EvaluationAuthorAuthor:: Independent contractor with Independent contractor with

considerable input from state DOE considerable input from state DOE AudienceAudience:: State policy makers, state DOE, State policy makers, state DOE,

district assessment and special district assessment and special education directors, state TAC education directors, state TAC members, special education members, special education teachers, and other key teachers, and other key stakeholders. This also will stakeholders. This also will contribute to the legal defensibility contribute to the legal defensibility of the system.of the system.

NotesNotes:: This will be a dynamic volume This will be a dynamic volume where new evidence is collected where new evidence is collected and evaluated over time. and evaluated over time.


Table of ContentsTable of Contents

I.I. Overview of the Assessment SystemOverview of the Assessment System

II.II. Who are the students?Who are the students?

III.III. What is the content?What is the content?

IV.IV. Introduction of the Validity Introduction of the Validity Framework and ArgumentFramework and Argument

V.V. Empirical EvidenceEmpirical Evidence

VI.VI. Evaluating the Validity ArgumentEvaluating the Validity Argument


Chapter VI: The Validity EvaluationChapter VI: The Validity EvaluationA.A. Revisiting the interpretative argumentRevisiting the interpretative argument

Logical/theoretical relationships among the content, Logical/theoretical relationships among the content, students, learning, and assessment—revisiting the students, learning, and assessment—revisiting the assessment triangleassessment triangle

B.B. The specific validity evaluation questions The specific validity evaluation questions addressed in this volumeaddressed in this volume

C.C. Synthesizing and weighing the various sources Synthesizing and weighing the various sources of evidenceof evidence

1.1. Arguments for the validity of the systemArguments for the validity of the system2.2. Arguments against the validity of the systemArguments against the validity of the system

D.D. An overall judgment about the defensibility of An overall judgment about the defensibility of inferences from the scores of the AA-AAS in the inferences from the scores of the AA-AAS in the context of specific uses and purposescontext of specific uses and purposes

1 an introduction to validity arguments for alternate assessments scott marion center for assessment...

Documents

assessment slide

validity arguments

validity argumentor

meaning of test scores

history of validity

little validity background

content model messick

technical documentation