stephen c. court presented in symposium american educational research association (aera)
DESCRIPTION
A District-initiated Appraisal of a State Assessment’s Instructional Sensitivity HOLDING ACCOUNTABILITY TESTS ACCOUNTABLE. Stephen C. Court Presented in Symposium American Educational Research Association (AERA) Annual Meeting May 2, 2010 Denver, Colorado. Accountability. Basic premise : - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/1.jpg)
A District-initiated AppraisalA District-initiated Appraisalof aof a
State Assessment’s State Assessment’sInstructional SensitivityInstructional Sensitivity
HOLDING ACCOUNTABILITY TESTS ACCOUNTABLEHOLDING ACCOUNTABILITY TESTS ACCOUNTABLE
Stephen C. CourtStephen C. Court
Presented in Symposium American Educational Research Association (AERA)
Annual MeetingMay 2, 2010
Denver, Colorado
![Page 2: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/2.jpg)
AccountabilityAccountability
Basic premise:
Teaching Learning Proficiency
High proficiency rates = Good schools
Low proficiency rates = Bad schools
![Page 3: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/3.jpg)
AccountabilityAccountability
Basic Assumption
State assessments distinguish well-taught students from not so well-taught students with enough accuracy to support accountability decisions.
![Page 4: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/4.jpg)
AccountabilityAccountability
Q: Is the assumption warranted?
A: Only if the tests are instructionally sensitive.
When tests are insensitive, accountability decisions are based on the wrong things – e.g., SES.
![Page 5: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/5.jpg)
Kansas: SESKansas: SES
![Page 6: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/6.jpg)
Kansas: Test ScoresKansas: Test Scores
![Page 7: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/7.jpg)
Kansas: Exemplary by SESKansas: Exemplary by SES
![Page 8: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/8.jpg)
The Situation in KansasThe Situation in Kansas
Basic Question
Can the instruction in low-poverty districts truly be that much better than the instruction in high-poverty districts?
Or, do instructionally-irrelevant factors (such as SES) distort or mask the effects of instruction?
![Page 9: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/9.jpg)
Multi-district StudyMulti-district Study
• Purpose– To compare instructional sensitivity appraisal models and
methods– To appraise the instructional sensitivity of the Kansas state
assessments
• District-initiated because no state-level study had been initiated – Indicator-level analysis– Loss/gain because no indicator-level cut scores
• Based initially on empirical approach recommended by Popham (2008)
![Page 10: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/10.jpg)
Tactical VariationsTactical Variations
• A variety of practical constraints and preliminary findings raised several conceptual and methodological issues.
• The original design underwent several revisions.
• Several tactical variations involving– data collection – data array, analysis, and interpretation
![Page 11: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/11.jpg)
Tactical VariationsTactical Variations
See the paper for details…
• discusses the issues and design revisions
• provides exegesis of item-selection criteria and test-construction that yield instructional insensitivity
• describes, demonstrates, and compares the tactical variations employed in the collection, array, and analysis of the data, as
well as in the interpretation of the results
Due to time constraints, let’s focus just on the “juiciest jewels”…
![Page 12: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/12.jpg)
Study ParticipantsStudy Participants
575 teachers responded– 320 teachers (grades 3-5 reading and math)– 129 reading teachers (grades 6-8)– 126 math teachers (grades 6-8)
14,000 students
• Only Grade 5 reading included in this study.
• To be reported in June at CCSSO in Detroit:– other reading results (grades 3-8) – all math results (grades 3-8)
![Page 13: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/13.jpg)
A Gold StandardA Gold Standard
By recommending that teachers be asked to identify their best-taught indicators, Popham (2008) transformed the instructional sensitivity issue in a fundamental way – both conceptually and operationally:
For the first time since IS inquiries began about 40 years ago, there now could be a gold standard independent of the test itself – a huge breakthrough!
![Page 14: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/14.jpg)
Old and New ModelOld and New Model
A = Non-Learning
B = Learning
C = Slip
D = Maintain
A = True Fail
B = False Pass = II-E
C = False Fail = II-D
D = True Pass
![Page 15: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/15.jpg)
Initial Analysis SchemeInitial Analysis Scheme
Initial logic:
If best-taught students outperform other students, indicator is sensitive to instruction.
If mean differences are small or in the wrong direction, indicator is insensitive to instruction.
![Page 16: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/16.jpg)
ProblemProblem
But significant performance differences between best-taught and other students do not necessarily represent instructional sensitivity.
affluent students provided ineffective instruction typically end up in Cell B
challenged students provided effective instruction typically end up in Cell C
![Page 17: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/17.jpg)
ProblemProblem
Thus: Means-based and DIF-driven approaches that evaluate between-group differences are not appropriate for appraising instructional sensitivity.
Instead: Focus on the degree to which indicators accurately distinguish effective from ineffective instruction – without confounding from instructionally irrelevant easiness or difficulty.
![Page 18: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/18.jpg)
Conceptually CorrectConceptually Correct
Rather than comparing group differences in terms of means, let’s look instead at the combined proportions of true fail and true pass. That is,
(A + D) / (A + B + C + D)
Which can be shortened to
(A + D) / N = Malta Index
![Page 19: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/19.jpg)
Malta IndexMalta Index
(A + D) / NRanges from 0 to 1
(Completely Insensitive to Totally Sensitive)
In practice:
A value of .50 = chanceEquivalent to random guessing
![Page 20: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/20.jpg)
Totally SensitiveTotally Sensitive
(A + D) / N =
(50 + 50) / 100 = 1.0
A perfectly sensitive item or indicator would cluster students into Cell A or Cell D.
![Page 21: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/21.jpg)
Totally InsensitiveTotally Insensitive
(A+D) / N = (0+0) / 100 = 0.0
A perfectly insensitive test clusters students into Cell B or Cell C
![Page 22: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/22.jpg)
UselessUseless
(A+D) / N = (25+25) /100 = 0.50
0.50 = mere chance
An indicator that cannot distinguish true fail or pass from false fail or pass is totally useless – no better than random guessing.
![Page 23: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/23.jpg)
Malta Index ParallelsMalta Index Parallels
The Malta Index is similar conceptually to:
– Mann-Whitney U
– Wilcoxon ranks statistic
– Area Under the Curve (AUC) in Receiver Operating Characteristic (ROC) curve analysis
But its interpretation is embedded in the context of instructional sensitivity appraisal.
![Page 24: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/24.jpg)
Malta IndexMalta Index
Compared to these other approaches, the Malta Index is easier to…
– compute– understand– interpret
Thus, it is more accessible conceptually to measurement novices, such as
– teachers– reporters– policy-makers
![Page 25: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/25.jpg)
ROC AnalysisROC AnalysisMalta Index values can be depicted graphically as ROC curves.
![Page 26: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/26.jpg)
Informal EvaluationInformal Evaluation
Malta Index values can be evaluated informally via acceptability criteria (Hosmer & Lemeshow, 2000)
Value
– .90-1.0 = excellent (A) – .80-.90 = good (B) – .70-.80 = acceptable (C) – .60-.70 = poor (D) – .50-.60 = fail (F)
![Page 27: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/27.jpg)
Indicator
Teacher Ratings (Most vs. Less)
Prior Data: (Best vs. Not Best)
Prior Data: (Best vs. Worst)
MI AUC MI AUC MI AUC
1 .51 .51 .56 .56 .64 .64
2 .50 .51 .54 .63 .64 .66
3 .50 .54 .56 .56 .59 .59
4 .57 .55 .62 .62 .68 .68
5 .53 .54 .72 .72 .79 .79
6 .52 .50 .61 .61 .69 .69
7 .53 .50 .56 .56 .62 .63
8 .55 .53 .56 .56 .59 .59
9 .52 .54 .57 .57 .64 .64
10 .52 .52 .57 .57 .64 .64
11 .51 .56 .59 .60 .68 .68
12 .52 .50 .57 .57 .63 .63
13 .66 .52 .56 .56 .58 .58
14 .64 .58 .58 .58 .62 .62
Average .54 .53 .64 .59 .64 .65
![Page 28: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/28.jpg)
Summary and InterpretationsSummary and Interpretations
• AUC and the Malta Index yield very similar but not identical results
• Identical conclusions overall: Grade 5 reading indicators lack instructional sensitivity
– No indicator was graded better than a “C”– Most were in the “Poor” to “Useless” range– Averages ranged from “Poor” to “Useless”
![Page 29: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/29.jpg)
Summary and InterpretationsSummary and Interpretations
Low instructional sensitivity values for grade 5 reading were disappointing, especially given:
– Local contractor (CETE)– Guidance from TAC (including Popham and
Pellegrino)– Concerns from the KAAC (including Court)
If Kansas assessments lack instructional sensitivity, what about other states’ assessments?
![Page 30: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/30.jpg)
ConclusionConclusionDear U.S. Department of Education:
Please make instructional sensitivity…
– An essential component in reviews of RTTT funding applications
– A critical element in the approval process of state and consortia accountability plans
When the Department revised its Peer Review Guidance (2007) to include alignment as a critical element of technical quality, states were compelled to conduct alignment studies that they otherwise would not have conducted.
Instructional sensitivity deserves similar Federal endorsement.
![Page 31: Stephen C. Court Presented in Symposium American Educational Research Association (AERA)](https://reader035.vdocuments.mx/reader035/viewer/2022081603/568156fe550346895dc4a40c/html5/thumbnails/31.jpg)
Presenter’s email:Presenter’s email:[email protected]@cox.net
Questions, comments, or Questions, comments, or suggestions are welcomesuggestions are welcome