james l. woodworth, credo hoover institute, stanford wen-juo lo, university of arkansas
DESCRIPTION
The Impact of Selection of Student Achievement Measurement Instrument on Teacher Value-added Measures. James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas Joshua B. McGee, Laura and John Arnold Foundation Nathan C. Jensen, Northwest Evaluation Association. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/1.jpg)
The Impact of Selection of Student Achievement Measurement Instrument
on Teacher Value-added Measures
James L. Woodworth, CREDO Hoover Institute, Stanford
Wen-Juo Lo, University of Arkansas
Joshua B. McGee, Laura and John Arnold Foundation
Nathan C. Jensen, Northwest Evaluation Association
![Page 2: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/2.jpg)
Presentation Outline
1. Purpose
2. Statistical Noisea. Why it matters
b. Sources
3. Data
4. Methods
5. Results
![Page 3: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/3.jpg)
Purpose
The purpose of this paper is to present to a statistics lay population the extent to which psychometric properties of student test instruments impact teacher value-added measures.
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 4: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/4.jpg)
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 5: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/5.jpg)
Why it matters?
5th
6th
Below Basic
Basic AdvancedProficient
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 6: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/6.jpg)
Primary Sourcesof Statistical Noise
1. Test Design
2. Vertical Alignment
3. Student Sample Size
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 7: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/7.jpg)
Test Design
Proficiency Tests
• Focused around proficiency point
• Designed to differentiate between proficient and not proficient
• Larger variance in Conditional Standard Errors (CSE)
Growth Tests
• Questions measure across entire ability spectrum
• Designed to differentiate between all points on the distribution
• Smaller variance in CSE
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 8: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/8.jpg)
Test Design
Paper and Pencil Tests
• Limit item pool to control length
• Focused around proficiency point
• Large variance in CSE
Computer Adaptive Test
• Larger item pool for question selection
• Focused around student ability point
• Smaller variance in CSE
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 9: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/9.jpg)
Test DesignCSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009
CSE Range: 24 - 74Weighted average CSE = 38.96
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 10: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/10.jpg)
Vertical Alignment
• Year to year alignment can impact the results of VAM– Units must be equal across test sessions• Spring-Spring VAM are most affected
• Fall-Spring VAM using same test avoid much of problem
• Item alignment on computer adaptive tests can impact the results of VAM
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 11: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/11.jpg)
Student Sample Size
• Central Limit Theorem– Larger student n provides a more stable estimate of
teacher VAM.
– Typical single year student n’s are 25, 50, and 100 for elementary and middle school teachers.
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 12: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/12.jpg)
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 13: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/13.jpg)
Data Sets
TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading, 2009 Population Statistics– Proficiency test– Vertically aligned scale scores– Average yearly gain
• 24 vertical scale points at “Met Expectations”• 34 vertical scale points at “Commended”
– Standard Errors – Conditional Standard Errors reported by TEA for each vertical scale score• CSE Range: 24 - 74• Weighted average CSE = 38.96
– Highly skewed distribution– High variance
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 14: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/14.jpg)
Data Sets
TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading
N: 323,507μ: 701.49σ2: 10048.30σ: 100.24
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 15: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/15.jpg)
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 16: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/16.jpg)
Data Sets
MAP – Measures of Academic Progress– Growth measure
– Computer Adaptive Test
– Single scale
– Average yearly gain• 5.06 RIT points
– Standard Errors – average standard errors range 2.5 - 3.5 RIT
– Slightly skewed distribution
– Small variance
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 17: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/17.jpg)
Data Sets
MAP – Measures of Academic Progress
N: 2,663,382
μ: 208.35
σ2: 161.82
σ: 12.72
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 18: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/18.jpg)
Simulated Data
As it is impossible to isolate true scores and error with real data, we created simulated data points.– True scores are known for all data points
– Every data point was given the same growth• All iterations have the same value-added
• Any deviation from expected is a function of measurement error only
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 19: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/19.jpg)
Simulated Data
We simulated 10,000 z-scores ~ N (0,1)
From this we selected nested, random samples of n=100, n=50, n=25.
Statistical Summary, z-Score Samples by n
Statistic Values
N 100 50 25
Mean -.13 -.09 .01
Std. Deviation .97 .97 1.00
Skewness -.12 .18 .10
Minimum -2.34 -1.85 -1.77
Maximum 2.09 2.09 2.09
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 20: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/20.jpg)
Data Generation
Pre-scores = P1 = z-score • σ +
Post-scores = P2 = P1 + controlled growth
Controlled Growth Values:TAKS = 24 (TAKS at “Commended” = 34) vertical scale points
MAP = 5.06 RIT points
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)
CSE = Conditional Standard Errors as reported by TEA and NWEA
x
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 21: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/21.jpg)
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 22: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/22.jpg)
Monte Carlo Simulation
We ran 1,000 iterations for each simulation which was equivalent to the same students taking the test 1,000 times with the same true scores, but different levels of error.
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)CSE = Conditional Standard Errors as reported by TEA and NWEA
Aggregated values by subgroup to determine average performance for each iteration. False Negative : Simulated Growth < .5 Controlled GrowthFalse Positive: Simulated Growth > 1.5 Controlled Growth
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 23: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/23.jpg)
Monte Carlo Results n=100 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 1.7 2.5 95.8TAKS Normal Distribution at “Meets” Level .9 1.8 97.3TAKS Normal Distribution Avg SE 1.2 1.8 97.0TAKS Normal Distribution at “Commended” Level
.8 .2 99.0
TAKS Normal Grade Transition 1.4 2.1 96.5MAP Normal 0.0 0.0 100.0MAP Max CSE 0.0 0.0 100.0
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 24: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/24.jpg)
Monte Carlo Results n=50 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 7.4 9.6 83.0TAKS Normal Distribution at “Meets” Level 6.6 8.4 85.0TAKS Normal Distribution Avg SE 5.7 7.4 86.9TAKS Normal Distribution at “Commended” Level
4.4 1.7 93.9
TAKS Normal Grade Transition 6.5 8.1 85.4MAP Normal 0.0 0.0 100.0MAP Max CSE .7 .6 98.7
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 25: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/25.jpg)
Monte Carlo Results n=25 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 16.1 18.4 65.5TAKS Normal Distribution at “Meets” Level 16.8 18.0 65.2TAKS Normal Distribution Avg SE 14.5 16.0 69.5TAKS Normal Distribution at “Commended” Level
10.2 7.7 82.1
TAKS Normal Grade Transition 18.6 18.2 63.2MAP Normal .5 .5 99.0MAP Max CSE 3.0 4.2 92.8
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 26: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/26.jpg)
ResultsStudent Sample Size n=100 n=50 n=25
Descriptive Statistics VAM
Controlled Growth
Average Simulated Growth SD
Average Simulated Growth SD
Average Simulated Growth SD
TAKS Actual Distribution 24 24.29 6.02 24.26 8.78 24.18 12.28
TAKS Normal Distribution at “Meets”
24 24.08 5.45 24.45 8.37 24.14 12.39
TAKS Normal Distribution Avg SE
24 24.19 5.45 24.61 8.03 24.59 11.47
TAKS Normal Distribution at “Commended”
34 33.85 5.60 34.15 8.12 34.92 11.87
TAKS Normal Grade Transition
24 24.08 5.59 24.24 8.59 24.15 12.85
MAP Normal 5.06 5.07 .49 5.12 .72 5.12 1.03MAP Max CSE 5.06 5.05 .71 5.05 .99 5.08 1.37
Test
Percent misidentified at
n=100
Percent misidentified at
n=50
Percent misidentified at
n=25TAKS Normal Distribution at “Meets” 2.7 15.0 34.8MAP Normal 0.0 0.0 1.0
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
![Page 27: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/27.jpg)
Conclusions
The Growth/Error ratio is the critical variable in VAM stability.
Necessary student n to achieve a stable VAM is sensitive to the Growth/Error ratio.
Stable VAMs are possible even with typical classroom n’s; however, careful attention must be paid to the suitability of the student assessment instrument.
![Page 28: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681599f550346895dc6eccd/html5/thumbnails/28.jpg)
Limitations
No Differentiation between Student Effects, Teacher Effects, or School Effects
No Environmental Effects
No Interaction Terms
These are all areas for additional research.