james l. woodworth, credo hoover institute, stanford wen-juo lo, university of arkansas

The Impact of Selection of Student Achievement Measurement Instrument

on Teacher Value-added Measures

James L. Woodworth, CREDO Hoover Institute, Stanford

Wen-Juo Lo, University of Arkansas

Joshua B. McGee, Laura and John Arnold Foundation

Nathan C. Jensen, Northwest Evaluation Association

Presentation Outline

1. Purpose

2. Statistical Noisea. Why it matters

b. Sources

3. Data

4. Methods

5. Results

Purpose

The purpose of this paper is to present to a statistics lay population the extent to which psychometric properties of student test instruments impact teacher value-added measures.

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Question

What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?


Why it matters?

5th

6th

Below Basic

Basic AdvancedProficient


Primary Sourcesof Statistical Noise

1. Test Design

2. Vertical Alignment

3. Student Sample Size


Test Design

Proficiency Tests

• Focused around proficiency point

• Designed to differentiate between proficient and not proficient

• Larger variance in Conditional Standard Errors (CSE)

Growth Tests

• Questions measure across entire ability spectrum

• Designed to differentiate between all points on the distribution

• Smaller variance in CSE


Test Design

Paper and Pencil Tests

• Limit item pool to control length

• Focused around proficiency point

• Large variance in CSE

Computer Adaptive Test

• Larger item pool for question selection

• Focused around student ability point

• Smaller variance in CSE


Test DesignCSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009

CSE Range: 24 - 74Weighted average CSE = 38.96


Vertical Alignment

• Year to year alignment can impact the results of VAM– Units must be equal across test sessions• Spring-Spring VAM are most affected

• Fall-Spring VAM using same test avoid much of problem

• Item alignment on computer adaptive tests can impact the results of VAM


Student Sample Size

• Central Limit Theorem– Larger student n provides a more stable estimate of

teacher VAM.

– Typical single year student n’s are 25, 50, and 100 for elementary and middle school teachers.


Question



Data Sets

TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading, 2009 Population Statistics– Proficiency test– Vertically aligned scale scores– Average yearly gain

• 24 vertical scale points at “Met Expectations”• 34 vertical scale points at “Commended”

– Standard Errors – Conditional Standard Errors reported by TEA for each vertical scale score• CSE Range: 24 - 74• Weighted average CSE = 38.96

– Highly skewed distribution– High variance


Data Sets

TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading

N: 323,507μ: 701.49σ2: 10048.30σ: 100.24


Data Sets

MAP – Measures of Academic Progress– Growth measure

– Computer Adaptive Test

– Single scale

– Average yearly gain• 5.06 RIT points

– Standard Errors – average standard errors range 2.5 - 3.5 RIT

– Slightly skewed distribution

– Small variance


Data Sets

MAP – Measures of Academic Progress

N: 2,663,382

μ: 208.35

σ2: 161.82

σ: 12.72


Simulated Data

As it is impossible to isolate true scores and error with real data, we created simulated data points.– True scores are known for all data points

– Every data point was given the same growth• All iterations have the same value-added

• Any deviation from expected is a function of measurement error only


Simulated Data

We simulated 10,000 z-scores ~ N (0,1)

From this we selected nested, random samples of n=100, n=50, n=25.

Statistical Summary, z-Score Samples by n

Statistic Values

N 100 50 25

Mean -.13 -.09 .01

Std. Deviation .97 .97 1.00

Skewness -.12 .18 .10

Minimum -2.34 -1.85 -1.77

Maximum 2.09 2.09 2.09


Data Generation

Pre-scores = P1 = z-score • σ +

Post-scores = P2 = P1 + controlled growth

Controlled Growth Values:TAKS = 24 (TAKS at “Commended” = 34) vertical scale points

MAP = 5.06 RIT points

Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)

CSE = Conditional Standard Errors as reported by TEA and NWEA

x


Question



Monte Carlo Simulation

We ran 1,000 iterations for each simulation which was equivalent to the same students taking the test 1,000 times with the same true scores, but different levels of error.

Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)CSE = Conditional Standard Errors as reported by TEA and NWEA

Aggregated values by subgroup to determine average performance for each iteration. False Negative : Simulated Growth < .5 Controlled GrowthFalse Positive: Simulated Growth > 1.5 Controlled Growth


Monte Carlo Results n=100 % False Negative

% False Positive

% Total Correct

IDTAKS Actual Distribution 1.7 2.5 95.8TAKS Normal Distribution at “Meets” Level .9 1.8 97.3TAKS Normal Distribution Avg SE 1.2 1.8 97.0TAKS Normal Distribution at “Commended” Level

.8 .2 99.0

TAKS Normal Grade Transition 1.4 2.1 96.5MAP Normal 0.0 0.0 100.0MAP Max CSE 0.0 0.0 100.0

Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results


% False Positive

% Total Correct

IDTAKS Actual Distribution 7.4 9.6 83.0TAKS Normal Distribution at “Meets” Level 6.6 8.4 85.0TAKS Normal Distribution Avg SE 5.7 7.4 86.9TAKS Normal Distribution at “Commended” Level

4.4 1.7 93.9

TAKS Normal Grade Transition 6.5 8.1 85.4MAP Normal 0.0 0.0 100.0MAP Max CSE .7 .6 98.7



% False Positive

% Total Correct

IDTAKS Actual Distribution 16.1 18.4 65.5TAKS Normal Distribution at “Meets” Level 16.8 18.0 65.2TAKS Normal Distribution Avg SE 14.5 16.0 69.5TAKS Normal Distribution at “Commended” Level

10.2 7.7 82.1

TAKS Normal Grade Transition 18.6 18.2 63.2MAP Normal .5 .5 99.0MAP Max CSE 3.0 4.2 92.8


ResultsStudent Sample Size n=100 n=50 n=25

Descriptive Statistics VAM

Controlled Growth

Average Simulated Growth SD



TAKS Actual Distribution 24 24.29 6.02 24.26 8.78 24.18 12.28

TAKS Normal Distribution at “Meets”

24 24.08 5.45 24.45 8.37 24.14 12.39

TAKS Normal Distribution Avg SE

24 24.19 5.45 24.61 8.03 24.59 11.47

TAKS Normal Distribution at “Commended”

34 33.85 5.60 34.15 8.12 34.92 11.87

TAKS Normal Grade Transition

24 24.08 5.59 24.24 8.59 24.15 12.85

MAP Normal 5.06 5.07 .49 5.12 .72 5.12 1.03MAP Max CSE 5.06 5.05 .71 5.05 .99 5.08 1.37

Test

Percent misidentified at

n=100


n=50


n=25TAKS Normal Distribution at “Meets” 2.7 15.0 34.8MAP Normal 0.0 0.0 1.0


Conclusions

The Growth/Error ratio is the critical variable in VAM stability.

Necessary student n to achieve a stable VAM is sensitive to the Growth/Error ratio.

Stable VAMs are possible even with typical classroom n’s; however, careful attention must be paid to the suitability of the student assessment instrument.

Limitations

No Differentiation between Student Effects, Teacher Effects, or School Effects

No Environmental Effects

No Interaction Terms

These are all areas for additional research.

james l. woodworth, credo hoover institute, stanford wen-juo lo, university of arkansas

Documents

impact of statistical

impact of selection

different test characteristics

test sessionsspringspring

accuracy of valueadded

vertical scale scorecse

proficientlarger variance

distributionsmaller