ying (“alison”) cheng 1 john behrens 2 qi diao 3 1 lab of educational and psychological...

27
Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame 2 Center for Digital Data, Analytics, & Adaptive Learning Pearson 3 CTB/McGraw-Hill Maximum Information per Unit Time in Adaptive Testing 1

Upload: oswin-anderson

Post on 03-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

1

Yi n g ( “ A l i s o n ” ) C h e n g 1

J o h n B e h r e n s 2

Q i D i a o 3

1 L a b o f E d u c a t i o n a l a n d P s y c h o l o g i c a l M e a s u r e m e n t

D e p a r t m e n t o f P s y c h o l o g y, U n i v e r s i t y o f N o t r e D a m e2 C e n t e r f o r D i g i t a l D a t a , A n a l y t i c s , & A d a p t i v e L e a r n i n g

P e a r s o n3 C T B / M c G r a w - H i l l

Maximum Information per Unit Time in Adaptive Testing

Page 2: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

2

Test Efficiency

Weiss (1982): CAT can achieve the same measurement

precision with half the number of items of linear tests when the maximum information (MI) method is used for item selection

Maximum information method (Lord, 1980) Choosing the item that yields the largest amount of

information at the most recent ability estimate Maximum information Per Item

Page 3: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

3

Test Efficiency

All tests are timed

Maximum information given a time limit Choosing the item that yields the largest ratio of

amount of information and time required Maximum information per unit time (MIPUT) (Fan,

Wang, Chang, & Douglas, 2013)

Page 4: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

4

MI vs. MIPUT

MI: ,

where the eligible set of items after t items have been administered, and is the information of item l evaluated at MIPUT:

where denominator is the expected time required to finish item l given the working speed of the examinee, .

Page 5: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

5

Implementation of MIPUT

log-normal model for response time (van der Linden, 2006):

So , (time intensity) and (working speed) can be estimated from response time data

(+)

Page 6: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

6

Performance of MIPUT

Fan et al. (2013) showed that the MIPUT method when compared to the MI method leads to:

i) shorter testing time;ii) small loss of measurement precision; iii) visibly worse item pool usage.

Fan et al. (2013) used a-stratification (Chang & Ying, 1999) with the MIPUT method to balance item pool usage and found it effective

Page 7: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

7

a-stratification

Item information: .Items with high discrimination parameter are

over-used under the MIa-stratification restricts item selection to low-

a items early in the test, and high-a items later

Apparently high-a items are still over-used under MIPUT

That’s why a-stratification helps balance item usage under MIPUT

Page 8: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

8

Questions that Remain

Fan et al. (2013) simulated items that: Item difficulty and time intensity are either correlated or

not correlated; Item discrimination and difficulty are not correlated; Item discrimination and time intensity are not correlated.

In reality: Item discrimination and difficulty are positively

correlated (~.4-.6) (Chang, Qian, Ying, 2001).

Q1: How about item discrimination and time intensity?

Page 9: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

9

Follow-Up Questions

Q2: If item discrimination and time intensity are indeed related: Will MIPUT still lead to worse item pool usage than MI? If so, is that still due to highly discrimination items or

due to highly time saving items?

Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than MI? If so, is that due to highly time saving items? If so, how can we control item exposure?

Page 10: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

10

Calibration of a large item bank Online math testing data 595 items Over 2 million entries of testing data 3PL and 2PL model – in the following analysis, focus

on 2PL Time intensity measured by the log-transformed

average time on each item

Q1: Item Discrimination and Time Intensity

Page 11: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

11

 

2PL_a 2PL_b 3PL_a 3PL_b 3PL_cTime

Intensity2PL_a 1 .111** .702** .009 -.584** .139

2PL_b .111** 1 .387** .935** -.369** .562

3PL_a .387** .702** 1 .350** -.363** .205

3PL_b .935** .009 .350** 1 -.226** .564

3PL_c -.369** -.584** -.363** -.226** 1 -.0425

Time Intensity

.522** .080 .153** .522** -.355** 1

Q1: Item Discrimination and Time Intensity

Page 12: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

12

Q2

So item discrimination and time intensity are indeed related. Then Will MIPUT still lead to worse item pool usage than

MI? If so, is that still due to highly discrimination items or

due to highly time saving items?

Page 13: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

13

A Simplified Version of MIPUT

where denominator is the average time required to finish item l is not individualizedMay be more robust against violation to model

assumptions

Page 14: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

14

CAT simulation Test length: 20 or 40 First item randomly chosen from the pool 5,000 test takers ~ N(0,1) Ability update: EAP with prior of N(0,1) No exposure control or content balancing if not

specified otherwise

Simulation Details

Page 15: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

15

Q2  20-Item 40-Item

MI_2PL MIPUT_2PL MI_2PL MIPUT_2PL

Bias .002 .003 0.001 .002

MSE .019 .020 0.012 .012

.991 .990 0.994 .994

Chi-square 95.29 96.41 81.00 84.10

No exposure 73.1% 71.9% 50.9% 52.1%

Underexposed (<.02)

77.8% 78.2% 57.0% 58.2%

Overexposed (>.20)

4.37% 5.04% 11.6% 11.8%

Average time used (mins)

38.596 34.434 79.346 70.112

Min testing time)

17.857 16.374 37.039 34.383

Max testing time

84.035 79.969 79.346 146.773

Page 16: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

16

Findings

On average, MIPUT leads to shorter tests (on average by 4 minutes than MI if test length is 20 – 10%, and 9 minutes if test length is 40 – 11%)

MIPUT leads to slightly worse exposure controlWhen item discrimination and time intensity are

positively related, the disadvantage of MIPUT in exposure control becomes less conspicuous

MI and MIPUT lead to negligible difference in measurement precision

Over-exposure is still largely attributable to highly discrimination items

Page 17: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

17

Q3

Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than

MI? If so, is that due to highly time saving items? If so, how can we control item exposure?

Page 18: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

18

MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL

Bias -.001 .001 .007 -.004

MSE .074 .078 .081 .075

.963 .962 .960 .963

Chi-square 23.31 152.67 25.59 90.21

No exposure 26.1% 78.3% 36.1% 43.2%

Underexposed (<.02)

59.2% 82.5% 56.0% 75.8%

Overexposed (>.20)

0 5.38% 0 4.03%

Average time used

(mins)

38.909 17.594 26.643 20.023

Min testing time

17.909 11.547 16.079 12.388

Max testing time

94.407 53.557 73.127 63.044

Test Length = 20

Page 19: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

19

Findings if Test Length = 20

MI vs MIPUT Negligible difference in measurement precision MIPUT reduces testing time by 21 minutes for a

20-item test (55% reduction)But MIPUT leads to much worse exposure control Items that are highly time saving are favored

o Correlation between the exposure rate and time intensity under MI-1PL: -.240 – an artifact of the item bank

o Correlation between the exposure rate and time intensity under MIPUT-1PL: -.398

Page 20: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

20

Exposure Control

a-stratification is not going to work Randomesque (Kingsbury & Zara, 1989)

Randomly choose one out of n best items, e.g., n = 5 MIPUT-R5

Progressive Restricted (Revuelta & Ponsoda, 1998) A weighted index, weight determined by the stage of the

test Random number and the time-adjusted item information Higher weight given to the time-adjusted item information

later in the test

Page 21: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

21

MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL

Bias -.001 .001 .007 -.004

MSE .074 .078 .081 .075

.963 .962 .960 .963

Chi-square 23.31 152.67 25.59 90.21

No exposure 26.1% 78.3% 36.1% 43.2%

Underexposed (<.02)

59.2% 82.5% 56.0% 75.8%

Overexposed (>.20)

0 5.38% 0 4.03%

Average time used

(mins)

38.909 17.594 26.643 20.023

Min testing time

17.909 11.547 16.079 12.388

Max testing time

94.407 53.557 73.127 63.044

Test Length = 20

Page 22: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

22

Findings if Test Length = 20

MIPUT_R5 Maintains measurement precision Much better exposure control Reduces testing time on average by 12 minutes

(>30% reduction)

MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 18 minutes

(reduction almost by half)

Page 23: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

23

MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL

Bias -.003 -.001 .003 .004

MSE .038 .040 .045 .040

.981 .981 .978 .981

Chi-square 30.75 135.37 17.73 99.31

No exposure 3.4% 60.7% 7.7% 15.6%

Underexposed (<.02)

25.7% 66.9% 18.3% 59.2%

Overexposed (>.20)

3.9% 13.4% 0 12.3%

Average time used

(mins)

77.660 41.192 67.903 45.223

Min testing time

39.451 27.300 48.718 29.181

Max testing time

162.889 132.986 136.585 137.489

Test Length = 40

Page 24: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

24

Findings if Test Length = 40

Same findings replicated when test length doublesMIPUT leads to much worse item pool usage

because of the overreliance on time saving itemsMIPUT_R5

Maintains measurement precision Much better exposure control Reduces testing time on average by 13%

MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 41%

Page 25: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

25

Overall Summary

MIPUT’s advantage of time saving is more conspicuous under the 1PL

MIPUT leads to much worse item pool usage than MI and relies heavily on time saving items

MIPUT_R5 is a promising method to maintain measurement precision, balance item pool usage and still keeps the time saving advantage

Page 26: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

26

Future Directions

Develop a parallel exposure control method under MIPUT to a-stratify: stratifying by time

Investigates the performance of the simplified MIPUT and the original MIPUT in the presence of violation of assumptions to the log-normal model for response time

More data analysis to explore the relationship between time intensity and item parameters

Control total testing time (van der Linden & Xiong, 2013)

Page 27: Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame

27

Thank You!

CTB/McGraw-Hill 2014 R&D GrantQuestion or paper, please visit

irtnd.wikispaces.com