empowered by psychometricsmedia01.commpartners.com/ncta/2015conference... · computerized adaptive...

Empowered by Psychometrics Inside the Black Box of

Computerized Adaptive Testing

Jim Wollack and Sonya Sedivy

University of Wisconsin – Madison

#nctaconf15

Purpose of Session

• Introduce several key psychometric concepts

and gain an appreciation for the theoretical

underpinnings of computerized adaptive

testing.

• Conceptual Overview of Computerized Adaptive

Testing

• Introduction of Item Response Theory

• Score Estimation in CAT

• Item Selection in CAT

• Practical Issues in CAT

Poll Question

Consider the following topics related to computerized

adaptive testing.

I. How items are chosen for administration

II. How a test score is determined

III. Test security issues specific to CAT

Please indicate which of these topics you currently would

be comfortable discussing with an examinee/parent.

SECTION I

• Conceptual Overview of Computerized

Adaptive Testing

Classical Test Theory

• X = T + E

• Person characteristics

• Total test score serves as a proxy for examinee’s

level on the construct

• Item characteristics

• Item difficulty is estimated as the proportion of

examinees who answer an item correctly

• Item discrimination estimated as correlation

between item score (1/0) and total score

Sample Data

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X

S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2

S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5

S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8

S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9

S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11

S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13

●

●

●

●

●

●

●

●

●

p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37

Computerized Adaptive Testing (CAT)

• Each examinee receives a customized exam

that is built to maximize the precision of their

score.

• When an examinee answers a question

incorrectly, next item will be easier

• When an examinee answers a question correctly,

next item will be harder.

• Difficulty of assessments varies so that each

examinee gets an exam for which they’ll answer

approximately 50% of the items correct.

Reliability versus Precision

• Reliability is a measure of the amount of error

in a group of test scores.

• Standard error of measurement (SEM) is a function

of reliability, and represents an average amount

that an individual’s score might vary upon retest.

• SEM is the same value for all examinees.

• Precision is a measure of the amount of error in

an individual’s test score.

• Varies based on examinee’s trait level.

• Measure of the quality of a specific test score.

Reliability versus Precision

• For a population of examinees, the test will

not be equally precise for all.

• Precision is maximized when examinees take

items that are challenging, but doable.

Estimate trait level (score)

Estimate precision of score

Is score sufficiently precise?

Has maximum test length

been reached

Deliver new item that will

maximize precision

Adaptive Algorithm

Candidate gets

a small set of

random items of

varying difficulty

End Test

NO

NO

YES YES

Sample Data

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X

S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2

S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5

S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8

S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9

S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11

S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13

●

●

●

●

●

●

●

●

●

p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37

Sample Data—CAT

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X N

S1 0 0 0 1 1 0 2 6

S2 0 1 1 0 1 0 0 3 7

S3 1 1 0 0 0 1 3 6

S4 1 1 1 0 0 0 3 6

S5 1 0 0 1 1 0 3 6

S6 1 1 0 1 1 1 5 6

●

●

●

●

●

●

●

●

●

• Everybody sees different items and takes tests of different lengths • CAT algorithm forces most students to a score of approximately 50% • How can we assign scores fairly and accurately?

Sample Data—CAT (items ordered by p)


S1 1 0 1 0 0 0 2 6

S2 1 1 1 0 0 0 0 3 7

S3 0 1 1 0 1 0 3 6

S4 1 0 1 1 0 0 3 6

S5 1 1 0 1 0 0 3 6

S6 1 1 1 1 0 1 5 6

●

●

●

●

●

●

●

●

●

• Picture is somewhat improved • Still can’t report accurate scores with any confidence

SECTION II

• Introduction to Item Response Theory

Item Response Theory

• Mathematical modeling approach to test scoring

and analysis

• Less intuitive, but more sophisticated approach

• Solves many problems with CTT

• Sample-dependency of item/exam statistics

• Test-dependency of total scores

• Tough to compare people and items

• Equal item weighting

• No good way to account for guessing

16

Trait Level vs. Prob. Correct Response

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

q (Examinee Trait Level)

Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

17

An Item Characteristic Curve

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

18

Sample Independent—Same Curve

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

Item Response Theory

• Directly models the probability of a candidate

getting an item correct based on their overall

level on the construct and item characteristics

• Item characteristics

• Item Difficulty

• Item Discrimination

• Pseudo-Guessing probability

Item Difficulty

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

Easier

Item

Harder

Item

Item Difficulty

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


.41

.68

Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

Easier

Item

Harder

Item

Item Discrimination

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


.32

.68

.41

.59 – .41 = .18

.68 – .32 = .36

.59

Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH

Score

Less

Discriminating

More

Discriminating

Accounting for Guessing

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Pro

bab

ilit

y


LOW HIGH

Score

Putting it all Together

0.0

0.2

0.4

0.6

0.8

1.0

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0


Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

LOW HIGH Score

SECTION III

• Score Estimation in CAT

Estimating Examinee Scores

• Requires a large pool of items with known

item parameters.

• Two “approaches” to score estimation

• Visual approach

• Conceptual approach

Visual approach to score estimation

• Test Characteristic Curve (TCC)

• Describes relationship between total test score

and examinee trait level

• TCC is obtained by “adding” item characteristic

curves across all trait levels

• Each test has its own TCC

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve--2 Items

TCC

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

200 250 300 350 400 450 500 550 600 650 700 750 800


TCC

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve-4 Items

TCC

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

200 250 300 350 400 450 500 550 600 650 700 750 800


TCC

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

Sample Data—CAT (items ordered by p)


S1 1 0 1 0 0 0 2 6

S2 1 1 1 0 0 0 0 3 7

S3 0 1 1 0 1 0 3 6

S4 1 0 1 1 0 0 3 6

S5 1 1 0 1 0 0 3 6

S6 1 1 1 1 0 1 5 6 ●

●

●

●

●

●

●

●

●

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curves for These 6 Examinees

Person 1 Person 2 Person 3 Person 4 Person 5 Person 6

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve

Limitation to Visual Approach

• Suggests that any two

students who earn

same number correct

score will earn the

same scaled score.

• Particular pattern of

right and wrong

answers is important

• How a number correct

score is obtained

really does matter in

score estimation

Ex

pe

cte

d N

um

be

r C

orr

ect

Sc

ore

0.00

0.25

0.50

0.75

1.00

Item Curves for 6 Items

Conceptual Approach to Score Estimation

LOW HIGH Score

.93

.83

.75

.71

.59

.33

Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

0.00

0.25

0.50

0.75

1.00Item Curves for 6 Items


.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

Pro

ba

bilit

y o

f C

orr

ect

Re

sp

on

se

0.00

0.25

0.50

0.75

1.00 Probability of Response for 6 Items


.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

Pro

ba

bilit

y o

f S

ele

cte

d R

es

po

ns

e


• Likelihood of response pattern found by

multiplying probabilities of individual responses

• .93 × .83 × .25 × .71 × .59 × .67 = .0542

0.00

0.25

0.50

0.75

1.00 Probability of Response for 6 Items


.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

Pro

ba

bilit

y o

f S

ele

cte

d R

es

po

ns

e


• Likelihood of response pattern found by

multiplying probabilities of individual responses

• .93 × .83 × .25 × .71 × .59 × .67 = .0542

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

200 250 300 350 400 450 500 550 600 650 700 750 800

Score of 528

maximizes the

likelihood

Likelihood Function

SECTION IV

• Item Selection in CAT

Selecting Items in CAT

• Each new item is selected so as to maximize

precision

• What does this mean?

• Item information is a measure of the amount that

we learn about a person’s trait level by

administering a particular item.

Selecting Items in CAT

• General characteristics of item information

• A specific item provides different amounts of

information, depending on the following

• Trait level of examinee

• Characteristics of item

• Information is maximized when an item is

administered whose difficulty is very close to the

examinee’s trait level.

• Information tends to be higher for highly

discriminating items

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information Current Estimate: 400

Black answered correctly

New Estimate: 475

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800


Green answered incorrectly

New Estimate: 440

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800


Brown answered correctly

New Estimate: 460

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800


Blue answered incorrectly

New Estimate: 448

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800


Black answered correctly

New Estimate: 456

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

Final score estimate = 456

SECTION V

• Practical Issues in CAT

Security Issues in CAT

• Security

• Advantages

• Essentially eliminates answer copying

• Shows fewer items to each candidate

• Disadvantages

• Increased reliance on handful of items

• Exposes entire test bank to group of candidates

• Gaming

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

Other Practical Issues in CAT

• Item Review

• Sparce Data Matrices

Re-Polling Question

Consider the following topics related to computerized

adaptive testing.

I. How items are chosen for administration

II. How a test score is determined

III. Test security issues specific to CAT

After participating in this session, please indicate which of

these topics you would now be more comfortable

discussing with an examinee/parent.

Thank you

• For more information, please contact

Jim Wollack

University of Wisconsin – Madison

[email protected]

empowered by psychometricsmedia01.commpartners.com/ncta/2015conference... · computerized adaptive...

Documents