empowered by psychometricsmedia01.commpartners.com/ncta/2015conference... · computerized adaptive...
TRANSCRIPT
Empowered by Psychometrics Inside the Black Box of
Computerized Adaptive Testing
Jim Wollack and Sonya Sedivy
University of Wisconsin – Madison
#nctaconf15
Purpose of Session
• Introduce several key psychometric concepts
and gain an appreciation for the theoretical
underpinnings of computerized adaptive
testing.
• Conceptual Overview of Computerized Adaptive
Testing
• Introduction of Item Response Theory
• Score Estimation in CAT
• Item Selection in CAT
• Practical Issues in CAT
Poll Question
Consider the following topics related to computerized
adaptive testing.
I. How items are chosen for administration
II. How a test score is determined
III. Test security issues specific to CAT
Please indicate which of these topics you currently would
be comfortable discussing with an examinee/parent.
SECTION I
• Conceptual Overview of Computerized
Adaptive Testing
Classical Test Theory
• X = T + E
• Person characteristics
• Total test score serves as a proxy for examinee’s
level on the construct
• Item characteristics
• Item difficulty is estimated as the proportion of
examinees who answer an item correctly
• Item discrimination estimated as correlation
between item score (1/0) and total score
Sample Data
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X
S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2
S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5
S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8
S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9
S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11
S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13
●
●
●
●
●
●
●
●
●
p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37
Computerized Adaptive Testing (CAT)
• Each examinee receives a customized exam
that is built to maximize the precision of their
score.
• When an examinee answers a question
incorrectly, next item will be easier
• When an examinee answers a question correctly,
next item will be harder.
• Difficulty of assessments varies so that each
examinee gets an exam for which they’ll answer
approximately 50% of the items correct.
Reliability versus Precision
• Reliability is a measure of the amount of error
in a group of test scores.
• Standard error of measurement (SEM) is a function
of reliability, and represents an average amount
that an individual’s score might vary upon retest.
• SEM is the same value for all examinees.
• Precision is a measure of the amount of error in
an individual’s test score.
• Varies based on examinee’s trait level.
• Measure of the quality of a specific test score.
Reliability versus Precision
• For a population of examinees, the test will
not be equally precise for all.
• Precision is maximized when examinees take
items that are challenging, but doable.
Estimate trait level (score)
Estimate precision of score
Is score sufficiently precise?
Has maximum test length
been reached
Deliver new item that will
maximize precision
Adaptive Algorithm
Candidate gets
a small set of
random items of
varying difficulty
End Test
NO
NO
YES YES
Sample Data
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X
S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2
S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5
S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8
S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9
S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11
S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13
●
●
●
●
●
●
●
●
●
p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37
Sample Data—CAT
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X N
S1 0 0 0 1 1 0 2 6
S2 0 1 1 0 1 0 0 3 7
S3 1 1 0 0 0 1 3 6
S4 1 1 1 0 0 0 3 6
S5 1 0 0 1 1 0 3 6
S6 1 1 0 1 1 1 5 6
●
●
●
●
●
●
●
●
●
• Everybody sees different items and takes tests of different lengths • CAT algorithm forces most students to a score of approximately 50% • How can we assign scores fairly and accurately?
Sample Data—CAT (items ordered by p)
I12 I5 I11 I3 I14 I4 I9 I7 I1 I8 I15 I10 I13 I6 I2 X N
S1 1 0 1 0 0 0 2 6
S2 1 1 1 0 0 0 0 3 7
S3 0 1 1 0 1 0 3 6
S4 1 0 1 1 0 0 3 6
S5 1 1 0 1 0 0 3 6
S6 1 1 1 1 0 1 5 6
●
●
●
●
●
●
●
●
●
• Picture is somewhat improved • Still can’t report accurate scores with any confidence
SECTION II
• Introduction to Item Response Theory
Item Response Theory
• Mathematical modeling approach to test scoring
and analysis
• Less intuitive, but more sophisticated approach
• Solves many problems with CTT
• Sample-dependency of item/exam statistics
• Test-dependency of total scores
• Tough to compare people and items
• Equal item weighting
• No good way to account for guessing
16
Trait Level vs. Prob. Correct Response
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
17
An Item Characteristic Curve
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
18
Sample Independent—Same Curve
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
Item Response Theory
• Directly models the probability of a candidate
getting an item correct based on their overall
level on the construct and item characteristics
• Item characteristics
• Item Difficulty
• Item Discrimination
• Pseudo-Guessing probability
Item Difficulty
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
Easier
Item
Harder
Item
Item Difficulty
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
.41
.68
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
Easier
Item
Harder
Item
Item Discrimination
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
.32
.68
.41
.59 – .41 = .18
.68 – .32 = .36
.59
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH
Score
Less
Discriminating
More
Discriminating
Accounting for Guessing
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Pro
bab
ilit
y
q (Examinee Trait Level)
LOW HIGH
Score
Putting it all Together
0.0
0.2
0.4
0.6
0.8
1.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
q (Examinee Trait Level)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
LOW HIGH Score
SECTION III
• Score Estimation in CAT
Estimating Examinee Scores
• Requires a large pool of items with known
item parameters.
• Two “approaches” to score estimation
• Visual approach
• Conceptual approach
Visual approach to score estimation
• Test Characteristic Curve (TCC)
• Describes relationship between total test score
and examinee trait level
• TCC is obtained by “adding” item characteristic
curves across all trait levels
• Each test has its own TCC
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curve--2 Items
TCC
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curve--3 Items
TCC
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curve-4 Items
TCC
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curve--5 Items
TCC
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
Sample Data—CAT (items ordered by p)
I12 I5 I11 I3 I14 I4 I9 I7 I1 I8 I15 I10 I13 I6 I2 X N
S1 1 0 1 0 0 0 2 6
S2 1 1 1 0 0 0 0 3 7
S3 0 1 1 0 1 0 3 6
S4 1 0 1 1 0 0 3 6
S5 1 1 0 1 0 0 3 6
S6 1 1 1 1 0 1 5 6 ●
●
●
●
●
●
●
●
●
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curves for These 6 Examinees
Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
200 250 300 350 400 450 500 550 600 650 700 750 800
Test Characteristic Curve
Limitation to Visual Approach
• Suggests that any two
students who earn
same number correct
score will earn the
same scaled score.
• Particular pattern of
right and wrong
answers is important
• How a number correct
score is obtained
really does matter in
score estimation
Ex
pe
cte
d N
um
be
r C
orr
ect
Sc
ore
0.00
0.25
0.50
0.75
1.00
Item Curves for 6 Items
Conceptual Approach to Score Estimation
LOW HIGH Score
.93
.83
.75
.71
.59
.33
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
0.00
0.25
0.50
0.75
1.00Item Curves for 6 Items
Conceptual Approach to Score Estimation
.93 .93
.83 .83
.75 .25
.71 .71
.59 .59
.33 .67
LOW HIGH Score
P(Corr) P(answer)
Pro
ba
bilit
y o
f C
orr
ect
Re
sp
on
se
0.00
0.25
0.50
0.75
1.00 Probability of Response for 6 Items
Conceptual Approach to Score Estimation
.93 .93
.83 .83
.75 .25
.71 .71
.59 .59
.33 .67
LOW HIGH Score
P(Corr) P(answer)
Pro
ba
bilit
y o
f S
ele
cte
d R
es
po
ns
e
Conceptual Approach to Score Estimation
• Likelihood of response pattern found by
multiplying probabilities of individual responses
• .93 × .83 × .25 × .71 × .59 × .67 = .0542
0.00
0.25
0.50
0.75
1.00 Probability of Response for 6 Items
Conceptual Approach to Score Estimation
.93 .93
.83 .83
.75 .25
.71 .71
.59 .59
.33 .67
LOW HIGH Score
P(Corr) P(answer)
Pro
ba
bilit
y o
f S
ele
cte
d R
es
po
ns
e
Conceptual Approach to Score Estimation
• Likelihood of response pattern found by
multiplying probabilities of individual responses
• .93 × .83 × .25 × .71 × .59 × .67 = .0542
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
200 250 300 350 400 450 500 550 600 650 700 750 800
Score of 528
maximizes the
likelihood
Likelihood Function
SECTION IV
• Item Selection in CAT
Selecting Items in CAT
• Each new item is selected so as to maximize
precision
• What does this mean?
• Item information is a measure of the amount that
we learn about a person’s trait level by
administering a particular item.
Selecting Items in CAT
• General characteristics of item information
• A specific item provides different amounts of
information, depending on the following
• Trait level of examinee
• Characteristics of item
• Information is maximized when an item is
administered whose difficulty is very close to the
examinee’s trait level.
• Information tends to be higher for highly
discriminating items
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information Current Estimate: 400
Black answered correctly
New Estimate: 475
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information Current Estimate: 475
Green answered incorrectly
New Estimate: 440
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information Current Estimate: 440
Brown answered correctly
New Estimate: 460
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information Current Estimate: 460
Blue answered incorrectly
New Estimate: 448
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information Current Estimate: 448
Black answered correctly
New Estimate: 456
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information
Final score estimate = 456
SECTION V
• Practical Issues in CAT
Security Issues in CAT
• Security
• Advantages
• Essentially eliminates answer copying
• Shows fewer items to each candidate
• Disadvantages
• Increased reliance on handful of items
• Exposes entire test bank to group of candidates
• Gaming
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
200 250 300 350 400 450 500 550 600 650 700 750 800
Item Information
Other Practical Issues in CAT
• Item Review
• Sparce Data Matrices
Re-Polling Question
Consider the following topics related to computerized
adaptive testing.
I. How items are chosen for administration
II. How a test score is determined
III. Test security issues specific to CAT
After participating in this session, please indicate which of
these topics you would now be more comfortable
discussing with an examinee/parent.
Thank you
• For more information, please contact
Jim Wollack
University of Wisconsin – Madison