1 detection of item degradation yongwei yang abdullah ferdous tzu-yun chin university of...

1

Detection of Item Degradation

Yongwei Yang

Abdullah Ferdous

Tzu-Yun Chin University of Nebraska-Lincoln

In T. L. Hayes (chair), Item degradation: impact, detection, and mitigation, an academic-practitioner collaborative forum conducted at the 22nd annual conference of the Society of Industrial and Organizational Psychology in New York City, NY, April 2007.

2

Item Degradation Item Degradation

Item’s favorable psychometric characteristics deteriorate over time Psychometric characteristics

Content relevance and representativeness Technical characteristics (e.g., “difficulty”/“location”, lack of

bias) Utility (e.g., item-criterion relationship)

Item Degradation vs. Exposure/Compromise Item degradation: observed phenomenon Item exposure/compromise:

Items have become known to test takers prior to administration

Possible reasons for degradation

3

Detection of Item Degradation

Essentially it is about investigating the comparability of item’s psychometric properties over time “temporal stability of the psychometric

characteristics” (Chan, Drasgow, & Sawin, 1999)

Can be evaluated under the framework of: Measurement invariance (MI; Meredith, 1993) Predictive invariance (PI; Millsap, 1995)

Item Degradation as MI or PI

Measurement Invariance (MI)

Same relationship across populations between observed indicators and the latent variables

Degradation noninvariance in such relationships over time Loading, location

4

( | , ) ( | )F x w v F x w= Predictive Invariance (PI)

Same relationship across populations between predictors and criterion

Degradation noninvariance in such relationships over time Indicator-criterion

relationship

( | , ) ( | )F y x v F y x=

Let x be observed indicator that measures latent w and predicts y,

and v be some population indicator

5

Item Degradation Detection Methods

Differential item functioning, item parameter drift

Mean & covariance modeling Assessing invariance in various aspects

pertain to measurement or predictive properties

Statistical process control

Models of change

6

Item Degradation Detection

Differential item functioning, item parameter drift

Mean & covariance modeling Assessing invariance in various aspects

pertain measurement or predictive properties

Statistical process control Cumulative sum (CUSUM) procedure

Models of change

7

CUSUM for Item Degradation Detection

Our approach—Conditional CUSUM Whether item parameters have deviated from target Make use of observed scores The importance of controlling for shifts in traits level over

time “Conditional”—test takers at different time points were matched based on

their total test score

Procedures Initial Item Calibration

Compute target item parameter (e.g., difficulty) using the first n job applicants from the operation sample

Define “time group” Every m applicants from the n+1 applicant to the last person under

investigation Define “trait group” (conditioning variable)

Divide job applicants into groups of reasonable size based on total test scores

Compute and plot CUSUM statistics for each trait group separately

8

Conditional CUSUM—Calculation Two-sided Standardized CUSUM

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

+−

+

−= +

−+

1

0

20

2

0,0max i

i

i

ii Ck

nn

XXC

σσ0

12 20

0

min 0, ii i

i

i

X XC k C

n n

σ σ− −

−

⎡ ⎤⎢ ⎥

−⎢ ⎥= + +⎢ ⎥⎢ ⎥+⎢ ⎥⎣ ⎦

Initial Status Item VarianceTime Group i Item Variance

Time Group i Item MeanTarget Item Mean

Reference value (k) and Control limit (h)

9

Conditional CUSUM—Data Source A web-based personnel selection assessment

for selecting managers 103 items measuring job-related non-cognitive

attributes CTT-based test construction and scoring Fixed-length, linear test Unproctored

Sample: Job applicants from Oct. 2002 to Sept. 2005 Re-taker excluded Total N = 7,000

10

Conditional CUSUM—Results Among the 103 items

36 flagged for upward shift in item means for at least one trait group

20 flagged for downward shift in item means for at least one trait group

9 flagged for having both upward and downward shifts for different trait groups

38 not flagged for any trait group

A couple examples: it035, it174

Follow-up analysis: Were there differences across item types with respect to the

likelihood of being flagged by conditional CUSUM?

Conditional CUSUM—Follow-up Multinomial logistic

regression DV: condition CUSUM flag;

3 categories; “Not Flagged” as the reference category

IV: ability (6 levels), item type (3 levels, multiple choice (MC) as the reference group

11

78.1% 5.2%16.7 %

86.8 % 5.3% 7.9%

79.9 % 11.9% 8.2%

80.6% 8.4% 11.0%

Ite m T y pe

Forward (n=210 )

Revers e (n =114)

M C ( n=29 4)

Total (n=618 )

Not

Flagge d

Flagg ed for

Downw ard

Shift

Flagg ed for

Upward S hift

Conditional C USUM Fl ag

78.1% 5.2%16.7 %

86.8 % 5.3% 7.9%

79.9 % 11.9% 8.2%

80.6% 8.4% 11.0%

Ite m T y pe

Forward (n=210 )

Revers e (n =114)

M C ( n=29 4)

Total (n=618 )

Not

Flagge d

Flagg ed for

Downw ard

Shift

Flagg ed for

Upward S hift

Conditional C USUM Fl ag

Results GOF statistic indicates appropriate fit of the main effect model

(X2=16.83, df=20, p=.664) The impact of ability levels on the CUSUM flags was not statistically

significant (X2=13.48, df=10, p=.198) The impact of item type on the CUSUM flags was statistically

significant (X2=17.83, df=4, p=.001). MC items were more likely to be flagged by conditional CUSUM for

negative shifts Forward items were more likely to be flagged by conditional

CUSUM for positive shifts

Model of Change Perspective 1:

Understanding patterns of change using examinee characteristics Do the trajectories of item parameter change vary across

different types of examinees? Applicant location, SES, demographics, etc.

Perspective 2: Understanding patterns of change using item characteristics Do the trajectories of item parameter change vary across

different types of items? Item format, complexity, content area, etc.

Formulating these questions in a longitudinal analysis framework

12

Perspective 1 Example

13

Using a 2-level longitudinal model to explore: RQ1: On average, was there a shift in item difficulty? RQ2: Were there variations in the slope of the shift? (If Yes to RQ2) RQ3: Could the variations be explained by job applicants

characteristics (e.g., trait level, region, etc.)? The model:

Analysis with item 174: RQ1: significant positive

slope RQ2: non-significant

variations RQ3: not pursued

0 1

0 00 0

1 10 1

( )ti i i ti ti

i i

i i

Y time e

r

r

π ππ βπ β

= + += += +

Level I:

Level II:

Perspective 2 Example

14

Using a 2-level longitudinal model to explore: RQ1: Across items, on

average was there a change in item difficulty over time?

RQ2: Were there variations in the slope of the change across items?

(If Yes to RQ2) RQ3: Could the variations be explained by item characteristics?

Model B:

Analysis with this data set: RQ3: item type did not

explain a significant portion of the variations in slopes

Perspective 2 Example Model A:

Analysis with this data set: RQ1: average slope

across items was not different from zero

RQ2: significant variations in slopes across items

15

0 1 2

0 00 0

1 10

2 20 2

( ) ( )ti i i t i ti ti

i i

i

i i

Y trait time e

r

r

π π ππ βπ βπ β

= + + += +== +

0 1 2

0 00 0

1 10

2 20 20 2

( ) ( )

( _ )

ti i i t i ti ti

i i

i

i i i

Y trait time e

r

item type r

π π ππ βπ βπ β β

= + + += +== + +

Level I

Level II

Summary and Discussions Two types of methods that serve different purposes:

Statistical process control (e.g., CUSUM): Real-time monitoring of degradation We illustrated conditional CUSUM procedure, but other methods exist

(e.g., an IRT-based moving residual approach by Han & Hambleton [2004])

Explicit modeling of patterns of degradation: Understanding the nature of degradation, exploring potential factors that

impact degradation, assisting the development of prevention and mitigation procedures

We illustrated longitudinal modeling methods, but various methods for studying MI/PI may be applied

These methods can also be used in monitoring and understanding degradation in other parameters (e.g., item variance, discrimination, response time) It might be helpful to monitor/model multiple parameters

simultaneously to (1) “flag” items more accurately and, (2) understand factors behind degradation

16

Summary and Discussions Understanding temporal stability of

measurement properties is essential to: Valid decisions based on test scores Valid inferences in substantive research based on

assessment outcomes Research on Flynn effect (e.g., Wicherts et al., 2004)

Further research is needed, such as What monitoring approaches would better fit personnel

selection assessment programs? What would lead to or impact degradation? How would item-level degradation impact test-level

decisions and inferences? Etc.

17

18

Some Useful References MI & PI Concepts

Mellenbergh (1989) Meredith (1993) Millsap (1995)

Various IPD and Item Exposure Detection Methods Bock, Muraki, & Pfeiffenberger (1988) Chan, Drasgow, & Sawin (1999) DeMars (2004) Donahue & Isham (1998) Han & Hambleton (2004) Kim, Cohen, & Park (1995)

CUSUM and Psychometric Applications: Hawkins & Olwell (1998) Meijer & van Krimpen-Stoop (2003) Montgomery (2005) van Krimpen-Stoop & Meijer (2002) Veerkamp & Glas (2000)

19

Contacts

Yongwei Yang: [email protected] Ferdous:

[email protected] Chin: [email protected]

THANK YOU

Item 35 Conditional CUSUM Charts

20

back

Item 174 Conditional CUSUM Charts

21

back

1 detection of item degradation yongwei yang abdullah ferdous tzu-yun chin university of...

Documents

item parameters

degradation slide

item parameter drift

item variance time group

time loading

pi measurement invariance

time conditionaltest

control limit h slide