statistical techniques to assess item biasfaculty.educ.ubc.ca/zumbo/papers/...li_aera_slides.pdf ·...

23
The University of British Columbia UNESCO Institute for Statistics 1 Bruno D. Zumbo University of British Columbia Brenda Siokhoon Tay-Lim United Nations Educational Scientific and Cultural Organization Zhen Li University of British Columbia, and Department of Education, Newfoundland and Labrador Presented in the Symposium Lessons Learned in Implementing a Literacy Assessment in a Household Survey in the Developing World: UNESCO’s Literacy Assessment and Monitoring Programme 2009 AERA Conference, San Diego Statistical Techniques to Assess Item Bias

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

1

Bruno D. Zumbo University of British Columbia

Brenda Siokhoon Tay-Lim United Nations Educational Scientific and Cultural Organization

Zhen Li University of British Columbia, and

Department of Education, Newfoundland and Labrador

Presented in the Symposium Lessons Learned in Implementing a Literacy Assessment in a Household Survey in the

Developing World: UNESCO’s Literacy Assessment and Monitoring Programme

2009 AERA Conference, San Diego

Statistical Techniques to Assess Item Bias

Page 2: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

2

• Introduction to the problem.– The opening paper of this session set the context

for the Literacy Assessment and Monitoring Programme (LAMP)

• At present many countries do not have any data on the literacy status of their population. For the few that do, the data are often inadequate for policy development. Furthermore cross-national analysis is not possible because the data are not “harmonized” between countries. Without access to benchmark data, formulating policy and monitoring programmes are difficult.

Page 3: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

• The focus of much of DIF research has been on large-scale assessment wherein there are lots of items and lots of examinees.– Furthermore, these assessment contexts often involve a

technical psychometric staff communicating with other technical staff about item analyses.

• The LAMP assessment is a bit different than typical large-scale assessment because of its purpose and complex structure (filter items, locators, Components, etc.), as well as the need to directly interact and communicate with stakeholders in various countries. 3

Page 4: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

• Zumbo (2007) describes what he calls the “third generation” of DIF modeling in which there are five general uses for DIF analyses:1. Fairness and equity in testing. <= current LAMP focus

2. Dealing with a possible threat to internal validity.

3. Investigating the comparability of translated and/or adapted measures. <= current LAMP focus

4. Trying to understand item response processes. <= current LAMP focus

5. Investigating lack of invariance for latent variable modeling, e.g., IRT. 4

Page 5: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

5

• Two broad classes of DIF detection methods– Modeling contingency tables or modeling logistic regression

models– IRT methods

• As Zumbo (2007) notes, the essential difference is the “what” and “how” the matching or conditioning is performed.– In its essence, the IRT approach is focused on determining the area

between the curves (or, equivalently, comparing the IRT parameters) of the two groups.

– Comparing the IRT parameter estimates or IRFs [item response functions] is an unconditional analysis because it implicitly assumes that the ability distribution has been ‘integrated out’. The mathematical expression ‘integrated out’ is commonly used in some DIF literature and is used in the sense that one computes the area between the IRFs across the distributionof the continuum of variation, theta.

Page 6: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

6

We are exploring Ramsay’s (1991, 2000) nonparametric IRT method.

• Three reasons:– The interest is on relatively small sample sizes (e.g. in LAMP field test

some items have less than 500 respondents) and relatively few items in the scale (items in Components Record booklet) or measure, made it so that we could not use most conventional parametric IRT models.

– Wanted an approach that has a data driven orientation and hence flexible in the modeling; because we had no reason to believe that the item response functions would be simple parametric functions -- such as a Rasch model, which is sometimes recommended for moderate-to-small-scale testing.

– We want an approach which is graphical and hence potentially explainable to policy researchers/stakeholders.

Page 7: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

Nonparametric IRT– IRT includes a family of models, which, if appropriate,

provides substantial information about item and examinee performance.

– The essential difference between the parametric and nonparameteric variants of IRT is the nature of the relationship between the probability of correct response and the examinees’ ability

– In parametric IRT, this relationship is assumed to be of a prespecified form, most typically, either logistic or normal ogive – if this assumption is violated then one gets poor estimation of item parameters and examinees’ ability.

7

Page 8: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

Nonparametric IRT – In contrast, the nonparametric models do not assume a

particular parametric or prespecified form of the item characteristic curve (ICC) but assumes that the ICC is nondecreasing in the ability.

– As Bolt (2001) and Habing (2001) note, any form of ICC that satisfies this constraint is acceptable in describing the relationship between the probability of correct response and examinees’ ability,

• thus offering a more flexible framework for applications of IRT modeling.

8

Page 9: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

9

Purpose:

• In the context of an example from the LAMP we describe nonparametric item response theory DIF analyses. And along the way we

– … will highlight some of the strengths and limitations of this approach for the LAMP.

– … wish to show how nonparametric IRT, more generally, may be of use for the LAMP.

Page 10: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

How the nonparametric regression works.– TestGraf implements a nonparametric

regression method that directly estimates a functional relationship between the item score and the test score.

– TestGraf uses a kernel smoothing to estimate the probability of option m to item i, Pim. The proficiency value (expected test score) θ is the independent variable, and the dependent variable is the probabilities of an examinee, a,

choosing option m for item i.10

Page 11: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

11

The nonparametric IRT DIF Statistic

Page 12: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

12

• With our knowledge of beta there are two approaches to testing the “no DIF” hypothesis.– Perform a formal hypothesis test making use of the purported

sampling distribution of beta (Not recommended for LAMP because of the complex sampling).

– A less formal hypothesis test: Compute beta and compare its value to a criterion (not making use of the sampling distribution of beta):

• For sample sizes of greater than 500 per group, Roussos and Stout (1996) proposed the following cut-off indices: (a) negligible DIF if |β| < .059, (b) moderate DIF if .059 ≤ |β| < .088, and (c) large DIF if |β| ≥ .088.

• Zumbo and Witarsa (2004) provide cut-off values for sample sizes of 500 (or fewer) respondents per group. See Appendix A.

Page 13: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

• Lets see the nonparametric IRT in action using LAMP data.– We investigated the 18 Locator items, all binary scored.

– Morocco (N=321); Palestinian Autonomous Territories (PAT) (N=276); close gender balance in each.

– The 18 items were essentially unidimensional determined by a factor analysis of the tetrachoric correlation matrices: for both, 1st eigenvalue at least 87.5% of variance.

– In our graphs 1=Morocco and 2= Palestinian Autonomous Territories (PAT)

– There are 18 items so let us use a “level of significance” of 0.01 and given our sample sizes let us use a cut-off value for beta of 0.042 (0.0415 from Appendix A).

13

Page 14: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

14

As a place to start, we look at the conditional reliability plots defined on the expected score. As expected, the reliability varies along the expected score. Reliability greater than 0.85 in the middle 50% of the distribution.

Page 15: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

15

We see that the beta is less than 0.042 so no DIF.

1=Morocco 2= Palestinian Autonomous Territories

We also learn that this item performs well; starts at lower left and moves smoothly to top right.

Here are the Nonparametric IRFs

Page 16: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

16

1=Morocco 2= Palestinian Autonomous Territories

We also learn that this item performs well; starts at lower left and moves smoothly to top right.

We see that the beta is less than 0.042 so no DIF, but it is close.

Page 17: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

17

1=Morocco 2= Palestinian Autonomous Territories

We see that the beta is less than 0.042 so no DIF, but the item performs equally poorly in both groups. Very difficult item and does not discriminate well!

I would not make much of the differences at the upper end because there is not much information there to have confidence in the curves.

Page 18: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

• Most of the items performed quite well; with the IRF tracing from bottom left to top right.

• Three items showed small DIF

18

Two items favor Palestinian Autonomous Territories

Page 19: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

19

PAT

Morocco

Test level comparative curve. How does DIF roll up to the test level.

As much as a 2 point difference favoring Palestinian Autonomous Territories in the middle 50 percent of the distribution.

Page 20: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

• Summary– We can see that the nonparametric IRT

approach to DIF gives us some very useful information.

• We get both the IRT item analysis andthe DIF analysis at the same time.

• We find three items that may show some sign of DIF; the items also roll up to a difference at the test level.

– What we do with these three items would be decided at a policy level.

– Limitation: Our approach is mostly descriptive.20

Page 21: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

21

Cut-off indices for β in identifying TestGraf DIF across sample size combinations and

three significance levels irrespective of the item characteristics

Level of Significance α -------------------------------------------------------------------------------- N 1 / N 2 .10 .05 .01 500/500 .0113 .0161 .0374

200/100 .0249 .0373 .0415

200/50 .0460 .0540 .0568

100/100 .0308 .0421 .0690

100/50 .0421 .0579 .0741

50/50 .0399 .0455 .0626

50/25 .0633 .0869 .1371

25/25 .0770 .0890 .1154

Appendix from Zumbo & Witarsa (2004)

Page 22: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

The University ofBritish Columbia

UNESCO Institute for Statistics

ReferencesBolt, D. M. (2001). Conditional covariance-based representation of

multidimensional test structure. Applied Psychological Measurement, 25, 244-257.

Habing, B. (2001). Nonparametric regression and the parametric bootstrap for local dependence assessment. Applied Psychological Measurement, 25, 221-233.

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56, 611-630.

Ramsay, J. O. (2000). TESTGRAF: A computer program for nonparametric analysis of testing data. Unpublished manuscript, McGill University.

Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233.

Zumbo, B. D., & Witarsa, P. M. (2004). Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement:Operating Characteristics And A Comparison With The Mantel Haenszel. Paper presented at American Educational Research Association Meeting, San Diego, CA.

22

Page 23: Statistical Techniques to Assess Item Biasfaculty.educ.ubc.ca/zumbo/papers/...Li_AERA_slides.pdf · • Zumbo (2007) describes what he calls the “third generation” of DIF modeling

This docum

ent was created w

ith Win2P

DF

available at http://ww

w.w

in2pdf.com.

The unregistered version of W

in2PD

F is for evaluation or non-com

mercial use only.

This page w

ill not be added after purchasing Win2P

DF

.