differential item functioning analysis of an efl v ocabulary...

English Teaching, Vo l. 65 , No. 3 Autumn 2010

Differential Item Functioning Analysis of an EFL V ocabulary Test

23

Chanho Park

(Korea Institute for Curriculum and Evaluation)

Park, Chanho. (2010). Differential item functioning analysis of an EFL

vocabulary tes t. English Teaching, 65(3), 23-41.

Di fTerential item functioning (DIF) has been studied as a sta tistical method of finding

item bias, and DIF studies have been conducted on tests of verbal abilities such as

reading comprehension 없d vocabulary knowledge. Recently, researchers of language

testing sought to understand the causes of DIF. Thc purpose of this study was to find

possible sources of explaining DlF on an EFL vocabulary test, which can then benetit

EFL leamers and teachers as wel l. Th is study thus applicd thrce DIF detection methods

(the Iikelihood ratio test, SlBTEST, and the Mantel-Haenszel test), followed by a

regression analysis lIsing item categories and difficllltics as predictors of explaining DIF

As a result, only the “ Academic" catcgory was stat istically significant, positively

contribllting to DIF in favor of male examinees. Male examinees were better at

academic lexical words than female leamers of EFL at the same vocablllary knowledge

level. This also means that female learners were relatively strong at nonacademic words

in comparison to male EF L learners. This finding may lead to more efTective curricll lum

of teaching English. The need for more evidence from replicated studies with better

item categorization is also warranted.

1. INTRODUCTION

Research on differenti a l ite m functioning (DIF) was tri ggered by an interest in removing

bias from educat ional or psycholog ical tests (Camilli & Shepard, 1994). The functionality

of a test item should be invariant across different groups if the group membership is

unrelated to the trait the test is purported to measure. However, a test item shows DIF if

the item function s differentia lly for examinees of the same ability but with different group

memberships. Since DlF may severely threaten the validity of a test, it has been studied as

a statistical method of finding possible item bias (Camilli & Shepard, 1994; Holl and &

24 Chanho Park

Wainer, 1993). Detecting and removing DIF items have been considered necessary for

valid and bias-당ee measurement.

Although DIF can be studied for any groups divided using single or multiple criteria, only a limited number of grouping criteria have been applied. Gender DIF is the most

common form of DIF studied in many tests for several reasons. First, sizes of gender

groups are comparable in most cases. DIF can be hard to detect if one group has too few

examinees (e.g., most examinees are male). Second, the group membership should not be

too highly correlated with the trait the test is measuring. Last, if DIF is found, it can be

interpreted as existence of potential item bias. For example, if DIF is studied between the

first language (L 1) and the second language α2) speakers for a language proficiency test, DlF may indicate the level of proficiency and does not mean bias. Therefore, gender DIF

is commonly studied in educational measurement including language testing.

1n the history of DIF studies in language testing (Abbott, 2007; Chen & Henning, 1985;

Mikyung Kim, 2001; Tae-Il Pae, 2004; Ryan & Bachman, 1992; Sasaki, 1991; Takala &

Kaftandjieva, 2000), early studies focused on applying diverse DIF detection methods to

various language tests in order to detect possible item bias. lf some test items were biased

against some ethnic or gender group, the bias could lead to DIF, which could then be

detected using the statistical methods

Wh ile early DlF studies focused on detection of DIF, recent studies staπed to

investigate possible causes of DIF in language tests. To understand the causes of DIF,

content analysis in combination with other statistical techniques has been applied (Abbott, 2007; Tae-1l Pae, 2004). Understanding causes of DIF is invaluable in that it will not only

help test developers write bias-free items but also help teachers better understand their

students ’ relative strengths or weaknesses. For example, if male students are always

favored by specific types of vocabulary assessment questions, the teacher may spend more

time with female students when teaching those types ofwords.

The purpose of this study was to explore possible sources of exp laining DIF and to

illustrate how the results can be used to facilitate teaching. First, vocabulary test items

were analyzed, and DIF was tested between male and female examinee groups. Gender

groups were studied partly because gender is often of interest in DIF stu

Differential ltem Functioning Analysis of an EFL Vocabulary Test 25

11. LlTERA TURE REVIEW

1. DIF in Language Testing

Various DIF detection mcthods have becn successfu ll y app licd to tcsts of L2s. Chen and

Henning (1985) attemptcd to dctcct DIF items across diffcrcnt native language groups

(Spanish and Chinese). ltcm difficulties for each subgroup wcre estimated under the Rasch

model , and were then standardized for comparison. As a rcsult, four vocabulary item

werc in favor of Spanish-speaking examinees duc in part to thc fact that thc Spanish

examinees had target-forrn- Iike lexicon in their nativc language. Sasaki ( 1991)

corroborated the res비ts of Chcn and Hel1ll ing (1985), and found that Chinesc studcnts can

bc favorcd by some grammar itcms in addition to similar fìndings about vocabulary itcms.

Ryan and Bachman (1992) app lied thc Mantel-Hacnszcl tcst (MH-test; Holland &

Thayer, 1988) on two L2 tests across Indo-European (IE) and non-Indo-European 이IE)

groups and also across gcnders. Mikyung Kim (2001) also cxamincd DIF across 1 E and

NIE groups in a speaking test, but applicd the likelihood ratio tcst (LR-test; Thisscn、

Stcinberg, & Wainer, 1988) and the logistic rcgrcssion mcthod. Mikyung Kim concludcd

that the LR test had methodological superiority over the logistic rcgrcssion method. Takala

and Kaftandjieva (2000) perfonned gender-related DIF studics with an L2 vocabulary tcst, suggesting that DIF items be excluded from an itcm bank so that biased item compositcs

do not appear from the item bank

While most DIF studies in language testing compared gendcr or language groups, Tac-II

Pae (2004) studied examinees with different backgrounds to investigate DIF on thc

English subtest ofthe Korean National Entrance Exam for Colleges and Universities. Tae

II Pae further tried to link DIF analysis with content analysis and found that English

subtest items on science-related topics were easier for students with science backgrounds.

lt is worthwhile to understand what may cause DIF by relating item contents with the

characteristics ofthe groups divided for DIF analysis. In this study, however, DIF may not

unequivocally mean item bias. If science-track students arc better at rcading

comprehension items of sc ience-related topics, we may not call this itcm bias as wc usc

thc tcrrn in an ethical or political sense.

In order to better understand why DIF occurs in languagc testing, Abbott (2007)

adopted multidimensionality-based approaches by Roussos and Stout (1996). According to

Roussos and Stout, DIF occu

26 Chanho Park

(2007) investigated the causes of DIF in reading comprehension using a theoretical

reading strategies framework (i.e. , bottom-up vs. top-down strategies). Through an effort

to detect causes of DIF and differential bundle functioning, Abbott could further explore

the secondary dimension.

2 . Vocabulary Assessment and Gender

Recently, vocabulary received considerable attention from language teachers and

researchers (Pearson, Hiebert, & Kamil, 2007). It is generally agreed that there is a strong

link between vocabulary and reading comprehension. According to Pearson et al.,

correlation coefficients between vocabulary and comprehension range between .6 and .8

although it is hard to defme their causal relationship. AIso, vocabulary is considered an

important factor in understanding L2 leamers' reading problems (RAND Reading Study

Group, 2002). There are an increasing number of research studies on L2 vocabulary

(Daller, Milton, & Treffers-Daller, 2007; Read, 2000), and development of vocabulary is

now considered central to leaming a language (Read & Chapelle, 2001).

While the importance of leaming vocabulary is emphasized, agreement on gender

differences of verbal ability including vocabulary knowledge has not been reached yet.

Hyde and Linn (1 988) conducted an extensive meta-analysis of gender differences in

verbal ability. Studies reporting vocabulary knowledge in favor of males or females were

mixed. Hyde and Linn thus concluded that there were no virtual gender differences in

verbal ability.

Whether gender differences exist or not, DIF items are continuously found in

vocabulary assessment (Maller, 2001; Takala & Kaftan이 ieva, 2000). DIF does not

indicate difference of overall ability (e .g., vocabulary knowledge); instead, it may indicate

item-Ievel multidimensionality. IfDIF does not occur by chance alone (i .e., type 1 eπor) ， it

will be worthwhile to understand sources of explaining DIF.

3. DIF Detection Methods

o IF detection methods can be categorized into parametric and nonparametric methods

(Penfield & Lam, 2000). Among the popu1ar DIF detection methods such as the LR-test

(Thissen, Steinberg, & Wainer, 1988), SIBTEST (Shealy & Stout, 1993), and the MH-test

(Holland & Thayer, 1988), the LR-test is a parametric method because it assumes a

particular item response theory (IRT) model is appropriate. SIBTEST and the MH-test do

not have such assumptions and are thus called nonparametric methods

Differential Item Functioning Analysis of an EFL Vocabulary Test 27

1) SIBTEST

SIBTEST (Shealy & Stout, 1993) compares scores on the studied item for the reference

group (group ofreference) and the focal group (group ofinterest) conditional on the valid

subtest scores, which are estimated using a regression correction procedure. Since DIF can

be defined as differential probability of correct answer by examinees of the same ability

level but with different group membership, matching examinees by ability level is crucial

in detecting DIF. SIBTEST uses regression-corrected true scores instead of raw scores.

SIBTEST computes β'UNI ' the effect size measure, which functions as a test statistic

when divided by its standard error. The test statistic asymptotically follows the standard

normal distribution under the null hypothesis of no DIF.

2) MH-Test

Holland and Thayer (1988) proposed the MH-test based on a common odds ratio for

correct versus incorrect responses in the reference versus focal group conditional on total

test score. From the common odds ratio, the Mantel-Haenszel chi-square (lMH) is derived

as the test statistic, which follows a chi-square distribution with one degree of freedom (，찌

under the null hypothesis of no DIF.

3) LR-Test

When an IRT model fits, the item parameters should be invariant across different

subgroups (the invariance principle; Embretson & Reise, 2000). The LR-test (Thissen, Steinberg, & Wainer, 1988) was conceived such that DIF is indicated if the likelihood is

different between when the parameters are constrained to be the same for the subgroups

(called a “ compact model") and when the parameters are allowed to differ (called an

“augmented model"). The -2 times log likelihood value of the augmented model

subtracted from the -2 times log likelihood value of the compact model produces a test

statistic ca l1ed G2 with df equaling the number of item parameters freed in the augmented ?

model (e.g., three for a three-parameter model) . G< follows a chi-square distribution under

the nu l1 hypothesis of no DIF. Since the LR-test is a parametric method, it should be

determined beforehand which IRT model is used. When the three-parameter logistic (3PL)

model is used for the LR-test, a chi-square test can be conducted for a l1 three parameters at

once (with dfo f3) or for each item parameter (with 예of 1). It is worth noting that the LR

test functions properly with a two-stage purification procedure. In the first stage each item

is tested assuming all other items are DIF-free. Once DIF candidates are obtained from the

first stage, non-candidates are treated as “pure" items, and each candidate item is tested

28 Chanho Park

again using “pure" non-candidate items as the anchor set.

11 1. METHOD

• . Instrument

Gap-filling multiple choice vocabulary items were developed. A total of 50 items were

pilot-tested, and the items with point-biserial correlation coefficients smaller than .20 (i.e., low discrimination) were excluded. Thus, 40 items, which had acceptable difficulty and

discrimination levels, remained. Examinees were asked to find the word or phrase that best

fills the gap in one exchange of dialogue (20 items) or in a sentence (20 items). These

types of items use context to test knowledge of lexical items (Hughes, 2003 , p. 182), and

Read and Chapelle (2001) state that vocabulary knowledge “ should be defined in relation

to particular contexts." (p. 22). The test items used in this study are copyrighted materials

and cannot be presented. Instead, samples are given in Appendix

Cronbach ’s alpha, as a measure of intemal consistency of the test, was .84, which

indicates that the test had acceptable reliability. As a check for the unidimensionality

assumption of IRT (and also for the local independence assumption), a principal

components analysis was conducted. A dominant principal component was found (Figure

1), and it was thus concluded that the test is “ essentially" unidimensional and thus a

unidimensional IRT model may fit (Stout, 1990).

ι3

....

여

이 @ u t m -」m >

N

。

FIGURE 1 Variances of Principal Components

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.1 0

Differcntialltcl11 Functioning Analysis ofan EFL Vocabulary Test 29

2. Participants

Originally more than 6,000 examinces took thc tcst. The examinces rcpresent all

geographical regions of Korca, and their education lcvcls rangc from thc primary school

students to adult learners. In order to remove poss iblc sample weight cffects, 2,500

examinees were randoml y selccted from each of the male and femalc groups. Thus, a tota l

of 5,000 examinees were used in this study. The mean age of male examinees was 22.88

(SD = 6.75) with 62 and 11 as maximum and minimum, respectively. Female examinees

were slightly younger with a mean of 22.01 (SD = 5.32), and the maximum and minimum

were 48 and 10, respectivcly. Although the age distribution was significantly different fo r

male and femalc groups (1(4998)=5.07 , p<.OO 1), it was mainly due to the large sample sizc.

lndced, Cohen’s d as an cffcct size measure was small (. 14).

Vocabulary abili ty levels wcre compared between male and female groups Llsing latcnt

trait scores (8) estimated under thc IRT 3PL model. Since ex istence of DIF among thc

items may bias abil ity lcvcls, 8’s were estimated using only “ purc" (i.e. , non-D1F) itcms.

Group means of men and women were identical up to thc second decimal, and thc

difference was not sign ificant (t( 4998)=-.1 6, p=.87) ‘ Thercfore, although men and women

differed slightly by agc, there was no practical di ffcrcncc of vocabulary knowlcdgc

between the two groups. See Table 1 for summary statistics ofmale and female groups.

TABLE 1 Summary Statistics of Participants' Demographic Information

Sample Male Female

Size m써 @… 1… 앙 n

카/

‘

이/-‘‘

2,500

22.01

5.32

48

10

뱅 셋 쩌꽁 써” 떻 애

ιν

&‘

@

녀、 .이 R

1

(

|

f

t

3

D

없 m

았

앙

M

S

M

M

샌 H

t 1(4998)=5.07 , p<.O 1

1(4998)=-. 16, p=.87

3. Data Analysis

First, D1F items were dctectcd using all three dctcction methods (LR-test, SIBTEST, and MH-test). The LR-test was conducted using the program IRTLRDIF, which provides

an effic ient way of conducting the LR-test by doing many MULTILOG runs in one run

(Thissen, 2001 ). For the LR-test, the 3PL model, the least restrictive IRT model among the

30 Chanho Park

pop비ar dichotomous models, was assumed to fit the data, and the test was conducted on

the three item parameters at once (i.e., a chi-square test with df of 3 for each item). The

SIBTEST computer program (Stout & Roussos, 1992) was used for both SIBTEST and

the MH-test. For all the tests, the nominal type 1 eπor rate was set at .05. Although only

the results from SIBTEST were used for further analyses, the other two methods were

implemented for cross-validation. Since the three methods show different type 1 error rates

and power, we can be more confident about DIF results when the three methods are

combined. Therefore, items were flagged DIF only when the three methods agreed. Note

that the male group was the reference group while female examinees were the focal group.

The items were then categorized by test type (dialogue vs. sentence), answer key type

(phrase vs. word), paπ of speech (noun vs. verb vs. etc.) and also by how cognitively

demanding the lexical words were and whether idiomatic expressions were involved.

Finally, a mu뼈l

effect size measure of S잉IBTEST’ onto the item categories and IRT b-parameters (item

difficulty) . Since β'UNI is bidirectional (i.e., can be positive or negative) and shows the

degree ofDIF (i.e., effect size), it was hoped that the regression analysis could reveal what

the significant sources are in explaining DIF.

IV. RESULTS

Tables 2 and 3 show the results of DIF analysis for the 20 dialogue items and 20

sentence items, respectively. Results from the three DIF detection methods agreed most of

the time, but differed slightly. SIBTEST detected 13 items (four from dialogue and nine

from sentence items) while the MH-test found 16 (five and eleven) significant items and

the LR-test 13 items (six and seven). Here, the LR-test tested only 13 items because the

other 27 items were found to be “ pure" in the first stage. lt is worth noting that the items

SIBTEST detected as DIF were also detected as DIF by at least one of the other two

methods. Finally, eleven items were flagged DIF by all three methods; four items favored

males and seven females

As a means of visual inspection for DIF, we can draw item characteristic curves (rccs)

of the DIF-flagged items using the results from the LR-test for illustration purposes.

Figures 2 and 3 show ICCs of DIF items in favor of males and females, respectively. The

solid curves are for males, and the dotted curves are for females. As the vocabulary

knowledge level becomes higher (i.e., move to the right-hand side), the probability of

correct answer becomes higher, too, but it never exceeds 1.0. Using ICCs, we can inspect

at what levels the effects of DIF is greatest, whether the ICCs cross. For examplε， for Item

21 in Figure 3, females are favored at high ability levels while it is reversed at low 1eve1s.

Differential ltem Functioning Analysis of an EFL Yocabulary Test 31

The patte ll1 is not highly discemable, but generally ICCs do not quite cross in Figure 2, while [CCs often cross in Figure 3. We should be cautious when interpreting these

crossing DIF ite ll1s. Although these ite ll1s generally favor fe ll1ales, ll1ale exa ll1inees are

favored at sO ll1e levels of 8.

There were 1l10re DIF ite ll1s in the sentence section ofthe vocabulary test (Table 3) than

in the dialogue section (Table 2). However, it is hard to generalize this fmding because the

dialogue items are not always involving lexical words from colloquial English or words of

everyday use. SO ll1e of the lexical words in dialogue items were cognitively demanding

acade ll1 ic words.

Item 9

FIGURE2 ICCs of D1F Item Favoring Males

Item 23

Male

영 Female

‘。

:E￡ir。〉t 。‘。*

。

「

g 。

-----Male --------- Female

@

。

-‘ 。

--= -g강;。」-

'" 。

。。

-3 -2 ~ 1 2 Vccabu!ary Kn。써edge

Item 34

Male - - - - - - - - - Female

∞。

=o5gii-‘ c 。。‘。*

(。1

。

。

-3 -2 -1 o 2 Vocabulary Knowledge

N 。

。。

-3 -2 .1 “ íí• 2 3 Vocabulary Knowlectge

Item 38

:: 1 Male - - - - - - - - - Female

∞。

Q:;(1((-」‘ ‘。。~。*

r。1

。。

-3 -2 o 1 2 3 Vocabulary Knowledge

32 Chanho Park

FIGURE3 ICCs of DIF Items Favoring Females

Item 13 Item 17 。。

-----Male

--------- Female

-----Male

- - - - - - - - - F emale <Jj 。

∞ 。

@

。

-‘ 。

〉)=EmQ。」ι

ω 。

‘‘ 。

i-= -Q@a。」-

N 。

r‘ 。

「3 -2 -1 0

Vocabulary Kr:oV\1edge

Item 19

3 낙 4

-2 -1 0 Vocabu녀ry Kno'써edge

Item 21

2 3

。 o

-----Male

- - - - - - - - - F emale ∞ 。

Male

Female ro 。

잉 。

-‘ 。

>):

rQ

m。。」ι

@

。

-‘ 。

i응-g;a。」a

'" o '" 。

낙 3

-2 -1 0 Vocabulary Kr.ow!edge

3 「4 -2 -1 0

Vccabulary ’(nOW1edge 3

Item 25 Item 26 o 。

∞ 。

∞ 。

Male

Female

-----Male

- - - - - - - - - Female

φ 。

-‘ 。

〉-;→

(ι1a。rι

@

。

.‘ 。

>)

: E

@a。」ι

N 。

N 。

「견

-2 -1 0 Vocabu!arv Know1edQe

뉴 3

-2 -1 0 Vocabu!arv KnowledQe

3

34 ChanhoPark

Test items were further categorized as stated before. For the test type category, for

example, only the “ Dialogue" category was used because two complete and mutually

exclusive categories will result in multicollinearity in a regression analysis (i.e. , Both

“ Dialogue" and “ Sentence" types could not be used as item categories). It was the same

for the other categories. Therefore, the resulting categories were “ Dialogue", “ Phrase", “ Noun", “ Verb", “ Academic", and “ ldiomatic". The 20 items in the dialogue section were

in the “ Dialogue" category “Phrase" meant the answer key is not a single word but a

phrase (e.g., phrasal verb). “ Noun" and “ Verb" were paπ of speech categories. An item

was coded “ Academic" if the lexical word of the answer key was something used in

academic texts and hardly used in everyday conversation (e .g., a posteriori or omnivorous).

When an item required knowledge of an idiomatic expression, it was coded “ Idiomatic".

Table 4 is the list of item categories for all 40 items. Item categories had only binary

values (1 /0). One meant that the item possessed the characteristic while zero meant the

item did not possess it. For example, the answer key for Item 1 was a verb used in a

dialogue.

TABLE3 DIF Resu1ts of the Sentence Section (ltems 21-40)

SlBTEST MH-Test LR-Test Favored [tem Z p X 2MH G2 p Group

21* -.O91 -4 .6앓 .000 18‘ 26.4 -빵繼 Female

22 005 53 594 23* .0갯 5.90 .000 ‘ 36.3 .뼈 Male 24 017 1.53 127

‘ 25* -.055 -3.98 .아)() Female 26* -.058 -4.92 .000 Female 27* -.047 -‘239407 .%O Female 28 .017 29 .003 22 826 16 .687 30 -1 .36 174 .90 .343 31 2.61 .009 ,.j.& ‘![‘쨌’‘’.‘ 었 나J> i ‘ i 32 .022 -1.63 103 1.72 .190 33 .015 1. 15 249 1.08 .299 34* .099 9.30 .000 8'0.35 .000 11 1.5 .000 형 Male 35 -.020 -1.73 084 2.11 .147 36 80 424 1.07 301 37 38* 꿇2& 쨌鍵鐵뿔 Male 39 40 -.17 866 00 975

Note: 1. The LR -test tests only candidates from the first stage 2. Shading refers to statistical significance (a=.05). 3. * denotes all three DlF tests agree.

Di fferent ial Item Functioning Analysis of an EFL Vocabulary Test 35

These item categories and item difficulties (IRT b-parameters) wcre used in a multiple

linear regression analysis. As bricfly explained before, βUNI functions as the effect sizc

measure for SIBTEST and can have both positive and negative signs. When regressed onto

the itcm difficulties and categories, we may conc1ude that item difficulties or the

catcgorics are positively or negatively contributing to DIF if a significant regression

coefficient is found

Table 5 shows regression coefficients. Of the predictors, only “ Academic" category is

statistically significant (p=.008). The other predictors are not statistically significant, and

they altogether explain about 30% of the variability of the DIF effect size measures. The

sign of the estimate is positive. Since male examinees were used as the reference group in

this study, positive ßUN1 values are associated with DIF in favor of males. Therefore, with caution, we may conc1ude that cognitively demanding academic words significantly

contribute to DIF in favor of males.

Chanho Park

TABLE4 Item Difficulty and Categorization

36

Idiomatic nU

nU

nU

nU

nU

nU

’----i

nU

nu

--nu

--’l

nU

’-----nU

nU

nU

nU

nU

nU

nU

nU

nU

AU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

’l

Academic nU

nU

nU

nU

nU

nU

nU

nU

nU

’l nU

nU

nU

nU

nU

’l nU

nU

nU

nU

nU

’l ’l nU

nU

nU

’l nU

’l nU

’l nu

--’l

nU

nU

’l ’l ’l nU

Verb 1l

nu

’i

1l

nU

1l

nu

nv

Ii

nu

nu

nu

----nu

ll

nu

’l

1l

nU

nu

nu

--nu

nU

nu

---i

nu

nU

nu

ll

nu

nu

nu

nu

nu

nu

--’l

Noun nu

nu

nu

nu

nU

nu

nu

nu

nu

nU

‘li --nu

nU

nu

nu

nu

nu

nU

’---’

l

nu

--nu

nu

nu

nU

nU

nu

’l

nu

--nU

nu

----nu

nU

nU

Phrase nu

--nU

nU

nU

nU

nU

’l

nU

nU

nU

nU

’l

’l

nU

nU

’l

’l

nU

nU

nU

nU

nu

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

nU

Dialogue ------------’

l

ll

li

------’

-----Il

------’

l

nu

nu

nu

nu

nU

nu

nu

nu

nu

nU

nu

nu

nu

nu

nU

nu

nu

nU

nu

nU

Difficulty

1.84 0.93 0.76

-0 .43 -0.8 1 -0.30 -1.88 -1.01

1.82 0.74 1.35 0.84 1.91 1. 13 0.15 2.49 2.34 1.37 1.84 1.28

-0.33 -1.29 -0.04 -0 .41 0.49

-1.38 1.42

-0.58 0.21 1.07 1.44 0.51 0.50 1.70

-0.77 1.23 1.55 1.67 0.79 1.43

Item

2 3 4 5 6 7 8 9* 10 11 12 13* 14 15 16 17* 18 19* 20 21* 22 23* 24 25* 26* 27* 28 29 30 31 32 33 34 35 36 37 38 39 40

Note: * denotes a DlF item.

Differential 1tem FlInctioning Analysis of an EFL Vocab1l1ary Test 37

TABLE 5

Results of Regression Analysis

Coefficients Estimate Standard error p

(Intercept) -.019 .012 -1.531 136

Diffic비ty .001 .005 .211 834

Dia10gue .007 .013 546 589

Phrase -.019 .017 -1.1 47 260

NOlln .001 .014 103 919

Verb .009 .013 707 485

Academic .038 .013 2.839 008**

1diomatic .001 .014 064 949

R2 3068

Note: ** denotes p < .0 I

v. DI5CU5510N AND CONCLU510N

In this study, three DlF detection methods were applied to an English vocabulary test.

The items showing DIF between male and female examinees were similar across the three

methods. The items were further investigated by r땅essmg βUNI ' the effect size measure

for SIBTEST, on the item difficulty and item categories. Only “ Academic" category was

found to be statistically significant in favor ofmale examinees.

The results showed that use of cognitively demanding academic words may lead to DIF

against female examinees, which has implications that may be useful for teaching purposes

Academic words causing DIF means female examinees have weaker academic vocabulary

knowledge than male examinees of the same vocabulary leve l. Thus, teachers of English

vocabulary may decide to spend more time with female students when teaching academic

words .

Since vocabulary knowledge is strongly connected to reading and listening

comprehension (Pearson et al., 2007), the DIF results can be useful in reading and

li stening classes as wel l. It is likely that female students have relatively bigger difficulty

comprehending academic passages in reading or listening. Reading and listening teachers

may thus spend more time with female students on vocabulary when the contents are

38 ChanhoPark

rather academic. Because only one predictor can be allowed for academic and

nonacademic categories (due to multicollinearity), significance of the “ Academic"

category also means that female leamers are more proficient at nonacademic words than

male leamers of the same level. Thus, teachers of English may need to be more careful

with male students on nonacademic vocabulary. All in all, different approaches are

warranted between male and female L2 leamers when teaching vocabulary of specific

categones.

The purpose ofthis study was multi-fold. First, several popular DIF detection methods

were compared on vocabulary assessment. Parametric and nonparametric methods are

often compared using Monte Carlo simulation techniques (e.g., Bolt, 2002). Since

different methods adopt different approaches, their performances are also different. lt is of

interest to compare the performance of various DIF detection methods on empirical

vocabulary assessment data. In this study, the parametric and nonparametric methods

produced quite similar results.

Second, this study purported to identify sources of explaining DIF. The paradigm of

DIF studies is shifting. DIF was first studied as a statistical method of finding item bias.

DIF detection methods were developed and compared to find well-performing methods

(i.e., powerful detection of DIF with a proper control of type 1 error rates) under various

conditions. ln this framework, DIF studies in language testing had only limited availability.

DIF detection methods were only applied to find biased items to remove. However, the

paradigm is changing. Researchers start to examine what causes DIF (Abbott, 2007). The

current study used a regression analysis to identify significant predictors of DIF. This

endeavor involved item categorization. Only one item category was found to be

statistically significant in this study; however, better categorization of items may reveal

better predictors of explaining DIF. Note that only about 30% of the variability of the DIF

effect size measures were explained by the predictors used in this study (Table 5).

In addition to the significant finding that can be useful when teaching English, this study

also serves as a framework that can be applied to other subtests and with better item

categorization. lt is easy to conduct the same type of analysis if relevant item

categorization is available. Cumulative evidence from different tests with di

Diffcrcntialltcl11 Functioning Analysis ofan EFL Vocabulary Test 39

REFERENCES

Abbott, M. L. (2007). A confinnatory approach to differential item functioning on an ESL

reading assessmen t. Language Testing , 24, 7-36

Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric

polytomous DIF detection methods. Applied Measurement in Education, 15, 113-

14 1. Camilli, G. , & Shepart, L. ( 1994). Methods for identifYing biased test items. Thousand

Oaks, CA: Sage

Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests

Language Tesfing, 2, 155-163.

Daller, H., Mi lton, 1., & Treffers-Daller, J. (Eds .) (2007). Modelling and assessing

vocabulaty knowledge. Cambridge, UK: Cambridge University Press

Douglas, J. , Roussos, L. A., & Stout, W. F. ( 1996). ltem-bundle DlF hypothesis testing:

Identifying suspect bundles and assessing thcir differential functioning. Journal of

Educational Measurement, 33, 465-484 ‘

Embretson, S. E., & Reise, S. P. (2000) . Item response theoηfor p.sychologists . Mahwah, NJ: Lawrence ErIbaum.

Holland, P. W., & Thayer, D. T. (1 988). Differential item pcrfonnance and the Mantcl

Haenszel procedure. In H. Wainer & H. 1. Braun (Eds.), Test validity (pp. 129-1 45)

Hillsdale, NJ: Lawrence Erlbaum.

Holland, P. W., & Wainer, H. (Eds.) (1 993). Dψerential item jùnctioning. Hi llsdale, NJ:

Lawrence ErIbaum

Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge, UK: Cambridge

University Press.

Hyde, 1., & Linn, M. (1 988). Gender differences in verbal ability: A meta-analysis.

Psychological Bulletin, 104, 53-69.

Kim, Mikyung. (200 1). Detecting DIF across the different language groups in a speaking

test. Language Testing, 18, 89-11 4.

κ1a ll er， S. 1. (200 1). Differential item functioning in the WISC-IlI: ltem parameters for

boys and girls in the national standardization sample. Educational and

Psychological Measurement, 61 , 793-817.

Pae, Tae-I I. (2004). DlF for examinees with different academic backgrounds. Language

Testing, 21 , 53-73.

Pearson, P. 0., Hiebert, E. H., & Kamil, M. L. (2007). Vocabulary assessment: What we

know and what we need to leam. Reading Research Quarterly , 42, 282-296.

Penfield, R. 0., & Lam, T. C. M. (2000). Assessing differentiaI item function ing in

performance assessmen t: Review and recommendations. Educational

40 Chanho Park

Measurement: Jssues and Practice, 19, 5-15.

RAND Reading Study Group. (2002). Reading for understanding: Toward an R&D

program in reading comprehension. Santa Monica, CA: RAND.

Read, 1. (2000). Assessing vocabulary. Cambridge, UK: Cambridge University Press.

Read, 1., & Chapelle, C. A. (2001). A framework for second language vocabulary

assessment. Language Testing, 18, 1-32.

Roussos, L. A., & Stout, W. F. (1 996). A multidimensionality-based DIF analysis

paradigm. Applied Psychological Measurement, 20, 355-371

Ryan, K. E., & Bachman, L. F. (1992). Differential item functioning on two tests ofEFL

proficiency. Language Testing , 9, 12-29.

Sasaki, M. (1 991). A comparison of two methods for detecting differential item

functioning in an ESL placement tes t. Language Testing , 8, 95-115.

Shealy, R. , & Stout, W. F. (1 993). A model-based standardization approach that separates

true biaslDIF from group ability differences and detects test biaslDTF as well as

item biaslDIF. Psychometrika, 58, 159-194.

Stout, W. F. (1990). A new item response theory modeling approach with applications to

unidimensionality assessment and ability estimation. Psychometrika, 55, 293-325.

Stout, W. F. , & Roussos, L. A. (1992) . SJBTEST user manual. Unpublished manuscript.

Available from W. F. Stout, University of Illinois at Urbana-Champaign.

Takala, S. , & Kaftandjieva, F. (2000). Test faimess: A DIF analysis of an L2 vocabulary

test. Laηguage Testing, 17, 323-340.

Thissen, D. (2001). IRTLRDIF (Version 2) [Computer software]. Retrieved May 1, 2009,

from http ://www.unc.edu/~dthissenldl.html.

Thissen, 0 ., Steinberg, L., & Wainer, H. (1 988). Use of item response theory in the study

of group differences in πace lines. In H. Wainer & H. I. Braun (Eds.), Test validity

(pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum.

APPENDIX

Sample Test Items

1. Gap-filling: Dialogue

A: 1 am going to the concert tomoπow， too.

B: What a nice

a) impression

c) coincidence

KEY: c

b) happiness

d) luck

Differentialltem Functioning Analysis ofan EFL Vocabulary Test

2 . Gap-filling: Sentence

This VCR player is s till under and can be repaired free of charge.

a) payment b) mortgage

c) coverage d) warranty

KEY:d

Note: These are not actual ite ms used in this study

Applicable levels: prilllary education, secondary education, tertiary education

Key words: vocabulary assessment, EFL test, differential item functioning (DIF)

Chanho Park Division ofthe College Scholastic Ability Test Research and Management Korea Institute for Curricululll and Evaluation (KJCE) Jeongdng Bldg. 15-5 Jeong-dong, Jung-gu Seoul 100-784, Korea Tel: 02-3704-36 13 Fax: 02-3704-3690 Elllail: cpark@ kice.re. kr

Received in June, 2010 Reviewed in July, 2010 Revised version received in August, 2010

41

differential item functioning analysis of an efl v ocabulary...

Documents