differential item functioning analysis of an efl v ocabulary...
TRANSCRIPT
English Teaching, Vo l. 65 , No. 3 Autumn 2010
Differential Item Functioning Analysis of an EFL V ocabulary Test
23
Chanho Park
(Korea Institute for Curriculum and Evaluation)
Park, Chanho. (2010). Differential item functioning analysis of an EFL
vocabulary tes t. English Teaching, 65(3), 23-41.
Di fTerential item functioning (DIF) has been studied as a sta tistical method of finding
item bias, and DIF studies have been conducted on tests of verbal abilities such as
reading comprehension 없d vocabulary knowledge. Recently, researchers of language
testing sought to understand the causes of DIF. Thc purpose of this study was to find
possible sources of explaining DlF on an EFL vocabulary test, which can then benetit
EFL leamers and teachers as wel l. Th is study thus applicd thrce DIF detection methods
(the Iikelihood ratio test, SlBTEST, and the Mantel-Haenszel test), followed by a
regression analysis lIsing item categories and difficllltics as predictors of explaining DIF
As a result, only the “ Academic" catcgory was stat istically significant, positively
contribllting to DIF in favor of male examinees. Male examinees were better at
academic lexical words than female leamers of EFL at the same vocablllary knowledge
level. This also means that female learners were relatively strong at nonacademic words
in comparison to male EF L learners. This finding may lead to more efTective curricll lum
of teaching English. The need for more evidence from replicated studies with better
item categorization is also warranted.
1. INTRODUCTION
Research on differenti a l ite m functioning (DIF) was tri ggered by an interest in removing
bias from educat ional or psycholog ical tests (Camilli & Shepard, 1994). The functionality
of a test item should be invariant across different groups if the group membership is
unrelated to the trait the test is purported to measure. However, a test item shows DIF if
the item function s differentia lly for examinees of the same ability but with different group
memberships. Since DlF may severely threaten the validity of a test, it has been studied as
a statistical method of finding possible item bias (Camilli & Shepard, 1994; Holl and &
24 Chanho Park
Wainer, 1993). Detecting and removing DIF items have been considered necessary for
valid and bias-당ee measurement.
Although DIF can be studied for any groups divided using single or multiple criteria, only a limited number of grouping criteria have been applied. Gender DIF is the most
common form of DIF studied in many tests for several reasons. First, sizes of gender
groups are comparable in most cases. DIF can be hard to detect if one group has too few
examinees (e.g., most examinees are male). Second, the group membership should not be
too highly correlated with the trait the test is measuring. Last, if DIF is found, it can be
interpreted as existence of potential item bias. For example, if DIF is studied between the
first language (L 1) and the second language α2) speakers for a language proficiency test, DlF may indicate the level of proficiency and does not mean bias. Therefore, gender DIF
is commonly studied in educational measurement including language testing.
1n the history of DIF studies in language testing (Abbott, 2007; Chen & Henning, 1985;
Mikyung Kim, 2001; Tae-Il Pae, 2004; Ryan & Bachman, 1992; Sasaki, 1991; Takala &
Kaftandjieva, 2000), early studies focused on applying diverse DIF detection methods to
various language tests in order to detect possible item bias. lf some test items were biased
against some ethnic or gender group, the bias could lead to DIF, which could then be
detected using the statistical methods
Wh ile early DlF studies focused on detection of DIF, recent studies staπed to
investigate possible causes of DIF in language tests. To understand the causes of DIF,
content analysis in combination with other statistical techniques has been applied (Abbott, 2007; Tae-1l Pae, 2004). Understanding causes of DIF is invaluable in that it will not only
help test developers write bias-free items but also help teachers better understand their
students ’ relative strengths or weaknesses. For example, if male students are always
favored by specific types of vocabulary assessment questions, the teacher may spend more
time with female students when teaching those types ofwords.
The purpose of this study was to explore possible sources of exp laining DIF and to
illustrate how the results can be used to facilitate teaching. First, vocabulary test items
were analyzed, and DIF was tested between male and female examinee groups. Gender
groups were studied partly because gender is often of interest in DIF stu
Differential ltem Functioning Analysis of an EFL Vocabulary Test 25
11. LlTERA TURE REVIEW
1. DIF in Language Testing
Various DIF detection mcthods have becn successfu ll y app licd to tcsts of L2s. Chen and
Henning (1985) attemptcd to dctcct DIF items across diffcrcnt native language groups
(Spanish and Chinese). ltcm difficulties for each subgroup wcre estimated under the Rasch
model , and were then standardized for comparison. As a rcsult, four vocabulary item
werc in favor of Spanish-speaking examinees duc in part to thc fact that thc Spanish
examinees had target-forrn- Iike lexicon in their nativc language. Sasaki ( 1991)
corroborated the res비ts of Chcn and Hel1ll ing (1985), and found that Chinesc studcnts can
bc favorcd by some grammar itcms in addition to similar fìndings about vocabulary itcms.
Ryan and Bachman (1992) app lied thc Mantel-Hacnszcl tcst (MH-test; Holland &
Thayer, 1988) on two L2 tests across Indo-European (IE) and non-Indo-European 이IE)
groups and also across gcnders. Mikyung Kim (2001) also cxamincd DIF across 1 E and
NIE groups in a speaking test, but applicd the likelihood ratio tcst (LR-test; Thisscn、
Stcinberg, & Wainer, 1988) and the logistic rcgrcssion mcthod. Mikyung Kim concludcd
that the LR test had methodological superiority over the logistic rcgrcssion method. Takala
and Kaftandjieva (2000) perfonned gender-related DIF studics with an L2 vocabulary tcst, suggesting that DIF items be excluded from an itcm bank so that biased item compositcs
do not appear from the item bank
While most DIF studies in language testing compared gendcr or language groups, Tac-II
Pae (2004) studied examinees with different backgrounds to investigate DIF on thc
English subtest ofthe Korean National Entrance Exam for Colleges and Universities. Tae
II Pae further tried to link DIF analysis with content analysis and found that English
subtest items on science-related topics were easier for students with science backgrounds.
lt is worthwhile to understand what may cause DIF by relating item contents with the
characteristics ofthe groups divided for DIF analysis. In this study, however, DIF may not
unequivocally mean item bias. If science-track students arc better at rcading
comprehension items of sc ience-related topics, we may not call this itcm bias as wc usc
thc tcrrn in an ethical or political sense.
In order to better understand why DIF occurs in languagc testing, Abbott (2007)
adopted multidimensionality-based approaches by Roussos and Stout (1996). According to
Roussos and Stout, DIF occu
26 Chanho Park
(2007) investigated the causes of DIF in reading comprehension using a theoretical
reading strategies framework (i.e. , bottom-up vs. top-down strategies). Through an effort
to detect causes of DIF and differential bundle functioning, Abbott could further explore
the secondary dimension.
2 . Vocabulary Assessment and Gender
Recently, vocabulary received considerable attention from language teachers and
researchers (Pearson, Hiebert, & Kamil, 2007). It is generally agreed that there is a strong
link between vocabulary and reading comprehension. According to Pearson et al.,
correlation coefficients between vocabulary and comprehension range between .6 and .8
although it is hard to defme their causal relationship. AIso, vocabulary is considered an
important factor in understanding L2 leamers' reading problems (RAND Reading Study
Group, 2002). There are an increasing number of research studies on L2 vocabulary
(Daller, Milton, & Treffers-Daller, 2007; Read, 2000), and development of vocabulary is
now considered central to leaming a language (Read & Chapelle, 2001).
While the importance of leaming vocabulary is emphasized, agreement on gender
differences of verbal ability including vocabulary knowledge has not been reached yet.
Hyde and Linn (1 988) conducted an extensive meta-analysis of gender differences in
verbal ability. Studies reporting vocabulary knowledge in favor of males or females were
mixed. Hyde and Linn thus concluded that there were no virtual gender differences in
verbal ability.
Whether gender differences exist or not, DIF items are continuously found in
vocabulary assessment (Maller, 2001; Takala & Kaftan이 ieva, 2000). DIF does not
indicate difference of overall ability (e .g., vocabulary knowledge); instead, it may indicate
item-Ievel multidimensionality. IfDIF does not occur by chance alone (i .e., type 1 eπor) , it
will be worthwhile to understand sources of explaining DIF.
3. DIF Detection Methods
o IF detection methods can be categorized into parametric and nonparametric methods
(Penfield & Lam, 2000). Among the popu1ar DIF detection methods such as the LR-test
(Thissen, Steinberg, & Wainer, 1988), SIBTEST (Shealy & Stout, 1993), and the MH-test
(Holland & Thayer, 1988), the LR-test is a parametric method because it assumes a
particular item response theory (IRT) model is appropriate. SIBTEST and the MH-test do
not have such assumptions and are thus called nonparametric methods
Differential Item Functioning Analysis of an EFL Vocabulary Test 27
1) SIBTEST
SIBTEST (Shealy & Stout, 1993) compares scores on the studied item for the reference
group (group ofreference) and the focal group (group ofinterest) conditional on the valid
subtest scores, which are estimated using a regression correction procedure. Since DIF can
be defined as differential probability of correct answer by examinees of the same ability
level but with different group membership, matching examinees by ability level is crucial
in detecting DIF. SIBTEST uses regression-corrected true scores instead of raw scores.
SIBTEST computes β'UNI ' the effect size measure, which functions as a test statistic
when divided by its standard error. The test statistic asymptotically follows the standard
normal distribution under the null hypothesis of no DIF.
2) MH-Test
Holland and Thayer (1988) proposed the MH-test based on a common odds ratio for
correct versus incorrect responses in the reference versus focal group conditional on total
test score. From the common odds ratio, the Mantel-Haenszel chi-square (lMH) is derived
as the test statistic, which follows a chi-square distribution with one degree of freedom (,찌
under the null hypothesis of no DIF.
3) LR-Test
When an IRT model fits, the item parameters should be invariant across different
subgroups (the invariance principle; Embretson & Reise, 2000). The LR-test (Thissen, Steinberg, & Wainer, 1988) was conceived such that DIF is indicated if the likelihood is
different between when the parameters are constrained to be the same for the subgroups
(called a “ compact model") and when the parameters are allowed to differ (called an
“augmented model"). The -2 times log likelihood value of the augmented model
subtracted from the -2 times log likelihood value of the compact model produces a test
statistic ca l1ed G2 with df equaling the number of item parameters freed in the augmented ?
model (e.g., three for a three-parameter model) . G< follows a chi-square distribution under
the nu l1 hypothesis of no DIF. Since the LR-test is a parametric method, it should be
determined beforehand which IRT model is used. When the three-parameter logistic (3PL)
model is used for the LR-test, a chi-square test can be conducted for a l1 three parameters at
once (with dfo f3) or for each item parameter (with 예of 1). It is worth noting that the LR
test functions properly with a two-stage purification procedure. In the first stage each item
is tested assuming all other items are DIF-free. Once DIF candidates are obtained from the
first stage, non-candidates are treated as “pure" items, and each candidate item is tested
28 Chanho Park
again using “pure" non-candidate items as the anchor set.
11 1. METHOD
• . Instrument
Gap-filling multiple choice vocabulary items were developed. A total of 50 items were
pilot-tested, and the items with point-biserial correlation coefficients smaller than .20 (i.e., low discrimination) were excluded. Thus, 40 items, which had acceptable difficulty and
discrimination levels, remained. Examinees were asked to find the word or phrase that best
fills the gap in one exchange of dialogue (20 items) or in a sentence (20 items). These
types of items use context to test knowledge of lexical items (Hughes, 2003 , p. 182), and
Read and Chapelle (2001) state that vocabulary knowledge “ should be defined in relation
to particular contexts." (p. 22). The test items used in this study are copyrighted materials
and cannot be presented. Instead, samples are given in Appendix
Cronbach ’s alpha, as a measure of intemal consistency of the test, was .84, which
indicates that the test had acceptable reliability. As a check for the unidimensionality
assumption of IRT (and also for the local independence assumption), a principal
components analysis was conducted. A dominant principal component was found (Figure
1), and it was thus concluded that the test is “ essentially" unidimensional and thus a
unidimensional IRT model may fit (Stout, 1990).
ι3
....
여
이 @ u t m -」m >
N
。
FIGURE 1 Variances of Principal Components
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.1 0
Differcntialltcl11 Functioning Analysis ofan EFL Vocabulary Test 29
2. Participants
Originally more than 6,000 examinces took thc tcst. The examinces rcpresent all
geographical regions of Korca, and their education lcvcls rangc from thc primary school
students to adult learners. In order to remove poss iblc sample weight cffects, 2,500
examinees were randoml y selccted from each of the male and femalc groups. Thus, a tota l
of 5,000 examinees were used in this study. The mean age of male examinees was 22.88
(SD = 6.75) with 62 and 11 as maximum and minimum, respectively. Female examinees
were slightly younger with a mean of 22.01 (SD = 5.32), and the maximum and minimum
were 48 and 10, respectivcly. Although the age distribution was significantly different fo r
male and femalc groups (1(4998)=5.07 , p<.OO 1), it was mainly due to the large sample sizc.
lndced, Cohen’s d as an cffcct size measure was small (. 14).
Vocabulary abili ty levels wcre compared between male and female groups Llsing latcnt
trait scores (8) estimated under thc IRT 3PL model. Since ex istence of DIF among thc
items may bias abil ity lcvcls, 8’s were estimated using only “ purc" (i.e. , non-D1F) itcms.
Group means of men and women were identical up to thc second decimal, and thc
difference was not sign ificant (t( 4998)=-.1 6, p=.87) ‘ Thercfore, although men and women
differed slightly by agc, there was no practical di ffcrcncc of vocabulary knowlcdgc
between the two groups. See Table 1 for summary statistics ofmale and female groups.
TABLE 1 Summary Statistics of Participants' Demographic Information
Sample Male Female
Size m써 @… 1… 앙 n
카/
‘
이/-‘‘
2,500
22.01
5.32
48
10
뱅 셋 쩌꽁 써” 떻 애
ιν
&‘
@
녀、 .이 R
1
(
|
f
t
3
D
없 m
았
앙
M
S
M
M
샌 H
t 1(4998)=5.07 , p<.O 1
1(4998)=-. 16, p=.87
3. Data Analysis
First, D1F items were dctectcd using all three dctcction methods (LR-test, SIBTEST, and MH-test). The LR-test was conducted using the program IRTLRDIF, which provides
an effic ient way of conducting the LR-test by doing many MULTILOG runs in one run
(Thissen, 2001 ). For the LR-test, the 3PL model, the least restrictive IRT model among the
30 Chanho Park
pop비ar dichotomous models, was assumed to fit the data, and the test was conducted on
the three item parameters at once (i.e., a chi-square test with df of 3 for each item). The
SIBTEST computer program (Stout & Roussos, 1992) was used for both SIBTEST and
the MH-test. For all the tests, the nominal type 1 eπor rate was set at .05. Although only
the results from SIBTEST were used for further analyses, the other two methods were
implemented for cross-validation. Since the three methods show different type 1 error rates
and power, we can be more confident about DIF results when the three methods are
combined. Therefore, items were flagged DIF only when the three methods agreed. Note
that the male group was the reference group while female examinees were the focal group.
The items were then categorized by test type (dialogue vs. sentence), answer key type
(phrase vs. word), paπ of speech (noun vs. verb vs. etc.) and also by how cognitively
demanding the lexical words were and whether idiomatic expressions were involved.
Finally, a mu뼈l
effect size measure of S잉IBTEST’ onto the item categories and IRT b-parameters (item
difficulty) . Since β'UNI is bidirectional (i.e., can be positive or negative) and shows the
degree ofDIF (i.e., effect size), it was hoped that the regression analysis could reveal what
the significant sources are in explaining DIF.
IV. RESULTS
Tables 2 and 3 show the results of DIF analysis for the 20 dialogue items and 20
sentence items, respectively. Results from the three DIF detection methods agreed most of
the time, but differed slightly. SIBTEST detected 13 items (four from dialogue and nine
from sentence items) while the MH-test found 16 (five and eleven) significant items and
the LR-test 13 items (six and seven). Here, the LR-test tested only 13 items because the
other 27 items were found to be “ pure" in the first stage. lt is worth noting that the items
SIBTEST detected as DIF were also detected as DIF by at least one of the other two
methods. Finally, eleven items were flagged DIF by all three methods; four items favored
males and seven females
As a means of visual inspection for DIF, we can draw item characteristic curves (rccs)
of the DIF-flagged items using the results from the LR-test for illustration purposes.
Figures 2 and 3 show ICCs of DIF items in favor of males and females, respectively. The
solid curves are for males, and the dotted curves are for females. As the vocabulary
knowledge level becomes higher (i.e., move to the right-hand side), the probability of
correct answer becomes higher, too, but it never exceeds 1.0. Using ICCs, we can inspect
at what levels the effects of DIF is greatest, whether the ICCs cross. For examplε, for Item
21 in Figure 3, females are favored at high ability levels while it is reversed at low 1eve1s.
Differential ltem Functioning Analysis of an EFL Yocabulary Test 31
The patte ll1 is not highly discemable, but generally ICCs do not quite cross in Figure 2, while [CCs often cross in Figure 3. We should be cautious when interpreting these
crossing DIF ite ll1s. Although these ite ll1s generally favor fe ll1ales, ll1ale exa ll1inees are
favored at sO ll1e levels of 8.
There were 1l10re DIF ite ll1s in the sentence section ofthe vocabulary test (Table 3) than
in the dialogue section (Table 2). However, it is hard to generalize this fmding because the
dialogue items are not always involving lexical words from colloquial English or words of
everyday use. SO ll1e of the lexical words in dialogue items were cognitively demanding
acade ll1 ic words.
Item 9
FIGURE2 ICCs of D1F Item Favoring Males
Item 23
Male
영 Female
‘。
:E£ir。〉t 。‘。*
。
「
g 。
-----Male --------- Female
@
。
-‘ 。
--= -g강;。」-
'" 。
。。
-3 -2 ~ 1 2 Vccabu!ary Kn。써edge
Item 34
Male - - - - - - - - - Female
∞。
=o5gii-‘ c 。。‘。*
(。1
。
。
-3 -2 -1 o 2 Vocabulary Knowledge
N 。
。。
-3 -2 .1 “ íí• 2 3 Vocabulary Knowlectge
Item 38
:: 1 Male - - - - - - - - - Female
∞。
Q:;(1((-」‘ ‘。。~。*
r。1
。。
-3 -2 o 1 2 3 Vocabulary Knowledge
32 Chanho Park
FIGURE3 ICCs of DIF Items Favoring Females
Item 13 Item 17 。 。
-----Male
--------- Female
-----Male
- - - - - - - - - F emale <Jj 。
∞ 。
@
。
-‘ 。
〉)=EmQ。」ι
ω 。
‘‘ 。
i-= -Q@a。」-
N 。
r‘ 。
「3 -2 -1 0
Vocabulary Kr:oV\1edge
Item 19
3 낙 4
-2 -1 0 Vocabu녀ry Kno'써edge
Item 21
2 3
。 o
-----Male
- - - - - - - - - F emale ∞ 。
Male
Female ro 。
잉 。
-‘ 。
>):
rQ
m。。」ι
@
。
-‘ 。
i응-g;a。」a
'" o '" 。
낙 3
-2 -1 0 Vocabulary Kr.ow!edge
3 「4 -2 -1 0
Vccabulary ’(nOW1edge 3
Item 25 Item 26 o 。
∞ 。
∞ 。
Male
Female
-----Male
- - - - - - - - - Female
φ 。
-‘ 。
〉-;→
(ι1a。rι
@
。
.‘ 。
>)
: E
@a。」ι
N 。
N 。
「견
-2 -1 0 Vocabu!arv Know1edQe
뉴 3
-2 -1 0 Vocabu!arv KnowledQe
3
34 ChanhoPark
Test items were further categorized as stated before. For the test type category, for
example, only the “ Dialogue" category was used because two complete and mutually
exclusive categories will result in multicollinearity in a regression analysis (i.e. , Both
“ Dialogue" and “ Sentence" types could not be used as item categories). It was the same
for the other categories. Therefore, the resulting categories were “ Dialogue", “ Phrase", “ Noun", “ Verb", “ Academic", and “ ldiomatic". The 20 items in the dialogue section were
in the “ Dialogue" category “Phrase" meant the answer key is not a single word but a
phrase (e.g., phrasal verb). “ Noun" and “ Verb" were paπ of speech categories. An item
was coded “ Academic" if the lexical word of the answer key was something used in
academic texts and hardly used in everyday conversation (e .g., a posteriori or omnivorous).
When an item required knowledge of an idiomatic expression, it was coded “ Idiomatic".
Table 4 is the list of item categories for all 40 items. Item categories had only binary
values (1 /0). One meant that the item possessed the characteristic while zero meant the
item did not possess it. For example, the answer key for Item 1 was a verb used in a
dialogue.
TABLE3 DIF Resu1ts of the Sentence Section (ltems 21-40)
SlBTEST MH-Test LR-Test Favored [tem Z p X 2MH G2 p Group
21* -.O91 -4 .6앓 .000 18‘ 26.4 -빵繼 Female
22 005 53 594 23* .0갯 5.90 .000 ‘ 36.3 .뼈 Male 24 017 1.53 127
‘ 25* -.055 -3.98 .아)() Female 26* -.058 -4.92 .000 Female 27* -.047 -‘239407 .%O Female 28 .017 29 .003 22 826 16 .687 30 -1 .36 174 .90 .343 31 2.61 .009 ,.j.& ‘![‘쨌’‘’.‘ 었 나J> i ‘ i 32 .022 -1.63 103 1.72 .190 33 .015 1. 15 249 1.08 .299 34* .099 9.30 .000 8'0.35 .000 11 1.5 .000 형 Male 35 -.020 -1.73 084 2.11 .147 36 80 424 1.07 301 37 38* 꿇2& 쨌鍵鐵뿔 Male 39 40 -.17 866 00 975
Note: 1. The LR -test tests only candidates from the first stage 2. Shading refers to statistical significance (a=.05). 3. * denotes all three DlF tests agree.
Di fferent ial Item Functioning Analysis of an EFL Vocabulary Test 35
These item categories and item difficulties (IRT b-parameters) wcre used in a multiple
linear regression analysis. As bricfly explained before, βUNI functions as the effect sizc
measure for SIBTEST and can have both positive and negative signs. When regressed onto
the itcm difficulties and categories, we may conc1ude that item difficulties or the
catcgorics are positively or negatively contributing to DIF if a significant regression
coefficient is found
Table 5 shows regression coefficients. Of the predictors, only “ Academic" category is
statistically significant (p=.008). The other predictors are not statistically significant, and
they altogether explain about 30% of the variability of the DIF effect size measures. The
sign of the estimate is positive. Since male examinees were used as the reference group in
this study, positive ßUN1 values are associated with DIF in favor of males. Therefore, with caution, we may conc1ude that cognitively demanding academic words significantly
contribute to DIF in favor of males.
Chanho Park
TABLE4 Item Difficulty and Categorization
36
Idiomatic nU
nU
nU
nU
nU
nU
’----i
nU
nu
--nu
--’l
nU
’-----nU
nU
nU
nU
nU
nU
nU
nU
nU
AU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
’l
Academic nU
nU
nU
nU
nU
nU
nU
nU
nU
’l nU
nU
nU
nU
nU
’l nU
nU
nU
nU
nU
’l ’l nU
nU
nU
’l nU
’l nU
’l nu
--’l
nU
nU
’l ’l ’l nU
Verb 1l
nu
’i
1l
nU
1l
nu
nv
Ii
nu
nu
nu
----nu
ll
nu
’l
1l
nU
nu
nu
--nu
nU
nu
---i
nu
nU
nu
ll
nu
nu
nu
nu
nu
nu
--’l
Noun nu
nu
nu
nu
nU
nu
nu
nu
nu
nU
‘li --nu
nU
nu
nu
nu
nu
nU
’---’
l
nu
--nu
nu
nu
nU
nU
nu
’l
nu
--nU
nu
----nu
nU
nU
Phrase nu
--nU
nU
nU
nU
nU
’l
nU
nU
nU
nU
’l
’l
nU
nU
’l
’l
nU
nU
nU
nU
nu
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
nU
Dialogue ------------’
l
ll
li
------’
-----Il
------’
l
nu
nu
nu
nu
nU
nu
nu
nu
nu
nU
nu
nu
nu
nu
nU
nu
nu
nU
nu
nU
Difficulty
1.84 0.93 0.76
-0 .43 -0.8 1 -0.30 -1.88 -1.01
1.82 0.74 1.35 0.84 1.91 1. 13 0.15 2.49 2.34 1.37 1.84 1.28
-0.33 -1.29 -0.04 -0 .41 0.49
-1.38 1.42
-0.58 0.21 1.07 1.44 0.51 0.50 1.70
-0.77 1.23 1.55 1.67 0.79 1.43
Item
2 3 4 5 6 7 8 9* 10 11 12 13* 14 15 16 17* 18 19* 20 21* 22 23* 24 25* 26* 27* 28 29 30 31 32 33 34 35 36 37 38 39 40
Note: * denotes a DlF item.
Differential 1tem FlInctioning Analysis of an EFL Vocab1l1ary Test 37
TABLE 5
Results of Regression Analysis
Coefficients Estimate Standard error p
(Intercept) -.019 .012 -1.531 136
Diffic비ty .001 .005 .211 834
Dia10gue .007 .013 546 589
Phrase -.019 .017 -1.1 47 260
NOlln .001 .014 103 919
Verb .009 .013 707 485
Academic .038 .013 2.839 008**
1diomatic .001 .014 064 949
R2 3068
Note: ** denotes p < .0 I
v. DI5CU5510N AND CONCLU510N
In this study, three DlF detection methods were applied to an English vocabulary test.
The items showing DIF between male and female examinees were similar across the three
methods. The items were further investigated by r땅essmg βUNI ' the effect size measure
for SIBTEST, on the item difficulty and item categories. Only “ Academic" category was
found to be statistically significant in favor ofmale examinees.
The results showed that use of cognitively demanding academic words may lead to DIF
against female examinees, which has implications that may be useful for teaching purposes
Academic words causing DIF means female examinees have weaker academic vocabulary
knowledge than male examinees of the same vocabulary leve l. Thus, teachers of English
vocabulary may decide to spend more time with female students when teaching academic
words .
Since vocabulary knowledge is strongly connected to reading and listening
comprehension (Pearson et al., 2007), the DIF results can be useful in reading and
li stening classes as wel l. It is likely that female students have relatively bigger difficulty
comprehending academic passages in reading or listening. Reading and listening teachers
may thus spend more time with female students on vocabulary when the contents are
38 ChanhoPark
rather academic. Because only one predictor can be allowed for academic and
nonacademic categories (due to multicollinearity), significance of the “ Academic"
category also means that female leamers are more proficient at nonacademic words than
male leamers of the same level. Thus, teachers of English may need to be more careful
with male students on nonacademic vocabulary. All in all, different approaches are
warranted between male and female L2 leamers when teaching vocabulary of specific
categones.
The purpose ofthis study was multi-fold. First, several popular DIF detection methods
were compared on vocabulary assessment. Parametric and nonparametric methods are
often compared using Monte Carlo simulation techniques (e.g., Bolt, 2002). Since
different methods adopt different approaches, their performances are also different. lt is of
interest to compare the performance of various DIF detection methods on empirical
vocabulary assessment data. In this study, the parametric and nonparametric methods
produced quite similar results.
Second, this study purported to identify sources of explaining DIF. The paradigm of
DIF studies is shifting. DIF was first studied as a statistical method of finding item bias.
DIF detection methods were developed and compared to find well-performing methods
(i.e., powerful detection of DIF with a proper control of type 1 error rates) under various
conditions. ln this framework, DIF studies in language testing had only limited availability.
DIF detection methods were only applied to find biased items to remove. However, the
paradigm is changing. Researchers start to examine what causes DIF (Abbott, 2007). The
current study used a regression analysis to identify significant predictors of DIF. This
endeavor involved item categorization. Only one item category was found to be
statistically significant in this study; however, better categorization of items may reveal
better predictors of explaining DIF. Note that only about 30% of the variability of the DIF
effect size measures were explained by the predictors used in this study (Table 5).
In addition to the significant finding that can be useful when teaching English, this study
also serves as a framework that can be applied to other subtests and with better item
categorization. lt is easy to conduct the same type of analysis if relevant item
categorization is available. Cumulative evidence from different tests with di
Diffcrcntialltcl11 Functioning Analysis ofan EFL Vocabulary Test 39
REFERENCES
Abbott, M. L. (2007). A confinnatory approach to differential item functioning on an ESL
reading assessmen t. Language Testing , 24, 7-36
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric
polytomous DIF detection methods. Applied Measurement in Education, 15, 113-
14 1. Camilli, G. , & Shepart, L. ( 1994). Methods for identifYing biased test items. Thousand
Oaks, CA: Sage
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests
Language Tesfing, 2, 155-163.
Daller, H., Mi lton, 1., & Treffers-Daller, J. (Eds .) (2007). Modelling and assessing
vocabulaty knowledge. Cambridge, UK: Cambridge University Press
Douglas, J. , Roussos, L. A., & Stout, W. F. ( 1996). ltem-bundle DlF hypothesis testing:
Identifying suspect bundles and assessing thcir differential functioning. Journal of
Educational Measurement, 33, 465-484 ‘
Embretson, S. E., & Reise, S. P. (2000) . Item response theoηfor p.sychologists . Mahwah, NJ: Lawrence ErIbaum.
Holland, P. W., & Thayer, D. T. (1 988). Differential item pcrfonnance and the Mantcl
Haenszel procedure. In H. Wainer & H. 1. Braun (Eds.), Test validity (pp. 129-1 45)
Hillsdale, NJ: Lawrence Erlbaum.
Holland, P. W., & Wainer, H. (Eds.) (1 993). Dψerential item jùnctioning. Hi llsdale, NJ:
Lawrence ErIbaum
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge, UK: Cambridge
University Press.
Hyde, 1., & Linn, M. (1 988). Gender differences in verbal ability: A meta-analysis.
Psychological Bulletin, 104, 53-69.
Kim, Mikyung. (200 1). Detecting DIF across the different language groups in a speaking
test. Language Testing, 18, 89-11 4.
κ1a ll er, S. 1. (200 1). Differential item functioning in the WISC-IlI: ltem parameters for
boys and girls in the national standardization sample. Educational and
Psychological Measurement, 61 , 793-817.
Pae, Tae-I I. (2004). DlF for examinees with different academic backgrounds. Language
Testing, 21 , 53-73.
Pearson, P. 0., Hiebert, E. H., & Kamil, M. L. (2007). Vocabulary assessment: What we
know and what we need to leam. Reading Research Quarterly , 42, 282-296.
Penfield, R. 0., & Lam, T. C. M. (2000). Assessing differentiaI item function ing in
performance assessmen t: Review and recommendations. Educational
40 Chanho Park
Measurement: Jssues and Practice, 19, 5-15.
RAND Reading Study Group. (2002). Reading for understanding: Toward an R&D
program in reading comprehension. Santa Monica, CA: RAND.
Read, 1. (2000). Assessing vocabulary. Cambridge, UK: Cambridge University Press.
Read, 1., & Chapelle, C. A. (2001). A framework for second language vocabulary
assessment. Language Testing, 18, 1-32.
Roussos, L. A., & Stout, W. F. (1 996). A multidimensionality-based DIF analysis
paradigm. Applied Psychological Measurement, 20, 355-371
Ryan, K. E., & Bachman, L. F. (1992). Differential item functioning on two tests ofEFL
proficiency. Language Testing , 9, 12-29.
Sasaki, M. (1 991). A comparison of two methods for detecting differential item
functioning in an ESL placement tes t. Language Testing , 8, 95-115.
Shealy, R. , & Stout, W. F. (1 993). A model-based standardization approach that separates
true biaslDIF from group ability differences and detects test biaslDTF as well as
item biaslDIF. Psychometrika, 58, 159-194.
Stout, W. F. (1990). A new item response theory modeling approach with applications to
unidimensionality assessment and ability estimation. Psychometrika, 55, 293-325.
Stout, W. F. , & Roussos, L. A. (1992) . SJBTEST user manual. Unpublished manuscript.
Available from W. F. Stout, University of Illinois at Urbana-Champaign.
Takala, S. , & Kaftandjieva, F. (2000). Test faimess: A DIF analysis of an L2 vocabulary
test. Laηguage Testing, 17, 323-340.
Thissen, D. (2001). IRTLRDIF (Version 2) [Computer software]. Retrieved May 1, 2009,
from http ://www.unc.edu/~dthissenldl.html.
Thissen, 0 ., Steinberg, L., & Wainer, H. (1 988). Use of item response theory in the study
of group differences in πace lines. In H. Wainer & H. I. Braun (Eds.), Test validity
(pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum.
APPENDIX
Sample Test Items
1. Gap-filling: Dialogue
A: 1 am going to the concert tomoπow, too.
B: What a nice
a) impression
c) coincidence
KEY: c
b) happiness
d) luck
Differentialltem Functioning Analysis ofan EFL Vocabulary Test
2 . Gap-filling: Sentence
This VCR player is s till under and can be repaired free of charge.
a) payment b) mortgage
c) coverage d) warranty
KEY:d
Note: These are not actual ite ms used in this study
Applicable levels: prilllary education, secondary education, tertiary education
Key words: vocabulary assessment, EFL test, differential item functioning (DIF)
Chanho Park Division ofthe College Scholastic Ability Test Research and Management Korea Institute for Curricululll and Evaluation (KJCE) Jeongdng Bldg. 15-5 Jeong-dong, Jung-gu Seoul 100-784, Korea Tel: 02-3704-36 13 Fax: 02-3704-3690 Elllail: cpark@ kice.re. kr
Received in June, 2010 Reviewed in July, 2010 Revised version received in August, 2010
41