using irt to separate measurement bias from true...
TRANSCRIPT
Psychological Methods2000. Vol. 5, No. 1, 125-146
Copyrighi 2000 by the AnIOS2-989X/00/S5.I
Psychological Association, Inc..00 DOI: 10.1037//1082-989X.5.I.125
Using IRT to Separate Measurement Bias From True Group
Differences on Homogeneous and Heterogeneous Scales:
An Illustration With the MMPI
Niels G. WallerVanderbilt University
Jane S. Thompson and Ernst WenkUniversity of California, Davis
The authors present a didactic illustration of how item response theory (IRT) can
be used to separate measurement bias from true group differences on homogeneous
and heterogeneous scales. Several bias detection methods are illustrated with 12
unidimensional Minnesota Multiphasic Personality Inventory (MMPI) factor scales
(Waller, 1999) and the 13 multidimensional MMPI validity and clinical scales. The
article begins with a brief review of MMPI bias research and nontechnical reviews
of the 2-parameter logistic model (2-PLM) and several IRT-based methods for bias
detection. A goal of this article is to demonstrate that homogeneous and heteroge-
neous scales that are composed of biased items do not necessarily yield biased test
scores. To that end, the authors perform differential item- and test-functioning
analyses on the MMPI factor, validity, and clinical scales using data from 511
Blacks and 1,277 Whites from the California Youth Authority.
Few topics in applied psychometrics have gener-
ated as much controversy and confusion as the
complementary issues of measurement invariance and
measurement bias. Although formal definitions of
measurement invariance (Holland & Wainer, 1993;
Meredith, 1993) are widely known by psychometri-
cians, and psychometrically defensible methods for
identifying biased items and tests have been available
Niels G. Waller, Department of Psychology and Human
Development, Vanderbilt University; Jane S. Thompson
and Ernst Wenk, Department of Psychology, University of
California, Davis.
We express our sincere thanks to Chris Fraley, Tim
Gaffney, Lew Goldberg, Caprice Niccoli-Waller, Steve
Reise, Auke Tellegen, Craig Thompson, and three anony-
mous reviewers for helpful comments on an earlier version
of this article. All the tests for differential item and test
functioning described in this article can be calculated with
an S-PLUS function called LINKDIF (Waller, 1998). Re-
searchers interested in applying the item response theory
methods described in this article may download LINKDIF
from the following website: http://peabody.vanderbilt.edu/
depts/psych_and_hd/faculty/wallern/.
Correspondence concerning this article should be ad-
dressed to Niels G. Waller, Department of Psychology and
Human Development, Box 512, Peabody College, Vander-
bilt University, Nashville, Tennessee 37203. Electronic
mail may be sent to [email protected].
for some time (Oshima, Raju, & Flowers, 1997; Raju,
1988, 1990; for nontechnical reviews see Reise,
Widaman, & Pugh, 1993; Widaman & Reise, 1997),
the assessment community has yet to fully assimilate
this work. As a case in point, consider the voluminous
literature about racial bias on the Minnesota Multi-
phasic Personality Inventory (MMPI; Hathaway &
McKinley, 1940) and the MMPI-2 (Butcher, Dahl-
strom, Graham, Tellegen, & Kaemmer, 1989). Much
of this literature concerns Black-White differences on
the MMPI validity and clinical scales (for reviews,
see Dahlstrom & Gynther, 1986; Greene, 1987, 1991,
chap. 8) and the supposed implications of these dif-
ferences for valid test interpretation (Gynther &
Green, 1980; McNulty, Graham, Ben-Porath, & Stein,
1997; Pritchard & Rosenblatt, 1980a; Timbrook &
Graham, 1994). Given that the MMPI and MMPI-2
are the most widely used psychological tests in the
world (Butcher & Rouse, 1996; Lubin, Larsen, Mata-
razzo, & Seever, 1985), any biases that are found in
the inventory would have legal, ethical, and practical
implications (Gottfredson, 1994). Thus, it is surpris-
ing that modern techniques for assessing measure-
ment bias have not been applied to the MMPI inven-
tories.
In this article we review several contemporary item
response theory (IRT; Hambleton, Swaminathan, &
Rogers, 1991; Lord, 1980) models for assessing item
125
126 WALLER, THOMPSON, AND WENK
and scale bias on unidimensional (Raju, 1988, 1990)
and multidimensional (Oshima et al., 1997) scales.
One of our goals is to encourage the assessment com-
munity to use these psychometric models when con-
ducting group comparisons research (e.g., in racial,
ethnic, cross-cultural, or gender group comparisons).
Toward this end we demonstrate how IRT can be used
to elucidate the psychometric properties of 12 homo-
geneous factor scales that can be scored on the MMPI
(Waller, 1999). We realize that most MMPI and
MMPI-2 scales are not homogeneous, a fact that
likely explains why IRT models for item and scale
bias have heretofore not been applied to the MMPI.
Although scale heterogeneity (i.e., multidimensional-
ity) is a putative impediment in the application of IRT
to the MMPI validity and clinical scales, we demon-
strate how unidimensional IRT models can be used to
assess measurement invariance on these scales.
This article is organized as follows. The first sec-
tion provides a brief review of the MMPI bias litera-
ture with respect to Black-White differences in scale
elevation. The second section provides a relatively
nontechnical introduction to the two-parameter logis-
tic IRT model (2-PLM; Birnbaum, 1968; Lord, 1980).
The third section describes how this model can be
used to separate measurement bias from true group
differences on estimated latent variables. The fourth
section characterizes the samples used in our didactic
example and reports the results of a series of analyses
aimed at detecting differential item and test function-
ing on the MMPI factor, validity, and clinical scales.
Finally, in the last section, we discuss the implications
of our analyses for future research aimed at distin-
guishing measurement bias from true group differ-
ences on homogeneous and heterogeneous scales.
Black-White Differences on the MMPI: ABrief Review of the Literature
The literature on MMPI Black-White differences
has been characterized by a level of passion not often
found in academic writing. During the 1970s and
1980s, for example, articles routinely appeared with
provocative titles such as "Is the MMPI an Appropri-
ate Assessment Device for Blacks?" (Gynther, 1981)
and "White Norms and Black MMPIs: A Prescription
for Discrimination?" (Gynther, 1972). Reviewers of
this literature (Gynther & Green, 1980; Pritchard &
Rosenblatt, 1980a, 1980b) expressed strong opinions,
and they frequently came away with widely opposing
conclusions when reviewing similar' bodies of work
(Greene, 1987; Gynther, 1989).
Two points of contention galvanized the contro-
versy during this period. The first was whether Blacks
scored significantly higher than Whites on various
MMPI scales, and the second was whether those dif-
ferences, if they existed, were attributable to biased
measurement. Dozens of articles compared Blacks
and Whites on the MMPI validity and clinical scales
(reviewed in Dahlstrom, Lachar, & Dahlstrom, 1986;
Greene, 1991). Many others attempted to document
Black-White differences at the item level (Bertelson,
Marks, & May, 1982; Costello, 1973, 1977; Gynther
& Witt, 1976: Harrison & Kass, 1967. 1968; Jones,
1978; Miller, Knapp, & Daniels, 1968; Witt &
Gynther, 1975).
Gynther (1972, 1989; Gynther & Green, 1980;
Gynther, Lachar, & Dahlstrom, 1978), in an influen-
tial series of articles, argued that a close examination
of the literature revealed a consistent pattern in which
"blacks, whether normal or institutionalized, gener-
ally obtain higher scores than whites on Scales F [In-
frequency], 8 [Sc; Schizophrenia] and 9 [Ma; Hypo-
mania]" (Gynther, 1972, p. 386). He also suggested
that these differences stemmed from inherent biases in
the test, and consequently he called for race-based
norms for scoring and interpreting the MMPI
(Gynther & Green, 1980; Gynther et al., 1978;
Gynther, 1972). Other researchers (e.g., Pritchard &
Rosenblatt, 1980a, 1980b) were quick to disagree, and
some pointed out that without further information
Black-White score differences could not speak to is-
sues of test bias or test fairness. Pritchard and Rosen-
blatt (1980b), for instance, noted that "scale differ-
ences between racial subgroups imply differential
rates of classification error only when the racial sub-
groups in a sample have equivalent base rates for
psychopathology" (p. 273, emphasis added). These
authors also noted that none of the comparisons cited
by Gynther and others had adequately matched their
samples on psychopathology, and thus the implica-
tions of those studies with respect to differential as-
sessment validity were uninterpretable.
Following Pritchard and Rosenblatt's (1980b) com-
mentary, numerous studies matched Black and White
samples on several moderator variables that were be-
lieved to account for the observed group differences
on the MMPI (Bertelson et al., 1982; Butcher,
Braswcll, & Raney, 1983; Newmark, Gentry, Warren,
& Finch, 1981; Patterson, Charles, Woodward, Rob-
erts, & Penk, 1981; Penk et al., 1982). Years later, in
his review of this literature, Greene (1987) concluded
that "moderator variables, such as socioeconomic sta-
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 127
tus, education, and intelligence, as well as profile va-
lidity, are more important determinants of MMPI per-
formance than ethnic status" (p. 509). More recently,
Graham (1993) opined that "differences between Af-
rican Americans and Caucasians are small when
groups are matched on variables such as age and so-
eioeconomic status" (p. 199).
We concur that matching is an important design
feature of valid measurement bias research. We also
believe, however, that the most important matching
variables in this regard have been conspicuously miss-
ing from this literature. To wit, no studies to our
knowledge have matched groups on the estimated la-
tent variables that hypothetically determine MMPI
item response behavior. The absence of such studies
is unfortunate because, as noted by Meredith and
Millsap (1992), "bias detection procedures which rely
exclusively on manifest variables are not generally
diagnostic of bias, or lack of bias" (p. 310). These
authors suggest that "logical alternatives to manifest
variable procedures are procedures that model the
manifest variables in terms of latent variables"
(p. 310).
There are several reasons why bias researchers
should match groups on the latent variables measured
by a test. (In the present case, this is analogous to
matching groups on psychopathology.) First, bias can
create or accentuate group differences on manifest
variables in situations in which there are no differ-
ences on the latent variables. Second, measurement
bias can mask true differences on latent variables such
that differences at the manifest level fail to emerge.
Third, multiitem inventories, such as the MMPI and
MMPT-2, may contain biased items that, when aggre-
gated, do not produce biased scales. This can occur
when a subgroup of items is biased against the ma-
jority group and a different subgroup of items is bi-
ased against the minority group. When such items are
combined, the effects of the biased items can be mini-
mized or eliminated at the scale level (Harrison &
Kass, 1967; Raju, van der Linden, & Fleer, 1995).
A central theme of this article is that group differ-
ences at the item or scale level can arise from mea-
surement bias, actual group differences, or a combi-
nation of these influences. Thus, studies that report
group differences on observed scores cannot unam-
biguously resolve the question of whether those
scores are equally precise, or equally valid, for dif-
ferent groups. Although the inability of group com-
parisons to resolve issues of measurement bias has
been recognized by the psychometrics community for
some time (Holland & Thayer, 1988; Jensen, 1980),
this uninformative method continues to be used in
many assessment literatures. In the MMPI literature,
for instance, group comparisons of scale means are
sometimes called the difference of means test (Pritch-
ard & Rosenblatt, 1980a, p. 263; see also Greene,
1991, chap. 8; Whitworth & McBlaine, 1993).
In the next two sections, we review the fundamen-
tals of the 2-PLM IRT model and several methods that
are based on this model for identifying biased items
and scales. These methods for assessing differential
functioning of items and tests do not suffer from the
flaws of the so-called difference of means test de-
scribed above. We focus on IRT methods because of
our strong conviction, and the consensus of the psy-
chometrics community (Camilli & Shepard, 1994;
Holland & Wainer, 1993; Millsap & Everson, 1993;
Thissen, Steinberg, & Gerrard, 1986), that IRT mod-
els provide the most powerful methods for detecting
differential functioning of items and scales in group
comparisons research (e.g., racial bias, cross-cultural
comparison, and questionnaire translation research;
see Ellis, Becker, & Kimmel, 1993; Ellis, Minsel, and
Becker, 1989; Huang & Church, 1997; Hulin &
Mayer, 1986). Readers who are familiar with the
2-PLM for dichotomous items (Hambleton et al.,
1991; Reise & Waller, 1990) may wish to proceed to
the third section.
A Brief Overview of the Two-ParameterLogistic IRT Model
The rubric IRT covers an extended family of psy-
chometric models (van der Linden & Hambleton,
1997), and thus we make no attempt to describe these
models in detail. Rather, we briefly describe the fun-
damentals of the 2-PLM because it is the most appro-
priate IRT model for modeling the (MMPI) data in
our didactic example.
An important component of the 2-PLM is called the
item response function (IRF). The TRF is the nonlin-
ear regression line that characterizes the probability of
a [0/1] keyed response as a function of an underlying
trait value. These nonlinear response functions are
also called item characteristic curves or item trace
lines. An example IRF for the 2-PLM is illustrated in
Figure 1A. Notice that in this figure trait levels are
represented on the jt-axis, and the item response prob-
ability is represented on the y-axis. Trait levels are
customarily scaled to a z-score metric such that the
population of trait values has a mean of 0.00 and a
standard deviation of 1.00, although other scalings are
128 WALLER, THOMPSON, AND WENK
A. B.
0
Theta (trait level)
0
Theta
c. D.
0
Theta
0
Theta
Figure 1. Item response theory-based graphical methods for assessing differential item
functioning (DIP) and differential test functioning. A: Item response function. B: Uniform
DIP. C: Nonuniform DIP. D: Test characteristic curve.
possible and sometimes used in IRT applications. No-
tice also that the trait level is called theta in this plot.
In this article, we often use the Greek letter 8 (theta)
to denote latent trait values.
The 2-PLM derives its name because the nonlinear
item-trait regression function, like the IRF in Figure
1, is defined by a two-parameter logistic function.
This function can be mathematically defined as
1
where Pj(9,-) denotes the probability that an individual
with 6 level i will endorse item j in the keyed direc-
tion. The e in Equation 1 denotes the transcendental
number that is approximated by 2.71828. The 1.7 in
this equation is a scaling factor that makes the logistic
IRF similar to the IRF in a two-parameter Normal
Ogive item response model (Lord, 1980). The two
parameters of the 2-PLM are the item slope (a) and
the item threshold ((3) parameters. An important char-
acteristic of this model is that item thresholds ($) and
participant trait levels (8) are scaled on a common
metric. For a particular item, the value of the item
threshold equals the 6 level that corresponds to a .50
probability of endorsing the item in the keyed direc-
tion. Thus, relatively difficult items—that is, items
with low endorsement frequencies—will have high
threshold values (i.e., large pi parameters), and they
will be located at the high end of the 8 continuum.
Items with low thresholds will be located at the low
end of the 8 continuum. The value of the item slope
(a) is a function of the steepness of the IRF at that
point on the trait continuum where 6, = (3,. This
parameter is related to the item factor loading such
that items with steep slopes have large factor loadings
(Takane & DC Leeuw, 1987). In other words, items
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 129
with large slope parameters (a) have comparatively
high saturations of trait relevant variance. Although
we have left out many important details about IRFs,
such as how the IRF is estimated (see Baker, 1992, for
a review), we have presented the requisite information
for understanding how IRT uses the IRF to distinguish
item and scale bias from true group differences on
latent variables.
Using Item Response Theory to SeparateMeasurement Bias From True Group
Differences on Estimated Latent Variables
At the item level, measurement equivalence is ob-
tained whenever the IRFs for two groups do not differ
(Reise et al., 1993). In effect, this implies that the
probability of endorsing an item in the keyed direction
is the same for two individuals with equal trait values
(i.e., individuals who are perfectly matched on the
latent trait) regardless of group membership. Notice
that in this definition of measurement equivalence, we
are not assuming that individuals from different
groups will have identical endorsement probabilities.
On the contrary, measurement equivalence requires
that we observe equal endorsement probabilities for
individuals with equal trait values.
Consider the two IRFs in Figure IB. For purposes
of illustration, we can imagine that these IRFs repre-
sent the item-trait regression functions for Blacks and
Whites on an MMPI item. Let the solid line denote the
IRF for Whites and the dashed line the IRF for
Blacks. Notice that the probability of endorsing this
item in the keyed direction is higher for Whites than
it is for Blacks at virtually all 0 levels. When this
occurs we say that the item shows evidence of uni-
form differential item functioning1 (DIP; Camilli &
Shepard, 1994; Holland & Wainer, 1993). Impor-
tantly, the amount of DIF is not constant across trait
levels. Specifically, at very low (—4 to -2) and very
high (+2 to +4) trait levels the IRFs are not dramati-
cally different, though at more moderate trait levels
the endorsement probabilities differ by as much as .80
on the probability scale.
Figure 1C shows an example of nonuniform DIF.
This plot illustrates the dangers that can arise when-
ever groups are compared on items that have group-
specific IRFs. Notice, for example, that in groupscomposed of individuals with low trait levels, Whites
endorse the item more frequently than Blacks. In
groups composed of high trait level individuals, the
opposite pattern occurs. Namely, in high-scoring
groups, Blacks endorse this item more frequently than
do Whites. The possibility of nonuniform DIF should
clearly make one pause before trying to draw conclu-
sions from the literature on group differences in item
endorsement frequencies. For instance, Black-White
differences in MMPI item endorsement frequencies
have been reported in a number of studies (Costello,
1977; Dahlstrom & Gynther, 1986). Greene (1987)
recently noted that "although from 58 to 213 [out of
566] items have been found to differentiate Blacks
from Whites [on the MMPI] in a given study, there
has been limited overlap among these items across
studies" (p. 503). If many MMPI items show evidence
of nonuniform DIF, we would expect such a diversity
of findings in studies with samples that vary in aver-
age trait level.
Item Response Theory Tests of DifferentialItem Functioning
From an IRT perspective, several methods are
available for detecting the magnitude and significance
of different IRFs (see Holland & Wainer, 1993, for a
review). In this study we calculated five IRT mea-
sures of DIF: (a) Lord's x2 test of DIF (Lord, 1980),
(b) Raju's signed area (SA) measure of DIF (Raju,
1988, 1990), (c) Raju's unsigned area (USA) measure
ofDIF(Raju, 1988,1990), (d) Raju etal.'s measure of
compensatory DIF (CDIF; Raju et al., 1995), and
(e) Raju et al.'s measure of noncompensatory DIF
(NCDIF; Raju et al., 1995). We now briefly describe
these measures.
Lord's x2 measure of DIF (Lord, 1980) is computed
as the weighted squared difference between the item
parameter estimates from two groups. This index is
used to test whether the estimated item slopes (awi,
&Bj) and item thresholds ((3^, (3Bj) differ for two
groups, say, for groups of Whites (W) and Blacks (B).
Lord's test statistic is computed by weighting the
squared discrepancy between the estimated item pa-
rameters by the pooled (estimated) parameter infor-
mation matrix. The information matrix is the inverse
of the variance-covariance matrix of the estimated
parameters. Specifically, let
V = (&WJ - €LBJ, $WJ - $BJ) (2)
1 Technically, the terms DIF and item bias are not syn-
onymous. DIF is a psychometric property of an item in two
or more groups, whereas item bias is a value judgment
concerning the personal, social, or ethical implications of
DIF. In this article we often use the terms interchangeably.
130 WALLER, THOMPSON, AND WENK
equal the vector of differences between the estimated
parameters and
(3)
equal the pooled variance-covariance matrix of the
parameter estimates. Then Lord's x2 is calculated as
= v r'v. (4)
In large samples, Lord's x2 has a central chi-square
distribution with two degrees of freedom.
Like many asymptotic test statistics, Lord's x~ suf-
fers from the fact that the statistic is valid only in large
samples, yet in large samples almost any difference
between the estimated item parameters of two groups
will reach statistical significance. Thus, there has been
a trend in recent years to place more emphasis on IRT
measures of DIP that yield measures of effect size as
well as tests of significance. Two such measures that
were calculated in the present study are Raju's SA and
USA indices (Raju, 1988, 1990). Both of these indices
quantify DTP by integrating the area (i.e., summing up
the distances) between two IRFs. Specifically, if Pw
and Pfl denote the estimated IRFs for two groups, say,
Whites and Blacks, then for item j,
and
^A — I (P ~ P Wft f^\O.r\, — I V* VV — B' "> v*^/
USA.= \PW-PB\M. (6)
In words, Equation 5 says that the SA index is com-
puted by adding up the area between the two IRFs for
all trait levels (6) from negative infinity to positive
infinity. Note that when the IRFs show evidence of
nonuniform DIP, as in Figure 1C, for some trait levels
the area between the IRFs will be positive and for
other trait levels the area will be negative. Thus, the
total signed area, represented by SA, might be small
(or even 0.00) even when the IRFs differ substan-
tially. The SA can take on positive or negative values.
The USA. on the other hand, equals the sum of the
absolute values of the differences between the IRFs,
and thus the USA can take on only positive values.
The derivations of Equations 5 and 6 are presented in
Raju's (1988) study. Raju (1990) presented formulas
for determining the statistical significance of the es-
timated SA and USA indices.
A Scale Composed of Biased Items Is Not
Necessarily Biased
Returning to our running example, when compar-
ing Black-White differences on the MMPI, some re-
searchers (e.g., Harrison & Kass, 1967) have sug-
gested that the largest differences occur at the item
level rather than the scale level. For example, accord-
ing to Harrison and Kass (1967), the MMPI validity
and clinical scales "are not very sensitive to race dif-
ferences, whereas the items are remarkably sensitive.
A canceling-out process must be at work in each
scale" (p. 462). Although, as noted previously, nu-
merous researchers have reported Black-White dif-
ferences for MMPI scales (Dahlstrom & Gynther,
1986; Gynther, 1972), the notion that item differ-
ences—or more interestingly, item biases—might
cancel out at the scale level is an interesting idea that
warrants further consideration. Fortunately, this topic
has received increased attention in the psychometrics
community in recent years (Nandakumar, 1993; Raju
et al., 1995; Shealy & Stout, 1993); some psychome-
tricians use the terms DIP amplification and cancel-
lation when describing an item's contribution to dif-
ferential test functioning (DTP) or test bias.
To better understand the concept of DTP or test
bias from an IRT perspective requires that we intro-
duce another important idea from IRT: the test char-
acteristic curve (TCC; Hambleton et al., 1991). Sim-
ply put, the (estimated) TCC is the nonlinear
regression of observed scores on the (estimated) IRT
latent trait values. By logical extension of Equation 1,
the predicted true score (T,) for an individual with
estimated latent trait leveli is calculated by summing
the predicted item endorsement probabilities across
all items of a scale. This idea can be mathematically
expressed as
(7)
where 7} is the predicted true score for subject ;', J is
the number of items on the scale, and the remaining
terms are defined as before.
As noted above, the TCC is the nonlinear regres-
sion of (predicted) true scores on (estimated) latent
trait levels. An example TCC for a hypothetical 25-
item test is depicted in Figure ID. Using this concept,
we can say that a test (such as an MMPI factor scale)
provides equal expected scores for individuals with
the same latent trait level regardless of group mem-bership when the TCCs for those groups are identical.
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 131
If the TCCs are not identical, then at some point along
the trait continuum the expected observed scores for
the two groups will differ. The TCC is the sum of the
IRFs for a particular scale, a fact that clearly explains
why DIP can be amplified or canceled when summing
over items.
Raju et al. (1995) have introduced a measure of
DTP that is calculated from the TCCs of two groups.
In terms of our running example, Raju et al.'s DTP
index is calculated as
1995) also introduced a measure of noncompensatory
DIP, which is calculated as:
1 ,DTF = — (8)
where TB and Tw are the true scores that are derived
from the test characteristic curves for the Black and
White examinees. Notice that for the Black partici-
pants only (nB; in Equation 8 we are averaging over
the trait scores of Black examinees), we are asking the
following question: Would the estimated true scores
(i.e., expected observed scores) differ if the items
were scored using the estimated item parameters cali-
brated on the White group versus the estimated item
parameters calibrated on the Black group? If the an-
swer to this question is yes, that is, if the TCCs for the
two groups differ, then Equation 8 will yield a large
positive number and we can confidently conclude that
the scale provides differential measurement for
Blacks. However, if the TCCs are similar, then Blacks
and Whites with similar trait estimates (6) will have
similar predicted true scores (within the boundaries
imposed by measurement error) and Equation 8 will
yield a small number. The square root of Equation 8,
the root differential test function (rDTF) expresses the
differences between the TCCs in the metric of the
observed scores. Thus, the rDTF serves as a useful
size measure of bias. Raju et al. (1995) provided equa-
tions for determining the statistical significance of the
DTP.
Raju (Raju, van der Linden & Fleer 1995) has also
introduced an index of so-called compensatory DIP
(CDIF). Raju's CDIF index measures an item's addi-
tive contribution to a scale's DTP:
j
DTP =2 CDIF.;=i
Thus, an item may show substantial DIP in terms of
Lord's x2 or Raju's SA and USA measures but show
relatively little, if any, CDIF. This would happen if
the item DIP was in the opposite direction of the DIP
of other items. Raju (Raju, van der Linden & Fleer
NCDIF = ) - /ve,)]2/Be,<*e.
In words, NCDIF is simply the average squared dif-
ference between the expected item endorsement prob-
abilities, where the expectations are calculated from
the two sets of item parameters. As before, the aver-
aging is computed over the distribution of estimated
trait levels for Blacks. In other words, for a given
estimate trait level (6) for an individual from the
Black sample, we (a) calculate the probability that
item j will be endorsed in the keyed direction when
using the estimated item parameters from the White
calibration, (b) calculate the probability that item j
will be endorsed in the keyed direction when using the
estimated item parameters from the Black calibration,
(c) square the difference between these probabilities,
and (d) calculate the weighted average of the squared
differences for all Black participants in our sample
(/B(9,) denotes the relative frequency of $,).
Detecting Differential Item and Test
Functioning: An Empirical Example With
the MMPI
Method
Participants. Our total sample included MMPI
item response data from 1,277 Whites and 511
Blacks. At the time of testing, all participants were
young male offenders committed to the California
Youth Authority (CYA) between January 1964 and
December 1965. These 1,788 individuals are a subset
of the 4,164 consecutive CYA intakes from the Re-
ception Guidance Center at the Deuel Vocational In-
stitution in Tracy, California. These data were origi-
nally collected as part of a larger study designed to
investigate the criminal career paths of youth offend-
ers (Wenk, 1990). Only MMPI protocols that satisfied
purposely conservative selection criteria (described
below) were included in the sample. When the data
were collected, the average age of the White male
offenders was 19.01 years (Mdn - 19, SO = 0.98,
range = 17-23), and the average age of the Black
male offenders was 18.97 years (Mdn = 19, SD =
0.94, range = 16-24). Participant race was coded
from official CYA documents (probation records, ar-
rest records, assessment records, etc.). The youth of-
132 WALLER, THOMPSON, AND WENK
fenders in this sample had committed a variety of
crimes including murder, auto theft, rape, robbery,
burglary, possession of drugs, assault with a deadly
weapon, arson, and kidnapping.
As part of the normal CYA intake, all youth of-
fenders are administered an extensive test battery.
Thus, we had access to an unusually rich collection of
data. For instance, our data set contained numerous IQ
and achievement measures for each participant. Al-
though these other tests are not the focus of this study,
a few summary findings from these data deserve men-
tion. In particular, as noted previously, several re-
searchers (reviewed in Greene, 1987) have claimed
that observed group differences on the MMPI and
MMPI-2 are minimized or eradicated when the groups
are matched on IQ or other moderator variables. An
examination of the IQ and achievement data for our
participants revealed group differences in the range
found in many other studies (Jensen, 1998). For ex-
ample, on the G Factor of the General Aptitude Test
Battery (Science Research Associates, 1947), Whites
achieved an average score of 99.25 (SD = 16.54),
and Blacks achieved an average score of 84.12 (SD =
13.14). On Raven's Progressive Matrices (Raven,
1960), Whites achieved an average score of 45.92 (SD
= 7.18), and Blacks achieved an average score of
41.80 (SD = 8.50). These group differences are in
line with those reported for other samples during the
mid 1960s. In this study, no attempt was made to
match the two groups on the IQ or achievement data.
Selection of MMPI protocols. When conducting
research involving group comparisons, it is particu-
larly important to exclude potentially invalid proto-
cols from the analyses. Unfortunately, not all studies
in the MMPI literature have taken this precaution.
Greene (1987) noted, for instance, that almost one
third of the MMPI racial bias studies included in his
review made no mention of how invalid protocols
were identified—if indeed they were. For the present
study, we decided to use purposely conservative se-
lection criteria that would err on the side of excluding
possible valid protocols rather than including possible
invalid protocols. After reviewing the literature on
MMPI profile validity (Graham, 1993, chap. 3;
Greene, 1991, chap. 3), we settled on the following
criteria. Protocols were selected if (a) the number of
"Cannot say" (omitted) responses was <30, (b) the
Gough F-K index was £ 11, (c) Greene's (1978) Care-
lessness score was -£5, (d) the Lie (L) scale score was
£7, and (e) the raw F scale score was <15. Several
studies have found that Blacks (Gynther et al., 1978),
delinquent adolescents (McKegney, 1965), and young
adults in general (Archer, 1984, 1987) endorse items
on the F scale at higher rates than do individuals from
the Minnesota normative sample, and thus our selec-
tion criteria may have resulted in the exclusion of
several valid MMPI protocols. Fortunately, our rela-
tively large samples allowed us to use stringent selec-
tion criteria for deeming a protocol valid.
The application of the aforementioned selection cri-
teria to the total sample of Blacks and Whites (N =
2,284) resulted in the exclusion of 25% of the avail-
able protocols from Blacks and 20% of the available
protocols from Whites. Because slightly more Blacks
were deleted from the final sample, we wondered
whether the two samples of excluded protocols dif-
fered in important ways. Specifically, we wondered
whether the samples differed in their mean endorse-
ment rates on the F (Infrequency) scale. Among the
scales included in our analyses, the F scale holds a
unique position because (a) previous researchers have
reported Black-White differences of F (e.g., Gynther,
1972), (b) scores on F are positively correlated with
clinical scales (e.g., Sc) on which group differences
have been found, and (c) F was used to select valid
MMPI protocols. We did not wish the selection cri-
teria to minimize group differences on F, and our
analyses revealed that they did not. The median F
score for the 174 excluded Blacks was 16.00 (M =
15.52, SD = 8.89), and the median F score for the
322 excluded Whites was also 16.00 (M = 14.29, SD
= 9.28). A two-tailed Wilcoxon rank sum test re-
vealed that the two groups did not differ significantly
o n f ( z = -1.37, p = .17).
Results
Our discussion of Black-White differences on the
MMPI is divided into three parts. First, to characterize
the personality profiles of our samples, we compare
the performance of Whites and Blacks on the MMPI
validity and clinical scales. We then tackle the ques-
tion of MMPI measurement equivalence, at both the
item and scale levels, by conducting IRT analyses of
12 MMPI factor scales (Waller, 1999) in the two sub-
groups. Using the item parameter estimates and esti-
mated latent trait values from these analyses, we then
examine differential functioning of items and tests on
the (unidimensional) factor scales and the (multidi-
mensional) MMPI validity and clinical scales.
Table 1 reports raw score summary statistics for 13
commonly scored validity and clinical scales of the
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 133
Table 1
MMPl Scale Scores and Effect Sizes for Whites and Blacks
MMPI scale
L
F
K
1 Hx
2D
3Hy
4Pd
5Mf
f,Pa
7Pt
SSc
9 Ma
OSi
Total
items
15
64
30
33
60
60
50
60
40
48
78
46
70
Whites
M
3.85
6.52
13.27
4.44
21.04
19.37
24.63
23.12
10.63
15.39
15.79
19.58
27.47
(« =
SD
1.60
3.21
4.22
3.10
4.12
3.87
3.85
4.12
2.97
7.29
7.47
3.97
8.47
1,277)
Range
0-7
0-15
1-26
0-23
6-39
4-36
7-39
9-41
3-24
CMC
1-50
6-33
6-57
Blacks (n =
M
3.80
7.06
12.92
4.93
20.49
18.61
23.71
23.44
10.31
15.53
17.37
21.83
26.14
SD
1.62
3.18
3.54
2.80
3.56
3.55
3.36
3.85
3.04
6.14
6.72
3.55
6.58
511)
Range
0-7
0-15
3-26
0-21
10-41
7-̂ 15
9-38
10-38
2-21
2-39
2-39
8-34
11-50
Effect
size"
.03
-.17
.08
-.16
.13
.20
.24
-.08
.11
-.02
-.21
-.57
.16
Note. Minnesota Multiphasic Personality Inventory (MMPI) scale names: L = Lie; F = Infrequency;
K — Defensiveness; Hs = Hypochondriasis; D = Depression; Hy = Hysteria; Pd = Psychopathic
Deviate; i\ff = Masculinity-Femininity; Pa = Paranoia; Ft = Psychaslhenia; Sc = Schizophrenia; Ma
~ Hypomania; i7 = Social Introversion.
" Effect size is calculated as (w025 ~ k.o25)/°Vj2Si where w025, b025, and cr№(05 equal the 2.5% trimmed
means and standard deviation of the White (w) and Black (b} groups.
MMPI. When computing the score means and stan-
dard deviations, we trimmed 2.5% off the lower and
upper score distributions to minimize the effects of
outliers on the obtained results (Wilcox, 1998). As
evidenced by these findings, the average profiles for
the Blacks and Whites are very similar. A quantitative
measure of this similarity is provided by the effect
size measures (calculated as the difference between
the White and Black trimmed means, divided by the
White trimmed standard deviation) that are reported
in the final column of Table 1. In our opinion, these
effect sizes are more informative than the results of
simple t tests for each scale. Nevertheless, considering
the large samples in this study—and hence the large
statistical power—it is noteworthy that 5 of the 13
scale comparisons to not reach statistical significance
at the .05 alpha level (L, K, Mf, Pa, and Pt; note that
trimmed means and standard deviations were also
used when calculating the t tests for each scale). As
quantified by the effect sizes, most of the differences
are relatively small. Only Scales 4 (Psychopathic De-
viate; Pd) and 9 (Hypomania; Ma) have moderate
effect sizes according to Cohen's (1988) widely
adopted criteria.
Perhaps an easier way to grasp the similarity of
these profiles is to look at the plotted scores in Figure
2. Note that the profiles in Figure 2 portray the aver-
age T scores of the Whites and Blacks. Our results
would have differed slightly if we had converted the
profiles of average raw scores (that are reported in
Table 2) into T scores because the MMPI does not use
linear T scores (a linear 7" score equals \QzI + 50).
Because our samples include adolescents and young
adults, we have plotted non-AT-corrected 7* scores,
consistent with standard practice for these age groups
(Archer, 1984, 1987). An inspection of these plots
bolsters our initial impression that the average profiles
for Blacks and Whites are remarkably similar. The
small differences that exist are not of sufficient mag-
nitude to warrant different interpretations of the av-
erage profiles. Both profiles show the characteristic
4—9 code type (i.e., highest elevations on scales Pd
[Psychopathic Deviate] and Ma [Hypomania]) that is
so often seen in delinquent and offender populations
(Graham, 1993).
Although the findings in Table 1 and Figure 2 fail
to show large Black-White differences on the stan-
dard MMPI validity and clinical scales, these results
should not be interpreted as indicating measurement
equivalence for the two groups. For reasons already
stated, observed group differences are irrelevant to
questions of measurement bias unless the groups are
perfectly matched on the latent variables that are mea-
sured by the scales. In the present case, the absence of
group differences may stem from a form of bias that
masks true differences on the latent variables. To rule
134 WALLER, THOMPSON, AND WENK
L F K Hs D Hy Pd Mf Pa Pt Sc Ma Si
1 2 3 4 5 6 7 8 9 0
MMPI Validity and Clinical Scales
Figure 2. Average Minnesota Multiphasic Personality Inventory (MMPI) profiles of White
and Black male youth offenders. MMPI scale names: L = Lie; F = Infrequency; K =
Defensiveness; Hs = Hypochondriasis; D = Depression; Hy — Hysteria; Pd = Psycho-
pathic Deviate; Mf = Masculinity-Femininity; fa = Paranoia; Pt = Psychasthenia; Sc =
Schizophrenia; Ma = Hypomania; Si = Social Introversion.
out this possibility, it is necessary to focus our analy-
ses at the latent variable level.
Item response theory analyses of differential item
and test functioning on MMPI factor scales. To rig-
orously test hypotheses of item and scale bias from a
model-based perspective (Embretson, 1996), we per-
formed IRT analyses on 12 unidimensional factor
scales that can be scored on the MMPI. These scales
are a subset of the 16 factor scales that are described
in Waller (1999). Each MMPI factor scale was de-
signed to measure a single latent trait. Although no
test is strictly unidimensional, the MMPI factor scales
are dominated by large first dimensions, and thus they
can be considered unidimensional for practical pur-
poses. Four MMPI scales were either too short or
otherwise unsuitable for an IRT study and thus are not
discussed in this article (e.g., our IRT analyses sug-
gested that the IRFs were not monotonically increas-
ing for the Stereotypic Masculine Interests factor
scale). The 12 factor scales (with sample items and
keyed responses) that were analyzed are called (a)
General Maladjustment (Gm; "Life is a strain for me
much of the time"; true), (b) Psychotic Ideation (Ps;
"I commonly hear voices without knowing where they
come from"; true), (c) Antisocial Tendencies (At;
"During one period when I was a youngster I engaged
in petty thievery"; true), (d) Stereotypic Feminine In-
terests (Fe; "I enjoy reading love stories"; true), (e)
Extroversion (Ex; "I enjoy the excitement of a
crowd"; true), (f) Family Attachment (Fm; "There is
very little love and companionship in my family as
compared to other homes"; false), (g) Christian Fun-
damentalism (Cf; "I believe in the Second Coming of
Christ"; true), (h) Phobias and Fears (Ph; "I am afraid
of fire"; true), (i) Social Inhibition (So; "I wish I were
not so shy"; true), (j) Cynicism (Cy; "Most people
make friends because friends are likely to be useful to
them"; true), (k) Assertiveness (As; "I like to let
people know where I stand on things"; true), and (1)
Somatic Complaints (Sm; "Much of the time my head
seems to hurt all over"; true). A fuller description of
these scales, such as their item composition, average
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 135
Table 2
Item Parameters and Differential Item Functioning (DIP) for MMPJ Phobias and Fears Scale
MMPI
no.
128
131
166
176270
367392401
480
492522
Note.
Whites
Abbreviated content a (3
Blood does not frighten me. (F) 0.63 1.81
Do not worry about catching
diseases. (F) 0.43 0.91
Afraid when looking down from
heights. (T) 0.86 1.06
No fear of snakes. (F) 0.75 0.86
Do not worry whether the door
is locked and windows
closed. (F) 0.23 0.40
Afraid of fire. (T) 0.83 1.19
Afraid of windstorms. (T) 0.67 2.91
No fear of water. (F) 0.78 1.61
Afraid of the dark. (T) 1.29 2,22
Afraid of earthquakes. (T) 0.44 0.66
No fear of spiders. (F) 0.87 0.31
Blacks
a
0.62
0.41
0.70
0.72
0.33
0.88
0.75
0.70
0.60
0.52
1.01
z equals the parameter estimate divided by the estimate'
3
2.22
0.70
1.51
0.69
-1.20
1.38
2.44
1.89
3.73
0.44
0.65
I nt-H'cLoro s
X2
8.97
1.81
14.15
3.01
45.02
5.50
3.74
3.74
18.91
3.40
19.05
's standard error.F — false. MMPI = Minnesota Multiphasic Personality Inventory; SA = si
P
.01
.40
<OI.22
•cOl
.06
.15
.15<.OI
.18
<01
SA
0.41
-0.21
0.45
-0.17
-1.60
0.19
-0.47
0.28
1.51
-0.21
0.35
(SA)
2.07
-1.22
3.76
-1.64
-4.55
1.85
-1.68
1.86
3.74
-1.44
4.22
USA
0.41
0.22
0.46
0.17
1.78
0.19
0.48
0.29
1.57
0.33
0.35
(USA)
2.07
1.26
3.53
1.70
5.37
1.97
1.65
1.75
3.67
1.60
4.39
CDIF
0.007
0.000
0.010
0.000
-0.010
0.002
-0.008
0.006
0.021
-0.004
-0.001
Direction of keying follows abbreviated content:igned area index ; USA — un;signed area index;
NCDIF"
0.005
0.001
0.011
0.002
0.039
0.003
0.003
0.003
0.024
0.002
0.009
T = true;CDIF =
compensatorv DIP; NCDIF = noncompensatory DIP.
"All? values for NCDfF are less than .01.
reliabilities in diverse samples, and correlations with
MMPI clinical and validity scales, is reported in
Waller (1999).
As a first step in the IRT analyses, marginal maxi-
mum-likelihood IRT item parameters were estimated
for the items of the 12 factor scales. These analyses
were conducted separately for Whites and Blacks and
for each factor scale using BILOG 3.10 (Mislevy &
Bock, 1990). BILOG 3.10 is a Windows-based pro-
gram for estimating the parameters of the 1 -, 2-, or
3-parameter logistic (unidimensional) IRT models by
marginal maximum likelihood. All BILOG program
defaults were used in these analyses.2 After calibrat-
ing the items, we compared the estimated and the
empirical 2-PLM IRFs for each item in the two
samples. The estimated 2-PLM IRFs are calculated by
substituting the group-specific item parameter esti-
mates in Equation 1. The empirical IRFs are calcu-
lated by grouping the maximum likelihood 6 esti-
mates from the IRT analyses into nine nonoverlapping
intervals and then determining the average item en-
dorsement frequency for each 0 interval.
When we compared the estimated and empirical
2-PLM IRFs, we found that virtually all of the items
on the 12 factor scales could be successfully cali-
brated with the 2-PLM. Specifically, the vast majority
of points of the empirical IRFs fell within the 95%
tolerance intervals of the estimated 2-PLM IRFs. We
should note that these comparisons were conducted
after linking the two sets of item parameters to a com-
mon metric. To accomplish the item linking, we used
the linking procedure of Stocking and Lord (1983) as
implemented in the software routine LINKDIF
(Waller, 1998).
Having estimated the item parameters in the two
groups, we were finally in a position to look for DIF
and DTP in the 12 factor scales. LINKDIF (Waller,
1998) was also used to calculate the five DIF and DTP
measures introduced in the previous sections. Table 2
reports our findings for the 11 items of the Phobias
and Fears (Ph) factor scale. A graphical display of
2 BILOG uses a normal prior for the latent trait distribu-
tion in the marginal maximum-likelihood estimation of item
parameters. For some scales, such as the MMPI Ps scale, a
normal prior for 0 may be unreasonable. We investigated
the influence of the prior distribution of 6 on the final item
parameter estimates by also analyzing the data using em-
pirically generated prior distributions (starting from either
normal or uniform distributions). These empirical priors
were estimated during the item parameter estimation phase
(using the BILOG FREE command on the CAL1R line). Our
results suggested that the form of the prior had little effect
on the final item parameter estimates (though it did have a
noticeable effect on the estimated distribution of 8). Thus,
without further information, we believe that a normal prior
can be justified in these moderately sized samples.
136 WALLER, THOMPSON, AND WENK
these findings is also provided by the 2-PLM IRFs in
Figure 3. Several aspects of these results and plots
warrant discussion. First, notice that a number of
items from the Phobias and Fears factor scale show
significant DIP as measured by Lord's \2 and Raju's
signed (SA) and unsigned (USA) measures. (We have
calculated agreement indices [kappas] for all pairs of
DIP indices for the 383 items that are included on the
12 MMPI factor scales. A summary of these results
can be obtained from Niels G. Waller.) On the basis of
Lord's x2, five items show significant item parameter
differences at the .01 significance level: Items 128,
166, 270, 480, and 522. The other DIP indices re-
ported in Table 2 also identify these items as showing
DIP. Notice, however, that these differences are not
always in the same direction. For instance, at a 0 level
of 2.00, Blacks are more likely than Whites to endorse
Item 270, "When I leave home I do not worry about
whether the door is locked and the windows closed,"
in the keyed direction, false. It is not difficult to imag-
ine why Blacks and Whites have different endorse-
ment probabilities on this item when the groups are
matched on the latent Phobias and Fears construct.
Many of the Black youth offenders in our sample
lived in crime-ridden neighborhoods and housing pro-
jects where unlocked doors and windows would be
invitations for robbery. Thus, although Item 270 is a
valid measure of a more general Phobias and Fears
construct, it also taps a specific fear (i.e., the fear of
being a crime victim) that may be a realistic concern
in some environments. For Item 480, on the other
hand, the situation is quite different. At a 6 level of
2.00, Whites are more likely than Blacks to answer
"True" to the statement "I am often afraid of the
dark." We do not know why Blacks and Whites re-
spond differently to this item. We do know that our
IRT analyses have elucidated many interesting item
differences that provide hypotheses for further study.
As interesting as these item differences are, we re-
mind the readers that differential item functioning
does not imply differential test functioning. In other
words, although many items on a scale may show
evidence of DIP in two groups, the scale may none-
theless provide valid measurement for both groups.
For instance, although several items in the Phobias
and Fears factor scale show evidence of DIF (see
Table 2 or Figure 3), the scale produces unbiased
scores for Blacks and Whites. This statement is sup-
ported by the fact that Raju's DTP index for the Pho-
bias and Fears factor scale was only 0.02 (p — .05) in
our samples. The square root of the DTP, which
equals 0.15 for this comparison, suggests that for a
given trait level estimate, Blacks and Whites will dif-
fer, on average, by 0.15 of a single point on the ob-
served scores.
Our analyses of the 12 MMPI factor scales revealed
numerous instances of DIF. On average, 38% of the
items on each scale produced significant values of
Lord's x2 at the .01 significance level. Of course, with
large sample sizes, and hence large statistical power,
one expects to find numerous significant and uninter-
esting differences when groups are compared on any
set of psychological variables (Meehl, 1967, 1978).
These findings were neither surprising—in that we
would expect similar results on any set of broadband
factor scales—nor troublesome. Our results would
give us cause for worry if the item differences pro-
duced biased scales. However, the plots in Figure 4
(which also report the rDTF index) demonstrate that
the 12 MMPI factor scales are not biased against
Blacks or Whites.
Figure 4 displays the test characteristic curves
(TCCs) for the 12 MMPI factor scales. Each plot con-
tains two TCCs, one produced from the estimated
item parameters of the Black participants and one
produced from the estimated item parameters of the
White participants. Notice that in virtually all cases,
the TCCs are similar and that in many cases they are
not visually distinguishable. Only the Assertiveness
(As) and Extroversion (Ex) factor scales show any
appreciable evidence of biased measurement, and in
both cases the amount of score distortion that is pro-
duced by differential item functioning would not yield
different clinical or personalogical interpretations at
any score level on the these scales, Por instance, the
rDTF divided by the total number of scale items is
only 0.04 and 0.05 for Assertiveness and Extrover-
sion, respectively.
The TCC plots in Figure 4 reassure us that the 12
MMPI factor scales can be used to make meaningful
group comparisons for Blacks and Whites. In other
words, in samples where Blacks (Whites) score rela-
tively higher (lower) than Whites (Blacks) on the ob-
served scores from these scales, we can be confident
that Blacks (Whites) also score higher (lower) than
Whites (Blacks) on the latent variables that are mea-
sured by these scales.
The previous analyses provided strong evidence for
Black-White measurement equivalence on our 12
MMPI factor scales. We noted that even if the TCCs
showed that these scales could not be used to make
valid comparisons at the observed score level, we
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 137
O'l 8'0 9'0 t'O Z'O 00
osuodsai paAes) B (o qoij
3
O'l 8'0 9'0 fr'O J O 00
esuodsej pefe>| B jo qojj
sn>
5
toCO
ffl
O'l B'O 9'0 »'0 J O 0-0
asuod$ej peAa>| F |o qoj^
(NOlCO
I
0 1 80 9'0 f'O J'O O'O
osuodsai
CN
C>4in
O'l. 9'0 9'0 *'0 Z'O 00
asuodsej paXeif e ^o qoj^
CO
E
O'l 8"0 90 »'0 J'O O'O
asuodssj paXs)) B jo qoj,d
ot^
S CM
I E
O'l. 8'0 9'0 t'O Z'O O'O
asuodsej pe/,e>| B jo qojd
'i 9'0 9'0 ra z'o o'
asuodsej pa^a>{ e |0 qcud
01 8'0 9'0 f O ZO O'O
asuodsaj paAa)| e 10 qoid
3
O'l 9'0 9'0 t-'O J'O O'O
asuodsej peAei) e
O'l 8'0 9'0 f'O J'O O'O
esuodsej pa^ei) e j
138 WALLER, THOMPSON, AND WENK
" 8U.M
8
°8U.C/3
OZ SI. 01 S 0
3JOOS letOl PSP!p3Jd
S3 OZ SI 01 9 0
3JODS IB101
s?
01 S 9 fr Z 0
3JOOS IBJOl papjpaJd
ii 8LLW
01 9 9 f Z 0
SJODS ROl pappaJd
o
s
IILLW
I '-
I -S
•S^M IS '
S S xf = h-
OS Ot- OE OZ 01 0
3JODS |E}oi
9 9 » Z
|-|£o y E
K !•
u cC
o
fi°
LJ-OT
oe oz 01
11 8U-OT
0|
tl II 01 3 9 f Z 0
Q ^S
dlOE OZ 01 0
aioos |B|oi pspipay
SS
tl
-a -O -S
'•S K
CMoq
CM
II
l£l^a
on osi 001 oe 09 OP oz o si 01
3JO3:
B
I
SI 01
ajoos leioi pspipsjd
S5 g «
II:-i"a s si?^(2
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 139
could still validly compare our groups at the estimated
latent trait level. In other words, because the maxi-
mum-likelihood trait estimates (from the IRT analy-
ses) are calculated from the group-specific (estimated)
item parameters, we can always compare groups on
the estimated 6 levels after the item and estimated
trait levels have been linked to a common metric (this
statement is just another way of saying that partial
measurement invariance is often a sufficient condition
for making valid group comparisons; see Byrne,
Shavelson, & Muthen, 1989; Reise et al., 1993).
Are the MMPI and MMPI-2 validity and clinical
scales biased against Blacks? It is well known that
the MMPI clinical scales were primarily constructed
by the method of contrasted groups (Greene, 1991,
chap. 1). Scales that are developed by this method are
notorious for being highly multidimensional, and the
MMPI clinical scales are no exception. Consequently,
the internal structure of the clinical scales and the
MMPI factor scales are notably different. The clinical
scales contain items of diverse content—as measured,
for example, by the Harris and Lingoes (1955) sub-
scales—whereas the factor scales are relatively con-
tent pure. These differences raise an intriguing ques-
tion. Namely, when individuals complete an MMPI
protocol, are their item responses determined by the
constructs underlying the multidimensional clinical
scales or the unidimensional factor scales? Surpris-
ingly, no one has addressed this question, although the
answer to this query has important implications for
much applied and theoretical MMPI work. For in-
stance, as we show below, our answer will help us
determine whether the MMPI validity and clinical
scales are biased against Blacks or Whites.
We tackled the aforementioned question by posing
the following hypothesis: If the unidimensional con-
structs, which are represented by the factor scales, are
of prime importance in determining MMPI item re-
sponses, then scores on the validity and clinical scales
should be recoverable from the factor-scale 8 esti-
mates. For instance, using only the trait estimates
from our previous IRT analyses, we should be able to
accurately reproduce the MMPI profiles that are dis-
played in Figure 2.
To test the aforementioned hypothesis requires IRT
item parameter estimates for the 383 unique items that
are scored on the MMPI validity and clinical scales.
There are also, coincidentally, 383 unique items on
the MMPI factor scales, although the items on the
validity and clinical scales do not overlap completely
with the items on the factor scales. In particular, we
did not have item parameter estimates for 87 items
(hereinafter called the missing items) that are scored
on one or more of the validity and clinical scales.
Thus, it was necessary to estimate item parameters for
as many of these items as possible. These estimates
were obtained as follows.3
First, in both the White and Black samples, we
calculated biserial correlations (using PRELIS 2;
Joreskog & Sorbom, 1996) between each of the 87
missing items and the estimated 0 levels from the
previous IRT analyses of the MMPI factor scales.
These correlations were used to assign a missing item
to one of the factor scales. An item was assigned to a
scale if its correlation with that scale was higher than
its correlation with any other factor scale in both the
White and Black samples. Moreover, the absolute
value of the highest item-factor correlation was re-
quired to exceed .20. Seventy of the 87 missing items
met these liberal criteria and were thus provisionally
assigned to a factor scale. Items were retained on the
scale if the item could be well modeled by the 2-PLM.
At this point we wanted to estimate item parameters
for the 70 recently assigned items in a manner that
would not bias the trait level estimates or the item
parameter estimates from our original IRT analyses of
the factor scales. To accomplish this goal we used
marginal maximum-likelihood item parameter estima-
tion as implemented in BILOG 3.10 (Mislevy &
Bock, 1990). This program is well suited to our task
because it allows item parameter estimates to be fixed
or freely estimated. Parameters are fixed (i.e., con-
strained) to user-specified values when a tight Bayes-
ian prior (i.e., a prior with a user-specified mean and
a small standard deviation) is placed on the parameter
estimate. Parameters are freely estimated when the
Bayesian priors are loose or when no prior is speci-
fied. Thus, by making judicious use of this option, we
were able to link the parameter estimates of the re-
cently assigned items to the metric of the previously
calibrated items. We simply assigned tight priors to
the slopes and thresholds of the original items—
thereby fixing their values to their previously calcu-
lated estimates—and assigned loose or no priors (we
used the BILOG default values) to the slopes and
thresholds for the recently added items. An example
BILOG file that demonstrates how to fix and free
parameter estimates is reproduced in the Appendix.
3 We would like to thank an anonymous reviewer for
suggesting the following analyses.
140 WALLER, THOMPSON, AND WENK
Using the aforementioned procedure, we ran 22
BILOG jobs to calculate slope and threshold estimates
for the 70 missing items that are needed to score the
MMPI validity and clinical scales (no additional items
were assigned to the MMPI Christian Fundamental-
ism scale). After completing these runs, we had
group-specific slope and threshold estimates for 96%
of the items needed to score the MMPI validity and
clinical scales. As previously noted, it was not pos-
sible to estimate item parameters for 17 items. Having
no parameter estimates for these items, however,
posed no problems for our ultimate goal because
MMPI protocols are considered potentially valid as
long as no more than 30 items are omitted. Thus,
using the 366 items for which group-specific param-
eter estimates were available, we were able to simu-
late item response vectors for Whites and Blacks with
identical latent trait values. These response vectors
were then used to score the MMPI validity and clini-
cal scales and to look for possible scale biases.
Specifically, 511 MMPI protocols were simulated
for Whites, and 511 protocols were simulated for
Blacks. These protocols (i.e., item response vectors)
were paired such that the same profile of latent trait
values was used to generate a White and a Black item
response vector. One aspect of this analysis deserves
emphasis: Namely, we used the same latent trait es-
timates in both simulations (in both cases we used the
estimated 6 levels from the Blacks); thus, our two
samples are perfectly matched on the latent traits that
are measured by the MMPI factor scales. Any differ-
ences found at the observed score level on the validity
and clinical scales can only arise from the use of the
two sets of estimated item parameters (i.e., the esti-
mated parameters from the Blacks and Whites).
Moreover, to minimize the error that is inherent in any
probabilistic item response model, when scoring the
MMPI scales, we summed the simulated item re-
sponse probabilities rather than simulated [0/1] raw
item responses. Thus, our analyses were conducted on
simulated true scores for Whites and Blacks.
Figure 5 displays the average MMPI profiles from
the two samples of reproduced item responses. Two
features of this plot bear directly on the question that
120
110
100
90
en2 80oo
<?70h-
60
50
40
30
Hs
1
D Pd Mi PaHy
2 3 4 5 6 7
MMPI Validity and Clinical Scales
Sc
8
Ma
9
S;
0
Figure 5. Reproduced Minnesota Multiphasic Personality Inventory (MMPI) profiles of
Black and White youth offenders using model-based item response probabilities. /- = Lie; F
= Infrequency; K = Defensiveness; Hs = Hypochondriasis; D = Depression; Hy =
Hysteria; Pd = Psychopathic Deviate; Mf — Masculinity-Femininity; Pa = Paranoia; Pt =
Psychasthenia; Sc = Schizophrenia; Ma = Hypomania; Si = Social Introversion.
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 141
motivated our analyses. The first is that the profiles in
Figure 5 show a reassuringly close resemblance to the
profiles in Figure 2. This finding suggests that MMPI
item responses are largely determined by the homo-
geneous constructs that are tapped by the MMPI fac-
tor scales. It does not prove our hypothesis, nor does
it rule out the possibility that latent taxonic variables
(Waller & Meehl, 1998) also influence MMPI item
response behavior. It does suggest, however, that the
latent dimensions that are measured by the factor
scales are useful explanatory constructs that can be
used to predict an individual's item response profile.
The second noteworthy feature of Figure 5 is that the
two profiles—which were reproduced from the White
and Black estimated item parameters—are remark-
ably close (the small differences are not clinically
meaningful). This finding strongly supports the con-
tention that groups of Blacks and Whites can be
meaningfully compared on the MMPI validity and
clinical scales.
The above conclusions are further bolstered by con-
sidering the multivariate extension of Raju's index of
DTP (Raju et al., 1995). Oshima et al. (1997) have
recently demonstrated how this index can be mean-
ingfully applied to multidimensional scales. To do so
one computes the estimated true score by summing
the (keyed) item response probabilities for all items of
a multidimensional scale. These response probabili-
ties are determined by the unidimensional (e.g., Equa-
tion 1, supra) or multidimensional IRFs (Ackerman,
1996) that were used to characterize the items. In our
didactic example, all MMPI items were modeled by
the (unidimensional) 2-PLM. A person's estimated
true score for a multidimensional scale, such as an
MMPI validity or clinical scale, is calculated by sum-
ming the (keyed) response probabilities for all items
on the scale. For example, although MMPI Items 13,
23, and 30 are scored on the (multidimensional) clini-
cal Depression scale, each item is scored on a separate
(unidimensional) factor scale. Specifically, Item 13
(work under tension) is scored on General Maladjust-
ment, Item 23 (troubled by nausea) is scored on So-
matic Complaints, and Item 30 (feel like swearing) is
scored on Antisocial Tendencies. Thus, when calcu-
lating estimated item response probabilities for these
items it is necessary to consider latent trait values on
three dimensions (General Maladjustment, Somatic
Complaints, and Antisocial Tendencies). Once these
response probabilities have been calculated they can
be summed to produce estimated true scores on De-
pression.
When working with multidimensional scales, such
as the MMPI validity and clinical scales, it is not
possible to portray the TCCs. Thus, we cannot com-
pare group-specific TCCs for the validity and clinical
scales as we did for the unidimensional factor scales
in Figure 4. However, as noted above, it is certainly
possible and desirable to compute the multidimen-
sional version of the DTP or the square root of this
index, the rDTF (recall that the rDTF is placed on the
metric of the original scores). Moreover, Oshima et al.
(1997) provided formulas for computing chi-square
test statistics for the DTP in the multidimensional
case. Table 3 reports the rDTF for the 13 validity and
clinical scales of our analyses. Note that the rDTF
values for all scales except K (Defensiveness) and Si
(Social Introversion) are statistically significant (p <
.05). Note also, however, that the rDTF values for all
scales are small. The two scales with the largest rDTF
are Scales 8 (Sc; Schizophrenia) and 9 (Ma; Hypo-
mania). These findings are interesting because previ-
ous investigators have speculated that MMPI Scales 8
and 9 are biased against Blacks. Although our find-
ings support that contention, they also forcefully sug-
gest that the degree of bias in these scales is minimal.
In particular, the average score differences between
Whites and Blacks with equal latent trait values on the
Table 3
Root Differential Test Functioning (rDTF) for MMPI
Validity and Clinical Scales for Black and White Item
Parameter Estimates
Validity and
clinical scales
L
F
K
1 Hs
ID
3//>'
4Pd
5Mf
6 Pa
IPi
8Sc
9 Ma
OSi
No. items
12
63
30
32
57
60
46
54
38
48
76
44
69
No. missing
31
0
1
3
0
4
6
2
02
2
1
rDTF
0.32
0.38
0.20*
0.35
0.56
0.47
0.63
0.41
0.63
0.82
1.89
1.56
0.57*
Note. rDTF is reported in the raw score metric. Minnesota. Mul-tiphasic Personality Inventory (MMPI) scale names: L = Lie; F =Infrequency; K = Defensiveness; Hs = Hypochondriasis; D =Depression; Hy = Hysteria; Pd = Psychopathic Deviate; Mf =Masculinity-Femininity; Pa = Paranoia; Pt = Psychasthenia;
Sc = Schizophrenia; Ma = Hypomania; Si = Social Introversion.* p > .05 (i.e., differential test functioning is not significantly dif-ferent from 0.00).
142 WALLER, THOMPSON, AND WENK
dimensions that are tapped by the MMPI are only 1.89
and 1.56 raw score points on Sc and Ma, respectively.
These small differences would not result in different
clinical interpretations. Nevertheless, researchers
wishing to reduce the amount of measurement bias in
these scales can easily do so by deleting those items
that maximally contribute to the DTP. For instance,
our item analyses indicated that MMPI Item 157 con-
tributes the most to the Sc DTP. This item asks ex-
aminees whether they have been punished without
cause. It is not difficult to understand why Blacks
(especially during the mid 1960s when our data were
collected) endorse this item more frequently than
Whites in our racially tense society, irrespective of
their standing on the constructs tapped by the MMPI.
By removing Item 157 the DTP for Sc is lowered from
1.89 to 1.71. We could continue to remove biased
items from Sc until the DTP fell below a specific
threshold, if desired.
Discussion
In this article we have described several methods
for separating true group difference from measure-
ment bias on unidimensional and multidimensional
scales. These methods are based on IRT and differ
from less formal procedures, such as the difference of
means test (Pritchard & Rosenblatt, 1980a), by equat-
ing the groups on the underlying latent traits that are
being measured. We have described how unidimen-
sional scales can sometimes be used to generate IRT
slope and threshold estimates for items on multidi-
mensional scales, and we have shown how these es-
timated item parameters can be used to elucidate dif-
ferential item and test functioning.
Our didactic example includes MMPI data from
1,277 White and 511 Black young adult criminal of-
fenders. Several findings from our analyses were no-
table. For instance, many MMPI items show evidence
of bias against Whites or Blacks. This finding was not
surprising. Any omnibus inventory, such as the
MMPI, the California Psychological Inventory
(Gough & Bradley, 1996), or the Multidimensional
Personality Questionnaire (Tellegen, 1982; Smith &
Reise, 1998), is likely to contain numerous items that
perform differently across various homogeneous
groups. However, most psychological research is con-
ducted at the scale level and thus the more important
question to ask is whether the items yield biased test
scores after they have been aggregated into scales.
Our analyses of the MMPI factor, validity, and clini-
cal scales suggest that Whites and Blacks can be
meaningfully compared on these scales with little fear
that obtained group differences are due to measure-
ment bias. We note that a small amount of bias was
found for two MMPI clinical scales (Sc and Ma) but
that the magnitude of the bias was insufficient to ef-
fect clinical interpretations. Nevertheless, we showed
how the IRT methods that we used in this article could
also be used to identify items whose removal would
be most effective in reducing scale score bias.
An important thesis of this article is that group
differences at the item or scale level can arise from
measurement bias, actual group differences, or a com-
bination of these influences. A corollary of this thesis
is that bias research must necessarily focus on latent
trait (unobserved) scores rather than manifest (ob-
served) scores (Meredith & Millsap, 1992). To con-
duct bias research at the latent trait level requires one
to fit a formal psychometric model to the observed
item responses (Embretson, 1996; Lord, 1980). In this
article, we demonstrated how the 2-PLM IRT model
could be used to obtain latent trait estimates for the
underlying constructs that are tapped by the MMPI
factor scales. Model-data fit analyses demonstrated
that this psychometric model accurately describes
item response behavior on the MMPI. This last point
is particularly noteworthy because previous research-
ers have paid scant attention to the latent structure of
the MMPI. Consider, for example, the MMPI Depres-
sion scale, also known as Scale 2. There is no con-
sensus in the MMPI community on whether elevated
scores on this scale signify higher levels of depres-
sion—that is, higher scores on an underlying latent
Depression dimension—or whether higher scores im-
ply higher probabilities of belonging to a latent de-
pression taxon (Waller & Meehl, 1998). This problem
is compounded exponentially when one considers that
most of the MMPI clinical scales, including Scale 2,
are highly multidimensional when modeled by factor
analysis (a model that is arguably inappropriate if
Scale 2 measures a latent taxon). For this and other
reasons, groups that are matched on manifest (i.e.,
observed) MMPI clinical scales are almost certainly
not matched on the underlying latent constructs that
the scales implicitly measure. These problems are not
unique to the MMPI but plague numerous personality
and psychopathology scales.
In recent years there have been repeated calls for
model-based personality and psychopathology assess-
ment (Embretson, 1996; Embretson & Herschberger,
1999; Waller, 1999; Waller & Meehl, 1998; Waller,
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 143
Tellegen, McDonald, & Lykken, 1996). In this article
we have attempted to demonstrate the benefits of
model-based assessment with the MMPI by demon-
strating two IRT models that can be used to assess
measurement bias at item and scale levels on both
homogeneous and heterogeneous scales. Importantly,
these models show that scales that contain biased
items may nonetheless provide unbiased estimates of
the underlying latent traits that influence scale-score
performance. Stated more formally, the presence of
differential item functioning does not lead inexorably
to differential test functioning. Item bias may become
amplified or canceled when aggregated at the total
score level. These important characteristics of items
and tests will remain hidden, however, until research-
ers adopt a model-based approach to psychological
assessment.
References
Ackerman, T. (1996). Graphical representation of multidi-
mensional item response theory analyses. Applied Psy-
chological Measurement, 20, 311-329.
Archer, R. P. (1984). Use of the MMPI with adolescents: A
review of salient issues. Clinical Psychology Review, 4,
241-251.
Archer, R. P. (1987). Using the MMPI with adolescents.
Hillsdale, NJ: Erlbaum.
Baker, F. B. (1992). Item response theory: Parameter esti-
mation techniques. New York: Marcel Dekker.
Bertelson, A. D., Marks, P. A., & May, G. D. (1982). MMPI
and race: A controlled study. Journal of Consulting and
Clinical Psychology, SO, 316-318.
Birnbaum, A. (1968). Some latent trait models and their use
in inferring an examinee's ability. In F. M. Lord & M. R.
Novick (Eds.), Statistical theories of mental test scores
(pp. 395^79). Reading, MA: Addison-Wesley.
Butcher, J. N., Braswell, L., & Raney, D. (1983). A cross-
cultural comparison of American Indian, Black, and
White inpatients on the MMPI and presenting symptoms.
Journal of Consulting and Clinical Psychology, 51, 587-
594.
Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Tellegen,
A., & Kaemmer, B. (1989). MMPI-2 manual for admin-
istration and scoring. Minneapolis: University of Minne-
sota Press.
Butcher, J. N., & Rouse, S. V. (1996). Personality: Indi-
vidual differences and clinical assessment. Annual Re-
view of Psychology, 47, 87-111.
Byrne, B. M., Shavelson, R. J., & Muthen, B. (1989). Test-
ing for the equivalence of factor covariance and mean
structures: The issue of partial measurement invariance.
Psychological Bulletin, 105, 456-466.
Camilli, G., & Shepard, L. A. (1994). Methods for identi-
fying biased test items. Thousand Oaks, CA: Sage.
Cohen, J. (1988). Statistical power analysis for the behav-
ioral sciences. Hillsdale, NJ: Erlbaum.
Costello, R. M. (1973). Item level racial differences on the
MMPI. Journal of Social Psychology, 91, 161-162.
Costello, R. M. (1977). Construction and cross-validation of
an MMPI Black-White scale. Journal of Personality As-
sessment, 41, 514-519.
Dahlstrom, W. G., & Gynther, M. D. (1986). Previous
MMPI research on Black Americans. In W. G. Dahl-
strom, D. Lachar, & L. E. Dahlstrom (Eds.), MMPI pat-
terns of American minorities. Minneapolis: University of
Minnesota Press.
Dahlstrom, W. G., Lachar, D., & Dahlstrom, L. E. (Eds.).
(1986). MMPI patterns of American minorities. Minne-
apolis: University of Minnesota Press.
Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item
response theory evaluation of an English version of the
Trier Personality Inventory (TPI). Journal of Cross-
Cultural Psychology, 24, 133-148.
Ellis, B. B., Minsel, B., & Becker, P. (1989). Evaluation of
attitude survey translations: An investigation using item
response theory. International Journal of Psychology, 24,
665-684.
Embretson, S.E. (1996). The new rules of measurement.
Psychological Assessment, 8, 341-349.
Embretson, S. E., & Herschberger, S. (1999). The new rules
of measurement: What every psychologist and educator
should know. Mahwah, NJ: Erlbaum.
Gottfredson, L. S. (1994). The science and politics of race
norming. American Psychologist, 49, 955-963.
Gough, H. G., & Bradley, P. (1996). Manual for the Cali-
fornia Psychological Inventory. Palo Alto, CA: Consult-
ing Psychologists Press.
Graham, J. R. (1993). MMPI-2: Assessing personality and
psychopathology. Oxford, England: Oxford University
Press.
Greene, R.L. (1978). An empirically derived MMPI care-
lessness scale. Journal of Clinical Psychology, 34, 407—
410.
Greene, R. L. (1987). Ethnicity and MMPI performance: A
review. Journal of Consulting and Clinical Psychology,
55, 497-512.
Greene, R. L. (1991). The MMPI-2/MMPI: An interpretive
manual. Needham Heights, MA: Allyn and Bacon.
Gynther, M. D. (1972). White norms and Black MMPIs: A
prescription for discrimination? Psychological Bulletin,
78, 386-402.
144 WALLER, THOMPSON, AND WENK
Gynther, M. D. (1981). Is the MMPI an appropriate assess-
ment device for Blacks? Journal of Black Psychology, 7,
67-75.
Gynther, M. D. (1989). MMPI comparisons of Blacks and
Whites: A review and commentary. Journal of Clinical
Psychology, 45, 878-883.
Gynther, M. D., & Green, S. B. (1980). Accuracy may make
a difference, but does a difference make for accuracy? A
response to Pritchard and Rosenblatt. Journal of Clinical
Psychology, 48, 268-272.
Gynther, M. D., Lachar, D., & Dahlstrom, W. G. (1978).
Are special norms for minorities needed? Development
of an MMPI F scale for Blacks. Journal of Consulting
and Clinical Psychology, 46, 1403-1408.
Gynther, M. D., & Witt. P. H. (1976). Windstorms and im-
portant persons: Personality characteristics of Black edu-
cators. Journal of Clinical Psychology, 32, 613—616.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J.
(1991). Fundamentals of items response theory. Newbury
Park, CA: Sage.
Harris, R., & Lingoes, J. (1955). Subscales for the Minne-
sota Multiphasic Personality Inventory. Ann Arbor, MI:
The Langley Porter Clinic.
Harrison, R. H., & Kass, E. H. (1967). Differences between
Negro and White pregnant women on the MMPI. Journal
of Consulting Psychology, 31, 454—463.
Harrison, R. H., & Kass, E. H. (1968). MMPI correlates of
Negro acculturation in a northern city. Journal of Per-
sonality and Social Psychology, 10, 262-270.
Hathaway, S. R., & McKinley, J. C. (1940). A multiphasic
personality schedule (Minnesota): I. Construction of the
schedule. Journal of Psychology, 10, 249-254.
Holland, P. W., & Thayer, D. T. (1988). Differential item
performance and the Mantel-Haenszel procedure. In H.
Waincr & H. 1. Braun (Eds.), Test validity (pp. 129-145).
Hillsdale, NJ: Erlbaum.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential
item functioning. Hillsdale, NJ: Erlbaum.
Huang, C. D., & Church, A. T. (1997). Identifying cultural
differences in items and traits: Differential item function-
ing in the NEO Personality Inventory. Journal of Cross-
Cultural Psychology, 28, 192-218.
Hulin, C. L., & Mayer, L. J. (1986). Psychometric equiva-
lence of a translation of the Job Descriptive Index into
Hebrew. Journal of Applied Psychology, 71, 83-94.
Jensen, A. R. (1980). Bias in mental testing. New York:
Free Press.
Jensen, A. R. (1998). The g factor: The science of mental
ability. Westport, CT: Praeger.
Jones, E. E. (1978). Black-White personality differences:
Another look. Journal of Personality Assessment, 42,
244-252.
Joreskog, K. G., & Sorbom, D. (1996). PRELIS 2 user's
reference guide. Chicago: Scientific Software Interna-
tional.
Lord, F. M. (1980). Applications of item response theory.
Hillsdale, NJ: Erlbaum.
Lubin, B., Larsen, R. M., Matarazzo, J. D., & Secver, M.
(1985). Psychological test usage patterns in five profes-
sional settings. American Psychologist, 40, 857-861.
McKegney, F. P. (1965). An item analysis of the MMPI F
scale in juvenile delinquents. Journal of Clinical Psy-
chology, 21, 201-205.
McNulty, J. L., Graham, J. R., Ben-Porath, Y. S., & Stein,
L. A. R. (1997). Comparative validity of MMPI-2 scores
of African American and Caucasian mental health center
clients. Psychological Assessment, 9, 464-470.
Meehl, P. E. (1967). Theory-testing in psychology and
physics: A methodological paradox. Philosophy of Sci-
ence, 34, 103-115.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks:
Sir Karl, Sir Ronald, and the slow progress of soft psy-
chology. Journal of Consulting and Clinical Psychology,
46, 806-834.
Meredith, W. (1993). Measurement invariance, factorial
analysis, and factorial invariance. Psychometrika. 58,
525-543.
Meredith, W., & Millsap, E. E. (1992). On the misuse of
manifest variables in the detection of measurement bias.
Psychometrika, 57, 289-311.
Miller, C., Knapp, S. C., & Daniels, C. W. (1968). MMPI
study of Negro mental hygiene clinic patients. Journal of
Abnormal Psychology, 73, 168-173.
Millsap, R. E., & Everson, H. (1993). Methodology review-
Statistical approaches for assessing measurement bias.
Applied Psychological Measurement, 17, 297-334.
Mislevy, R. J., & Bock, R. D. (1990). B1LOG 3: Item analy-
sis and test scoring with binary logistic models. Chicago:
Scientific Software International.
Nandakumar, R. (1993). Simultaneous DIF amplification
and cancellation: Shealy-Stout's test for DIF. Journal of
Educational Measurement, 30, 293—311.
Newmark, C. S., Gentry, L., Warren, N., & Finch, A. J.
(1981). Racial bias in an MMPI index of schizophrenia.
Journal of Clinical Psychology, 20, 215-216.
Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Devel-
opment and demonstration of multidimensional IRT-
based internal measures of differential function of items
and tests. Journal of Educational Measurement, 34, 253-
272.
Patterson, E. T., Charles, H. L., Woodward, W. A., Roberts,
DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 145
W. R., & Penk, W. E. (1981). Differences in measures of
personality and family environment among Black and
White alcoholics. Journal of Consulting and Clinical
Psychology, 49, 1-9.
Penk, W. E., Roberts, W. R., Robinowitz, R.. Dolan, M. P.,
Atkins, H. G., and Woodward, W. A. (1982). MMPI dif-
ferences of Black and White male polydrug abusers seek-
ing treatment. Journal of Consulting and Clinical Psy-
chology, 50, 463^65.
Pritchard, D. A., & Rosenblatt, A. (1980a). Racial bias in
the MMPI: A methodological review. Journal of Con-
sulting and Clinical Psychology, 48, 263-267.
Pritehard, D. A., & Rosenblatt, A. (1980b). Reply to
Gynther and Green. Journal of Consulting and Clinical
Psychology, 48, 273-274.
Raju, N. S. (1988). The area between two item characteristic
curves. Psychometrlka, 53, 495-502.
Raju, N. S. (1990). Determining the significance of esti-
mated signed and unsigned areas between two item re-
sponse functions. Applied Psychological Measurement,
14, 197-207.
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995).
IRT-based internal measures of differential functioning
of items and tests. Applied Psychological Measurement,
19, 353-368.
Raven, J. C. (1960). Guide to the standard progressive ma-
trices. London: H. K. Lewis.
Reise, S. P., & Waller, N. G. (1990). Fitting the two-
parameter model to personality data. Applied Psychologi-
cal Measurement, 14, 45-58.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Con-
firmatory factor analysis and item response theory: Two
approaches for exploring measurement invariance. Psy-
chological Bulletin, 114, 552-566.
Science Research Associates. (1947). Army general classi-
fication test examiners manual. Chicago: Author.
Shealy, R., & Stout, W. (1993). An item response theory
model for test bias. In P. W. Holland & H. Wainer (Eds.),
Differential item functioning (pp. 197-239). Hillsdale,
NJ: Erlbaum.
Smith, L. L., & Reise, S. P. (1998). Gender differences on
negative affectivity: An IRT study of differential item
functioning on the Multidimensional Personality Ques-
tionnaire Stress Reaction scale. Journal of Personality
and Social Psychology, 75, 1350-1362.
Stocking, M. L., & Lord, F. M. (1983). Developing a com-
mon metric in item response theory. Applied Psychologi-
cal Measurement, 7, 201-210.
Takane, Y., & De Leeuw, J. (1987). On the relationship
between item response theory and factor analysis of dis-
cretized variables. Psychometrika, 52, 393^08.
Tellegen, A. (1982). Manual for the Multidimensional Per-
sonality Questionnaire. Minneapolis: University of Min-
nesota, Department of Psychology.
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond
group differences: The concept of item bias. Psychologi-
cal Bulletin, 99, 118-128.
Timbrook, R. E., & Graham, J. R. (1994). Ethnic differ-
ences on the MMPI-2. Psychological Assessment, 6, 212-
217.
van der Linden, W. !., & Hamblcton. R. K. (1997). Hand-
book of modern item response theory. New York: Springer.
Waller, N. G. (1998). LINKDIF: An S-PLUS routine for
linking item parameters and calculating IRT measures of
differential functioning of items and tests. Applied Psy-
chological Measurement, 22, 392.
Waller, N.G. (1999). Searching for structure in the MMPI.
In S. E. Embretson & S. L. Hershberger (Eds.), The new
rules of measurement: What every psychologist and edu-
cator should know (pp. 185-217). Hillsdale, NJ: Erlbaum.
Waller, N. G., & Meehl, P. E. (1998). Multivariate taxomet-
ric procedures: Distinguishing types from continua.
Thousand Oaks, CA: Sage.
Waller, N. G., Tellegen, A., McDonald, R. P., & Lykken,
D. T. (1996). Exploring nonlinear models in personality
assessment: Development and preliminary validation of a
negative emotionality scale. Journal of Personality, 64,
545-576.
Wenk, E. (1990). Criminal careers: Criminal violence and
substance abuse (final report). Washington, DC: United
States Department of Justice, National Institute of Justice.
Whitworth, R. H., & McBlaine, D. D. (1993). Comparison
of the MMPI and the MMPI-2 administered to Anglo-
and Hispanic-American university students. Journal of
Personality Assessment, 61, 19-27.
Widaman, K. F., & Reise, S. P. (1997). Exploring the mea-
surement invariance of psychological instruments: Appli-
cations in the substance use domain. In K. J. Bryant, M.
Windle, S. G. West (Eds.), The science of prevention:
Methodological advances from alcohol and substance
abuse research (pp. 281-324). Washington, DC: Ameri-
can Psychological Association.
Wilcox, R. R. (1998). How many discoveries have been lost
by ignoring modern statistical methods? American Psy-
chologist, 53, 300-314.
Witt, P. H., & Gynther, M. D. (1975). Another explanation
for Black-White MMPI differences. Journal of Clinical
Psychology, 31, 69-70.
(Appendix follows}
146 WALLER, THOMPSON, AND WENK
Appendix
BILOG Program
The following BILOG file demonstrates how to estimate
item response theory (IRT) item parameters for the two-
parameter logistic TRT model with fixed and free parameter
constraints. In this example the parameters for the first 11
items are fixed to their previously estimated values by
specifying tight priors on the PRIOR command line. TMU
and TSIGMA specify the means and standard deviations of
the prior distributions for the item thresholds. Note that the
means for the prior threshold distributions match the esti-
mated threshold values (for Whites) that are reported in
Table 2. Note also that the standard deviations for these
prior distributions are exceedingly small. Specifically, for
the 11 original items of the Phobias and Fears factor scale,
the standard deviations of the threshold prior distributions
are uniformly .005. Hence, by specifying tight priors,
BILOG constrains the estimated thresholds (for these 11
items) to the means of the prior distributions. Means and
standard deviations for prior distributions are not specified
for the two items that were added to the scale (Items 240
and 287). Notice also that the means of the prior distribu-
tions for the slope parameters (SMU) are equal to the natu-
ral log of (he slope estimates that are reported in Table 2 for
the original 11 items on the Phobias and Fears factor scale.
To fix these items to their previously estimated slope values,
the standard deviations for the slope prior distributions are
uniformly equal to .001.
Phobias and Fears Whites: 2PLM
Items 240 (-) & 287 (-) added to augment the original factor scale
>COMMENTS: Method = 1 (maximum likelihood scoring of theta)
>COMMENTS:
>GLOBAL NPArm = 2, SAVE, DFName = 'c: \PsyMeth\PhobW.DAT';
>SAVE FARM = 'c: \PsyMeth\PhobW.PRM',
GRAPH = 'c: \PsyMeth\PhobW.PLT,
SCORE = 'c: \PsyMeth\PhobW.SCR';
>LENGTH NITems= 13;
>INPUT NTOtal= 13, NIDch =4, SAMPLE = 1277;
(4A1, 2X, 6A1, 2X, 5A1, 2X, 2A1)>TEST TNAme= "Phobias", EVAMES = (1128, 1131, 1166, 1176, 1270, 1367, 1392, 1401, 1480, 1492,
1522, R240, R287);
>CALIB TPRIOR, READPRI, NEW = 50, NFULL = 500, PLOT =1.0, CHISOR = 0.0;
>PRIOR TMU = (1.8063, 0.9110, 1.0659, 0.8611, 0.3967, 1.1900, 2.9153, 1.6149, 2.2257. 0.6577, 0.3088),
TSIGMA = (.005(0)11),
SMU = (-0.46777, -0.83264, -0.15106, -0.28157, -1.47447, -0.18971, -0.40557 -0.24974
0.25518 -0.82098 -0.14030),
SSIGMA = (.001(0)11);
>SCORE METHOD =1, NQPT = 20, IDIST = 3;
Received July 31, 1998
Revision received July 5, 1999
Accepted October 2, 1999 •