an empirical investigation of impact moderation in test … · 2016-05-19 · research report march...
TRANSCRIPT
RESEARCHREPORT
March 2001RR-01-04
Statistics & Research DivisionPrinceton, NJ 08541
An Empirical Investigationof Impact Moderation inTest Construction
Martha L. StockingIda LawrenceMiriam FeigenbaumThomas JireleCharles LewisThomas Van Essen
AN EMPIRICAL INVESTIGATION OF IMPACTMODERATION IN TEST CONSTRUCTION
Martha L. StockingIda Lawrence
Miriam FeigenbaumThomas JireleCharles Lewis
Thomas Van Essen
March, 2001
Educational Testing ServicePrinceton, NJ 08541
Research Reports provide preliminary and limiteddissemination of ETS research prior to publication. They areavailable without charge from the
Research Publications OfficeMail Stop 07-REducational Testing ServicePrinceton, NJ 08541
1
Abstract
This investigation constructed four different kinds of test sections using three methods
of test assembly that incorporate the goals of simultaneous moderation of three different kinds
of impact--gender impact, African American impact, and Hispanic American impact. The
test sections were administered undetectably to random samples from the appropriate
population. The results were evaluated by comparison of the characteristics of moderated
sections with those of parallel operational sections. Almost all methods of test assembly
produced either moderation of impact in the appropriate direction or no change in impact.
Taking impact into account in test assembly tended to lower reliability slightly, reduce the
relative efficiency for test takers in the middle score range while increasing efficiency for
those with more extreme scores, raise concurrent validity slightly and maintain the construct
measured by the parallel operational section.
Key Words: impact, automated test assembly, bias, DIF, expert systems.
2
Introduction
The professional standards upon which the measurement community rests emphasize
that group differences in test results must be either relevant to the construct being measured or
must be removed, in order to produce a test that is fair and equitable (American Educational
Research Association, American Psychological Association, National Council for Measurement
in Education, 1999). Determining the best approach to the removal of irrelevant differences
has been the topic of much research over a number of years. Some authors assume that any
group differences are irrelevant and that the fairest test is assembled by choosing items that
minimize group mean score differences, regardless of the effect on the construct intended to
be measured by the test (e. g., Rosser, 1987; Weiss, 1987). Still other authors argue that the
only irrelevant differences are those remaining after conditioning on test score, which has lead
to the study of Differential Item Functioning (DIF; Holland and Thayer, 1988).
Stocking, Jirele, Lewis, and Swanson (1998a) sought to simultaneously moderate the
mean score differences between women and men, and African American test takers and White
test takers, without changing the construct measured by a test of mathematical skills. They
used methods of automated test assembly (ATA) (see, for example, Luecht, 1998, van der
Linden, 1998, and Wightman, 1998) as a tool in the assembly of moderated impact tests by
adding the goal of moderating mean group differences to all the other goals that are used in
test assembly. This study was limited by three features. First, the item pool was constructed
from 11 intact test forms and was unrealistically small. Second, some aspects of the standard
test assembly process were omitted from their procedures of test assembly. It is possible that
3
some of these omitted aspects are important in maintaining features of the construct to be
measured that do not easily lend themselves to incorporation in ATA methods. Finally, the
results for the constructed tests were estimated from item statistics and population
information, as opposed to being evaluated by the administration of the tests to random
samples of test takers from the population of interest.
The current study seeks to overcome these three limitations, at least partially. Real
item pools were used that were typical of the size and structure normally used to assemble the
studied tests. The test assembly procedures employed represent a more complete
incorporation of standard test assembly features, including review by test specialists, the
elimination of overlapping items, and substitution of different items, though with some
restrictions. Unidentified moderated sections, that is, sections for which scores did not count
toward test takers' reported scores, were constructed to simultaneously moderate the mean
score differences between women and men, African American and White test takers, and
Hispanic American and White test takers. These moderated sections were constructed to be
parallel in terms of all properties except impact to operational sections on which scores
counted towards test takers' reported scores. Moderated sections of both verbal and
mathematical skills tests were undetectably administered along with parallel operational
sections to random samples of the test taker population at a regularly scheduled test
administration. The results for the moderated sections were evaluated by comparison to
parallel operational test sections.
In this attempt to moderate group mean test score differences through the alteration of
the test assembly process, we implemented the suggestion of Bond (1987, page 18):
4
The central question remains, however, 'What factors should be taken intoaccount when more items survive the item analysis than are needed on the finaloperational form of a test?' I would submit that issues of equity and equalopportunity (i.e., group difference statistics) have a place here and arereasonable candidate criteria for final item selection under some circumstances. This suggestion is admittedly a departure from past practice (some would say aradical departure). But, if applied with reason, issues of equity can beintroduced without doing violence to test validity.
This notion is further reinforced by the current Standards for Educational and Psychological
Testing (1999, page 83). According to Standard 7.11,
When a construct can be measured in different ways that are approximatelyequal in their degrees of construct representation and freedom from construct-irrelevant variance, evidence of mean score differences across relevantsubgroups of examinees should be considered in deciding which test to use.
In the next section of this paper we discuss a number of concepts that are central to
the understanding of this study. In the succeeding section we summarize the assemblies of
the moderated test sections. Remaining sections describe the test administration and present
the results. The final sections evaluate the moderated sections compared to their operational
counterparts. What we can conclude from this investigation and suggestions for further
research are discussed.
Central Concepts
Weighted Deviations Model
The weighted deviations model (WDM) and heuristic of Swanson and Stocking (1993)
was the particular method of automated test assembly (ATA) employed in this study to
construct test sections. The WDM is similar to many models in the decision sciences and is
used to select items from a pool of items in such a way as to minimize the weighted sum of
deviations from constraints reflecting desirable test properties, as established by test specialists.
5
The constraints on item selection reflect formal test specifications as well as good test
construction practices, indicating specialists' judgment of what construct should be measured
by a test and how it is to be measured. They may include specifications about item content,
type, cognitive demands, statistical properties, impact properties, and any other property that
is of interest in item selection. These constraints on item selection are the closest we can
come to an operational articulation of the construct measured by the test.
The weights in the weighted deviations model are set by test specialists for each
constraint. They reflect two aspects of the constraints--the relative importance of each
constraint when compared to all other constraints, and the structure of the pool from which
the items are being selected when compared to the structure of the constraints specified.
Most optimization algorithms in the decision sciences seek to find the optimal solution
to an optimization problem, if that problem has a feasible solution. The weighted deviations
model incorporates a different philosophy by attempting to find the best possible solution for
problems that may either be mathematically infeasible, or that are so large that finding the
optimal solution is too costly. This philosophy mirrors the activity of expert test specialists
who must produce a test to fit detailed content and statistical constraints from a large pool
that may not perfectly mirror these constraints. This model and heuristic are routinely
successfully used to assemble test forms for many different testing programs at Educational
Testing Service (Stocking, Swanson, and Pearlman, 1993).
Impact
Impact at the item level is defined as the difference between the observed proportions
of correct responses to the item for two groups of interest. Item impact will vary, of course,
6
as a function of the groups to whom the item is administered. In this investigation, we follow
the convention found in the DIF literature (e. g., Holland and Thayer, 1988) of computing
impact as the difference formed by subtracting the proportion correct for the reference group
from the proportion correct for the focal group. The focal group is the group of particular
interest, e. g., African American test takers; the reference group is the group with whom the
focal group is to be compared, e. g., White test takers.
Test level impact is the difference between average test scores for the focal and
reference groups. If a test is scored number correct, test level impact may be computed as a
simple sum over items of item impact. When we speak of moderating the impact of a test,
we mean decreasing the relative disadvantage of test takers who are members of the focal
group with respect to test takers who are members of the reference group. In other words,
impact, which is usually negative, is considered moderated if it becomes less negative, zero,
or even positive. (Typically focal group test takers are not advantaged with respect to
reference group test takers, although there may be some exceptions, such as Asian American
test takers compared to White test takers on some mathematical skills tests.)
Another useful statistic for characterizing impact at the test level is the standard mean
difference, or standard difference, D, as in Willingham and Cole (1997, page 21). We will
use this measure of impact throughout this study. The numerator of this index contains the
focal group mean minus the reference group mean; the denominator contains the (unweighted)
average standard deviation. Thus D is independent of the score scale on which test results are
reported and may be compared across different tests and samples. The standard differences
7
for a recent test taker population for the two tests used in this study and for the various
comparisons of interest are given in Table 1.
Table 1: Standard Mean Differences for Studied Tests
Test Women and Men African Americanand White
Hispanic Americanand White
Asian Americanand White
Verbal Skills -.06 -.93 -.68 -.25
Mathematical Skills -.32 -1.04 -.70 .28
Defining the Construct
Many methodologies exist that attempt to provide extended information and meaning
from the analysis of response patterns underlying test scores (see, for example, Embretson,
1984, 1987, 1991; Fischer, 1973; Lazersfeld and Henry, 1968; Thurstone, 1947). While
useful in understanding deeper meanings of test scores, these approaches may be less useful in
terms of providing direct guidance in the construction of future editions of the same test.
This latter type of guidance is usually provided through the development of detailed
test specifications, which may or may not reflect information obtained from extensive analyses
designed to identify the underlying construct being measured. To the extent that test
specifications used in the construction of different test editions reflect important aspects of the
underlying construct, the construct is controlled across different editions of the same test. To
the extent that the construct is not reflected in test specifications, the maintenance of the
construct cannot be assumed across different editions.
For this and previous studies, we have reasoned along the following lines: the
construct is defined implicitly by test committees setting test specifications, and is implicitly
8
operationalized by the behavior of expert test specialists who assemble and review tests that
are then administered to test takers. To the extent that we can adequately model this process,
we preserve the intended construct. To conservatively reflect the uncertainty of the
relationship between test specifications and the underlying construct, we will usually refer to
"test specifications" or "constraints on item selection" rather than "the construct" for the
remainder of this paper.
Methods of Moderating Impact Using ATA
Two different approaches to using ATA methods to moderate impact were used in this
investigation. In addition, two variations were tried out for one of the methods, giving a total
of three methods tried. The first method, called "test construction" (TC), uses the WDM
directly to simultaneously satisfy all statistical and nonstatistical constraints on item selection,
including the moderation of the three different kinds of impact of interest. This is the same
approach that was used in the previous study by Stocking et al. (1998a). Two versions were
tried, one in which a small moderation in the three kinds of impact was the goal (TC-S), and
a second in which a larger moderation in the three kinds of impact was the goal (TC-L).
The second approach, called "test selection" (TS), used a more elaborate and less
automatic scheme. In this approach, item pools were divided into random subsets, and the
WDM was used to generate a large number of (possibly overlapping) tests without any
consideration of impact. The draft tests were built to meet content and statistical
specifications. These tests were then ranked, from low impact to high impact, separately
based on their relative standing on the three impact values of the constructed test. That is,
each test had a separate ranking based on its location in the lists of gender, African American,
9
and Hispanic American impact. Then an overall rank was assigned to a test that represented
the highest (that is, the worst) of the three individual ranks. A test was then selected from the
lower ranked (less impact) tests based on a subjective approach that attempted to select a test
such that each of the three impact values was close to its historical low point derived from
frequency distributions of impact values for parallel tests administered in the past. In
addition, special emphasis was given to the improvement of gender impact if possible to
accomplish without causing the two kinds of ethnic impact to become worse. This procedure
has a certain resemblance to standard minimax procedures in that an attempt was made to
minimize the maximum of the three different kinds of impact. The test selection approach
differs from the test construction approach in that group differences are taken into account
after the test construction process.
Test Structure and Moderated Assemblies
A test of verbal skills and a test of mathematical skills, both administered nationally to
the test taker population, were used for this study. The focus of the study was on two
separately timed sections of the verbal skills test, referred to as VS1 and VS2, and two
separately timed sections of the mathematical skills test, MS1 and MS2. Each test taker
responded to operational sections of VS1, VS2, MS1 and MS2 and also to a section that
contained a moderated impact version of an operational section. The raw score on any test
section was the formula score for that section.
A number of test forms were prepared and administered to random samples of the
population of test takers. For the purposes of this study, ten different forms were required, as
shown in Table 2.
10
Table 2: Forms Required
FormNumber
Contents of ModeratedSection
Number of items Method of Assembly ofModerated Section
1 VS1 35 Test Selection2 VS1 35 Test Construction, Small3 VS2 30 Test Selection4 VS2 30 Test Construction, Small5 MS1 25 Test Selection6 MS1 25 Test Construction, Small7 MS1 25 Test Construction, Larger8 MS2 25 Test Selection9 MS2 25 Test Construction, Small10 MS2 25 Test Construction, Larger
Impact Targets
All three methods of assembling test sections with moderated impact depend upon the
specifications of impact targets or goals. This is more informal and less explicit for the Test
Selection procedure than for the two Test Construction procedures. To derive these targets,
historical distributions of the three different types of impact, based on the pretest statistics for
items included in operational sections, were examined. This impact for a section was the
simple sum of item impact values. Targets were generally selected to be within historical
ranges, but toward the lower (less impact) end of the ranges. The targets for the condition
TC-S were within historical ranges; the targets for the condition TC-L were set to values that
should produce a greater moderation in impact and might not be within historical ranges.
For example, pretest gender impact was collected for 38 previous 25-item
mathematical skills operational sections. The median of this distribution was -1.71, and the
first quartile was -1.55 on the pretest impact scale. The target for TC-S for gender impact
11
was set at -1.2, which is below the first quartile but within the historical range. The target for
TC-L was set to -.8, which is below the lowest value of -1.1 in the historical ranges. Targets
for both sections of mathematical skills were the same. The impact targets used in the
assemblies are given in Table 3.
Table 3: Pretest Impact Goals for Moderated Section Assemblies
Moderated Section GenderImpact
African American/White Impact
Hispanic American/White Impact
VS1, n=35, TC-S .05 -5.00 -2.40
VS2, n=30, TC-S .10 -3.70 -2.00
MS1, MS2, n=25, TC-S -1.20 -3.95 -2.00
MS1, MS2, n=25, TC-L -.80 -3.70 -1.80
Assemblies
All Test Construction (as distinguished from Test Selection) assemblies used the WDM
and the impact targets simultaneously to assemble test sections. The verbal skills sections
were assembled from a pool of 1142 items subject to 64 constraints on test content and
statistical properties other than impact. The 64 constraints were different for each verbal
skills section (VS1 and VS2). For the VS1 sections (n=35) impact targets were given weights
of 3.0; for the VS2 sections (n=30) impact targets were weighted at 1.0. The choice of target
weights is arbitrary and, as noted earlier, reflects not only the importance of the targets but
also the relationship between the complex structure of the pool and the constraints.
The mathematical skills sections were assembled from a pool of 5718 items subject to
196 constraints on test content and statistical properties other than impact. As with the verbal
skills sections, the 196 constraints were different for each of the mathematical skills sections
12
(MS1 and MS2). For both the TC-S assemblies and the TC-L assemblies, and for both MS1
and MS2, impact targets were weighted at 5.0.
Assemblies using the Test Selection paradigm were, of course, substantially more
complex. A total of 150 potential sections were produced for each of the four sections, VS1,
VS2, MS1 and MS2, using the method of random partitioning of the pools as described
earlier. The constraints on content and statistical properties other than impact were identical
to those used in the Test Construction assemblies. The candidates for each section were
ranked on each type of impact, and a final selection was made based on a subjective approach
of jointly selecting each of the three group differences close to the low values in the historical
distributions of impact.
Test Specialist Review
Each of the candidate moderated sections, both those produced using Test Construction
and those produced using Test Selection, was reviewed by test specialists to ensure that each
section "held together" as a cohesive section in the same fashion that is typically seen in
sections constructed without regard to impact. Substitutions were then made for items that
were not acceptable, a judgment that was made for a variety of reasons. A frequent cause
was that items overlapped with other items in a fashion not prohibited by the constraints on
item selection. A second cause was that some mathematics items were thought to be
impacted by the use of calculators. A third cause was item obsolescence or simply that the
test specialists did not think the quality of the item was sufficiently high.
Item substitutions were accomplished under the constraint that the group impacts of
interest should be changed as little as possible after the substitution when compared to prior
13
values. To accomplish this, for every item in a moderated section a set of desired properties
was identified for the "ideal" item to be used as a replacement. Lists of possible replacement
items were generated, with the possibilities ranked by the amount of change their substitution
would cause to the current values of impact. Test specialists then chose the best substitute for
a current item based both on their own expertise as well as potential change in impact.
This WDM assembly and review process was identical to the process normally used in
the assembly of operational test sections, excepting, of course, any consideration of
moderating impact and the restriction on item replacements. The item pools were sufficiently
large so that test specialists did not feel unduly constrained by this restriction. Thus, to the
extent that the construct measured by these tests is normally preserved in the assembly of
operational sections, it was also preserved in the assembly of the moderated impact test
sections.
Test Administration
The ten forms indicated in Table 2 were administered at a regularly scheduled test
administration to random samples from the test taker population. The volumes for the ten
forms ranged from about 7,300 test takers to about 8,900 test takers. The N's, means,
standard deviations on the raw formula score metric, and standard mean differences are shown
for each section and for each group of interest in Table 4. Note that each sample took two
versions of the section of interest, one operational and the other moderated. For example,
8,793 took the operational VS1 section (VS1 OP) and also took a moderated section that
contained the test selection (TS) version of VS1. A second sample of 8,074 took the same
operational VS1 section and also took a moderated section that contained the test construction,
14
small moderation (TC-S) version of VS1. Displaying these data separately by sample, that is,
not combining information on operational sections, allows comparisons of sample means and
variability that aid in the understanding of group differences. The order of sections in Table 4
is the same as that for Table 2.
For each pair of (operational, moderated) sections, information is given for the total
sample of test takers in the first three columns of Table 4. The remaining columns give
information about each group and each group comparison of interest. Columns 4 through 10
give information on the performance of women and men on both the operational and
moderated section. Columns 11 through 17 display the performance of African American test
takers and White test takers on the same operational and moderated sections. Columns 18
through 21 display information on the performance of Hispanic American test takers, not
repeating the information on White test takers in columns 14 through 16. Finally, in columns
22 through 25, information is provided on the performance of Asian American test takers.
This latter data is provided to explore the possibility that moderating impact for some groups
of interest might change, in an unfavorable fashion, impact for groups that is not directly
controlled by the test assembly procedure.
Each group comparison in Table 4 contains a shaded column as its right-most column--
columns 10, 17, 21, and 25. These columns contain values of the Willingham and Cole
standard mean difference, D. It is the only information that can be compared across
15
Table 4: N's, Means, Standard Deviations of Raw Formula Scores and Standard Mean Differences by Section and Group
Total Women Men W-M AA White AA-W Hispanic H-W Asian A-W
N(1)
Mean(2)
SD(3)
N(4)
Mean(5)
SD(6)
N(7)
Mean(8)
SD(9)
D(10)
N(11)
Mean(12)
SD(13)
N(14)
Mean(15)
SD(16)
D(17)
N(18)
Mean(19)
SD(20)
D(21)
N(22)
Mean(23)
SD(24)
D(25)
VS1 OP 8793 16.8 8.2 4782 16.6 8.1 4011 17.1 8.4 -.06 567 11.4 7.5 5749 17.9 7.8 -.85 503 13.1 8.1 -.60 583 16.6 9.0 -.16
TS 17.4 7.8 17.3 7.8 17.5 7.8 -.03 12.0 7.4 18.5 7.3 -.89 13.5 7.6 -.69 16.3 8.5 -.29
VS1 OP 8074 16.8 8.2 4330 16.6 8.1 3744 17.0 8.2 -.06 526 10.9 7.6 5234 17.8 7.7 -.89 458 13.6 8.1 -.54 551 16.4 8.9 -.17
TC-S 15.5 7.5 15.7 7.5 15.3 7.6 .04 10.4 6.9 16.5 7.2 -.86 12.9 7.4 -.50 15.1 8.0 -.19
VS2 OP 8959 15.4 7.3 4873 15.2 7.3 4086 15.7 7.3 -.07 603 10.1 7.2 5790 16.5 6.7 -.93 526 11.7 7.2 -.69 675 14.6 7.9 -.26
TS 15.0 6.5 15.0 6.5 15.1 6.6 -.01 10.7 6.1 15.9 6.1 -.85 12.1 6.1 -.62 14.1 7.2 -.27
VS2 OP 8397 15.5 7.5 4481 15.3 7.5 3916 15.6 7.4 -.05 521 9.8 7.0 5480 16.5 7.0 -.96 419 12.1 7.9 -.59 637 14.5 8.0 -.26
TC-S 14.5 6.6 14.5 6.5 14.6 6.8 -.01 10.3 6.2 15.2 6.3 -.78 12.4 6.4 -.44 14.1 7.2 -.16
MS1 OP 8857 12.8 6.4 4785 12.0 6.2 4071 13.7 6.6 -.27 584 7.3 5.4 5762 13.3 6.1 -1.04 506 10.2 5.9 -.53 620 15.1 6.4 .28
TS 12.8 5.7 12.2 5.5 13.5 5.9 -.22 8.2 5.1 13.3 5.5 -.97 10.4 5.4 -.54 14.9 5.9 .28
MS1 OP 8201 12.7 6.4 4347 11.9 6.1 3854 13.7 6.5 -.27 549 7.9 5.4 5342 13.4 6.1 -.95 443 10.1 5.9 -.56 577 14.6 6.6 .18
TC-S 12.6 6.1 12.0 6.0 13.3 6.2 -.21 8.3 5.7 13.3 5.9 -.85 10.3 5.9 -.51 14.4 6.3 .19
MS1 OP 7581 13.1 6.4 4052 12.4 6.2 3529 13.9 6.5 -.23 489 7.8 5.5 4945 13.7 6.1 -1.02 443 10.4 6.1 -.54 553 15.5 6.4 .29
TC-L 13.0 5.7 12.8 5.4 13.3 5.9 -.10 8.5 5.3 13.6 5.4 -.96 10.6 5.6 -.54 15.2 5.5 .31
MS2-OP 8559 12.5 5.9 4547 11.6 5.7 4010 13.6 6.0 -.35 570 7.6 5.8 5574 13.2 5.5 -.99 481 9.8 5.7 -.60 620 14.0 5.8 .14
TS 12.5 5.6 12.1 5.4 12.9 5.7 -.15 8.2 5.6 13.0 5.2 -.89 10.0 5.5 -.56 14.6 5.6 .29
MS2-OP 7812 12.5 5.8 4219 11.6 5.6 3593 13.6 5.9 -.36 528 7.4 5.2 5035 13.3 5.5 -1.10 463 9.9 5.8 -.60 559 13.7 6.0 .07
TC-S 12.6 5.9 12.1 5.8 13.2 6.0 -.19 7.9 5.2 13.2 5.6 -.98 10.2 5.8 -.53 14.7 6.4 .24
MS2-OP 7335 12.7 5.7 3922 11.7 5.6 3413 13.8 5.7 -.37 520 7.7 5.0 4741 13.3 5.4 -1.09 410 10.5 5.9 -.51 488 14.1 6.0 .13
TC-L 12.2 5.7 11.7 5.6 12.7 5.8 -.17 7.4 5.3 12.7 5.4 -.99 10.0 5.6 -.49 14.3 6.1 .28
16
moderated and operational sections that themselves might unintentionally vary in difficulty, or
across samples of groups of interest which might vary in ability. These values of D may be
compared to those given in Table 1.
As described previously, the Test Construction moderated sections were assembled to
meet certain specified goals of simultaneous impact moderation based on pretest information.
These goals were presented in Table 3. Table 5 displays information about how well these
goals were actually met both during assembly, based on item pretest information, and when
moderated sections were administered. For ease of reference, the goals are repeated from
Table 3. For comparison purposes, the same information is also provided for moderated
sections assembled using the Test Selection approach in which explicit goals play no part in
assembly. The impact represented in this table is the simple sum of item impact values for all
of the items in the assembled moderated section.
In terms of the targets originally set for the assembly of the moderated sections, the
two Test Construction approaches offered more precise achievement of pretest impact goals
than the more subjective Test Selection approach. In terms of impact computed from the
administration of sections to random test taker samples, both African American impact and
Hispanic American impact were larger (worse) than might be expected based on pretest
information. This was not unexpected since the assembly process capitalized on the more
optimal pretest impact values that were likely to be less optimal when re-estimated from
administration data. That is, this regression-to-the-mean effect was the consequence of
selecting items on the basis of pretest statistics that may be more likely to have sampling
errors in one direction than the other. In contrast, gender impact was sometimes smaller
17
(better), particularly for the mathematical skills sections. This may be because gender impact
was estimated with larger sample sizes than ethnic impact and therefore had smaller sampling
error.
Table 5: Impact: Target, At Assembly, At Administration
Impact at Assembly Impact at Administration
Gender African American/White
Hispanic/White
Gender African American/White
Hispanic/White
VS1, n=35
TS -.01 -5.42 -2.91 -.16 -5.58 -4.19
TC-S .09 -5.04 -2.39 .29 -5.12 -3.09
Target (w = 3.0) .05 -5.00 -2.40
VS2, n=30
TS .02 -3.97 -2.32 -.07 -4.37 -3.11
TC-S .08 -3.71 -2.01 -.06 -4.14 -2.37
Target (w = 1.0) .10 -3.70 -2.00
MS1, n=25
TS -1.10 -3.96 -1.99 -1.08 -4.31 -2.42
TC-S -1.20 -3.95 -2.01 -1.11 -4.21 -2.56
Target (w = 5.0) -1.20 -3.95 -2.00
TC-L -.80 -3.65 -1.81 -.53 -4.40 -2.46
Target (w = 5.0) -.80 -3.70 -1.80
MS2, n=25
TS -.96 -3.81 -1.98 -.71 -4.29 -2.57
TC-S -1.18 -4.00 -2.01 -.99 -4.60 -2.58
Target (w = 5.0) -1.20 -3.95 -2.00
TC-L -.80 -3.71 -1.81 -.81 -4.58 -2.26
Target (w = 5.0) -.80 -3.70 -1.80
18
Evaluation
Impact: Standard Mean Differences
Every pair of rows in Table 4 represents the same sample of test takers and their
performance on two test sections, one operational and one parallel moderated section. For
example, the first two rows in Table 4 contain the results for 8793 test takers on the VS1
operational section, and the moderated VS1 section built with Test Selection. Comparing the
change in the standard mean difference across pairs of rows, and for columns containing the
information for the comparison for women and men indicates that almost any manipulation of
impact moderates these standard mean differences. Typically the moderations are larger for
mathematical skills comparisons than they are for verbal skills comparisons. There does not
seem to be very much difference between the TS and the TC-S approach, although there may
be slightly larger differences for the TC-L approach for the mathematical skills sections.
Examining the same information for the standard mean difference between African
American test takers and White test takers indicates the same general result. With the
exception of the VS1 section, almost any manipulation of impact in test assembly is helpful in
moderating values of D. For the VS1 section, Test Selection actually increases D, while TC-S
decreases D. The same statements can be made about the standard mean difference
between Hispanic American test takers and White test takers.
Test Selection increases the standard mean difference between Asian American test
takers and White test takers for both verbal skills sections. Test Construction with small
impact goals (TC-S) seems to mitigate this phenomenon for the VS1 section, and reverse it
completely for VS2, that is, VS2 impact manipulation with TC-S moderates the standard
19
mean difference. No method of impact moderation has much effect on the standard mean
differences for MS1 and Asian American test takers compared to White test takers. However,
for MS2, the effect is to increase the relative advantage of Asian American test takers.
The behavior of the standard mean difference across test sections for the group
comparisons of interest is displayed in Figures 1a, 1b, 1c, and 1d. Each subfigure represents
a different group comparison. Within each subfigure, D's are plotted for each of the four
sections in separate panels. Each panel contains the D for the operational test section and the
moderated test section for each method of assembling the moderated test section. Within a
subfigure, the vertical axis has the same scale. Across panels, the vertical scales differ,
although the size of a single unit, that is, the distance between two marks on the vertical
scale, remains at a constant interval size of 0.1. This design decision reflects a focus on
within group comparisons of the effect of different methods of assembling moderated sections,
rather than comparisons across groups.
Impact was intentionally controlled for the first three group comparisons. The plots
show that Test Construction methods consistently either moderated impact or didn't change
impact for the moderated section when compared to the operational section. The results for
the Test Selection method are less consistent, and for some sections, especially verbal skills
sections, Test Selection made the moderated section impact worse than the operational section.
The same statement holds true even in the final subfigure, for Asian American impact, which
was not controlled in the test assembly process.
24
Reliability, SEM, and (Concurrent) Validity
An earlier study by Hackett, Holland, Pearlman, and Thayer (1987) found that test
sections especially designed to have moderated impact also had lower reliability and higher
concurrent validity. Stocking et al. (1998a), estimated the same result for their moderated
impact tests. In addition, a larger average standard error of measurement can be expected for
tests designed to have moderated impact when compared to tests assembled without regard to
impact. The mathematical proofs underlying these assertions are given in Stocking, Jirele,
Lewis, and Swanson (1998b).
Table 6 gives the reliability and the average standard error of measurement for the test
sections constructed for the current study, computed from the raw formula scores. For the
operational versions of the sections, samples taking different moderated sections were
combined in this computation. That is to say, all test takers administered, for example, the
VS1 operational section (8793 + 8074 from Table 4) were used in the computations for the
VS1 Operational section. Compared to the operational sections, reliability estimates for the
moderation sections are typically slightly lower. Comparisons between the test construction
approach and the test selection approach are more mixed. For verbal skills sections, Test
Construction sections are less reliable than Test Selection sections; for math skills sections, in
three out of four possible comparisons, the Test Constructions sections are more reliable than
Test Selection sections. Average standard error of measurement (SEM) values are typically
slightly higher for most moderated sections compared to the operational sections.
25
Table 6: Reliability, Average SEM, and Concurrent Validity
Section Reliability Average SEM Concurrent Validity
VS1 Operational .86 3.1 .52
VS1 TS .85 3.0 .51
VS1 TC-S .83 3.1 .51
VS2 Operational .86 2.7 .52
VS2 TS .82 2.8 .50
VS2 TC-S .81 2.8 .52
MS1 Operational .85 2.4 .56
MS1 TS .82 2.5 .57
MS1 TC-S .83 2.5 .58
MS1 TC-L .81 2.5 .57
MS2 Operational .83 2.4 .53
MS2 TS .82 2.3 .56
MS2 TC-S .84 2.4 .58
MS2 TC-L .83 2.4 .59
The right most column of Table 6 displays the concurrent validity, that is the
correlation of test scores with self-reported academic grade point average. In contrast to
previous predictions, the validity for the two verbal skills sections is slightly lower for
sections produced with impact moderation compared to their parallel operational counterparts.
For the mathematical skills sections, the results confirm previous predictions, particularly for
the MS2 section.
26
Relative Efficiency
Figure 2 displays the efficiency of each moderated section relative to that of the
corresponding operational section. Relative Efficiency, the ratio of corresponding test score
information functions, is a useful model-based method of making inferences about two tests,
conditional on ability, that is not affected by the choice of scale for measuring that ability
(Lord, 1980, Chapter 6, page 89). Verbal skills results are in the first row; mathematical
skills results are in the second row. The horizontal line plotted at 1.0 indicates that the two
sections are equally efficient. Stocking, et al. (1998a) suggested that tests produced by a
process that deliberately sought to moderate impact would likely be less efficient than tests
assembled without regard to impact at middle levels of ability and more efficient at more
extreme (low or high) levels. This suggestion was based on the observation that moderated
impact tests tend to have easier and harder items than tests constructed ignoring impact. For
gender impact, where the sample sizes of the two groups are roughly equal, there is a
mathematical basis for this observation, as demonstrated in Stocking et. al, (1998b).
The results shown in Figure 2 suggest that this assertion is substantially upheld,
although it seems more clear for the VS2 and MS1 sections and less clear for VS1 and MS2
sections. For MS2, the moderated impact sections are approximately as efficient as the
operational section at most ability levels.
27
Section VS 1
0.0
0.5
1.0
1.5
2.0
Score
Eff
icie
ncy
Rel
ativ
e to
Ope
rati
ons
Sect
ion
Test Selection
Test Construction-Small
Section VS 2
0.0
0.5
1.0
1.5
2.0
ScoreE
ffic
ienc
y R
elat
ive
to O
pera
tion
s Se
ctio
n
Test Selection
Test Construction-Small
Section MS 1
0.0
0.5
1.0
1.5
2.0
Score
Eff
icie
ncy
Rel
ativ
e to
Ope
rati
ons
Sect
ion
Test Selection
Test Construction-Small
Test Construction-Larger
Section MS 2
0.0
0.5
1.0
1.5
2.0
Score
Eff
icie
ncy
Rel
ativ
e to
Ope
rati
ons
Sect
ion
Test Selection
Test Construction-Small
Test Construction-Larger
Low High Low High
Low High Low High
28
The Construct
There is no unequivocal method of making simple comparisons of constructs measured
by different tests; a large body of literature addresses such issues (see, for example, Carroll,
1976; Haertel and Wiley, 1992; Shealy and Stout, 1993; Snow and Lohman, 1993; Takane
and de Leeuw, 1987). However, in the current context, a rough indication of the similarity
between constructs measured may be obtained by computing the correlation between a
moderated section and its parallel operational section. These correlations were computed for
raw formula scores, and then corrected for attenuation, and are displayed in Table 7.
Corrected correlations close to 1.0 indicate that the test takers are rank ordered similarly on
the moderated and operational sections, and that the two sections measure a statistically
similar construct.
Table 7: Correlations Between Moderated and Operational Sections
Sections Correlated Correlation CorrectedCorrelation
VS1 Operational, VS1 TS .83 .98
VS1 Operational, VS1 TC-S .83 .98
VS2 Operational, VS2 TS .82 .98
VS2 Operational, VS2 TC-L .82 .98
MS1 Operational, MS1 TS .82 .99
MS1 Operational, MS1 TC-S .84 .99
MS1 Operational, MS1 TC-L .82 .99
MS2 Operational, MS2 TS .81 .98
MS2 Operational, MS2 TC-S .81 .97
MS2 Operational, MS2 TC-L .81 .97
29
Discussion
A number of factors limit the results of this empirical investigation of the simultaneous
moderation of gender, African American, and Hispanic impact for tests of verbal skills and
mathematical skills. First, the investigation was limited to the construction and administration
of test sections, not entire tests. To the extent that the content of each section does not mirror
in miniature the content of the total test, the results for test sections offer only imperfect
information about the characteristics of the total test. Two factors mitigate this concern:
First, the use of the standard mean difference, D, as a measure of impact, is helpful in making
comparisons among sections and the total test characteristics. Second, the nearly uniform
results (with the exception of the Test Selection procedure) that impact moderation was
successful on a section level imply strongly that the same might be true on a total test level.
In addition, the single combinations of a method of test construction with a test section
give no information about the variability of the results. That is, we have a single instance of
the consequences of using the Test Construction, small moderation, (TC-S), to construct the
VS2 section. If we could have repeated this study many times we would have a context in
which to judge whether or not the particular results obtained were typical or atypical. This
kind of repetition was beyond the scope of this study.
The explicit consideration of impact when assembling tests, as suggested by Bond
(1987), is effective in moderating irrelevant impact when tests are administered. Modern
ATA methods assist in this process, although are not required. The results of this empirical
study tended to confirm more theoretical predictions from previous studies in terms of the
30
properties of the resultant tests. Moreover, the results were achieved even for the Asian
American and White impact that was not explicitly controlled. It is likely that this is due, at
least indirectly, to the gender composition of all groups. Women were 55% of the total
population, 54% of the White test takers, 59% of the African American test takers, 58% of
the Hispanic American test takers, and 52% of the Asian American test takers. Thus it is
likely that any moderation of gender impact may also moderate the other impacts of interest.
The two Test Construction methods based on the WDM and incorporating impact targets as
explicit goals produced more consistent results than the methodology based on the Test
Selection approach, with higher efficiency.
What is not addressed by the current study is any consideration of the sustainability of
moderating impact during test construction, that is, current item productions methods may not
be sufficient to maintain the creation of moderated impact tests over time. The extent to
which this is true is not clear, however, because the WDM requires only a balance of impact
(low and high) within a test. That is, the WDM does not rely on the selection of just the
items with very small impact values. Nevertheless, this important practical issue should be
addressed in advance of any implementation, perhaps through a series of carefully designed
simulation studies.
31
References
American Educational Research Association, American Psychological Association,
National Council on Measurement in Education. (1999). Standards for Educational
and Psychological Testing. Washington, D. C.: American Psychological Association.
Bond, L. (1987). The Golden Rule settlement: A minority perspective. Educational
Measurement Issues and Practice, 6(2), 18-20.
Carroll, J. B. (1976). Psychometric tests as cognitive tasks: A new "structure of intellect." In
L. B. Resnick (Ed.), The nature of intelligence. Hillsdale, NJ: Erlbaum.
Embretson, S. E. (1984). A general latent trait model for response processes. Psychometrika,
49, 175-186.
Embretson, S. E. (1987). Component latent trait models for paragraph comprehension tests.
Applied Psychological Measurement, 11, 175-193.
Embretson, S. E. (1991). A multidimensional latent trait model for measuring learning and
change. Psychometrika, 56, 495-515.
Fischer, G. H. (1973). Linear logistic test model as an instrument in educational research.
Acta Psychologica, 37, 359-374.
Hackett, R., Holland, P., Pearlman, M., and Thayer, D. (1987) Test construction manipulating
score differences between black and white examinees: Properties of the resulting tests.
Research Report 87-30. Princeton, NJ: Educational Testing Service.
Heartel, E. H., and Wiley, D. E. (1992). Representations of ability structures: implications for
testing. In N. Frederiksen, R. J. Mislevy, and I. Bejar (Eds.), Test theory for a
new generation of tests. Hillsdale, NJ: Erlbaum, 359-384.
32
Holland, P. W. and Thayer, D. T. (1998) Differential item functioning and the Mantel-
Haenszel procedure. In H. Wainer and H. I. Braun (Eds.), Test validity. Hillsdale,
NJ: Erlbaum.
Lazersfeld, P. F., and Henry, N. W. (1968). Latent structure analysis. New York:
Houghton-Mifflin.
Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems.
Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.
Luecht, R. M. (1998) Computer-assisted test assembly using optimization heuristics. Applied
Psychological Measurement, 22, 224-236.
Rosser, P. (1989) The SAT Gender Gap: Identifying the Causes. Washington, D.C.: Center
for Women Policy Studies.
Shealy, R., and Stout, W. F. (1993). An item response theory model for test bias. In P. W.
Holland and H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Erlbaum,
197-238.
Snow, R., and Lohman, D. (1993). Implications of cognitive psychology for educational
measurement. In Linn, R. (Ed.) Educational Measurement. National Council of
Measurement, American Council on Education, Phoenix, AZ: Oryx.
Stocking, M. L., Jirele, T., Lewis, C., Swanson, L. (1998a) Moderating possibly irrelevant
multiple mean score differences on a test of mathematical reasoning. Journal of
Educational Measurement, 35, 199-221.
Stocking, M. L., Jirele, T., Lewis, C., Swanson, L. (1998b) An investigation of the
simultaneous moderation of average gender and African American score differences on
a test of mathematical reasoning. Research Report 98-46. Princeton, NJ:
Educational Testing Service
33
Stocking, M. L., Swanson, L., and Pearlman, M. (1993) Application of an automated item
selection method to real data. Applied Psychological Measurement, 17, 167-176.
Swanson, L., and Stocking, M. L. (1993). A model and heuristic for solving very large item
selection problems. Applied Psychological Measurement, 17, 151-166.
Takane, Y., and de Leeuw, J. (1987). On the relationship between item response theory and
factor analysis of discretized variables. Psychometrika, 52, 393-408.
Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago, IL: University of Chicago
Press.
van der Linden, W. J. (1998) Optimal assembly of psychological and educational tests.
Applied Psychological Measurement, 22, 195-211.
Weiss, J. (1987). The Golden Rule bias reduction principle: A practical reform. Educational
Measurement Issues and Practice, 6(2), 23-25.
Wightman, L. F. (1998) Practical issues in computerized test assembly. Applied
Psychological Measurement, 22, 292-302.
Willingham, W. and Cole, N. S. (1997). Gender and fair Assessment. Mahwah, NJ:
Lawrence Erlbaum Associates, Publishers.