an empirical investigation of impact moderation in test … · 2016-05-19 · research report march...

RESEARCHREPORT

March 2001RR-01-04

Statistics & Research DivisionPrinceton, NJ 08541

An Empirical Investigationof Impact Moderation inTest Construction

Martha L. StockingIda LawrenceMiriam FeigenbaumThomas JireleCharles LewisThomas Van Essen

AN EMPIRICAL INVESTIGATION OF IMPACTMODERATION IN TEST CONSTRUCTION

Martha L. StockingIda Lawrence

Miriam FeigenbaumThomas JireleCharles Lewis

Thomas Van Essen

March, 2001

Educational Testing ServicePrinceton, NJ 08541

Research Reports provide preliminary and limiteddissemination of ETS research prior to publication. They areavailable without charge from the

Research Publications OfficeMail Stop 07-REducational Testing ServicePrinceton, NJ 08541

1

Abstract

This investigation constructed four different kinds of test sections using three methods

of test assembly that incorporate the goals of simultaneous moderation of three different kinds

of impact--gender impact, African American impact, and Hispanic American impact. The

test sections were administered undetectably to random samples from the appropriate

population. The results were evaluated by comparison of the characteristics of moderated

sections with those of parallel operational sections. Almost all methods of test assembly

produced either moderation of impact in the appropriate direction or no change in impact.

Taking impact into account in test assembly tended to lower reliability slightly, reduce the

relative efficiency for test takers in the middle score range while increasing efficiency for

those with more extreme scores, raise concurrent validity slightly and maintain the construct

measured by the parallel operational section.

Key Words: impact, automated test assembly, bias, DIF, expert systems.

2

Introduction

The professional standards upon which the measurement community rests emphasize

that group differences in test results must be either relevant to the construct being measured or

must be removed, in order to produce a test that is fair and equitable (American Educational

Research Association, American Psychological Association, National Council for Measurement

in Education, 1999). Determining the best approach to the removal of irrelevant differences

has been the topic of much research over a number of years. Some authors assume that any

group differences are irrelevant and that the fairest test is assembled by choosing items that

minimize group mean score differences, regardless of the effect on the construct intended to

be measured by the test (e. g., Rosser, 1987; Weiss, 1987). Still other authors argue that the

only irrelevant differences are those remaining after conditioning on test score, which has lead

to the study of Differential Item Functioning (DIF; Holland and Thayer, 1988).

Stocking, Jirele, Lewis, and Swanson (1998a) sought to simultaneously moderate the

mean score differences between women and men, and African American test takers and White

test takers, without changing the construct measured by a test of mathematical skills. They

used methods of automated test assembly (ATA) (see, for example, Luecht, 1998, van der

Linden, 1998, and Wightman, 1998) as a tool in the assembly of moderated impact tests by

adding the goal of moderating mean group differences to all the other goals that are used in

test assembly. This study was limited by three features. First, the item pool was constructed

from 11 intact test forms and was unrealistically small. Second, some aspects of the standard

test assembly process were omitted from their procedures of test assembly. It is possible that

3

some of these omitted aspects are important in maintaining features of the construct to be

measured that do not easily lend themselves to incorporation in ATA methods. Finally, the

results for the constructed tests were estimated from item statistics and population

information, as opposed to being evaluated by the administration of the tests to random

samples of test takers from the population of interest.

The current study seeks to overcome these three limitations, at least partially. Real

item pools were used that were typical of the size and structure normally used to assemble the

studied tests. The test assembly procedures employed represent a more complete

incorporation of standard test assembly features, including review by test specialists, the

elimination of overlapping items, and substitution of different items, though with some

restrictions. Unidentified moderated sections, that is, sections for which scores did not count

toward test takers' reported scores, were constructed to simultaneously moderate the mean

score differences between women and men, African American and White test takers, and

Hispanic American and White test takers. These moderated sections were constructed to be

parallel in terms of all properties except impact to operational sections on which scores

counted towards test takers' reported scores. Moderated sections of both verbal and

mathematical skills tests were undetectably administered along with parallel operational

sections to random samples of the test taker population at a regularly scheduled test

administration. The results for the moderated sections were evaluated by comparison to

parallel operational test sections.

In this attempt to moderate group mean test score differences through the alteration of

the test assembly process, we implemented the suggestion of Bond (1987, page 18):

4

The central question remains, however, 'What factors should be taken intoaccount when more items survive the item analysis than are needed on the finaloperational form of a test?' I would submit that issues of equity and equalopportunity (i.e., group difference statistics) have a place here and arereasonable candidate criteria for final item selection under some circumstances. This suggestion is admittedly a departure from past practice (some would say aradical departure). But, if applied with reason, issues of equity can beintroduced without doing violence to test validity.

This notion is further reinforced by the current Standards for Educational and Psychological

Testing (1999, page 83). According to Standard 7.11,

When a construct can be measured in different ways that are approximatelyequal in their degrees of construct representation and freedom from construct-irrelevant variance, evidence of mean score differences across relevantsubgroups of examinees should be considered in deciding which test to use.

In the next section of this paper we discuss a number of concepts that are central to

the understanding of this study. In the succeeding section we summarize the assemblies of

the moderated test sections. Remaining sections describe the test administration and present

the results. The final sections evaluate the moderated sections compared to their operational

counterparts. What we can conclude from this investigation and suggestions for further

research are discussed.

Central Concepts

Weighted Deviations Model

The weighted deviations model (WDM) and heuristic of Swanson and Stocking (1993)

was the particular method of automated test assembly (ATA) employed in this study to

construct test sections. The WDM is similar to many models in the decision sciences and is

used to select items from a pool of items in such a way as to minimize the weighted sum of

deviations from constraints reflecting desirable test properties, as established by test specialists.

5

The constraints on item selection reflect formal test specifications as well as good test

construction practices, indicating specialists' judgment of what construct should be measured

by a test and how it is to be measured. They may include specifications about item content,

type, cognitive demands, statistical properties, impact properties, and any other property that

is of interest in item selection. These constraints on item selection are the closest we can

come to an operational articulation of the construct measured by the test.

The weights in the weighted deviations model are set by test specialists for each

constraint. They reflect two aspects of the constraints--the relative importance of each

constraint when compared to all other constraints, and the structure of the pool from which

the items are being selected when compared to the structure of the constraints specified.

Most optimization algorithms in the decision sciences seek to find the optimal solution

to an optimization problem, if that problem has a feasible solution. The weighted deviations

model incorporates a different philosophy by attempting to find the best possible solution for

problems that may either be mathematically infeasible, or that are so large that finding the

optimal solution is too costly. This philosophy mirrors the activity of expert test specialists

who must produce a test to fit detailed content and statistical constraints from a large pool

that may not perfectly mirror these constraints. This model and heuristic are routinely

successfully used to assemble test forms for many different testing programs at Educational

Testing Service (Stocking, Swanson, and Pearlman, 1993).

Impact

Impact at the item level is defined as the difference between the observed proportions

of correct responses to the item for two groups of interest. Item impact will vary, of course,

6

as a function of the groups to whom the item is administered. In this investigation, we follow

the convention found in the DIF literature (e. g., Holland and Thayer, 1988) of computing

impact as the difference formed by subtracting the proportion correct for the reference group

from the proportion correct for the focal group. The focal group is the group of particular

interest, e. g., African American test takers; the reference group is the group with whom the

focal group is to be compared, e. g., White test takers.

Test level impact is the difference between average test scores for the focal and

reference groups. If a test is scored number correct, test level impact may be computed as a

simple sum over items of item impact. When we speak of moderating the impact of a test,

we mean decreasing the relative disadvantage of test takers who are members of the focal

group with respect to test takers who are members of the reference group. In other words,

impact, which is usually negative, is considered moderated if it becomes less negative, zero,

or even positive. (Typically focal group test takers are not advantaged with respect to

reference group test takers, although there may be some exceptions, such as Asian American

test takers compared to White test takers on some mathematical skills tests.)

Another useful statistic for characterizing impact at the test level is the standard mean

difference, or standard difference, D, as in Willingham and Cole (1997, page 21). We will

use this measure of impact throughout this study. The numerator of this index contains the

focal group mean minus the reference group mean; the denominator contains the (unweighted)

average standard deviation. Thus D is independent of the score scale on which test results are

reported and may be compared across different tests and samples. The standard differences

7

for a recent test taker population for the two tests used in this study and for the various

comparisons of interest are given in Table 1.

Table 1: Standard Mean Differences for Studied Tests

Test Women and Men African Americanand White

Hispanic Americanand White

Asian Americanand White

Verbal Skills -.06 -.93 -.68 -.25

Mathematical Skills -.32 -1.04 -.70 .28

Defining the Construct

Many methodologies exist that attempt to provide extended information and meaning

from the analysis of response patterns underlying test scores (see, for example, Embretson,

1984, 1987, 1991; Fischer, 1973; Lazersfeld and Henry, 1968; Thurstone, 1947). While

useful in understanding deeper meanings of test scores, these approaches may be less useful in

terms of providing direct guidance in the construction of future editions of the same test.

This latter type of guidance is usually provided through the development of detailed

test specifications, which may or may not reflect information obtained from extensive analyses

designed to identify the underlying construct being measured. To the extent that test

specifications used in the construction of different test editions reflect important aspects of the

underlying construct, the construct is controlled across different editions of the same test. To

the extent that the construct is not reflected in test specifications, the maintenance of the

construct cannot be assumed across different editions.

For this and previous studies, we have reasoned along the following lines: the

construct is defined implicitly by test committees setting test specifications, and is implicitly

8

operationalized by the behavior of expert test specialists who assemble and review tests that

are then administered to test takers. To the extent that we can adequately model this process,

we preserve the intended construct. To conservatively reflect the uncertainty of the

relationship between test specifications and the underlying construct, we will usually refer to

"test specifications" or "constraints on item selection" rather than "the construct" for the

remainder of this paper.

Methods of Moderating Impact Using ATA

Two different approaches to using ATA methods to moderate impact were used in this

investigation. In addition, two variations were tried out for one of the methods, giving a total

of three methods tried. The first method, called "test construction" (TC), uses the WDM

directly to simultaneously satisfy all statistical and nonstatistical constraints on item selection,

including the moderation of the three different kinds of impact of interest. This is the same

approach that was used in the previous study by Stocking et al. (1998a). Two versions were

tried, one in which a small moderation in the three kinds of impact was the goal (TC-S), and

a second in which a larger moderation in the three kinds of impact was the goal (TC-L).

The second approach, called "test selection" (TS), used a more elaborate and less

automatic scheme. In this approach, item pools were divided into random subsets, and the

WDM was used to generate a large number of (possibly overlapping) tests without any

consideration of impact. The draft tests were built to meet content and statistical

specifications. These tests were then ranked, from low impact to high impact, separately

based on their relative standing on the three impact values of the constructed test. That is,

each test had a separate ranking based on its location in the lists of gender, African American,

9

and Hispanic American impact. Then an overall rank was assigned to a test that represented

the highest (that is, the worst) of the three individual ranks. A test was then selected from the

lower ranked (less impact) tests based on a subjective approach that attempted to select a test

such that each of the three impact values was close to its historical low point derived from

frequency distributions of impact values for parallel tests administered in the past. In

addition, special emphasis was given to the improvement of gender impact if possible to

accomplish without causing the two kinds of ethnic impact to become worse. This procedure

has a certain resemblance to standard minimax procedures in that an attempt was made to

minimize the maximum of the three different kinds of impact. The test selection approach

differs from the test construction approach in that group differences are taken into account

after the test construction process.

Test Structure and Moderated Assemblies

A test of verbal skills and a test of mathematical skills, both administered nationally to

the test taker population, were used for this study. The focus of the study was on two

separately timed sections of the verbal skills test, referred to as VS1 and VS2, and two

separately timed sections of the mathematical skills test, MS1 and MS2. Each test taker

responded to operational sections of VS1, VS2, MS1 and MS2 and also to a section that

contained a moderated impact version of an operational section. The raw score on any test

section was the formula score for that section.

A number of test forms were prepared and administered to random samples of the

population of test takers. For the purposes of this study, ten different forms were required, as

shown in Table 2.

10

Table 2: Forms Required

FormNumber

Contents of ModeratedSection

Number of items Method of Assembly ofModerated Section

1 VS1 35 Test Selection2 VS1 35 Test Construction, Small3 VS2 30 Test Selection4 VS2 30 Test Construction, Small5 MS1 25 Test Selection6 MS1 25 Test Construction, Small7 MS1 25 Test Construction, Larger8 MS2 25 Test Selection9 MS2 25 Test Construction, Small10 MS2 25 Test Construction, Larger

Impact Targets

All three methods of assembling test sections with moderated impact depend upon the

specifications of impact targets or goals. This is more informal and less explicit for the Test

Selection procedure than for the two Test Construction procedures. To derive these targets,

historical distributions of the three different types of impact, based on the pretest statistics for

items included in operational sections, were examined. This impact for a section was the

simple sum of item impact values. Targets were generally selected to be within historical

ranges, but toward the lower (less impact) end of the ranges. The targets for the condition

TC-S were within historical ranges; the targets for the condition TC-L were set to values that

should produce a greater moderation in impact and might not be within historical ranges.

For example, pretest gender impact was collected for 38 previous 25-item

mathematical skills operational sections. The median of this distribution was -1.71, and the

first quartile was -1.55 on the pretest impact scale. The target for TC-S for gender impact

11

was set at -1.2, which is below the first quartile but within the historical range. The target for

TC-L was set to -.8, which is below the lowest value of -1.1 in the historical ranges. Targets

for both sections of mathematical skills were the same. The impact targets used in the

assemblies are given in Table 3.

Table 3: Pretest Impact Goals for Moderated Section Assemblies

Moderated Section GenderImpact

African American/White Impact

Hispanic American/White Impact

VS1, n=35, TC-S .05 -5.00 -2.40

VS2, n=30, TC-S .10 -3.70 -2.00

MS1, MS2, n=25, TC-S -1.20 -3.95 -2.00

MS1, MS2, n=25, TC-L -.80 -3.70 -1.80

Assemblies

All Test Construction (as distinguished from Test Selection) assemblies used the WDM

and the impact targets simultaneously to assemble test sections. The verbal skills sections

were assembled from a pool of 1142 items subject to 64 constraints on test content and

statistical properties other than impact. The 64 constraints were different for each verbal

skills section (VS1 and VS2). For the VS1 sections (n=35) impact targets were given weights

of 3.0; for the VS2 sections (n=30) impact targets were weighted at 1.0. The choice of target

weights is arbitrary and, as noted earlier, reflects not only the importance of the targets but

also the relationship between the complex structure of the pool and the constraints.

The mathematical skills sections were assembled from a pool of 5718 items subject to

196 constraints on test content and statistical properties other than impact. As with the verbal

skills sections, the 196 constraints were different for each of the mathematical skills sections

12

(MS1 and MS2). For both the TC-S assemblies and the TC-L assemblies, and for both MS1

and MS2, impact targets were weighted at 5.0.

Assemblies using the Test Selection paradigm were, of course, substantially more

complex. A total of 150 potential sections were produced for each of the four sections, VS1,

VS2, MS1 and MS2, using the method of random partitioning of the pools as described

earlier. The constraints on content and statistical properties other than impact were identical

to those used in the Test Construction assemblies. The candidates for each section were

ranked on each type of impact, and a final selection was made based on a subjective approach

of jointly selecting each of the three group differences close to the low values in the historical

distributions of impact.

Test Specialist Review

Each of the candidate moderated sections, both those produced using Test Construction

and those produced using Test Selection, was reviewed by test specialists to ensure that each

section "held together" as a cohesive section in the same fashion that is typically seen in

sections constructed without regard to impact. Substitutions were then made for items that

were not acceptable, a judgment that was made for a variety of reasons. A frequent cause

was that items overlapped with other items in a fashion not prohibited by the constraints on

item selection. A second cause was that some mathematics items were thought to be

impacted by the use of calculators. A third cause was item obsolescence or simply that the

test specialists did not think the quality of the item was sufficiently high.

Item substitutions were accomplished under the constraint that the group impacts of

interest should be changed as little as possible after the substitution when compared to prior

13

values. To accomplish this, for every item in a moderated section a set of desired properties

was identified for the "ideal" item to be used as a replacement. Lists of possible replacement

items were generated, with the possibilities ranked by the amount of change their substitution

would cause to the current values of impact. Test specialists then chose the best substitute for

a current item based both on their own expertise as well as potential change in impact.

This WDM assembly and review process was identical to the process normally used in

the assembly of operational test sections, excepting, of course, any consideration of

moderating impact and the restriction on item replacements. The item pools were sufficiently

large so that test specialists did not feel unduly constrained by this restriction. Thus, to the

extent that the construct measured by these tests is normally preserved in the assembly of

operational sections, it was also preserved in the assembly of the moderated impact test

sections.

Test Administration

The ten forms indicated in Table 2 were administered at a regularly scheduled test

administration to random samples from the test taker population. The volumes for the ten

forms ranged from about 7,300 test takers to about 8,900 test takers. The N's, means,

standard deviations on the raw formula score metric, and standard mean differences are shown

for each section and for each group of interest in Table 4. Note that each sample took two

versions of the section of interest, one operational and the other moderated. For example,

8,793 took the operational VS1 section (VS1 OP) and also took a moderated section that

contained the test selection (TS) version of VS1. A second sample of 8,074 took the same

operational VS1 section and also took a moderated section that contained the test construction,

14

small moderation (TC-S) version of VS1. Displaying these data separately by sample, that is,

not combining information on operational sections, allows comparisons of sample means and

variability that aid in the understanding of group differences. The order of sections in Table 4

is the same as that for Table 2.

For each pair of (operational, moderated) sections, information is given for the total

sample of test takers in the first three columns of Table 4. The remaining columns give

information about each group and each group comparison of interest. Columns 4 through 10

give information on the performance of women and men on both the operational and

moderated section. Columns 11 through 17 display the performance of African American test

takers and White test takers on the same operational and moderated sections. Columns 18

through 21 display information on the performance of Hispanic American test takers, not

repeating the information on White test takers in columns 14 through 16. Finally, in columns

22 through 25, information is provided on the performance of Asian American test takers.

This latter data is provided to explore the possibility that moderating impact for some groups

of interest might change, in an unfavorable fashion, impact for groups that is not directly

controlled by the test assembly procedure.

Each group comparison in Table 4 contains a shaded column as its right-most column--

columns 10, 17, 21, and 25. These columns contain values of the Willingham and Cole

standard mean difference, D. It is the only information that can be compared across

15

Table 4: N's, Means, Standard Deviations of Raw Formula Scores and Standard Mean Differences by Section and Group

Total Women Men W-M AA White AA-W Hispanic H-W Asian A-W

N(1)

Mean(2)

SD(3)

N(4)

Mean(5)

SD(6)

N(7)

Mean(8)

SD(9)

D(10)

N(11)

Mean(12)

SD(13)

N(14)

Mean(15)

SD(16)

D(17)

N(18)

Mean(19)

SD(20)

D(21)

N(22)

Mean(23)

SD(24)

D(25)

VS1 OP 8793 16.8 8.2 4782 16.6 8.1 4011 17.1 8.4 -.06 567 11.4 7.5 5749 17.9 7.8 -.85 503 13.1 8.1 -.60 583 16.6 9.0 -.16

TS 17.4 7.8 17.3 7.8 17.5 7.8 -.03 12.0 7.4 18.5 7.3 -.89 13.5 7.6 -.69 16.3 8.5 -.29

VS1 OP 8074 16.8 8.2 4330 16.6 8.1 3744 17.0 8.2 -.06 526 10.9 7.6 5234 17.8 7.7 -.89 458 13.6 8.1 -.54 551 16.4 8.9 -.17

TC-S 15.5 7.5 15.7 7.5 15.3 7.6 .04 10.4 6.9 16.5 7.2 -.86 12.9 7.4 -.50 15.1 8.0 -.19

VS2 OP 8959 15.4 7.3 4873 15.2 7.3 4086 15.7 7.3 -.07 603 10.1 7.2 5790 16.5 6.7 -.93 526 11.7 7.2 -.69 675 14.6 7.9 -.26

TS 15.0 6.5 15.0 6.5 15.1 6.6 -.01 10.7 6.1 15.9 6.1 -.85 12.1 6.1 -.62 14.1 7.2 -.27

VS2 OP 8397 15.5 7.5 4481 15.3 7.5 3916 15.6 7.4 -.05 521 9.8 7.0 5480 16.5 7.0 -.96 419 12.1 7.9 -.59 637 14.5 8.0 -.26

TC-S 14.5 6.6 14.5 6.5 14.6 6.8 -.01 10.3 6.2 15.2 6.3 -.78 12.4 6.4 -.44 14.1 7.2 -.16

MS1 OP 8857 12.8 6.4 4785 12.0 6.2 4071 13.7 6.6 -.27 584 7.3 5.4 5762 13.3 6.1 -1.04 506 10.2 5.9 -.53 620 15.1 6.4 .28

TS 12.8 5.7 12.2 5.5 13.5 5.9 -.22 8.2 5.1 13.3 5.5 -.97 10.4 5.4 -.54 14.9 5.9 .28

MS1 OP 8201 12.7 6.4 4347 11.9 6.1 3854 13.7 6.5 -.27 549 7.9 5.4 5342 13.4 6.1 -.95 443 10.1 5.9 -.56 577 14.6 6.6 .18

TC-S 12.6 6.1 12.0 6.0 13.3 6.2 -.21 8.3 5.7 13.3 5.9 -.85 10.3 5.9 -.51 14.4 6.3 .19

MS1 OP 7581 13.1 6.4 4052 12.4 6.2 3529 13.9 6.5 -.23 489 7.8 5.5 4945 13.7 6.1 -1.02 443 10.4 6.1 -.54 553 15.5 6.4 .29

TC-L 13.0 5.7 12.8 5.4 13.3 5.9 -.10 8.5 5.3 13.6 5.4 -.96 10.6 5.6 -.54 15.2 5.5 .31

MS2-OP 8559 12.5 5.9 4547 11.6 5.7 4010 13.6 6.0 -.35 570 7.6 5.8 5574 13.2 5.5 -.99 481 9.8 5.7 -.60 620 14.0 5.8 .14

TS 12.5 5.6 12.1 5.4 12.9 5.7 -.15 8.2 5.6 13.0 5.2 -.89 10.0 5.5 -.56 14.6 5.6 .29

MS2-OP 7812 12.5 5.8 4219 11.6 5.6 3593 13.6 5.9 -.36 528 7.4 5.2 5035 13.3 5.5 -1.10 463 9.9 5.8 -.60 559 13.7 6.0 .07

TC-S 12.6 5.9 12.1 5.8 13.2 6.0 -.19 7.9 5.2 13.2 5.6 -.98 10.2 5.8 -.53 14.7 6.4 .24

MS2-OP 7335 12.7 5.7 3922 11.7 5.6 3413 13.8 5.7 -.37 520 7.7 5.0 4741 13.3 5.4 -1.09 410 10.5 5.9 -.51 488 14.1 6.0 .13

TC-L 12.2 5.7 11.7 5.6 12.7 5.8 -.17 7.4 5.3 12.7 5.4 -.99 10.0 5.6 -.49 14.3 6.1 .28

16

moderated and operational sections that themselves might unintentionally vary in difficulty, or

across samples of groups of interest which might vary in ability. These values of D may be

compared to those given in Table 1.

As described previously, the Test Construction moderated sections were assembled to

meet certain specified goals of simultaneous impact moderation based on pretest information.

These goals were presented in Table 3. Table 5 displays information about how well these

goals were actually met both during assembly, based on item pretest information, and when

moderated sections were administered. For ease of reference, the goals are repeated from

Table 3. For comparison purposes, the same information is also provided for moderated

sections assembled using the Test Selection approach in which explicit goals play no part in

assembly. The impact represented in this table is the simple sum of item impact values for all

of the items in the assembled moderated section.

In terms of the targets originally set for the assembly of the moderated sections, the

two Test Construction approaches offered more precise achievement of pretest impact goals

than the more subjective Test Selection approach. In terms of impact computed from the

administration of sections to random test taker samples, both African American impact and

Hispanic American impact were larger (worse) than might be expected based on pretest

information. This was not unexpected since the assembly process capitalized on the more

optimal pretest impact values that were likely to be less optimal when re-estimated from

administration data. That is, this regression-to-the-mean effect was the consequence of

selecting items on the basis of pretest statistics that may be more likely to have sampling

errors in one direction than the other. In contrast, gender impact was sometimes smaller

17

(better), particularly for the mathematical skills sections. This may be because gender impact

was estimated with larger sample sizes than ethnic impact and therefore had smaller sampling

error.

Table 5: Impact: Target, At Assembly, At Administration

Impact at Assembly Impact at Administration

Gender African American/White

Hispanic/White

Gender African American/White

Hispanic/White

VS1, n=35

TS -.01 -5.42 -2.91 -.16 -5.58 -4.19

TC-S .09 -5.04 -2.39 .29 -5.12 -3.09

Target (w = 3.0) .05 -5.00 -2.40

VS2, n=30

TS .02 -3.97 -2.32 -.07 -4.37 -3.11

TC-S .08 -3.71 -2.01 -.06 -4.14 -2.37

Target (w = 1.0) .10 -3.70 -2.00

MS1, n=25

TS -1.10 -3.96 -1.99 -1.08 -4.31 -2.42

TC-S -1.20 -3.95 -2.01 -1.11 -4.21 -2.56

Target (w = 5.0) -1.20 -3.95 -2.00

TC-L -.80 -3.65 -1.81 -.53 -4.40 -2.46

Target (w = 5.0) -.80 -3.70 -1.80

MS2, n=25

TS -.96 -3.81 -1.98 -.71 -4.29 -2.57

TC-S -1.18 -4.00 -2.01 -.99 -4.60 -2.58

Target (w = 5.0) -1.20 -3.95 -2.00

TC-L -.80 -3.71 -1.81 -.81 -4.58 -2.26

Target (w = 5.0) -.80 -3.70 -1.80

18

Evaluation

Impact: Standard Mean Differences

Every pair of rows in Table 4 represents the same sample of test takers and their

performance on two test sections, one operational and one parallel moderated section. For

example, the first two rows in Table 4 contain the results for 8793 test takers on the VS1

operational section, and the moderated VS1 section built with Test Selection. Comparing the

change in the standard mean difference across pairs of rows, and for columns containing the

information for the comparison for women and men indicates that almost any manipulation of

impact moderates these standard mean differences. Typically the moderations are larger for

mathematical skills comparisons than they are for verbal skills comparisons. There does not

seem to be very much difference between the TS and the TC-S approach, although there may

be slightly larger differences for the TC-L approach for the mathematical skills sections.

Examining the same information for the standard mean difference between African

American test takers and White test takers indicates the same general result. With the

exception of the VS1 section, almost any manipulation of impact in test assembly is helpful in

moderating values of D. For the VS1 section, Test Selection actually increases D, while TC-S

decreases D. The same statements can be made about the standard mean difference

between Hispanic American test takers and White test takers.

Test Selection increases the standard mean difference between Asian American test

takers and White test takers for both verbal skills sections. Test Construction with small

impact goals (TC-S) seems to mitigate this phenomenon for the VS1 section, and reverse it

completely for VS2, that is, VS2 impact manipulation with TC-S moderates the standard

19

mean difference. No method of impact moderation has much effect on the standard mean

differences for MS1 and Asian American test takers compared to White test takers. However,

for MS2, the effect is to increase the relative advantage of Asian American test takers.

The behavior of the standard mean difference across test sections for the group

comparisons of interest is displayed in Figures 1a, 1b, 1c, and 1d. Each subfigure represents

a different group comparison. Within each subfigure, D's are plotted for each of the four

sections in separate panels. Each panel contains the D for the operational test section and the

moderated test section for each method of assembling the moderated test section. Within a

subfigure, the vertical axis has the same scale. Across panels, the vertical scales differ,

although the size of a single unit, that is, the distance between two marks on the vertical

scale, remains at a constant interval size of 0.1. This design decision reflects a focus on

within group comparisons of the effect of different methods of assembling moderated sections,

rather than comparisons across groups.

Impact was intentionally controlled for the first three group comparisons. The plots

show that Test Construction methods consistently either moderated impact or didn't change

impact for the moderated section when compared to the operational section. The results for

the Test Selection method are less consistent, and for some sections, especially verbal skills

sections, Test Selection made the moderated section impact worse than the operational section.

The same statement holds true even in the final subfigure, for Asian American impact, which

was not controlled in the test assembly process.

24

Reliability, SEM, and (Concurrent) Validity

An earlier study by Hackett, Holland, Pearlman, and Thayer (1987) found that test

sections especially designed to have moderated impact also had lower reliability and higher

concurrent validity. Stocking et al. (1998a), estimated the same result for their moderated

impact tests. In addition, a larger average standard error of measurement can be expected for

tests designed to have moderated impact when compared to tests assembled without regard to

impact. The mathematical proofs underlying these assertions are given in Stocking, Jirele,

Lewis, and Swanson (1998b).

Table 6 gives the reliability and the average standard error of measurement for the test

sections constructed for the current study, computed from the raw formula scores. For the

operational versions of the sections, samples taking different moderated sections were

combined in this computation. That is to say, all test takers administered, for example, the

VS1 operational section (8793 + 8074 from Table 4) were used in the computations for the

VS1 Operational section. Compared to the operational sections, reliability estimates for the

moderation sections are typically slightly lower. Comparisons between the test construction

approach and the test selection approach are more mixed. For verbal skills sections, Test

Construction sections are less reliable than Test Selection sections; for math skills sections, in

three out of four possible comparisons, the Test Constructions sections are more reliable than

Test Selection sections. Average standard error of measurement (SEM) values are typically

slightly higher for most moderated sections compared to the operational sections.

25

Table 6: Reliability, Average SEM, and Concurrent Validity

Section Reliability Average SEM Concurrent Validity

VS1 Operational .86 3.1 .52

VS1 TS .85 3.0 .51

VS1 TC-S .83 3.1 .51

VS2 Operational .86 2.7 .52

VS2 TS .82 2.8 .50

VS2 TC-S .81 2.8 .52

MS1 Operational .85 2.4 .56

MS1 TS .82 2.5 .57

MS1 TC-S .83 2.5 .58

MS1 TC-L .81 2.5 .57

MS2 Operational .83 2.4 .53

MS2 TS .82 2.3 .56

MS2 TC-S .84 2.4 .58

MS2 TC-L .83 2.4 .59

The right most column of Table 6 displays the concurrent validity, that is the

correlation of test scores with self-reported academic grade point average. In contrast to

previous predictions, the validity for the two verbal skills sections is slightly lower for

sections produced with impact moderation compared to their parallel operational counterparts.

For the mathematical skills sections, the results confirm previous predictions, particularly for

the MS2 section.

26

Relative Efficiency

Figure 2 displays the efficiency of each moderated section relative to that of the

corresponding operational section. Relative Efficiency, the ratio of corresponding test score

information functions, is a useful model-based method of making inferences about two tests,

conditional on ability, that is not affected by the choice of scale for measuring that ability

(Lord, 1980, Chapter 6, page 89). Verbal skills results are in the first row; mathematical

skills results are in the second row. The horizontal line plotted at 1.0 indicates that the two

sections are equally efficient. Stocking, et al. (1998a) suggested that tests produced by a

process that deliberately sought to moderate impact would likely be less efficient than tests

assembled without regard to impact at middle levels of ability and more efficient at more

extreme (low or high) levels. This suggestion was based on the observation that moderated

impact tests tend to have easier and harder items than tests constructed ignoring impact. For

gender impact, where the sample sizes of the two groups are roughly equal, there is a

mathematical basis for this observation, as demonstrated in Stocking et. al, (1998b).

The results shown in Figure 2 suggest that this assertion is substantially upheld,

although it seems more clear for the VS2 and MS1 sections and less clear for VS1 and MS2

sections. For MS2, the moderated impact sections are approximately as efficient as the

operational section at most ability levels.

27

Section VS 1

0.0

0.5

1.0

1.5

2.0

Score

Eff

icie

ncy

Rel

ativ

e to

Ope

rati

ons

Sect

ion

Test Selection

Test Construction-Small

Section VS 2

0.0

0.5

1.0

1.5

2.0

ScoreE

ffic

ienc

y R

elat

ive

to O

pera

tion

s Se

ctio

n

Test Selection


Section MS 1

0.0

0.5

1.0

1.5

2.0

Score

Eff

icie

ncy

Rel

ativ

e to

Ope

rati

ons

Sect

ion

Test Selection


Test Construction-Larger

Section MS 2

0.0

0.5

1.0

1.5

2.0

Score

Eff

icie

ncy

Rel

ativ

e to

Ope

rati

ons

Sect

ion

Test Selection


Test Construction-Larger

Low High Low High

Low High Low High

28

The Construct

There is no unequivocal method of making simple comparisons of constructs measured

by different tests; a large body of literature addresses such issues (see, for example, Carroll,

1976; Haertel and Wiley, 1992; Shealy and Stout, 1993; Snow and Lohman, 1993; Takane

and de Leeuw, 1987). However, in the current context, a rough indication of the similarity

between constructs measured may be obtained by computing the correlation between a

moderated section and its parallel operational section. These correlations were computed for

raw formula scores, and then corrected for attenuation, and are displayed in Table 7.

Corrected correlations close to 1.0 indicate that the test takers are rank ordered similarly on

the moderated and operational sections, and that the two sections measure a statistically

similar construct.

Table 7: Correlations Between Moderated and Operational Sections

Sections Correlated Correlation CorrectedCorrelation

VS1 Operational, VS1 TS .83 .98

VS1 Operational, VS1 TC-S .83 .98

VS2 Operational, VS2 TS .82 .98

VS2 Operational, VS2 TC-L .82 .98

MS1 Operational, MS1 TS .82 .99

MS1 Operational, MS1 TC-S .84 .99

MS1 Operational, MS1 TC-L .82 .99

MS2 Operational, MS2 TS .81 .98

MS2 Operational, MS2 TC-S .81 .97

MS2 Operational, MS2 TC-L .81 .97

29

Discussion

A number of factors limit the results of this empirical investigation of the simultaneous

moderation of gender, African American, and Hispanic impact for tests of verbal skills and

mathematical skills. First, the investigation was limited to the construction and administration

of test sections, not entire tests. To the extent that the content of each section does not mirror

in miniature the content of the total test, the results for test sections offer only imperfect

information about the characteristics of the total test. Two factors mitigate this concern:

First, the use of the standard mean difference, D, as a measure of impact, is helpful in making

comparisons among sections and the total test characteristics. Second, the nearly uniform

results (with the exception of the Test Selection procedure) that impact moderation was

successful on a section level imply strongly that the same might be true on a total test level.

In addition, the single combinations of a method of test construction with a test section

give no information about the variability of the results. That is, we have a single instance of

the consequences of using the Test Construction, small moderation, (TC-S), to construct the

VS2 section. If we could have repeated this study many times we would have a context in

which to judge whether or not the particular results obtained were typical or atypical. This

kind of repetition was beyond the scope of this study.

The explicit consideration of impact when assembling tests, as suggested by Bond

(1987), is effective in moderating irrelevant impact when tests are administered. Modern

ATA methods assist in this process, although are not required. The results of this empirical

study tended to confirm more theoretical predictions from previous studies in terms of the

30

properties of the resultant tests. Moreover, the results were achieved even for the Asian

American and White impact that was not explicitly controlled. It is likely that this is due, at

least indirectly, to the gender composition of all groups. Women were 55% of the total

population, 54% of the White test takers, 59% of the African American test takers, 58% of

the Hispanic American test takers, and 52% of the Asian American test takers. Thus it is

likely that any moderation of gender impact may also moderate the other impacts of interest.

The two Test Construction methods based on the WDM and incorporating impact targets as

explicit goals produced more consistent results than the methodology based on the Test

Selection approach, with higher efficiency.

What is not addressed by the current study is any consideration of the sustainability of

moderating impact during test construction, that is, current item productions methods may not

be sufficient to maintain the creation of moderated impact tests over time. The extent to

which this is true is not clear, however, because the WDM requires only a balance of impact

(low and high) within a test. That is, the WDM does not rely on the selection of just the

items with very small impact values. Nevertheless, this important practical issue should be

addressed in advance of any implementation, perhaps through a series of carefully designed

simulation studies.

31

References

American Educational Research Association, American Psychological Association,

National Council on Measurement in Education. (1999). Standards for Educational

and Psychological Testing. Washington, D. C.: American Psychological Association.

Bond, L. (1987). The Golden Rule settlement: A minority perspective. Educational

Measurement Issues and Practice, 6(2), 18-20.

Carroll, J. B. (1976). Psychometric tests as cognitive tasks: A new "structure of intellect." In

L. B. Resnick (Ed.), The nature of intelligence. Hillsdale, NJ: Erlbaum.

Embretson, S. E. (1984). A general latent trait model for response processes. Psychometrika,

49, 175-186.

Embretson, S. E. (1987). Component latent trait models for paragraph comprehension tests.

Applied Psychological Measurement, 11, 175-193.

Embretson, S. E. (1991). A multidimensional latent trait model for measuring learning and

change. Psychometrika, 56, 495-515.

Fischer, G. H. (1973). Linear logistic test model as an instrument in educational research.

Acta Psychologica, 37, 359-374.

Hackett, R., Holland, P., Pearlman, M., and Thayer, D. (1987) Test construction manipulating

score differences between black and white examinees: Properties of the resulting tests.

Research Report 87-30. Princeton, NJ: Educational Testing Service.

Heartel, E. H., and Wiley, D. E. (1992). Representations of ability structures: implications for

testing. In N. Frederiksen, R. J. Mislevy, and I. Bejar (Eds.), Test theory for a

new generation of tests. Hillsdale, NJ: Erlbaum, 359-384.

32

Holland, P. W. and Thayer, D. T. (1998) Differential item functioning and the Mantel-

Haenszel procedure. In H. Wainer and H. I. Braun (Eds.), Test validity. Hillsdale,

NJ: Erlbaum.

Lazersfeld, P. F., and Henry, N. W. (1968). Latent structure analysis. New York:

Houghton-Mifflin.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems.

Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.

Luecht, R. M. (1998) Computer-assisted test assembly using optimization heuristics. Applied

Psychological Measurement, 22, 224-236.

Rosser, P. (1989) The SAT Gender Gap: Identifying the Causes. Washington, D.C.: Center

for Women Policy Studies.

Shealy, R., and Stout, W. F. (1993). An item response theory model for test bias. In P. W.

Holland and H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Erlbaum,

197-238.

Snow, R., and Lohman, D. (1993). Implications of cognitive psychology for educational

measurement. In Linn, R. (Ed.) Educational Measurement. National Council of

Measurement, American Council on Education, Phoenix, AZ: Oryx.

Stocking, M. L., Jirele, T., Lewis, C., Swanson, L. (1998a) Moderating possibly irrelevant

multiple mean score differences on a test of mathematical reasoning. Journal of

Educational Measurement, 35, 199-221.

Stocking, M. L., Jirele, T., Lewis, C., Swanson, L. (1998b) An investigation of the

simultaneous moderation of average gender and African American score differences on

a test of mathematical reasoning. Research Report 98-46. Princeton, NJ:

Educational Testing Service

33

Stocking, M. L., Swanson, L., and Pearlman, M. (1993) Application of an automated item

selection method to real data. Applied Psychological Measurement, 17, 167-176.

Swanson, L., and Stocking, M. L. (1993). A model and heuristic for solving very large item

selection problems. Applied Psychological Measurement, 17, 151-166.

Takane, Y., and de Leeuw, J. (1987). On the relationship between item response theory and

factor analysis of discretized variables. Psychometrika, 52, 393-408.

Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago, IL: University of Chicago

Press.

van der Linden, W. J. (1998) Optimal assembly of psychological and educational tests.

Applied Psychological Measurement, 22, 195-211.

Weiss, J. (1987). The Golden Rule bias reduction principle: A practical reform. Educational

Measurement Issues and Practice, 6(2), 23-25.

Wightman, L. F. (1998) Practical issues in computerized test assembly. Applied

Psychological Measurement, 22, 292-302.

Willingham, W. and Cole, N. S. (1997). Gender and fair Assessment. Mahwah, NJ:

Lawrence Erlbaum Associates, Publishers.

an empirical investigation of impact moderation in test … · 2016-05-19 · research report march...

Documents