the effect of computer-based tests on ann gallaghergmat. the gmat data consisted of two separate...

The Effect of Computer-Based Tests on

Racial/Ethnic, Gender, and Language Groups

Ann Gallagher Brent Bridgeman

Cara Cahalan

GRE No. 9621P

June 2000

This report presents the findings of a research project funded by and carried

out under the auspices of the Graduate Record Examinations Board

Educational Testing Service, Princeton, NJ 0854 1

********************

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

Copyright 0 2000 by Educational Testing Service. All rights reserved.

Abstract

This study examined data from several national testing programs to determine whether the change

from paper-based administration to computer-based tests (CBTs) influences group differences in

performance. Performance by gender, racial/ethnic, and language groups on the Graduate Record

Examination (GRE@) General Test, the Graduate Management Admissions Test (GMAT@), the SAT@ I:

Reasoning (SAT) test, the Praxis? Professional Assessment for Beginning Teachers (Praxis), and the

Test of English as a Foreign Language (TOEFL) was analyzed to ensure that the change to CBTs does not

pose a disadvantage to any of these subgroups, beyond that already identified for paper-based tests.

Although all differences were quite small, some consistent patterns were found for some racial/ethnic and

gender groups. African American examinees and, to a lesser degree, Hispanic examinees appear to benefit

from the CBT format. However, for some tests, the CBT version negatively impacted female examinees.

Analyses by gender within race/ethnicity revealed a similar pattern, though only for White females.

Analyses for groups based on language showed no consistent patterns, but results indicate that the

computer-based TOEFL has increased impact for some language groups -- especially Chinese and Korean

groups.

Key words: CBT, Gender, Race, Ethnicity, Language, Computer-based testing, Assessment

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . ..*.......................................................................................................................... 1

Method .......................................................................................................................................................... 2

Sources of Information ............................................................................................................... 2

Procedures ................................................................................................................................... 5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................*........... 7

Conclusion .................................................................................................................................................. 12

Future Research ........................................................................................................................ 15

References .................................................................................................................................................. 16

Appendix A ................................................................................................................................................ 27

Appendix B ................................................................................................................................................. 32

List of Tables

Table 1. Descriptive Statistics for Test Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*...................*......... 18

Table 2. Descriptive Statistics for Quantitative Tests by Race/Ethnicity, Gender, and Race/Ethnic@ Within Gender . . . . . . . . . . . . . . . . . . . . ..*...................................................................... 19

Table 3. Differences in Impact for Population Groups by Quantitative Test . . . . . . . ..*....................... 20

Table 4. Descriptive Statistics for Verbal Tests by Race/Ethnicity, Gender, and Race/Ethnic@ Within Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*............................................................................... 21

Table 5. Differences in Impact for Population Groups by Verbal Test ,.......................................... 22

Table 6. Descriptive Statistics for Quantitative Tests by Language Group by Quantitative Test 23

Table 7. Differences in Impact for Language Groups by Quantitative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Table 8. Descriptive Statistics for Quantitative Tests by Language Groups by Verbal Test . . . . . ...24

Table 9. Differences in Impact for Language Groups by Verbal Test ............................................. 24

Table 10. Descriptive Statistics for TOEFL Verbal Subtests by Language, Gender, and Language Within Gender ........................................................................................................................ 25

Table 11. Differences in Impact for Language Groups by TOEFL Subtests .................................... 26

List of Figures

Figure Al. SAT Scores by Gender ........................................................................................................... 27

Figure A2. GMATl Scores by Gender .................................................................................................... 28

Figure A3. GMAT2 Scores by Gender .................................................................................................... 29

Figure A4. TOEFL Scores by Gender ..................................................................................................... 30

Figure A4. TOEFL Scores by Gender, continued .................................................................................. 31

Figure Bl. Quantitative Tests by Race/Ethnicity ................................................................................... 33

Figure B2. Verbal Tests by Race/Ethnicity ............................................................................................. 34

Figure B3. Quantitative Tests by Race/Ethnic@ for Males .................................................................. 35

Figure B4. Quantitative Tests by Race/Ethnic@ for Females .............................................................. 36

Figure B5. Verbal Tests by Race/Ethnic@ for Males ............................................................................ 37

Figure B6. Verbal Tests by Race/Ethnic@ for Females ........................................................................ 38

Figure B7. Quantitative Tests by Gender ................................................................................................ 39

Figure B8. Verbal Tests by Gender ......................................................................................................... 40

Figure B8. Verbal Tests by Gender, continued ....................................................................................... 41

Figure B9. Verbal Tests by Language Group ......................................................................................... 42

Figure BlO. Verbal Tests by Language Group for Females .................................................................. 43

Figure Bl 1. Verbal Tests by Language Group for Males ...................................................................... 44

Figure B12. Quantitative Tests Language Group ................................................................................... 45

Figure B13. Quantitative Tests by Gender and Language Group ........................................................ 46

Introduction

Many large national testing programs are converting from paper-based tests to computer-based

tests (CBTs). Reactions to CBTs have largely been positive, due to recognition of the value of such

features as flexible scheduling and rapid score reporting. Nonetheless, some concerns exist. A particularly

important question is that of the fairness of CBTs relative to paper-based tests. Can all subgroups of

examinees expect to perform as well on a computerized test as they would on a traditional, paper-based

examination? Or will there be differences, perhaps arising from differences in the amount of experience

or comfort in the use of computers that may be associated with subgroup membership?

Over the last decade, a large body of research has been conducted with the purpose of

understanding the nature and sources of group differences in test performance (see, for example,

Willingham & Cole, 1997, or Hoover, Politzer, & Taylor, 1990). Most testing programs shifting from

paper-based tests to CBTs intend to maintain comparability of scores from one testing format to the other,

but because other kinds of format changes have produced inconsistent results in the past (Willingham &

Cole, 1997), it is important to examine whether the shift from paper to computers produces any consistent

patterns.

Two sets of researchers have performed large scale reviews of studies examining differences in

performance on CBTs and paper-based versions of tests (Mazzeo & Harvey, 1988; Mead & Drasgow,

1993). Both reviews focused primarily on average differences across formats without subdividing

samples on the basis of race/ethnicity or gender. Both reviews found conflicting results based on the type

of test (that is, power versus speeded tests, or personality versus placement tests). For example, Mead and

Drasgow focused on correlations across formats and found no medium effect (computer vs. paper) for

carefully constructed power tests, but they found a substantial effect for speeded tests. The only high-

stakes admissions test included in this work was in the early, nonadaptive version of the computerized

Graduate Record Examination (GRE@) General Test. Mazzeo and Harvey’s work notes the diversity of

display and response formats between CBTs and paper-based tests and the interactions of these

differences with performance. However, this study was conducted in the late 1980s when CBT interfaces

were quite immature and familiarity with computers was limited. Consequently the generalizability of this

study is less useful today than when it was first conducted.

Several comparability studies examining differences in scores on computer-based and paper-

based versions of tests have been conducted for individual testing programs (Schaeffer, Bridgeman,

Golub-Smith, Lewis, Potenza, & Steffan, 1998; Schaeffer, Steffan, Golub-Smith, Mills, & Durso, 1995;

P.A. Carey, personal communication, May 1998). As part of this work, the studies include a brief analysis

of subgroup performance. Several other studies have been conducted at Educational Testing Service@ to

examine performance of subgroups within testing populations for specific tests (Bridgeman & Cooper,

1998; Bridgeman & Schaeffer, 1995; Eignor, Way, & Amoss, 1994; O’Neill & Powers, 1993; Ward &

Bridgeman, 1996). These studies have been done independently rather than as a coordinated effort,

however, and comparisons across studies are difficult due to methodological differences.

This study consolidates data from prior studies in order to compare findings across testing

programs, with the intent of identifying patterns of performance across assessments. To do this, we

analyzed data from studies examining five testing programs to determine whether the difference between

paper-based performance and CBT performance is of equal magnitude across subgroups -- in particular,

subgroups defined by raceiethnicity, gender, and native language.

Method

Sources of Information

Operational testing data and data from experimental research administrations of tests were drawn

from the following ETS testing programs:

0 GRE General Test

0 SAT@ I: Reasoning (SAT)

0 Graduate Management Admissions Test (GMAT@)

l Test of English as a Foreign Language (TOEFL@)

0 Praxis@ Professional Assessments for Beginning Teachers

Only existing data gathered for operational programs, or data gathered for previous studies, were

examined here. All data for GRE, GMAT, and TOEFL testing programs were drawn from comparability

studies conducted by Schaeffer et al. ( 1998), Bridgeman, Anderson andwightman (1998), and Carey

(personal communication, May 1998), respectively. The SAT data was from a previous research study

(Lawrence, Potenza, & Feigenbaum, 1998). Praxis data was assembled from prior computer-based and

paper-based administrations of the test. No new experimental data was gathered for this study. Table 1

displays summary statistics for each testing sample. Note that some administrations were operational and

others were experimental, and all computer-based tests under study were adaptive -- that is, questions for

2

an individual examinee were selected from a pool of items based, in part, on that individual’s performance

on prior questions.

For three of the testing programs -- SAT, GMAT, and TOEFL -- data are for examinees who took

both CBT and paper-based formats. Data for the remaining two tests -- GRE and Praxis -- are matched

samples in which examinees took either CBT or paper-based tests and were then matched on the

following background variables: best language, race/ethnicity, gender, undergraduate major or program,

undergraduate grade point average (GPA), mother’s education level, and the examinee’s current education

level. In addition, Praxis examinees were also matched on years since attending college, and GRE

General Test examinees were matched on prior computer experience.

SAT. The SAT data consist of juniors who took the paper-based SAT in May 1996, and then took

an experimental computer-based SAT in June 1996. To increase motivation, examinees who took the

experimental administration were informed that, if they desired, scores from the computer-based test

could be added to their score record, which is reported to colleges. Participants were from schools

selected based on their proximity to CBT test centers, the number of juniors who took the SAT in

preceding years, the size of the school, the proportion of female test takers at the school, and score

consistency in previous years. Score consistency refers to the stability of SAT test scores from year to

year within a school. A total of 4,700 juniors were invited to take the computer-based SAT. Of the 1,732

students who took the computer-based test, analyses were conducted only on the 1,401 students who

reported making a “good” or “strong” effort on the computer-based version. This sample had slightly

higher paper-based scores than college bound seniors: Means for participants were 530 on the verbal

portion of the test, and 542 on the math section; means for college bound seniors were 505 and 508,

respectively (Lawrence, Potenza, & Feigenbaum, 1998).

GMAT. The GMAT data consisted of two separate samples. Both samples took an operational

paper-based test and an experimental computer-based test. The first sample (GMAT 1) included 3,465

participants who took the paper-based test in October 1996. About half of these participants took the

computer-based test prior to the paper-based test, and half took the computer-based test after the paper-

based test. The higher score from both tests was reported, so test takers were motivated to do well. The

second sample (GMAT2) included 773 test takers who took the paper-based test prior to the computer-

based test. GMAT2 was administered in April 1997 with extended section time limits, because the

GMATl field trial revealed that the original computer time limits were too short for examinees to

complete the test. Current operational time limits are the same as those in GMAT2. Due to the relatively

small sample size, GMAT2 was used only for analyses by gender, language, and language by gender.

TOEFL. Forty-eight thousand people who had taken the paper-based TOEFL between November

1997 and February 1998 were invited to take the computer-based TOEFL between January and March

1998. TOEFL was administered in paper-based and computer-based format to 6,556 test takers. This

study examines 3,791 of these test takers who reported that their best language was Chinese, Korean,

Japanese, Thai, Russian, Spanish, or Arabic. Paper-based test fees were reimbursed and computer-based

test scores were sent out in June 1998, which motivated test takers to do well.

GRE General Test. GRE data were evaluated using two separate samples from the 1996-1997

academic year. The first sample (GREl) consisted of the 190,044 U. S. citizens who took the GRE

General Test in the United States and who spoke English better than any other language. Of these, 78,257

examinees took the computer-based GRE and 111,787 took the paper-based version. The computer-based

and the paper-based samples were matched exactly on several characteristics, including race/ethnic&y,

computer usage, gender, undergraduate GPA, undergraduate major, mother’s education level, and

examinee’s current education level. Matching was one to one, in that every individual record in the

computer-based sample was exactly matched to one other record in the paper-based sample on the above

characteristics. This matching was done to reduce some of the irrelevant differences between the samples.

Twenty-eight percent of the computer-based sample was eliminated because there was no exact match for

them in the paper-based sample. Because test takers were free to choose either the computer or paper

versions of the test, the groups cannot be considered to be randomly equivalent.

The second GRE General Test sample (GRE2), which was used for comparisons of examinees

whose best language was English with examinees whose best language was not English, was also

matched. Of the 3 13,124 test takers who reported their English proficiency level, 106,367 examinees who

took the computer-based GRE were matched with 206,757 examinees who took the paper-based version.

Using the matching procedure described above, computer-based test takers were matched with paper-

based test takers on eight background variables: computer usage, gender, undergraduate GPA,

undergraduate major, mother’s education level, examinee’s current education level, English fluency, and

race/ethnic&y. Examinees who took the test outside of the United States were excluded from analyses

because data from international administrations could not be matched by race/ethnicity or citizenship.

Also, as in sample one, 27% of the resulting computer-based sample was eliminated because there was no

exact match in the paper-based sample.

Praxis. Praxis test data from 1993 through 1997 were evaluated by matching computer and paper

samples for each subtest (reading, mathematics, and writing). In the same procedures described for the

GRE General Test samples, examinees who took the computer-based test were matched with examinees

who took the paper-based test on eight characteristics: gender, race/ethnicity, undergraduate GPA,

examinee’s highest education level, mother’s highest education level, years since attending college,

participation in a teacher education program, and speaking another language better than English. For each

subtest, over 86% of computer-based test takers were matched with paper-based test takers: 39,027 who

took the reading subtest, 40,257 who took the writing subtest, and 40,325 who took the mathematics

subtest.

Procedures

The purpose of this study was to examine the effects of computer-based testing on group

differences in performance. This paper examines differences between differences. Thus, the difference

between reference and focal group performance (that is, impact) was calculated for each testing format

(paper or computer). The size of the difference for each testing format was then compared within each

group of examinees (for example, impact for the paper-based SAT was compared to impact for the

computer-based SAT). Several different group comparisons, based on race/ethnic@, gender, and

language, were then conducted for each testing sample.

Analysis. For the SAT, GRE General Test, and GMAT samples, CBT and paper-based scores had

been previously equated according to standard operating procedures. Because TOEFL and Praxis scores

for the paper and computer tests were on different scales, z-scores were generated for data on these tests.

These procedures assure that there will be no overall format effects, but do not preclude finding format

effects in any subgroup. Since z-scores were generated for all examinees (including language groups not

included in TOEFL analyses and test takers that did not report their gender, race/ethnicity, or language),

mean z-scores may not equal zero for the sample used in the current analyses.

The data for all tests were summarized in terms of the standardized mean differences between

paper-based tests and computer-based tests for each population. Standardized mean differences are one

way of measuring effect size. These statistics were calculated for each test, comparing computer and

paper-based performance in the following manner. First, the standardized difference, d, between a

reference group (e.g., White examinees) and a focal group (e.g., African American examinees) was

computed as follows:

d=

where is the mean for the reference group, x, is the mean for the focal group, and SDr and SDf are

the standard deviations for the reference and focal groups, respectively. This formula, based on

unweighted standard deviations, is not dependent on subgroup sample sizes, unlike the usual formula that

uses weighted standard deviations (Willingham & Cole, 1997, p. 2 1). Next, the average difference in

impact was defined as the difference between dc and dp , where dc is the impact between reference and

focal groups for the CBT, and dp is impact between reference and focal groups for the paper-based test.

Thus, positive numbers indicate relatively higher scores for the focal group on the computer-based test.

Average differences for paper-based tests and CBTs were analyzed for each subgroup by

race/ethnicity, gender, gender within race/ethnic@, and race/ethnic@ within gender. Analyses also

examined differences by language group and by gender within language group. The reference group

defined as White examinees for analyses by race/ethnic@ and race/ethnic@ within gender. Male

was

examinees were the reference group for analyses by gender, gender within race/ethnicity, and gender

within language group. Mean differences for each pair (reference group - focal group) were then

compared by testing format (paper-based versus CBT).

Analyses comparing students whose best language is English and students who speak another

language better than English were conducted for the GRE General Test and GMAT samples. Because few

examinees taking the SAT or Praxis exams reported that English is not their best language, analyses by

language group were not conducted for these two tests. The GRE General Test and GMAT did not collect

information on examinees’ best language if it was not English, so analyses for these two tests looked at

performance by groups based only on two English language proficiencies: speak English as well or better

than any other language, or do not speak English as well as another language. These two groups will be

referred to as English language (EL) and other language (OL). For these analyses, EL was considered the

reference group.

Analyses of TOEFL data were conducted on several language groups. Very few TOEFL

examinees reported English as their best language. Consequently, some other criterion was required for

defining a reference group. Because the focus of the current study is on how the shiR to computer-based

testing might differentially affect various subgroups of the testing population, computer familiarity

6

seemed a reasonable variable to use in selecting a reference group. Recent work by Kirsch, Jamieson,

Taylor, and Eignor (1998) examining computer familiarity among various language groups within the

TOEFL population found that examinees reporting Spanish as their best language had the highest scores

on the computer-familiarity scale developed for this purpose. Therefore, for analyses of TOEFL data,

Spanish-speaking examinees were considered the reference group.

For SAT, GMAT, and TOEFL samples, in which the same examinees took the test in both

formats, we tested the statistical significance of dp-dc using the group-by-format interaction from the

repeated measures ANOVA (for example, male vs. female by CBT vs. paper-based test). For the GRE

General Test and Praxis we used a t-test for independent samples. The GRE and Praxis samples are

considered independent because these two groups of examinees self-selected either a paper-and-pencil or

CBT format, and matching based on background variables does not control for self-selection. The

numerator for the t-tests was dp-dc and the denominator was the standard error of the difference. The

standard errors for dp and dc (needed to compute the standard error of the difference) were derived from

the first-order Taylor series expansion of d as described in Willingham and Johnson (1997, section 3, p.

4). Although we focused primarily on mean differences in this study, we also investigated differences

over the full range of scores.

Results

Means and sample size for each test by group can be found in Tables 2,4,6,8, and 10, while

Tables 3, 5, 7, 9, and 11 display the difference between dp and dc with associated significance values for

t-tests and ANOVAs. Note that there are large variations in sample sizes. As a result, some significant

differences are actually small in magnitude relative to other differences that were not significant. Results

of analyses conducted by race/ethnicity and race/ethnic@ within gender are reported in Tables 3 and 5.

Results for analyses of language groups and language group within gender are reported in Tables 7,9, and

11. All the tables refer to changes in impact from paper-based test to computer-based test.

Appendix A shows the regressions of paper scores on computer scores for the tests in which the

same examinees took both test formats. The regression line is a Lowess line (Chambers, Cleveland,

Kleiner, & Tukey, 1983), which permits observation of nonlinear relationships. However, the regressions

were all reasonably linear and virtually the same for all subgroups.

Appendix B contains plots of means on paper-based tests and CBTs for all analyses

subgroups. In each plot, the diagonal represents the point at which the two scores are equal.

bY

This diagonal

was adjusted for the SAT and GMAT2 samples to approximate the practice effect, because all students in

these samples took the paper-based test first. For the SAT sample, adjustments were made based on actual

score gains for the total sample (5.8 points for verbal and 6.9 points for math). For the GMAT2 sample,

adjustments were made based on the average increase in score for GMAT 1, which alternated the order of

paper-based and computer-based administrations; adjustments were .59 of a point on GMAT 1 verbal

score and .76 of a point on GMATl quantitative score. This type of correction was not necessary for the

TOEFL data, because plots were generated from z-scores. Groups falling below the line performed

relatively better on the computer-based version of the test and groups falling above the line performed

relatively better on the paper version of the test. Bars around each mean represent two standard errors

from the mean.

Race/ethnicity. Tables 3 and 5 display statistics for data on race/ethnicity and gender groups for

all verbal and quantitative tests under study. Analyses by race/ethnic@ of both verbal and quantitative

tests indicated that impact generally decreased for African American, and to a lesser extent, for Hispanic

examinees on CBT versions of the tests.

Of the five verbal sections examined (the GRE 1, SAT, and GMAT 1 verbal components, as well

as the Praxis reading and writing subtests), four sections showed reduced impact for African American

examinees in the CBT version of tests compared with the paper-based version (for GREl verbal results, t

[53,415] = 2.83, p < .Ol; for SAT verbal, F [l, 1,149] = 7.12, p < .Ol; for the Praxis reading test, 1

[37,538] = 5.25, p < .Ol; and for the Praxis writing test, 1[39,043] = 4.33, p < .Ol). Impact was also

reduced for Hispanic examinees on three CBT versions of the five verbal tests (for SAT verbal, F [ 1,

1,127] = 4.32, p < .05; for Praxis reading, & [35,534] = 1.97, p < .05; and for Praxis writing, 5 [37,084] =

2.07, p < .O 1).

On the four quantitative tests (the GRE 1, SAT, and GMAT 1 quantitative sections, and the Praxis

mathematics subtest), a similar pattern of reduced impact was evident for both African American and

Hispanic examinees on the CBT versions of the tests. There was significantly reduced impact for African

American examinees on the GREl quantitative test (t[53,415] = 3.90, p < .Ol), the Praxis mathematics

test (t [38,800] = 7.86, p < .Ol), and the GMATl quantitative test (F [I, 1,993] = 10.77, p < .OOl). For

Hispanic examinees, a reduction in impact on CBT over paper-based versions of quantitative tests was

found on the GREl quantitative test (i [5 1,93 I] = 2.97, p < .Ol) and the SAT math test (F [ 1, 1,127] =

4.26, p < .05.

Race/ethnic@ within gender. African American and Hispanic examinees, especially females, also

appeared to benefit from the CBT format in comparisons of race/ethnicity within gender. When compared

to White female examinees on verbal tests, there was significantly reduced impact for African American

females on the CBT versions of four out of five of the tests: GREl verbal (t [35,098] = 2.62, D< .Ol),

SAT verbal (F [l, 637]= 6.53, E< .05), and Praxis reading and writing tests, respectively ($ [27,300] =

4.43, p<.Ol, and t [27,85 I] = 3.27, D< .Ol). When compared with White males, there was also reduced

impact for African American males on the CBT version of the two Praxis tests (& [ 10,234] = 2.89, D< .Ol,

for reading, and 1 [ 11,188] = 3.18, D< .O 1, for writing). Finally, there was reduced impact for Hispanic

females on the CBT version of the SAT verbal test (E [ 1,6 161 = 18.17, p < .OO 1).

When compared with White males, there was significantly reduced impact for African American

males on the CBT versions of two of the four quantitative tests GREl quantitative (t [ 18,3 151 = 2.12, p <

.05), and Praxis mathematics (t [ lo,1 381 = 3.16, p< .Ol). Impact was also reduced on the CBT versions of

two of the tests for Hispanic males: GREl quantitative (g [18,184] = 3.25, p < .05) and SAT math (E [l,

5091 = 6.35, D< .05).

When compared with White females, there was reduced impact for African American females on

the CBT version of three of the four quantitative tests: GREl quantitative @135,098] = 3.35, p < .05),

Praxis mathematics (t [28,660] = 7.26, p < .Ol), and GMATl quantitative (E [l, 849]= 10.78, p < ,001).

There was also reduced impact for Asian females on the CBT version of the Praxis mathematics test (&

[26,781] = 2.25, p < .05).

Gender. The pattern revealed for analyses by gender appears to be the reverse of the patterns for

race/ethnicity. On five of the six verbal tests examined by gender, there was increased impact for female

examinees on the CBT versions: GREl verbal (t [56,653] = 2.86, g < .O l), GMATl verbal (F [ 1, 3,463] =

11.95, p < .OOl), GMAT2 verbal (F [ 1, 7711 = 6.99, p < .Ol), and Praxis reading and writing (t [39,023] =

3.23, p < .Ol, and t [40,253] = 3.16, p < .Ol, respectively). There was also increased impact for females on

two of the three TOEFL subtests: reading (F [l, 3,756] = 4.32, p < .05) and listening (E [l, 3,756] = 7.72,

p < .Ol). A similar, though weaker, pattern was found on the quantitative tests, where two of the five tests

showed increased impact for females on the CBT versions: GREl quantitative (t [56,653] = 2.13, p < .05)

and Praxis mathematics (t [40,321] = 6.48, p < .Ol). Although these gender differences are significant, the

actual differences are small. Refer to Appendix A for scatter plots of computer score and paper score by

gender for the repeated measures samples (SAT, TOEFL, GMATl, and GMAT2).

Gender within race/ethnic@. The only group to show a statistically significant gender difference

in impact between paper-based and CBT versions of any of the verbal or quantitative tests were White

examinees. Since White examinees comprise the largest racial/ethnic group in the testing population for

all of the tests discussed here except TOEFL, it is not surprising that the pattern for White males and

females mirrors the pattern for all males and females in the samples examined.

There was increased impact for White females on the CBT version of four of the five verbal tests:

GREl verbal @[50,203] = 2.63, p < .Ol), GMATl verbal (F [l, 1,786] = 11.42, p < .OOl), and Praxis

reading and writing, (t [34,942] = 2.9 1, p <. 01, and t [36,470] = 2.87, p < .Ol, respectively). There was

also increased impact on two of the quantitative tests: GREl quantitative (t [50,203] = 1.98, p < .05) and

Praxis mathematics (t [36,007] = 4.24, p < .Ol).

Language (English vs. other). Tables 7 and 9 display statistics for analyses of English versus

other language groups for verbal and quantitative tests. Analyses by language group on both verbal and

quantitative tests showed no consistent patterns. Three samples were analyzed for two tests with verbal

and quantitative sections. The three samples were the domestic test takers of the GRE (GRE2) and two

separate GMAT administrations (GMAT 1 and GMAT2).

On the verbal tests, there was increased impact in the CBT format for the other language group in

one of the three samples: GMATl (F [l, 3,463] = 4.97, p < .05). There was no significant change due to

format for the other two samples (GRE2 and GMAT2). For quantitative tests, there was a significant

decrease in impact on one of the three tests: GMAT2 (E [ 1,683] = 5.44, p < .05). Impact in the other two

samples showed no significant differences by format for the quantitative tests. With these contradictory

findings, it is important to keep in mind that the GRE2 test takers represent the largest sample and showed

no significant differences in impact on the quantitative or verbal subtests.

Language (English vs. other) within gender. No definite patterns were found in analyses of

groups for whom English is the primary language versus some other language within gender. For male

examinees on the GRE2 and GMAT2, no significant changes in impact by format were found between the

English language and other language groups on the verbal or quantitative tests. On the CBT version of

GMAT 1 quantitative test, impact decreased for males in the other language group (F [ 1,2,047] = 5.28, p

< .05). However, impact increased for these males on the computer-based GMATl verbal test (F [ 1,

2,047] = 4.30, p < .05). In short, males in the English language group did relatively better on the

computer-based GMATl verbal test, and relatively worse on the computer-based GMATl quantitative

test, than males in the other language group. It is worth noting that the increase in impact found in the

10

total group analysis on the CBT version of the GMATl verbal test was primarily due to these results for

male test takers.

A majority of the analyses for female test takers in the English language and other language

groups revealed no significant differences. On the verbal tests (GMATl, GMAT2, and GRE2), no

significant differences in impact were found between these two groups of females. The same was true for

two of the three quantitative tests (GRE2 and GMATl). However, impact decreased for females in the

other language group on the CBT version of GMATZ quantitative test (F [l, 2921 = 3.90, p < .OS). That is

to say, females in the other language group did relatively better on the CBT version of the GMAT2

quantitative test than did females in the English language group.

Language (Spanish vs. other). Table 11 displays data from analyses of language groups, and

gender within language group. As mentioned earlier, Spanish-speaking examinees were identified as the

reference group for these analyses. On the TOEFL computer-based writing test, impact increased for

Korean and Chinese examinees (F [l, 1,130] = 6.87, Q < .Ol, and F [l, 1,801] = 15.55,~ < .OOl,

respectively). The opposite was true for Russian test takers (F [l, 712]= 4.32, Q < .05). That is, Russian

examinees improved significantly more than Spanish-speaking examinees with the move from the paper-

based writing test to the computer-based writing test.

On the CBT version of the TOEFL reading test, impact decreased for Japanese and Arabic

examinees (F [ 1,793] = 4.23, g < .05, and F [ 1, 707]= 7.09,~ < .Ol, respectively), while all other groups

showed no significant change due to format. Finally, on the computer-based TOEFL listening test, impact

increased for Chinese examinees (F [ 1, 180 l] = 27.44,~ < .OO l), Korean examinees (F [ 1, 1,130] =

13.59, p. < .OOl), Japanese examinees (E [1,793] = 20.82, Q < .OOl), and Thai examinees (F [l, 7611 =

14.71, Q < .OOl). Results of these analyses suggest that Chinese and Korean examinees may find the CBT

tests somewhat more difficult than paper-and-pencil tests. These two language groups showed increased

impact on the CBT versions of two out of three tests. No other patterns were evident for the remaining

language groups.

Language (Spanish vs. other) within gender. On the TOEFL reading and writing tests, no

significant differences in impact were found when comparing Spanish-speaking males to males in the

other six language groups. On the TOEFL listening test, impact for Chinese, Korean, Japanese, and

Arabic males increased significantly on the CBT version when compared with Spanish-speaking males

(Chinese: F [l, 720]= 13.88,~~ .OOl; Korean: F [l, 5641 = 13.42, pc .OOl; Japanese: F [l, 300]= 10.82, - -

p < ,001; and Arabic: F [l, 387]= 4.05, p < .05).

11

Female examinees showed a slightly different pattern. On the TOEFL writing test, impact for

Chinese females increased significantly on the CBT version (F [ 1, 1,064] = 13.38, Q < .OO 1). Analyses of

the TOEFL reading test indicated that impact decreased on the CBT version for Japanese females when

they were compared to Spanish-speaking females (E [I, 4911 = 8.96, ~z < .OOl). Finally, results from the

TOEFL listening test indicated that impact for Chinese, Japanese, and Thai females increased

significantly on the CBT version (Chinese: F [l, 1,064] = 11.97, g < .OOl; Japanese: F [l, 491]= 8.26, E 5

.OOl; and Thai: F [l, 4051 = 12.67, ~z < .OOl).

It appears that the decrease in impact found for the total group of Japanese examinees on the

TOEFL reading test is primarily due to results found for Japanese females on the computer-based version

of the test. From the paper-based test to the computer-based test, scores for Japanese males and females

increased by z-score units of .04 and .13, respectively, while z-scores for Spanish-speaking males

declined by .05, and Spanish-speaking females lost .03 z-score units.

Gender within Language group. The only significant gender difference in impact between paper-

based and computer-based tests for a language group was found among Thai examinees on the TOEFL

reading and listening tests. On both tests, Thai males performed better on the computer-based version of

the test, while Thai females performed better on the paper-based version (E [ 1, 336]= 4.19, B < .05, and F

[ 1, 3361 = 7.49, Q < .O 1, for the reading and listening tests, respectively).

Conclusion

The study presented here examined differences in the magnitude of impact on gender,

racial/ethnic, and language groups resulting from the change from paper-based tests to computer-based

tests across several testing programs. The primary purpose of the analyses was to determine whether

consistent patterns of impact are evident based on gender, raceiethnicity, and language. If large

differences in any subgroup were found, scores on paper-based tests and CBTs could not be treated as

comparable.

The several samples used in this study were large, and the effects that were found were small.

Consequently, considerable thought must go into determining the practical implications of these findings.

Even though differences are statistically significant, they may have no real effect on the admissions

process for which the test scores are intended.

12

The computer-based and paper-based tests studied here differed in more ways than just the mode

of administration. First, all of the computer-based tests were also adaptive tests; that is, question difficulty

was tailored to student ability on the computer-based tests but not on the paper-based tests. Second, only

the paper-based tests permitted students to skip questions and return to them later if time allowed. Third,

because the paper-based SAT -- unlike any of the other paper-based tests studied -- contained a correction

for guessing, SAT examinees had to decide whether to guess or leave questions blank when they did not

know the answer; on the computer-based test, they had to answer each question in order to move on to the

next question. Fourth, on computer-based tests, an examinee might have to scroll back to find the

appropriate part of a passage, while on paper-based tests, it could be on the same page as related

questions. On reflection, it is not clear whether the general lack of gender and racial/ethnic differences

noted in the study resulted from minimal differences on all of these characteristics of the testing

experience, or whether negative impacts on one characteristic were counterbalanced by positive impacts

on another.

Although differences were generally quite small, some consistent patterns of impact were found

for racial/ethnic and gender groups. African American examinees and, to a lesser degree, Hispanic

examinees appear to benefit slightly from the CBT format. Where significant differences in impact were

found for these groups, all indicated reduced impact as a result of the change to the CBT format. On all

CBTs analyzed by race/ethnic@, African Americans and Hispanics performed better than or equal to how

they performed on the paper-based tests.

These findings agree with early work examining the School and College Ability Test (Johnson &

Mihal, 1973). On that test, African American students scored significantly higher on the CBT version of

the test, whereas there was a lesser (and not significant) score increase for White students. Pine, Church,

Gialluca, and Weiss (1980) reported similar findings, but attributed differences to an increase in

motivation on the computer-based test on the part of African American examinees. As noted above, these

statistically significant differences in the present study may have no practical significance because of their

small size. For example, in scale-score points, the change in impact from paper-based to computer-based

test performance was only four points for African American examinees who took the GRE 1 quantitative

test.

The pattern for female examinees appears to be the reverse of the pattern for African American

and Hispanic test takers; impact generally increased for females on the computer-based test. However

when gender differences were compared within racial/ethnic groups, statistically significant differences

13

were found only for the group of White examinees (the largest sample). Impact for White females

increased on the CBT versions of tests; however, all differences in impact were relatively small, ranging

from -. 128 to ,016 standard deviation units.

Analyses for groups based on English language proficiency versus other language proficiency

showed a fairly consistent pattern; verbal tests showed no difference in impact, while quantitative tests

showed impact. One of three quantitative tests showed increased impact on the CBT, while two of the

three quantitative tests showed reduced impact on the computer-based test. These differences may be due

to factors specific to the individual test or population, since they were found for the GMATl and GMAT2

samples, but not for the GRE2 population. It is important to keep in mind that even significant differences

were relatively small. The differences in impact for all three tests by English language versus other

language group, and by English language versus other language group within gender, ranged from -.OS 1

to .177 standard deviation units.

As noted earlier, the Spanish language group was selected as the reference group for analyses of

language groups for the TOEFL test. This was based on earlier findings regarding computer familiarity.

Results of our analysis indicate that some language groups --especially Chinese and Korean speakers --

perform slightly better on the paper version of the test than on the CBT version (relative to the Spanish-

speaking group), while other groups -- especially Russian speakers -- perform slightly better on the CBT

version.

Results of the computer familiarity study (Kirsch, Jamieson, Taylor, & and Eignor, 1998) do not

explain these differences. There does not appear to be a direct relationship between the computer

familiarity of specific groups and their performance on computer-based tests, relative to paper-based tests.

These findings are in line with another TOEFL study examining computer familiarity and TOEFL test

performance (Taylor, Jamieson, Eignor, & Kirsch, 1998). The authors of that study concluded that,

although statistically significant differences were found in the performance of groups based on computer

familiarity on a computer-based version of TOEFL, differences were so small that they had no practical

significance. However, it should be noted that at least one of the language groups (Russian) that

outperformed Spanish-speaking students on CBT tests in the present study were not included in either of

the earlier TOEFL studies of computer familiarity.

As testing programs move from paper-based to CBT formats, it is important to understand the

consequences this format change may have on the examinee population. The research discussed here

indicates that for some groups within the examinee population, the move to a CBT format may actually be

14

beneficial, while for other groups it may not be. Because some differences in impact due to testing format

(paper-based or computer-based) occur consistently across several tests, we can assume that changes in

impact are due to changes in format and not to anomalies associated with changes to specific tests.

Identifying a pattern across a variety of tests is an initial step toward understanding the source of group

differences in performance on paper-based or CBT formats. There is no way of knowing from the present

study whether lower or higher performance on CBTs, relative to paper-based tests, actually describes any

group’s ability or achievement. Future studies will need to focus on understanding the source of these

differences.

Future Research

This study identified a few patterns of changes in impact for population subgroups on high stakes

tests that appear to be associated with computer-based versus paper-based test format. Future work should

explore the underlying reasons for these findings and their practical significance. For example, it is

possible that computer-based testing creates a less threatening environment for some students and a more

threatening environment for others, and that the differences in impact for the two testing conditions are an

indicator of this effect. Stereotype threat (Steele, 1997) has recently been proposed as a contributor to the

lower scores achieved by minority populations on high stakes tests. It is conceivable that African

American and Hispanic students experience less stereotype threat (or test anxiety or some other

phenomena that may depress test scores) in a computer-based testing environment in which examinees are

physically isolated from one another. Results for females, on the other hand, indicate that they may

experience an increase in stereotype threat (or other phenomena) under the computer-based condition.

Given the proliferation of technology and improved internet access since the introduction of

CBTs, other work is needed to examine the relationship of computer familiarity to test performance on

computer-based tests, relative to paper versions. Future studies examining group differences in

performance should control for both computer access and familiarity. As more paper-based tests are

replaced by computer-based tests, a new body of research should be developed focusing on relationships

between subgroup performance differences and aspects of the CBT environment, such as computer

familiarity, computer interface, and user control issues, such as font size.

15

References

Bridgeman, B., Anderson, D., & Wightman, L. (1998). GMAT comparabilitv study. Princeton, NJ: Educational Testing Service.

Bridgeman, B., & Cooper, P. (1998, April). Comparabilitv of scores on word-processed and handwritten essays on the Graduate Management Admissions Test. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

Bridgeman, B., & Schaeffer, G. A. (1995, April). A comparison of gender differences on paper-and- pencil and computer-adaptive versions of the Graduate Record Examination. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey, P. A. (1983). Graphical methods for data analysis. Boston: Duxbury Press.

Eignor, D. R., Way, W. D., & Amoss, K. E. (1994, April). Establishing the comparabilitv of the NCLEX using CAT with traditional NCLEX examinations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Hoover, M. R., Politzer, R. L., & Taylor, 0. (1990). Bias in reading tests for Black language speakers: A sociolinguistic perspective. In A. G. Hilliard (Ed.), Testing; African American Students (pp. 8 l- 98). Morristown, NJ: Aaron Press.

Johnson, D. F., & Mihal, W. L. (1973). Performance of Blacks and Whites in computerized versus manual testing environments. American Psychologist, 28,694~699.

Kirsch, I., Jamieson, J., Taylor, C., & Eignor, D. R. (1998). Computer familiaritv among TOEFL examinees (ETS Research Report No. 98-6). Princeton, NJ: Educational Testing Service.

Lawrence, I., Potenza, M. T., & Feigenbaum, M. (1998, April). Examinee reactions to a computer-based administration of the SAT. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional educational and psvchological tests: A review of the literature (College Board Report No. 88-8). Princeton, NJ: Educational Testing Service.

Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psvchological Bulletin, 114,449-458.

ONeill, K., & Powers, D. E. (1993, April). The performance of examinee subgroups on a computer- administered test of basic academic skills. Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA.

Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1980). Effects of computerized adaptive testing: on Black and White students (Research Report No. 79-2). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program.

16

Schaeffer, G. A., Bridgeman, B., Golub-Smith, M. L., Lewis, C., Potenza, M. T., & Steffan, M. (1998). Comparability of paper-and-pencil and computer adaptive test scores on the GRE General Test (GRE Board Professional Report No. 95-08; ETS Research Report No. 98-38). Princeton, NJ: Educational Testing Service.

Schaeffer, G. A., Steffan, M., Golub-Smith, M. L., Mills, C. N., 8~ Durso, R. (1995). The introduction and comparability of the computer-adaptive GRE General Test (GRE Board Professional Report No. 88-08). Princeton, NJ: Educational Testing Service.

Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52, 6 13-629.

Taylor, C., Jamieson, J., Eignor, D. R., & Kirsch, I. (1998). The relationship between computer familiarity and performance on computer-based TOEFL test tasks (ETS Research Report No. 98- 08). Princeton, NJ: Educational Testing Service.

Ward, W. C., & Bridgeman, B. (1996). Subgroup differences and acceptance of computer-based testing (ETS internal report). Educational Testing Service, Princeton, NJ.

Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.

Willingham, W. W., & Johnson, L. (1997). Supplement to gender and fair assessment. Princeton, NJ: Educational Testing Service.

17

Table 1

Descriptive Statistics for Test Sample

Subjects Administratior

Same Experimental

3perational

z Matched Operational

Mother’s

Sex Language Race/Ethnic@ GPA education Computer Use

% English best or equal % with B.A. or % who use

Test Sample Date N” % to other language % White % > 3.0 higher sometimes or mort

SAT 1996 1,401 55 97 75 87 45 64

GMATI 1996 3,465 41 68 52 NA NA NA

GMAT2 1997 773 43 70 53 NA NA NA

TOEFL 1997-98 3,791 56 NA NA NA NA NA

GREI 56,657 66 100 89 80 40 89 1996-97

GRE2 77,233 64 97 85 77 39 8%

40,257 71 100 91 54 30 NA

1993-97 39,027 73 100 90 53 31 NA

PRAXIS-Mathematics 40,325 74 100 89 53 29 NA

Note. N1 = data not available.

. . Citizens

% of U.S.

Citizens

100

69

75

NA

100

100

NA

NA

NA

aFor matched samples, the overall N reflects the number of matches, not the number of test takers. bChoices were never, rarely, sometimes,

frequently, or daily.

Table 2

Descriptive Statistics for Quantitative Tests by Race/Ethnicity, Gender, and Race/Ethnicity Within Gender

GRE 1 quantitative score SAT math score I Praxis Math z-score GMATI auantitative scar

White

African American

Pisian

Hispanic

CBT Paper

M SD M SD

544 121 540 119

419 119 411 104

618 119 606 121

496 129 481 117

CBT PC 3T Pa )er

N SD

50.207 .93

3,212 488 91 480 89 100 -.88

#

i-I.13 .98 1.06

1.452 581 101 578 99 131 .I2 .96 1 .Ol 1 .oo

1,728 76 78 -.36 .99 1.04

White Males 17.609 99 476 .45 .89 ] .39 .87

African American Males 708 99 36 -.54 I.031 -.82 1.07

kian Males 615 584 101 580 93 69 .44 .79

HisDanic Males 577 569 85 539 75 35 -.04 1.03

JVhite Females 515 Ill 512 110 32,598 537 90 527 89 575 -.05 .93

pifrican American Females 406 109 399 95 2.504 # 480 89 474 83 64 -.98 .94 -1.21 1.04

bisian Females 579 117 568 119 837 57711011575 106 62 .04 .97 -.09 1 .Ol

iispanic Females 464 117 458 107 1,151 538 1 90 1 525 78 43 -.46 * .97 -.49 1.04

Vales 595 123 587 121 19,522 5731 1031563 .93 I .31 .93

-emales 507 116 503 114 37,135 534 1 95 I526 .99 I -.I1 1 .oo _ .

Note. GMAT2 quantitative score was not included in the table due to insufficient numbers for racial/ethnic analyses. Gender statistics are as

10,523 35 9 35 9 2,049

29,802 30 9 31 9 1,416

follows: males (MCBT = 33, SD = 9; maper = 33, SD = 9; and n = 438) and females (FcBT = 29, SD = 9; Epaper = 29, SD = 9; and n = 335).

Table 3

Differences in Impact for Population Groups” by Ouantitative Test

I GREl quantitative I SAT math Praxis math I

GMATl quantitative

~ dp-dcb/ e 1

African American

dp-dcb e

.I0 .Ol B>W -.Ol ns

-.07 ns

.I3 .05 H>W

-02 ns

-.05 ns

.25 .05 H>W

-.Ol ns

-.07 ns

.04 ns

Asian I I / -06 ns White

Hispanic 1 .I1 / .Ol j H>W

African American males I .I3 I I .05 B>W

White males Asian males

Hispanic males

.06 ns

.20 .05 H>W

White females

African American females

Asian females

.I0 .Ol B>W

.07 ns

025 .Ol B>W .I3 ,001 B>W

.I9 -05 A>W -09 ns

.07 ns -.06 ns

-.07 ns .I0 ns

.I4 ns .06 ns

-.I8 ns -.Ol ns

Hispanic females I .05 ns I I

African American males African American females I -03 I ns I

Asian males Asian female I -.05 ns I I

HisPanic males Hispanic females I -.I5 I ns I

Males Females .I _

-.03 .05 M>F --~~ ~_ . .

Note. There were insufficient numbers of GMAT2 quantitative scores for racial/ethnic analyses, but a repeated measures ANOVA by gender

found that no significant change in impact (F [ 1, 771]= 2.37, ]I= .128, dp - dc = .05). ns = not significant, A = Asian, B = African American, H =

Hispanic, W = Caucasian non-Hispanic, M = male, F = female.

aMeans and Ns by subgroup can be found in Table 2. bPositive numbers indicate focal group is relatively higher on computer-based test, while

negative numbers indicate the reference group is relatively higher on computer-based test.

White males White females I -.03 1 .05 I M>F -.II I .OI I M>F I -.12 I ns I

Table 4

Descriptive Statistics for Verbal Tests by Race/Ethnicity, Gender, and Race/Ethnicitv Within Gender

White

African American

Asian

African American Males

Asian Males

CBT

N M SD

1,051 .08 .94

100 -.85 1.22

131 -.09 .97

78 -.27 1.03

476 .I8 .93

36 -.68 1.16

69 -.03 1.04

African American Females

kian Females

\Jote. GMAT2 verbal score wa I not included in the table due to insufficient num

verbal score I SAT verbal score I Praxis

-.45 1.11 153 -.47

.07 .93 25,308 .I1

-1.06 1.24 1,996 -.71

-.I4 1.00 287 .02

-.36 1.04 439 -.23

.08 .96 10,632 -.I0

-.02 1 .OO 28,395 .05

Jers for racial/ethnic

‘raxi

3T

SD

97 L

1 .oe

89 L

99 L

1 .OE

.94

.92

1.02

1 09 A

1.02

88 L

98 L

1 04 L

98 k ana

; writin

I Pa

3

!!d

.08

-.86

-.02

-.40

-.07

-1.06

-.20

-.63

.I4

-.80

.03

-.31

-.I4

.06

1.02

.98

lyses. Gender

follows: males: (&IcBT = 29, SD = 8; aaper = 28, SD = 8; and n = 438) and females (EcBT = 28, SD = 8; haper = 27,

Table 5

Differences in Impact for Population Groupsa by Verbal Test

GRE Verbal Score SAT verbal score Praxis reading score GMATl verbal score Praxis writing score

Reference Group Focal Group d,-dcb p d,-dcb E d,-dcb g d,-dcb p d,-dcb p

African American .07 .Ol B>W .14 .Ol B>W .18 .Ol B>W .OO ns .14 .Ol B>W

White Asian -.04 ns -.04 ns .04 ns -.08 ns .04 ns

Hispanic .04 ns .12 .05 H>W .12 .05 H>W -.13 ns .12 .Ol H>W

African American males .08 ns .15 ns .20 .Ol B>W -.06 ns .20 .Ol B>W

White males Asian males -.03 ns -.02 ns -.02 ns -.08 ns .13 ns

Hispanic males .lO ns -.14 ns .23 us .07 ns .13 ns

African American females .08 .Ol B>W .16 .05 B>W .17 .Ol B>W .08 ns .12 .Ol B>W

White females Asian females -.06 ns -.07 ns .07 ns -.08 ns .02 ns

Hispanic females .Ol ns .32 .OOl H>W .09 ns .03 ns .13 ns

African American males African American females -.03 ns -.07 ns -.09 ns -.02 ns -.ll ns

Asian males Asian females -.06 ns -.03 ns .05 ns -.lO ns -.15 ns

Hispanic males Hispanic females -.13 ns .42 .OOl F>M -.19 ns -.ll ns -.07 ns

White males White females -.04 .Ol M>F -.04 ns -.05 .Ol M>F -.13 .OOl M>F -.05 .Ol M>F

Males Females -.04 .Ol M>F -.Ol ns -.05 .Ol M>F -.09 .OOl M>F -.05 .Ol M>F

vote. There were insufficient numbers of GMAT2 verbal scores for racial/ethnic analyses, but a repeated measures ANOVA by gender found a

significant change in impact (E [ 1, 77 I] = 6.99, g < .O 1, dp - dc = -. 11, M > F). ns = not significant, A = Asian, B = African American, H =

Hispanic, W = Caucasian non-Hispanic, M = male, F = female.

“Means and Ns by subgroup can be found in Table 4. bpositive numbers indicate focal group is relatively higher on computer-based test.

Table 6

Descriptive Statistics for Ouantitative Tests by Language Group by Quantitative Test

English language

Other language

English males

Other males

English females

Other females

Table 7

587 132 582 132 3,646 34 10

592 126 585 124 25.568 34 9

quantitative score PaTer

M SD - N

33 9 2,355

34 9 1,110

35 9 I I 1,379

35 1 9 1 670

30 1 9 1 976

33 1 10 1 440

GMAT; CBT

I M SD -

30 9

34 9

32 9

36 9

28 9

32 9

auantitative score

30 I 9 I 540

35 10 89

#

28 9 238

30 9 56

Differences in Impact for Language Groups bv Quantitative Test

GRE2 quantitative GMAT 1 quantitative GMAT2 quantitative

Reference group Focal group dp-de” Iz dp-dc” I.2 dp-dcb II

-.O 1 English language Other language ns .03 ns

English males Other males .Ol ns .05 0.05 O>E so ns

English females Other females -.02 ns .Ol ns .18 0.05 O>E Note. ns = not significant, E = English best language, 0 = other language best

“Means and Ns by subgroup can be found in Table 6. bPositive numbers indicate focal group is relatively higher on computer-based test.

Table 8

Descriptive Statistics for Quantitative Tests by Language Groups by Verbal Test

English language 477 103 481 109 73,587 32 7 32 8 2,355 29 7 29 8 540

Other language 437 112 442 119 3,646 27 9 28 9 1,110 24 8 23 8 145 7

English males 496 106 500 113 25,568 32 7 32 7 1,379 30 7 29 7 302

Other males 416 112 419 120 1,434 28 9 28 9 670 24 8 23 8 89

I English Other females females 466 451 100 111 471 458 106 117 45,005 2.212 26 30 9 7 27 31 9 8 440 976 28 24 7 8 24 28 8 8 238 56

Table 9

Differences in Impact for Language Groups by Verbal Test

GRE2 verbal GMAT 1 verbal GMAT2 verbal

Reference group Focal group dp-dc” E dp-dc” El dp-dcb 12

English language Other language -.03 ns -.05 0.05 O<E .Ol l-is

English males Other males -.04 ns -.05 0.05 O<E 702 ns

English females Other females -.03 ns -.05 ns .06 ns Note. ns = not significant, E = English best language, 0 = other language best

aMeans and Ns by subgroup can be found in Table 8. bPositive numbers indicate focal group is relatively higher on computer-based test.

Table 10

Descriptive Statistics for TOEFL Verbal Subtests bv Lanmarre, Gender, and Language Within Gender

TOEFL Writing z-score TOEFL Reading z-score TOEFL Listening z-score

CBT Paper CBT Paver CBT PaTer

Language group N Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD

Chinese 1378 .03 .93 .11 .91 .03 1.02 .05 1.01 -.28 .97 -.23 .94

Korean 707 .03 .91 .09 .87 .05 .94 .05 .95 -.24 .95 -.22 .90

Japanese 370 -.28 .97 -.27 .92 -.35 1.00 -.45 .98 -.23 .96 -.15 .92

Thai 338 -.47 .95 -.44 .93 -.46 .99 -.45 1.00 -.33 .89 -.28 .91

Russian 289 .27 .93 -14 .97 .29 .87 .25 .93 .43 .84 .31 .94

Spanish 425 -.13 1.15 -.17 1.03 .25 .88 .24 .89 .18 .98 .06 1.06

Arabic 284 -.07 1.01 -.20 1.06 -.15 .98 -.28 1.06 .14 .90 .08 .96

Chinese males 525 .14 .93 .18 .89 .17 1.01 .17 1.01 -.26 .98 -.24 .93

Korean males 369 .03 .93 .09 .87 .09 .93 .07 .98 -.37 .99 -.34 .93

Japanese males 105 -.Ol 1.03 -.05 .98 -.22 1.13 -.25 1.07 -.36 1.05 -.27 1.03

Thai males 159 -.47 .97 -.45 .92 -.38 .99 -.45 1.08 -.29 .88 -.33 .86

Russian males 109 .39 .87 .23 .95 -49 .80 .47 .80 -56 .74 .35 .88

Spanish males 197 -.14 1.17 -.18 1.06 .32 .90 .27 .90 .09 1.07 -.08 1.13

Arabic males 192 -.15 1.06 -.28 1.14 -.22 1.04 -.39 1.14 .06 .97 .02 1.02

Chinese females 838 -.04 .93 .07 .92 -.06 1.02 -.03 1.00 -.30 .97 -.23 .95

Korean females 328 .04 .88 .08 .86 .Ol .93 .02 .91 -.09 .89 -.09 .85

Japanese females 265 -.39 .92 -.37 .88 -.41 .94 -.54 .94 -.18 .92 -.lO .88

Thai females 179 -.47 .94 -.42 .94 -.53 .98 -.45 .94 -.37 .90 -.24 .95

Russian females 177 .20 .97 .08 .98 .17 .89 .12 .98 .35 .88 .28 .97

Spanish females 228 -.12 1.13 -.16 1.00 .19 .86 .22 .89 .27 .89 .19 .99

Arabic females 88 .08 .89 -.05 .87 .Ol .84 -.03 .83 .33 .68 .22 .83

Males 1656 -.Ol 1.00 .oo .98 .03 1.00 .07 1.04 -.18 1.01 -.16 .98

Females 2103 -.06 .96 -.09 .94 -.08 -98 -.08 .98 -.08 .95 -.12 .94 gate. “includes all males and females who took the TOEFL, therefore n is greater than the sum of males

and females in the seven language groups.

25

Table 11

Differences in Impact for Language Groupsa by TOEFL Subtests

Reference group

Spanish

Spanish males

Thai males

Russian males .ll ns -.03 ns .08 ns

Arabic males .09 ns .08 ns -.12 .05 S>A

Chinese females -.16 .OOl s>c .Ol ns -.18 .OOl s>c

Korean females -.lO ns .02 ns -.lO ns

Spanish females

Chinese males

corean males

apanese males

Thai males

<ussian males

Spanish males

Arabic males

dales Note. ns = not significant, A = Arabic, C = Chinese, J = Japanese, K = Korean, R = Russian, S = Spanish,

and T = Thai, M = male, F = female.

aMeans and Ns by subgroup can be found in Table 10. bPositive numbers indicate focal group is

relatively higher on computer-based test.

26

Appendix A

Scatter Plots by Gender for SAT, GMAT, and TOEFL

SAT Math

- Fema _____

1 Male

200 300 400 500 600 700 800

Computer Score

SAT Verbal 800

700

2 600 0 0

WI 500

v-l Q) n a” 400

300

200

200 300 400 500

Computer

600 700 800

Score

le

+ Female _____

n Male

Figure Al. SAT Scores by Gender

27

GMAT 1 Quantitative 60

50

40 d)

E * 8 30

!? PI

20

10

0

0 10 20 30 40 50 60

Computer Score

6C

5c

4c

2 s

; 30

2 PC

20

10

0

GMAT 1 Verbal

0 10 20 30 40 50 60

Computer Score

+ Female __---

1 Male

r Female _____

z Male

Figure A2. GMATl Scores by Gender

28

GMAT2 Quantitative

50

40

E z * 30 fiz !i?

e( 20

10 ‘- Female

OJ , , 1 , ) 1 -i--&ale 0 10 20 30 40 50 60

Computer Score

GMAT2 Verbal

60- 50-

40- a,

i * 30- 8 Et PC

20-

0 10 20 30 40 50 60

Computer Score

Female __-__

Male

Figure A3. GMAT2 Scores by Gender

29

3-

TOEFL Writing

2-

l-

Computer Z-Score

TOEFL Listening

-4 - i (3

-5 -5 -4 -3 -2 -1 0 1 2 3

Computer Z-Score

+ Female --___

;’ Male

Figure A4. TOEFL Scores by Gender

30

TOEFL Reading

-5 -4 -3 -2 -1 0 1 2 3

Computer Z-Score

- Female _--__

r IMale

Figure A4. TOEFL Scores by Gender, continued

31

Appendix B

Graphs of Performance by Race/Ethnicity, Gender, and Language Group

Sample Graph

80 __ Higher smre on Paper Test

OI/ 0 20 40 60 80 100

Computer Score

Sample Graph

32

600

i?2 550 8

ul $ 4

Q 500

SAT Math: Race/Ethnicity

450

455 505 555

Computer Score* ’ 5 point practice effect on x-axis

605

GREl Quantitative: Race/Ethnicity Praxis Mathematics: Race/Ethnicity

450 1 / Hispaw

350 450 550 650

Computer Score

I GMAT 1 Quantitative: RaceBhnicity

40

35

f! 8 ‘2 30

$

d

25

African American

20 25 30 35 40

Computer Score

I .60

.20 e!

8 3 -.20

$

n” -.60

-1.40

-1.40 -1 .oo -.60 -.20 .20 .60

Computer Z-Score

Figure Bl. Quantitative Tests by Race/Ethnic@

33

SAT Verbal: RaceIEthnicity Praxis Reading: RaceIEthnicity

White H Asian

Hispanic

-1 .oo t/

ml African American


GREl Verbal: Race/Ethnicity Praxis Writing: RacelEthnicity

650 --

g 8 B

-.20

N

i -.60

d i

e

// PI

African Americric;

1 Hispanic

350 d African American I

350 450 550 650

Computer Score

-1 .oo t /

-1.4oY I -1.40 -1.00 -.60 -.20 .20 .60

Computer Z-Score

GMAT 1 Verbal: Race/Ethnicity

35

?! s @ B

30

4 a

25 African American

20

20 25 30 35 40

Computer Score

Figure B2. Verbal Tests by Race/Ethnic@

34

SAT Math: Males by Race/Ethnicity

600

e! 8 550

v)

$

B n

500

African American

450

455 505 555

Computer Score’ * 5 point practice effect on x-axis

605

GREl Quantitative: Males by Race/Ethnicity

650 --

African Americcal

350 450 550 650

Computer Score

GM AT 1 Quantitative: Males by Race/Ethnicity

40 ,

35 --

?! 8 m 8

30--

% n.

25 --

/

African American

I 20 Y

20 25 30 35

Computer Score

Praxis Mathematics: Males by Race/Ethnicity

I Hispanic 1: *

African Americka

-1.40 -1.00 -.60 -.20 .20 .60

Computer Z-Score

Figure B3. Quantitative Tests by Race/Ethnic@ for Males

35

/

SAT Math: Females by Race/Ethnicity

000

550 t 8 w t mp D- 500

455 505 555

Computer Score* * 5 point practice effect on x-axis

605

GRE 1 Quantitative: Praxis Mathematics: Females by RaceLEthnicity Females by Race/Ethnicity

Computer Score

GMAT 1 Quantitative: Females by RacdEthnicity

35

rican American

20

20 25 30 35 40

Computer Score

-1.40 -1 .oo -.60 -.20 .20 .60

Computer Z-Score

Figure B4. Quantitative Tests by Race/Ethnic@ for Females

36

SAT Verbal: Males by RaceEthnicity

450

455 505 555

Computer Score* 5 point practice effect on x-axis

7

605

GRE 1 Verbal: Praxis Writing: Males by Race/Ethnicity Males by RaceEthnicity

.60 650 --

er .20

8 3 -.20

k -.60 iY

t /* African American

350 450 550

Computer Score

650 I -1.40 -1.00 -.60 -.20 .20 .60 / Computer Z-Score

GMAT 1 Verbal: Males by Race/Ethnicity

25 ncan Amencan

20 25 30 35 40

Computer Score

Praxis Reading: Males by Race/Ethnicity

.60

/ .20 --

e! ite

8 9 -.20 -- N

ii -.60 -- p"

-1.00 -- African Americzat

-1.40 Y I I I -1.40 -1.00 -.60 -.20 .20 .60

Computer Z-Score

Figure B5. Verbal Tests by Race/Ethnic@ for Males

37

SAT Verbal: Females by Race/Eth.nicity

Computer Score* t 5 point practice effect on x-axis

GRE 1 Verbal: Females by Race/Ethnicity

650 --

African American

350 450 550 650

Computer Score

GMAT 1 Verbal: Females by Race/Ethnicity

35

C 8 * t

30

4 a_

25 African Amencan

25 30 35

Computer Score

Praxis Reading: Females by Race/Ethnicity

p! .20

s 3 -.20

8 4

0, -.60 I

y: 1:: t/” African Atyerican I

.60

/

t

-1.40 -1 .oo -.60 -.20 .20 .60

Computer Z-Score

PRAXIS writing Females by Race/Ethnicity

.60

f

.20

q -.20 N

8 2 -.60 P

-1 .oo African American

-1.40 -1.40 -1 .OO -.60 -.20 .20 .60

Computer Z-Score

Figure B6. Verbal Tests by Race/Ethnic@ for Females

38

* L

600 -

SAT Math: Gender

550 --

w

455 505 555 605

Computer Score*

5 point practice effect on x-axis

GFEl Quantitative: Gender

650 --

350 450 550

Computer Score

650

GMAT 1 Quantitative: Gender

35 --

p! s f/J 30 -- t 4 n

20 25 30 35 40

Computer Score

25

GMAT2 Quantitative: Gender

20 21 26 31 36 41


Praxis: Math by Gender

.60

g! .20 8 v) A -.20

8 4 -.60 n

-1 .oo

-1.40 -1.40 -1 .oo -.60 -.20 .20 .60

Computer Z-Score

1

Figure B7. Quantitative Tests by Gender

39

SAT Verbal: Gender GMAT2 Verbal: Gender

600

550 -- t ifi v) $ 4 0. 500 --

25

450

455 505 555

Computer Score* t 5 point practice effect on x-axis

605 21 26 31 36 41


I

GREl Verbal: Gender Praxis Reading: Gender

.60

650 -- .20

?! $j 550 -- fn t 4 n

s! 8 -.20 cn I; b B

-.60

0. -1 .oo

-1.40

350 450 550

Computer Score

650

-1.40 -1.00 -.60 -.20 .20 .60

Computer &Score

GMAT 1 Verbal: Gender Praxis Writing: Gender

.60

35 --

-.60

20 d , I

20 25 30 35 40

Computer Score

-1.40 -1.00 -.60 -20 .20 .60

Computer Z-Score

1

Figure B8. Verbal Tests by Gender

40

TOEFL Writing: Gender TOEFL Listening: Gender

.70

g! 8 2 .oo

$ 4 P

-.70

-.70 .oo .70

Computer Z-Score

TOEFL Reading: Gender

.70

g! 8

3 .oo

ii 2

-.70

-.70 .oo

Computer Z-Score

.oo

Computer Z-Score

Figure B8. Verbal Tests by Gender, continued

41

TOEFL Writing: Language

‘70/1 2 s v)

I4 $ .oo 4 n !

-.70 -

-.70 .oo .70

Computer Z-Score

TOEFL Reading: Language

/

Russia)

Computer Z-Score

TOEFL Listening: Language

.701

e! 8

(I)

r;l .oo i h

Computer Z-Score

GRE2 Verbal: Language

650

e! 3 550 VJ

t

4 P

450

350

350 450 550 650

Computer Score

GMAT 1 Verbal: Language

35

20

20 25 30 35 40

Computer Score

40

35

9) G

g 30 8 4 n

25

20

GMAT2 Verbal: Language

T

21 26 31 36 41

’ 1 point practice

Computer Score* effect on x-axis

Figure B9. Verbal Tests by Language Group

42

TOEFL Writing: Females by Language

/

-.70 Y

-.70 .oo

Computer Z-Score

.70

TOEFL Reading: Females by Language

‘703

.oo

Computer ZScore

.70

I

TOEFL Listening: Females by Language

1

Spanish /I

I .oo

Computer Z-Score

GRE2 Verbal: Females by Language

350 450 550 650

Computer Score

GMAT 1 Verbal: Females by Language

35 --

e s fB 30.- & 4 0.

20 25 30 35 40

Computer Score

GMAT2 Verbal: Females by Language

21 26 31 36

Computer Score’ * 1 point practice effect on x-axis

Figure BlO. Verbal Tests by Language Group for Females

43

TOEFL Writing: Males by Language

.oo

Computer Z-Score

TOEFL Reading: Males by Language

.oo

Computer Z-Score

TOEFL Listening: Males by Language

g

1;1 .oo $

4 n

-.70

-.70 .oo

Computer Z-Score l

GRE2 Verbal: Males by Language

350 450 550 650

Computer Score

40

GMAT 1 Verbal: Males by Language

35 --

f! s m L

30--

4 P

20 25 30 35

Computer Score

40

GMAT2 Verbal: Males by Language

35 --

L 8 w 30 -- $ 4 II

21 26 31 36

Computer Score* 1 point practice effect on x-axis

Figure Bll. Verbal Tests by Language Group for Males

44

GRE2 Quantitative: Language

350: 350 450 550 650

Computer Score

GMATl Quantitative: Language

40

35

25

20 25 30 35 40

Computer Score

40

35

GMAT2 Quantitative: Language

25

20

21 26 31 36 41


CBT and Group Differences

45

Figure B12. Quantitative Tests Language Group

45

GRJZZ Quantitative: Females by Language

650 --

350 #

350 4.50 550 650

Computer Score

40 -

GMAT 1 Quantitative: Females by Language

35 --

t s

&

v, joI

!i? n Engltsh

25

/

B

20 I 20 25 30 35 40

Computer Score

GMAT2 Quantitative: Females by Language

?! s

; 30.-

2 P

25 --

20 ti

21 26 31 36


650

GRE2 Quantitative: Males by Language

English

Computer Score

GMATl Quantitative:

25 30 35 40

Computer Score

GMAT2 Quantitative: Males by Language

21 26 31

Computer Score*

’ 1 point practice effect on x-axis

Figure B13. Quantitative Tests by Gender and Language Group

46

the effect of computer-based tests on ann gallaghergmat. the gmat data consisted of two separate...

Documents