the effect of computer-based tests on ann gallaghergmat. the gmat data consisted of two separate...
TRANSCRIPT
The Effect of Computer-Based Tests on
Racial/Ethnic, Gender, and Language Groups
Ann Gallagher Brent Bridgeman
Cara Cahalan
GRE No. 9621P
June 2000
This report presents the findings of a research project funded by and carried
out under the auspices of the Graduate Record Examinations Board
Educational Testing Service, Princeton, NJ 0854 1
********************
Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate
Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.
********************
The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,
services, and employment policies are guided by that principle.
EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.
Copyright 0 2000 by Educational Testing Service. All rights reserved.
Abstract
This study examined data from several national testing programs to determine whether the change
from paper-based administration to computer-based tests (CBTs) influences group differences in
performance. Performance by gender, racial/ethnic, and language groups on the Graduate Record
Examination (GRE@) General Test, the Graduate Management Admissions Test (GMAT@), the SAT@ I:
Reasoning (SAT) test, the Praxis? Professional Assessment for Beginning Teachers (Praxis), and the
Test of English as a Foreign Language (TOEFL) was analyzed to ensure that the change to CBTs does not
pose a disadvantage to any of these subgroups, beyond that already identified for paper-based tests.
Although all differences were quite small, some consistent patterns were found for some racial/ethnic and
gender groups. African American examinees and, to a lesser degree, Hispanic examinees appear to benefit
from the CBT format. However, for some tests, the CBT version negatively impacted female examinees.
Analyses by gender within race/ethnicity revealed a similar pattern, though only for White females.
Analyses for groups based on language showed no consistent patterns, but results indicate that the
computer-based TOEFL has increased impact for some language groups -- especially Chinese and Korean
groups.
Key words: CBT, Gender, Race, Ethnicity, Language, Computer-based testing, Assessment
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . ..*.......................................................................................................................... 1
Method .......................................................................................................................................................... 2
Sources of Information ............................................................................................................... 2
Procedures ................................................................................................................................... 5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................*........... 7
Conclusion .................................................................................................................................................. 12
Future Research ........................................................................................................................ 15
References .................................................................................................................................................. 16
Appendix A ................................................................................................................................................ 27
Appendix B ................................................................................................................................................. 32
List of Tables
Table 1. Descriptive Statistics for Test Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*...................*......... 18
Table 2. Descriptive Statistics for Quantitative Tests by Race/Ethnicity, Gender, and Race/Ethnic@ Within Gender . . . . . . . . . . . . . . . . . . . . ..*...................................................................... 19
Table 3. Differences in Impact for Population Groups by Quantitative Test . . . . . . . ..*....................... 20
Table 4. Descriptive Statistics for Verbal Tests by Race/Ethnicity, Gender, and Race/Ethnic@ Within Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*............................................................................... 21
Table 5. Differences in Impact for Population Groups by Verbal Test ,.......................................... 22
Table 6. Descriptive Statistics for Quantitative Tests by Language Group by Quantitative Test 23
Table 7. Differences in Impact for Language Groups by Quantitative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Table 8. Descriptive Statistics for Quantitative Tests by Language Groups by Verbal Test . . . . . ...24
Table 9. Differences in Impact for Language Groups by Verbal Test ............................................. 24
Table 10. Descriptive Statistics for TOEFL Verbal Subtests by Language, Gender, and Language Within Gender ........................................................................................................................ 25
Table 11. Differences in Impact for Language Groups by TOEFL Subtests .................................... 26
List of Figures
Figure Al. SAT Scores by Gender ........................................................................................................... 27
Figure A2. GMATl Scores by Gender .................................................................................................... 28
Figure A3. GMAT2 Scores by Gender .................................................................................................... 29
Figure A4. TOEFL Scores by Gender ..................................................................................................... 30
Figure A4. TOEFL Scores by Gender, continued .................................................................................. 31
Figure Bl. Quantitative Tests by Race/Ethnicity ................................................................................... 33
Figure B2. Verbal Tests by Race/Ethnicity ............................................................................................. 34
Figure B3. Quantitative Tests by Race/Ethnic@ for Males .................................................................. 35
Figure B4. Quantitative Tests by Race/Ethnic@ for Females .............................................................. 36
Figure B5. Verbal Tests by Race/Ethnic@ for Males ............................................................................ 37
Figure B6. Verbal Tests by Race/Ethnic@ for Females ........................................................................ 38
Figure B7. Quantitative Tests by Gender ................................................................................................ 39
Figure B8. Verbal Tests by Gender ......................................................................................................... 40
Figure B8. Verbal Tests by Gender, continued ....................................................................................... 41
Figure B9. Verbal Tests by Language Group ......................................................................................... 42
Figure BlO. Verbal Tests by Language Group for Females .................................................................. 43
Figure Bl 1. Verbal Tests by Language Group for Males ...................................................................... 44
Figure B12. Quantitative Tests Language Group ................................................................................... 45
Figure B13. Quantitative Tests by Gender and Language Group ........................................................ 46
Introduction
Many large national testing programs are converting from paper-based tests to computer-based
tests (CBTs). Reactions to CBTs have largely been positive, due to recognition of the value of such
features as flexible scheduling and rapid score reporting. Nonetheless, some concerns exist. A particularly
important question is that of the fairness of CBTs relative to paper-based tests. Can all subgroups of
examinees expect to perform as well on a computerized test as they would on a traditional, paper-based
examination? Or will there be differences, perhaps arising from differences in the amount of experience
or comfort in the use of computers that may be associated with subgroup membership?
Over the last decade, a large body of research has been conducted with the purpose of
understanding the nature and sources of group differences in test performance (see, for example,
Willingham & Cole, 1997, or Hoover, Politzer, & Taylor, 1990). Most testing programs shifting from
paper-based tests to CBTs intend to maintain comparability of scores from one testing format to the other,
but because other kinds of format changes have produced inconsistent results in the past (Willingham &
Cole, 1997), it is important to examine whether the shift from paper to computers produces any consistent
patterns.
Two sets of researchers have performed large scale reviews of studies examining differences in
performance on CBTs and paper-based versions of tests (Mazzeo & Harvey, 1988; Mead & Drasgow,
1993). Both reviews focused primarily on average differences across formats without subdividing
samples on the basis of race/ethnicity or gender. Both reviews found conflicting results based on the type
of test (that is, power versus speeded tests, or personality versus placement tests). For example, Mead and
Drasgow focused on correlations across formats and found no medium effect (computer vs. paper) for
carefully constructed power tests, but they found a substantial effect for speeded tests. The only high-
stakes admissions test included in this work was in the early, nonadaptive version of the computerized
Graduate Record Examination (GRE@) General Test. Mazzeo and Harvey’s work notes the diversity of
display and response formats between CBTs and paper-based tests and the interactions of these
differences with performance. However, this study was conducted in the late 1980s when CBT interfaces
were quite immature and familiarity with computers was limited. Consequently the generalizability of this
study is less useful today than when it was first conducted.
Several comparability studies examining differences in scores on computer-based and paper-
based versions of tests have been conducted for individual testing programs (Schaeffer, Bridgeman,
Golub-Smith, Lewis, Potenza, & Steffan, 1998; Schaeffer, Steffan, Golub-Smith, Mills, & Durso, 1995;
P.A. Carey, personal communication, May 1998). As part of this work, the studies include a brief analysis
of subgroup performance. Several other studies have been conducted at Educational Testing Service@ to
examine performance of subgroups within testing populations for specific tests (Bridgeman & Cooper,
1998; Bridgeman & Schaeffer, 1995; Eignor, Way, & Amoss, 1994; O’Neill & Powers, 1993; Ward &
Bridgeman, 1996). These studies have been done independently rather than as a coordinated effort,
however, and comparisons across studies are difficult due to methodological differences.
This study consolidates data from prior studies in order to compare findings across testing
programs, with the intent of identifying patterns of performance across assessments. To do this, we
analyzed data from studies examining five testing programs to determine whether the difference between
paper-based performance and CBT performance is of equal magnitude across subgroups -- in particular,
subgroups defined by raceiethnicity, gender, and native language.
Method
Sources of Information
Operational testing data and data from experimental research administrations of tests were drawn
from the following ETS testing programs:
0 GRE General Test
0 SAT@ I: Reasoning (SAT)
0 Graduate Management Admissions Test (GMAT@)
l Test of English as a Foreign Language (TOEFL@)
0 Praxis@ Professional Assessments for Beginning Teachers
Only existing data gathered for operational programs, or data gathered for previous studies, were
examined here. All data for GRE, GMAT, and TOEFL testing programs were drawn from comparability
studies conducted by Schaeffer et al. ( 1998), Bridgeman, Anderson andwightman (1998), and Carey
(personal communication, May 1998), respectively. The SAT data was from a previous research study
(Lawrence, Potenza, & Feigenbaum, 1998). Praxis data was assembled from prior computer-based and
paper-based administrations of the test. No new experimental data was gathered for this study. Table 1
displays summary statistics for each testing sample. Note that some administrations were operational and
others were experimental, and all computer-based tests under study were adaptive -- that is, questions for
2
an individual examinee were selected from a pool of items based, in part, on that individual’s performance
on prior questions.
For three of the testing programs -- SAT, GMAT, and TOEFL -- data are for examinees who took
both CBT and paper-based formats. Data for the remaining two tests -- GRE and Praxis -- are matched
samples in which examinees took either CBT or paper-based tests and were then matched on the
following background variables: best language, race/ethnicity, gender, undergraduate major or program,
undergraduate grade point average (GPA), mother’s education level, and the examinee’s current education
level. In addition, Praxis examinees were also matched on years since attending college, and GRE
General Test examinees were matched on prior computer experience.
SAT. The SAT data consist of juniors who took the paper-based SAT in May 1996, and then took
an experimental computer-based SAT in June 1996. To increase motivation, examinees who took the
experimental administration were informed that, if they desired, scores from the computer-based test
could be added to their score record, which is reported to colleges. Participants were from schools
selected based on their proximity to CBT test centers, the number of juniors who took the SAT in
preceding years, the size of the school, the proportion of female test takers at the school, and score
consistency in previous years. Score consistency refers to the stability of SAT test scores from year to
year within a school. A total of 4,700 juniors were invited to take the computer-based SAT. Of the 1,732
students who took the computer-based test, analyses were conducted only on the 1,401 students who
reported making a “good” or “strong” effort on the computer-based version. This sample had slightly
higher paper-based scores than college bound seniors: Means for participants were 530 on the verbal
portion of the test, and 542 on the math section; means for college bound seniors were 505 and 508,
respectively (Lawrence, Potenza, & Feigenbaum, 1998).
GMAT. The GMAT data consisted of two separate samples. Both samples took an operational
paper-based test and an experimental computer-based test. The first sample (GMAT 1) included 3,465
participants who took the paper-based test in October 1996. About half of these participants took the
computer-based test prior to the paper-based test, and half took the computer-based test after the paper-
based test. The higher score from both tests was reported, so test takers were motivated to do well. The
second sample (GMAT2) included 773 test takers who took the paper-based test prior to the computer-
based test. GMAT2 was administered in April 1997 with extended section time limits, because the
GMATl field trial revealed that the original computer time limits were too short for examinees to
complete the test. Current operational time limits are the same as those in GMAT2. Due to the relatively
small sample size, GMAT2 was used only for analyses by gender, language, and language by gender.
TOEFL. Forty-eight thousand people who had taken the paper-based TOEFL between November
1997 and February 1998 were invited to take the computer-based TOEFL between January and March
1998. TOEFL was administered in paper-based and computer-based format to 6,556 test takers. This
study examines 3,791 of these test takers who reported that their best language was Chinese, Korean,
Japanese, Thai, Russian, Spanish, or Arabic. Paper-based test fees were reimbursed and computer-based
test scores were sent out in June 1998, which motivated test takers to do well.
GRE General Test. GRE data were evaluated using two separate samples from the 1996-1997
academic year. The first sample (GREl) consisted of the 190,044 U. S. citizens who took the GRE
General Test in the United States and who spoke English better than any other language. Of these, 78,257
examinees took the computer-based GRE and 111,787 took the paper-based version. The computer-based
and the paper-based samples were matched exactly on several characteristics, including race/ethnic&y,
computer usage, gender, undergraduate GPA, undergraduate major, mother’s education level, and
examinee’s current education level. Matching was one to one, in that every individual record in the
computer-based sample was exactly matched to one other record in the paper-based sample on the above
characteristics. This matching was done to reduce some of the irrelevant differences between the samples.
Twenty-eight percent of the computer-based sample was eliminated because there was no exact match for
them in the paper-based sample. Because test takers were free to choose either the computer or paper
versions of the test, the groups cannot be considered to be randomly equivalent.
The second GRE General Test sample (GRE2), which was used for comparisons of examinees
whose best language was English with examinees whose best language was not English, was also
matched. Of the 3 13,124 test takers who reported their English proficiency level, 106,367 examinees who
took the computer-based GRE were matched with 206,757 examinees who took the paper-based version.
Using the matching procedure described above, computer-based test takers were matched with paper-
based test takers on eight background variables: computer usage, gender, undergraduate GPA,
undergraduate major, mother’s education level, examinee’s current education level, English fluency, and
race/ethnic&y. Examinees who took the test outside of the United States were excluded from analyses
because data from international administrations could not be matched by race/ethnicity or citizenship.
Also, as in sample one, 27% of the resulting computer-based sample was eliminated because there was no
exact match in the paper-based sample.
Praxis. Praxis test data from 1993 through 1997 were evaluated by matching computer and paper
samples for each subtest (reading, mathematics, and writing). In the same procedures described for the
GRE General Test samples, examinees who took the computer-based test were matched with examinees
who took the paper-based test on eight characteristics: gender, race/ethnicity, undergraduate GPA,
examinee’s highest education level, mother’s highest education level, years since attending college,
participation in a teacher education program, and speaking another language better than English. For each
subtest, over 86% of computer-based test takers were matched with paper-based test takers: 39,027 who
took the reading subtest, 40,257 who took the writing subtest, and 40,325 who took the mathematics
subtest.
Procedures
The purpose of this study was to examine the effects of computer-based testing on group
differences in performance. This paper examines differences between differences. Thus, the difference
between reference and focal group performance (that is, impact) was calculated for each testing format
(paper or computer). The size of the difference for each testing format was then compared within each
group of examinees (for example, impact for the paper-based SAT was compared to impact for the
computer-based SAT). Several different group comparisons, based on race/ethnic@, gender, and
language, were then conducted for each testing sample.
Analysis. For the SAT, GRE General Test, and GMAT samples, CBT and paper-based scores had
been previously equated according to standard operating procedures. Because TOEFL and Praxis scores
for the paper and computer tests were on different scales, z-scores were generated for data on these tests.
These procedures assure that there will be no overall format effects, but do not preclude finding format
effects in any subgroup. Since z-scores were generated for all examinees (including language groups not
included in TOEFL analyses and test takers that did not report their gender, race/ethnicity, or language),
mean z-scores may not equal zero for the sample used in the current analyses.
The data for all tests were summarized in terms of the standardized mean differences between
paper-based tests and computer-based tests for each population. Standardized mean differences are one
way of measuring effect size. These statistics were calculated for each test, comparing computer and
paper-based performance in the following manner. First, the standardized difference, d, between a
reference group (e.g., White examinees) and a focal group (e.g., African American examinees) was
computed as follows:
d=
where is the mean for the reference group, x, is the mean for the focal group, and SDr and SDf are
the standard deviations for the reference and focal groups, respectively. This formula, based on
unweighted standard deviations, is not dependent on subgroup sample sizes, unlike the usual formula that
uses weighted standard deviations (Willingham & Cole, 1997, p. 2 1). Next, the average difference in
impact was defined as the difference between dc and dp , where dc is the impact between reference and
focal groups for the CBT, and dp is impact between reference and focal groups for the paper-based test.
Thus, positive numbers indicate relatively higher scores for the focal group on the computer-based test.
Average differences for paper-based tests and CBTs were analyzed for each subgroup by
race/ethnicity, gender, gender within race/ethnic@, and race/ethnic@ within gender. Analyses also
examined differences by language group and by gender within language group. The reference group
defined as White examinees for analyses by race/ethnic@ and race/ethnic@ within gender. Male
was
examinees were the reference group for analyses by gender, gender within race/ethnicity, and gender
within language group. Mean differences for each pair (reference group - focal group) were then
compared by testing format (paper-based versus CBT).
Analyses comparing students whose best language is English and students who speak another
language better than English were conducted for the GRE General Test and GMAT samples. Because few
examinees taking the SAT or Praxis exams reported that English is not their best language, analyses by
language group were not conducted for these two tests. The GRE General Test and GMAT did not collect
information on examinees’ best language if it was not English, so analyses for these two tests looked at
performance by groups based only on two English language proficiencies: speak English as well or better
than any other language, or do not speak English as well as another language. These two groups will be
referred to as English language (EL) and other language (OL). For these analyses, EL was considered the
reference group.
Analyses of TOEFL data were conducted on several language groups. Very few TOEFL
examinees reported English as their best language. Consequently, some other criterion was required for
defining a reference group. Because the focus of the current study is on how the shiR to computer-based
testing might differentially affect various subgroups of the testing population, computer familiarity
6
seemed a reasonable variable to use in selecting a reference group. Recent work by Kirsch, Jamieson,
Taylor, and Eignor (1998) examining computer familiarity among various language groups within the
TOEFL population found that examinees reporting Spanish as their best language had the highest scores
on the computer-familiarity scale developed for this purpose. Therefore, for analyses of TOEFL data,
Spanish-speaking examinees were considered the reference group.
For SAT, GMAT, and TOEFL samples, in which the same examinees took the test in both
formats, we tested the statistical significance of dp-dc using the group-by-format interaction from the
repeated measures ANOVA (for example, male vs. female by CBT vs. paper-based test). For the GRE
General Test and Praxis we used a t-test for independent samples. The GRE and Praxis samples are
considered independent because these two groups of examinees self-selected either a paper-and-pencil or
CBT format, and matching based on background variables does not control for self-selection. The
numerator for the t-tests was dp-dc and the denominator was the standard error of the difference. The
standard errors for dp and dc (needed to compute the standard error of the difference) were derived from
the first-order Taylor series expansion of d as described in Willingham and Johnson (1997, section 3, p.
4). Although we focused primarily on mean differences in this study, we also investigated differences
over the full range of scores.
Results
Means and sample size for each test by group can be found in Tables 2,4,6,8, and 10, while
Tables 3, 5, 7, 9, and 11 display the difference between dp and dc with associated significance values for
t-tests and ANOVAs. Note that there are large variations in sample sizes. As a result, some significant
differences are actually small in magnitude relative to other differences that were not significant. Results
of analyses conducted by race/ethnicity and race/ethnic@ within gender are reported in Tables 3 and 5.
Results for analyses of language groups and language group within gender are reported in Tables 7,9, and
11. All the tables refer to changes in impact from paper-based test to computer-based test.
Appendix A shows the regressions of paper scores on computer scores for the tests in which the
same examinees took both test formats. The regression line is a Lowess line (Chambers, Cleveland,
Kleiner, & Tukey, 1983), which permits observation of nonlinear relationships. However, the regressions
were all reasonably linear and virtually the same for all subgroups.
Appendix B contains plots of means on paper-based tests and CBTs for all analyses
subgroups. In each plot, the diagonal represents the point at which the two scores are equal.
bY
This diagonal
was adjusted for the SAT and GMAT2 samples to approximate the practice effect, because all students in
these samples took the paper-based test first. For the SAT sample, adjustments were made based on actual
score gains for the total sample (5.8 points for verbal and 6.9 points for math). For the GMAT2 sample,
adjustments were made based on the average increase in score for GMAT 1, which alternated the order of
paper-based and computer-based administrations; adjustments were .59 of a point on GMAT 1 verbal
score and .76 of a point on GMATl quantitative score. This type of correction was not necessary for the
TOEFL data, because plots were generated from z-scores. Groups falling below the line performed
relatively better on the computer-based version of the test and groups falling above the line performed
relatively better on the paper version of the test. Bars around each mean represent two standard errors
from the mean.
Race/ethnicity. Tables 3 and 5 display statistics for data on race/ethnicity and gender groups for
all verbal and quantitative tests under study. Analyses by race/ethnic@ of both verbal and quantitative
tests indicated that impact generally decreased for African American, and to a lesser extent, for Hispanic
examinees on CBT versions of the tests.
Of the five verbal sections examined (the GRE 1, SAT, and GMAT 1 verbal components, as well
as the Praxis reading and writing subtests), four sections showed reduced impact for African American
examinees in the CBT version of tests compared with the paper-based version (for GREl verbal results, t
[53,415] = 2.83, p < .Ol; for SAT verbal, F [l, 1,149] = 7.12, p < .Ol; for the Praxis reading test, 1
[37,538] = 5.25, p < .Ol; and for the Praxis writing test, 1[39,043] = 4.33, p < .Ol). Impact was also
reduced for Hispanic examinees on three CBT versions of the five verbal tests (for SAT verbal, F [ 1,
1,127] = 4.32, p < .05; for Praxis reading, & [35,534] = 1.97, p < .05; and for Praxis writing, 5 [37,084] =
2.07, p < .O 1).
On the four quantitative tests (the GRE 1, SAT, and GMAT 1 quantitative sections, and the Praxis
mathematics subtest), a similar pattern of reduced impact was evident for both African American and
Hispanic examinees on the CBT versions of the tests. There was significantly reduced impact for African
American examinees on the GREl quantitative test (t[53,415] = 3.90, p < .Ol), the Praxis mathematics
test (t [38,800] = 7.86, p < .Ol), and the GMATl quantitative test (F [I, 1,993] = 10.77, p < .OOl). For
Hispanic examinees, a reduction in impact on CBT over paper-based versions of quantitative tests was
found on the GREl quantitative test (i [5 1,93 I] = 2.97, p < .Ol) and the SAT math test (F [ 1, 1,127] =
4.26, p < .05.
Race/ethnic@ within gender. African American and Hispanic examinees, especially females, also
appeared to benefit from the CBT format in comparisons of race/ethnicity within gender. When compared
to White female examinees on verbal tests, there was significantly reduced impact for African American
females on the CBT versions of four out of five of the tests: GREl verbal (t [35,098] = 2.62, D< .Ol),
SAT verbal (F [l, 637]= 6.53, E< .05), and Praxis reading and writing tests, respectively ($ [27,300] =
4.43, p<.Ol, and t [27,85 I] = 3.27, D< .Ol). When compared with White males, there was also reduced
impact for African American males on the CBT version of the two Praxis tests (& [ 10,234] = 2.89, D< .Ol,
for reading, and 1 [ 11,188] = 3.18, D< .O 1, for writing). Finally, there was reduced impact for Hispanic
females on the CBT version of the SAT verbal test (E [ 1,6 161 = 18.17, p < .OO 1).
When compared with White males, there was significantly reduced impact for African American
males on the CBT versions of two of the four quantitative tests GREl quantitative (t [ 18,3 151 = 2.12, p <
.05), and Praxis mathematics (t [ lo,1 381 = 3.16, p< .Ol). Impact was also reduced on the CBT versions of
two of the tests for Hispanic males: GREl quantitative (g [18,184] = 3.25, p < .05) and SAT math (E [l,
5091 = 6.35, D< .05).
When compared with White females, there was reduced impact for African American females on
the CBT version of three of the four quantitative tests: GREl quantitative @135,098] = 3.35, p < .05),
Praxis mathematics (t [28,660] = 7.26, p < .Ol), and GMATl quantitative (E [l, 849]= 10.78, p < ,001).
There was also reduced impact for Asian females on the CBT version of the Praxis mathematics test (&
[26,781] = 2.25, p < .05).
Gender. The pattern revealed for analyses by gender appears to be the reverse of the patterns for
race/ethnicity. On five of the six verbal tests examined by gender, there was increased impact for female
examinees on the CBT versions: GREl verbal (t [56,653] = 2.86, g < .O l), GMATl verbal (F [ 1, 3,463] =
11.95, p < .OOl), GMAT2 verbal (F [ 1, 7711 = 6.99, p < .Ol), and Praxis reading and writing (t [39,023] =
3.23, p < .Ol, and t [40,253] = 3.16, p < .Ol, respectively). There was also increased impact for females on
two of the three TOEFL subtests: reading (F [l, 3,756] = 4.32, p < .05) and listening (E [l, 3,756] = 7.72,
p < .Ol). A similar, though weaker, pattern was found on the quantitative tests, where two of the five tests
showed increased impact for females on the CBT versions: GREl quantitative (t [56,653] = 2.13, p < .05)
and Praxis mathematics (t [40,321] = 6.48, p < .Ol). Although these gender differences are significant, the
actual differences are small. Refer to Appendix A for scatter plots of computer score and paper score by
gender for the repeated measures samples (SAT, TOEFL, GMATl, and GMAT2).
Gender within race/ethnic@. The only group to show a statistically significant gender difference
in impact between paper-based and CBT versions of any of the verbal or quantitative tests were White
examinees. Since White examinees comprise the largest racial/ethnic group in the testing population for
all of the tests discussed here except TOEFL, it is not surprising that the pattern for White males and
females mirrors the pattern for all males and females in the samples examined.
There was increased impact for White females on the CBT version of four of the five verbal tests:
GREl verbal @[50,203] = 2.63, p < .Ol), GMATl verbal (F [l, 1,786] = 11.42, p < .OOl), and Praxis
reading and writing, (t [34,942] = 2.9 1, p <. 01, and t [36,470] = 2.87, p < .Ol, respectively). There was
also increased impact on two of the quantitative tests: GREl quantitative (t [50,203] = 1.98, p < .05) and
Praxis mathematics (t [36,007] = 4.24, p < .Ol).
Language (English vs. other). Tables 7 and 9 display statistics for analyses of English versus
other language groups for verbal and quantitative tests. Analyses by language group on both verbal and
quantitative tests showed no consistent patterns. Three samples were analyzed for two tests with verbal
and quantitative sections. The three samples were the domestic test takers of the GRE (GRE2) and two
separate GMAT administrations (GMAT 1 and GMAT2).
On the verbal tests, there was increased impact in the CBT format for the other language group in
one of the three samples: GMATl (F [l, 3,463] = 4.97, p < .05). There was no significant change due to
format for the other two samples (GRE2 and GMAT2). For quantitative tests, there was a significant
decrease in impact on one of the three tests: GMAT2 (E [ 1,683] = 5.44, p < .05). Impact in the other two
samples showed no significant differences by format for the quantitative tests. With these contradictory
findings, it is important to keep in mind that the GRE2 test takers represent the largest sample and showed
no significant differences in impact on the quantitative or verbal subtests.
Language (English vs. other) within gender. No definite patterns were found in analyses of
groups for whom English is the primary language versus some other language within gender. For male
examinees on the GRE2 and GMAT2, no significant changes in impact by format were found between the
English language and other language groups on the verbal or quantitative tests. On the CBT version of
GMAT 1 quantitative test, impact decreased for males in the other language group (F [ 1,2,047] = 5.28, p
< .05). However, impact increased for these males on the computer-based GMATl verbal test (F [ 1,
2,047] = 4.30, p < .05). In short, males in the English language group did relatively better on the
computer-based GMATl verbal test, and relatively worse on the computer-based GMATl quantitative
test, than males in the other language group. It is worth noting that the increase in impact found in the
10
total group analysis on the CBT version of the GMATl verbal test was primarily due to these results for
male test takers.
A majority of the analyses for female test takers in the English language and other language
groups revealed no significant differences. On the verbal tests (GMATl, GMAT2, and GRE2), no
significant differences in impact were found between these two groups of females. The same was true for
two of the three quantitative tests (GRE2 and GMATl). However, impact decreased for females in the
other language group on the CBT version of GMATZ quantitative test (F [l, 2921 = 3.90, p < .OS). That is
to say, females in the other language group did relatively better on the CBT version of the GMAT2
quantitative test than did females in the English language group.
Language (Spanish vs. other). Table 11 displays data from analyses of language groups, and
gender within language group. As mentioned earlier, Spanish-speaking examinees were identified as the
reference group for these analyses. On the TOEFL computer-based writing test, impact increased for
Korean and Chinese examinees (F [l, 1,130] = 6.87, Q < .Ol, and F [l, 1,801] = 15.55,~ < .OOl,
respectively). The opposite was true for Russian test takers (F [l, 712]= 4.32, Q < .05). That is, Russian
examinees improved significantly more than Spanish-speaking examinees with the move from the paper-
based writing test to the computer-based writing test.
On the CBT version of the TOEFL reading test, impact decreased for Japanese and Arabic
examinees (F [ 1,793] = 4.23, g < .05, and F [ 1, 707]= 7.09,~ < .Ol, respectively), while all other groups
showed no significant change due to format. Finally, on the computer-based TOEFL listening test, impact
increased for Chinese examinees (F [ 1, 180 l] = 27.44,~ < .OO l), Korean examinees (F [ 1, 1,130] =
13.59, p. < .OOl), Japanese examinees (E [1,793] = 20.82, Q < .OOl), and Thai examinees (F [l, 7611 =
14.71, Q < .OOl). Results of these analyses suggest that Chinese and Korean examinees may find the CBT
tests somewhat more difficult than paper-and-pencil tests. These two language groups showed increased
impact on the CBT versions of two out of three tests. No other patterns were evident for the remaining
language groups.
Language (Spanish vs. other) within gender. On the TOEFL reading and writing tests, no
significant differences in impact were found when comparing Spanish-speaking males to males in the
other six language groups. On the TOEFL listening test, impact for Chinese, Korean, Japanese, and
Arabic males increased significantly on the CBT version when compared with Spanish-speaking males
(Chinese: F [l, 720]= 13.88,~~ .OOl; Korean: F [l, 5641 = 13.42, pc .OOl; Japanese: F [l, 300]= 10.82, - -
p < ,001; and Arabic: F [l, 387]= 4.05, p < .05).
11
Female examinees showed a slightly different pattern. On the TOEFL writing test, impact for
Chinese females increased significantly on the CBT version (F [ 1, 1,064] = 13.38, Q < .OO 1). Analyses of
the TOEFL reading test indicated that impact decreased on the CBT version for Japanese females when
they were compared to Spanish-speaking females (E [I, 4911 = 8.96, ~z < .OOl). Finally, results from the
TOEFL listening test indicated that impact for Chinese, Japanese, and Thai females increased
significantly on the CBT version (Chinese: F [l, 1,064] = 11.97, g < .OOl; Japanese: F [l, 491]= 8.26, E 5
.OOl; and Thai: F [l, 4051 = 12.67, ~z < .OOl).
It appears that the decrease in impact found for the total group of Japanese examinees on the
TOEFL reading test is primarily due to results found for Japanese females on the computer-based version
of the test. From the paper-based test to the computer-based test, scores for Japanese males and females
increased by z-score units of .04 and .13, respectively, while z-scores for Spanish-speaking males
declined by .05, and Spanish-speaking females lost .03 z-score units.
Gender within Language group. The only significant gender difference in impact between paper-
based and computer-based tests for a language group was found among Thai examinees on the TOEFL
reading and listening tests. On both tests, Thai males performed better on the computer-based version of
the test, while Thai females performed better on the paper-based version (E [ 1, 336]= 4.19, B < .05, and F
[ 1, 3361 = 7.49, Q < .O 1, for the reading and listening tests, respectively).
Conclusion
The study presented here examined differences in the magnitude of impact on gender,
racial/ethnic, and language groups resulting from the change from paper-based tests to computer-based
tests across several testing programs. The primary purpose of the analyses was to determine whether
consistent patterns of impact are evident based on gender, raceiethnicity, and language. If large
differences in any subgroup were found, scores on paper-based tests and CBTs could not be treated as
comparable.
The several samples used in this study were large, and the effects that were found were small.
Consequently, considerable thought must go into determining the practical implications of these findings.
Even though differences are statistically significant, they may have no real effect on the admissions
process for which the test scores are intended.
12
The computer-based and paper-based tests studied here differed in more ways than just the mode
of administration. First, all of the computer-based tests were also adaptive tests; that is, question difficulty
was tailored to student ability on the computer-based tests but not on the paper-based tests. Second, only
the paper-based tests permitted students to skip questions and return to them later if time allowed. Third,
because the paper-based SAT -- unlike any of the other paper-based tests studied -- contained a correction
for guessing, SAT examinees had to decide whether to guess or leave questions blank when they did not
know the answer; on the computer-based test, they had to answer each question in order to move on to the
next question. Fourth, on computer-based tests, an examinee might have to scroll back to find the
appropriate part of a passage, while on paper-based tests, it could be on the same page as related
questions. On reflection, it is not clear whether the general lack of gender and racial/ethnic differences
noted in the study resulted from minimal differences on all of these characteristics of the testing
experience, or whether negative impacts on one characteristic were counterbalanced by positive impacts
on another.
Although differences were generally quite small, some consistent patterns of impact were found
for racial/ethnic and gender groups. African American examinees and, to a lesser degree, Hispanic
examinees appear to benefit slightly from the CBT format. Where significant differences in impact were
found for these groups, all indicated reduced impact as a result of the change to the CBT format. On all
CBTs analyzed by race/ethnic@, African Americans and Hispanics performed better than or equal to how
they performed on the paper-based tests.
These findings agree with early work examining the School and College Ability Test (Johnson &
Mihal, 1973). On that test, African American students scored significantly higher on the CBT version of
the test, whereas there was a lesser (and not significant) score increase for White students. Pine, Church,
Gialluca, and Weiss (1980) reported similar findings, but attributed differences to an increase in
motivation on the computer-based test on the part of African American examinees. As noted above, these
statistically significant differences in the present study may have no practical significance because of their
small size. For example, in scale-score points, the change in impact from paper-based to computer-based
test performance was only four points for African American examinees who took the GRE 1 quantitative
test.
The pattern for female examinees appears to be the reverse of the pattern for African American
and Hispanic test takers; impact generally increased for females on the computer-based test. However
when gender differences were compared within racial/ethnic groups, statistically significant differences
13
were found only for the group of White examinees (the largest sample). Impact for White females
increased on the CBT versions of tests; however, all differences in impact were relatively small, ranging
from -. 128 to ,016 standard deviation units.
Analyses for groups based on English language proficiency versus other language proficiency
showed a fairly consistent pattern; verbal tests showed no difference in impact, while quantitative tests
showed impact. One of three quantitative tests showed increased impact on the CBT, while two of the
three quantitative tests showed reduced impact on the computer-based test. These differences may be due
to factors specific to the individual test or population, since they were found for the GMATl and GMAT2
samples, but not for the GRE2 population. It is important to keep in mind that even significant differences
were relatively small. The differences in impact for all three tests by English language versus other
language group, and by English language versus other language group within gender, ranged from -.OS 1
to .177 standard deviation units.
As noted earlier, the Spanish language group was selected as the reference group for analyses of
language groups for the TOEFL test. This was based on earlier findings regarding computer familiarity.
Results of our analysis indicate that some language groups --especially Chinese and Korean speakers --
perform slightly better on the paper version of the test than on the CBT version (relative to the Spanish-
speaking group), while other groups -- especially Russian speakers -- perform slightly better on the CBT
version.
Results of the computer familiarity study (Kirsch, Jamieson, Taylor, & and Eignor, 1998) do not
explain these differences. There does not appear to be a direct relationship between the computer
familiarity of specific groups and their performance on computer-based tests, relative to paper-based tests.
These findings are in line with another TOEFL study examining computer familiarity and TOEFL test
performance (Taylor, Jamieson, Eignor, & Kirsch, 1998). The authors of that study concluded that,
although statistically significant differences were found in the performance of groups based on computer
familiarity on a computer-based version of TOEFL, differences were so small that they had no practical
significance. However, it should be noted that at least one of the language groups (Russian) that
outperformed Spanish-speaking students on CBT tests in the present study were not included in either of
the earlier TOEFL studies of computer familiarity.
As testing programs move from paper-based to CBT formats, it is important to understand the
consequences this format change may have on the examinee population. The research discussed here
indicates that for some groups within the examinee population, the move to a CBT format may actually be
14
beneficial, while for other groups it may not be. Because some differences in impact due to testing format
(paper-based or computer-based) occur consistently across several tests, we can assume that changes in
impact are due to changes in format and not to anomalies associated with changes to specific tests.
Identifying a pattern across a variety of tests is an initial step toward understanding the source of group
differences in performance on paper-based or CBT formats. There is no way of knowing from the present
study whether lower or higher performance on CBTs, relative to paper-based tests, actually describes any
group’s ability or achievement. Future studies will need to focus on understanding the source of these
differences.
Future Research
This study identified a few patterns of changes in impact for population subgroups on high stakes
tests that appear to be associated with computer-based versus paper-based test format. Future work should
explore the underlying reasons for these findings and their practical significance. For example, it is
possible that computer-based testing creates a less threatening environment for some students and a more
threatening environment for others, and that the differences in impact for the two testing conditions are an
indicator of this effect. Stereotype threat (Steele, 1997) has recently been proposed as a contributor to the
lower scores achieved by minority populations on high stakes tests. It is conceivable that African
American and Hispanic students experience less stereotype threat (or test anxiety or some other
phenomena that may depress test scores) in a computer-based testing environment in which examinees are
physically isolated from one another. Results for females, on the other hand, indicate that they may
experience an increase in stereotype threat (or other phenomena) under the computer-based condition.
Given the proliferation of technology and improved internet access since the introduction of
CBTs, other work is needed to examine the relationship of computer familiarity to test performance on
computer-based tests, relative to paper versions. Future studies examining group differences in
performance should control for both computer access and familiarity. As more paper-based tests are
replaced by computer-based tests, a new body of research should be developed focusing on relationships
between subgroup performance differences and aspects of the CBT environment, such as computer
familiarity, computer interface, and user control issues, such as font size.
15
References
Bridgeman, B., Anderson, D., & Wightman, L. (1998). GMAT comparabilitv study. Princeton, NJ: Educational Testing Service.
Bridgeman, B., & Cooper, P. (1998, April). Comparabilitv of scores on word-processed and handwritten essays on the Graduate Management Admissions Test. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.
Bridgeman, B., & Schaeffer, G. A. (1995, April). A comparison of gender differences on paper-and- pencil and computer-adaptive versions of the Graduate Record Examination. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey, P. A. (1983). Graphical methods for data analysis. Boston: Duxbury Press.
Eignor, D. R., Way, W. D., & Amoss, K. E. (1994, April). Establishing the comparabilitv of the NCLEX using CAT with traditional NCLEX examinations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
Hoover, M. R., Politzer, R. L., & Taylor, 0. (1990). Bias in reading tests for Black language speakers: A sociolinguistic perspective. In A. G. Hilliard (Ed.), Testing; African American Students (pp. 8 l- 98). Morristown, NJ: Aaron Press.
Johnson, D. F., & Mihal, W. L. (1973). Performance of Blacks and Whites in computerized versus manual testing environments. American Psychologist, 28,694~699.
Kirsch, I., Jamieson, J., Taylor, C., & Eignor, D. R. (1998). Computer familiaritv among TOEFL examinees (ETS Research Report No. 98-6). Princeton, NJ: Educational Testing Service.
Lawrence, I., Potenza, M. T., & Feigenbaum, M. (1998, April). Examinee reactions to a computer-based administration of the SAT. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional educational and psvchological tests: A review of the literature (College Board Report No. 88-8). Princeton, NJ: Educational Testing Service.
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psvchological Bulletin, 114,449-458.
ONeill, K., & Powers, D. E. (1993, April). The performance of examinee subgroups on a computer- administered test of basic academic skills. Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA.
Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1980). Effects of computerized adaptive testing: on Black and White students (Research Report No. 79-2). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program.
16
Schaeffer, G. A., Bridgeman, B., Golub-Smith, M. L., Lewis, C., Potenza, M. T., & Steffan, M. (1998). Comparability of paper-and-pencil and computer adaptive test scores on the GRE General Test (GRE Board Professional Report No. 95-08; ETS Research Report No. 98-38). Princeton, NJ: Educational Testing Service.
Schaeffer, G. A., Steffan, M., Golub-Smith, M. L., Mills, C. N., 8~ Durso, R. (1995). The introduction and comparability of the computer-adaptive GRE General Test (GRE Board Professional Report No. 88-08). Princeton, NJ: Educational Testing Service.
Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52, 6 13-629.
Taylor, C., Jamieson, J., Eignor, D. R., & Kirsch, I. (1998). The relationship between computer familiarity and performance on computer-based TOEFL test tasks (ETS Research Report No. 98- 08). Princeton, NJ: Educational Testing Service.
Ward, W. C., & Bridgeman, B. (1996). Subgroup differences and acceptance of computer-based testing (ETS internal report). Educational Testing Service, Princeton, NJ.
Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.
Willingham, W. W., & Johnson, L. (1997). Supplement to gender and fair assessment. Princeton, NJ: Educational Testing Service.
17
Table 1
Descriptive Statistics for Test Sample
Subjects Administratior
Same Experimental
3perational
z Matched Operational
Mother’s
Sex Language Race/Ethnic@ GPA education Computer Use
% English best or equal % with B.A. or % who use
Test Sample Date N” % to other language % White % > 3.0 higher sometimes or mort
SAT 1996 1,401 55 97 75 87 45 64
GMATI 1996 3,465 41 68 52 NA NA NA
GMAT2 1997 773 43 70 53 NA NA NA
TOEFL 1997-98 3,791 56 NA NA NA NA NA
GREI 56,657 66 100 89 80 40 89 1996-97
GRE2 77,233 64 97 85 77 39 8%
40,257 71 100 91 54 30 NA
1993-97 39,027 73 100 90 53 31 NA
PRAXIS-Mathematics 40,325 74 100 89 53 29 NA
Note. N1 = data not available.
. . Citizens
% of U.S.
Citizens
100
69
75
NA
100
100
NA
NA
NA
aFor matched samples, the overall N reflects the number of matches, not the number of test takers. bChoices were never, rarely, sometimes,
frequently, or daily.
Table 2
Descriptive Statistics for Quantitative Tests by Race/Ethnicity, Gender, and Race/Ethnicity Within Gender
GRE 1 quantitative score SAT math score I Praxis Math z-score GMATI auantitative scar
White
African American
Pisian
Hispanic
CBT Paper
M SD M SD
544 121 540 119
419 119 411 104
618 119 606 121
496 129 481 117
CBT PC 3T Pa )er
N SD
50.207 .93
3,212 488 91 480 89 100 -.88
#
i-I.13 .98 1.06
1.452 581 101 578 99 131 .I2 .96 1 .Ol 1 .oo
1,728 76 78 -.36 .99 1.04
White Males 17.609 99 476 .45 .89 ] .39 .87
African American Males 708 99 36 -.54 I.031 -.82 1.07
kian Males 615 584 101 580 93 69 .44 .79
HisDanic Males 577 569 85 539 75 35 -.04 1.03
JVhite Females 515 Ill 512 110 32,598 537 90 527 89 575 -.05 .93
pifrican American Females 406 109 399 95 2.504 # 480 89 474 83 64 -.98 .94 -1.21 1.04
bisian Females 579 117 568 119 837 57711011575 106 62 .04 .97 -.09 1 .Ol
iispanic Females 464 117 458 107 1,151 538 1 90 1 525 78 43 -.46 * .97 -.49 1.04
Vales 595 123 587 121 19,522 5731 1031563 .93 I .31 .93
-emales 507 116 503 114 37,135 534 1 95 I526 .99 I -.I1 1 .oo _ .
Note. GMAT2 quantitative score was not included in the table due to insufficient numbers for racial/ethnic analyses. Gender statistics are as
10,523 35 9 35 9 2,049
29,802 30 9 31 9 1,416
follows: males (MCBT = 33, SD = 9; maper = 33, SD = 9; and n = 438) and females (FcBT = 29, SD = 9; Epaper = 29, SD = 9; and n = 335).
Table 3
Differences in Impact for Population Groups” by Ouantitative Test
I GREl quantitative I SAT math Praxis math I
GMATl quantitative
~ dp-dcb/ e 1
African American
dp-dcb e
.I0 .Ol B>W -.Ol ns
-.07 ns
.I3 .05 H>W
-02 ns
-.05 ns
.25 .05 H>W
-.Ol ns
-.07 ns
.04 ns
Asian I I / -06 ns White
Hispanic 1 .I1 / .Ol j H>W
African American males I .I3 I I .05 B>W
White males Asian males
Hispanic males
.06 ns
.20 .05 H>W
White females
African American females
Asian females
.I0 .Ol B>W
.07 ns
025 .Ol B>W .I3 ,001 B>W
.I9 -05 A>W -09 ns
.07 ns -.06 ns
-.07 ns .I0 ns
.I4 ns .06 ns
-.I8 ns -.Ol ns
Hispanic females I .05 ns I I
African American males African American females I -03 I ns I
Asian males Asian female I -.05 ns I I
HisPanic males Hispanic females I -.I5 I ns I
Males Females .I _
-.03 .05 M>F --~~ ~_ . .
Note. There were insufficient numbers of GMAT2 quantitative scores for racial/ethnic analyses, but a repeated measures ANOVA by gender
found that no significant change in impact (F [ 1, 771]= 2.37, ]I= .128, dp - dc = .05). ns = not significant, A = Asian, B = African American, H =
Hispanic, W = Caucasian non-Hispanic, M = male, F = female.
aMeans and Ns by subgroup can be found in Table 2. bPositive numbers indicate focal group is relatively higher on computer-based test, while
negative numbers indicate the reference group is relatively higher on computer-based test.
White males White females I -.03 1 .05 I M>F -.II I .OI I M>F I -.12 I ns I
Table 4
Descriptive Statistics for Verbal Tests by Race/Ethnicity, Gender, and Race/Ethnicitv Within Gender
White
African American
Asian
African American Males
Asian Males
CBT
N M SD
1,051 .08 .94
100 -.85 1.22
131 -.09 .97
78 -.27 1.03
476 .I8 .93
36 -.68 1.16
69 -.03 1.04
African American Females
kian Females
\Jote. GMAT2 verbal score wa I not included in the table due to insufficient num
verbal score I SAT verbal score I Praxis
-.45 1.11 153 -.47
.07 .93 25,308 .I1
-1.06 1.24 1,996 -.71
-.I4 1.00 287 .02
-.36 1.04 439 -.23
.08 .96 10,632 -.I0
-.02 1 .OO 28,395 .05
Jers for racial/ethnic
‘raxi
3T
SD
97 L
1 .oe
89 L
99 L
1 .OE
.94
.92
1.02
1 09 A
1.02
88 L
98 L
1 04 L
98 k ana
; writin
I Pa
3
!!d
.08
-.86
-.02
-.40
-.07
-1.06
-.20
-.63
.I4
-.80
.03
-.31
-.I4
.06
1.02
.98
lyses. Gender
follows: males: (&IcBT = 29, SD = 8; aaper = 28, SD = 8; and n = 438) and females (EcBT = 28, SD = 8; haper = 27,
Table 5
Differences in Impact for Population Groupsa by Verbal Test
GRE Verbal Score SAT verbal score Praxis reading score GMATl verbal score Praxis writing score
Reference Group Focal Group d,-dcb p d,-dcb E d,-dcb g d,-dcb p d,-dcb p
African American .07 .Ol B>W .14 .Ol B>W .18 .Ol B>W .OO ns .14 .Ol B>W
White Asian -.04 ns -.04 ns .04 ns -.08 ns .04 ns
Hispanic .04 ns .12 .05 H>W .12 .05 H>W -.13 ns .12 .Ol H>W
African American males .08 ns .15 ns .20 .Ol B>W -.06 ns .20 .Ol B>W
White males Asian males -.03 ns -.02 ns -.02 ns -.08 ns .13 ns
Hispanic males .lO ns -.14 ns .23 us .07 ns .13 ns
African American females .08 .Ol B>W .16 .05 B>W .17 .Ol B>W .08 ns .12 .Ol B>W
White females Asian females -.06 ns -.07 ns .07 ns -.08 ns .02 ns
Hispanic females .Ol ns .32 .OOl H>W .09 ns .03 ns .13 ns
African American males African American females -.03 ns -.07 ns -.09 ns -.02 ns -.ll ns
Asian males Asian females -.06 ns -.03 ns .05 ns -.lO ns -.15 ns
Hispanic males Hispanic females -.13 ns .42 .OOl F>M -.19 ns -.ll ns -.07 ns
White males White females -.04 .Ol M>F -.04 ns -.05 .Ol M>F -.13 .OOl M>F -.05 .Ol M>F
Males Females -.04 .Ol M>F -.Ol ns -.05 .Ol M>F -.09 .OOl M>F -.05 .Ol M>F
vote. There were insufficient numbers of GMAT2 verbal scores for racial/ethnic analyses, but a repeated measures ANOVA by gender found a
significant change in impact (E [ 1, 77 I] = 6.99, g < .O 1, dp - dc = -. 11, M > F). ns = not significant, A = Asian, B = African American, H =
Hispanic, W = Caucasian non-Hispanic, M = male, F = female.
“Means and Ns by subgroup can be found in Table 4. bpositive numbers indicate focal group is relatively higher on computer-based test.
Table 6
Descriptive Statistics for Ouantitative Tests by Language Group by Quantitative Test
English language
Other language
English males
Other males
English females
Other females
Table 7
587 132 582 132 3,646 34 10
592 126 585 124 25.568 34 9
quantitative score PaTer
M SD - N
33 9 2,355
34 9 1,110
35 9 I I 1,379
35 1 9 1 670
30 1 9 1 976
33 1 10 1 440
GMAT; CBT
I M SD -
30 9
34 9
32 9
36 9
28 9
32 9
auantitative score
30 I 9 I 540
35 10 89
#
28 9 238
30 9 56
Differences in Impact for Language Groups bv Quantitative Test
GRE2 quantitative GMAT 1 quantitative GMAT2 quantitative
Reference group Focal group dp-de” Iz dp-dc” I.2 dp-dcb II
-.O 1 English language Other language ns .03 ns
English males Other males .Ol ns .05 0.05 O>E so ns
English females Other females -.02 ns .Ol ns .18 0.05 O>E Note. ns = not significant, E = English best language, 0 = other language best
“Means and Ns by subgroup can be found in Table 6. bPositive numbers indicate focal group is relatively higher on computer-based test.
Table 8
Descriptive Statistics for Quantitative Tests by Language Groups by Verbal Test
English language 477 103 481 109 73,587 32 7 32 8 2,355 29 7 29 8 540
Other language 437 112 442 119 3,646 27 9 28 9 1,110 24 8 23 8 145 7
English males 496 106 500 113 25,568 32 7 32 7 1,379 30 7 29 7 302
Other males 416 112 419 120 1,434 28 9 28 9 670 24 8 23 8 89
I English Other females females 466 451 100 111 471 458 106 117 45,005 2.212 26 30 9 7 27 31 9 8 440 976 28 24 7 8 24 28 8 8 238 56
Table 9
Differences in Impact for Language Groups by Verbal Test
GRE2 verbal GMAT 1 verbal GMAT2 verbal
Reference group Focal group dp-dc” E dp-dc” El dp-dcb 12
English language Other language -.03 ns -.05 0.05 O<E .Ol l-is
English males Other males -.04 ns -.05 0.05 O<E 702 ns
English females Other females -.03 ns -.05 ns .06 ns Note. ns = not significant, E = English best language, 0 = other language best
aMeans and Ns by subgroup can be found in Table 8. bPositive numbers indicate focal group is relatively higher on computer-based test.
Table 10
Descriptive Statistics for TOEFL Verbal Subtests bv Lanmarre, Gender, and Language Within Gender
TOEFL Writing z-score TOEFL Reading z-score TOEFL Listening z-score
CBT Paper CBT Paver CBT PaTer
Language group N Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
Chinese 1378 .03 .93 .11 .91 .03 1.02 .05 1.01 -.28 .97 -.23 .94
Korean 707 .03 .91 .09 .87 .05 .94 .05 .95 -.24 .95 -.22 .90
Japanese 370 -.28 .97 -.27 .92 -.35 1.00 -.45 .98 -.23 .96 -.15 .92
Thai 338 -.47 .95 -.44 .93 -.46 .99 -.45 1.00 -.33 .89 -.28 .91
Russian 289 .27 .93 -14 .97 .29 .87 .25 .93 .43 .84 .31 .94
Spanish 425 -.13 1.15 -.17 1.03 .25 .88 .24 .89 .18 .98 .06 1.06
Arabic 284 -.07 1.01 -.20 1.06 -.15 .98 -.28 1.06 .14 .90 .08 .96
Chinese males 525 .14 .93 .18 .89 .17 1.01 .17 1.01 -.26 .98 -.24 .93
Korean males 369 .03 .93 .09 .87 .09 .93 .07 .98 -.37 .99 -.34 .93
Japanese males 105 -.Ol 1.03 -.05 .98 -.22 1.13 -.25 1.07 -.36 1.05 -.27 1.03
Thai males 159 -.47 .97 -.45 .92 -.38 .99 -.45 1.08 -.29 .88 -.33 .86
Russian males 109 .39 .87 .23 .95 -49 .80 .47 .80 -56 .74 .35 .88
Spanish males 197 -.14 1.17 -.18 1.06 .32 .90 .27 .90 .09 1.07 -.08 1.13
Arabic males 192 -.15 1.06 -.28 1.14 -.22 1.04 -.39 1.14 .06 .97 .02 1.02
Chinese females 838 -.04 .93 .07 .92 -.06 1.02 -.03 1.00 -.30 .97 -.23 .95
Korean females 328 .04 .88 .08 .86 .Ol .93 .02 .91 -.09 .89 -.09 .85
Japanese females 265 -.39 .92 -.37 .88 -.41 .94 -.54 .94 -.18 .92 -.lO .88
Thai females 179 -.47 .94 -.42 .94 -.53 .98 -.45 .94 -.37 .90 -.24 .95
Russian females 177 .20 .97 .08 .98 .17 .89 .12 .98 .35 .88 .28 .97
Spanish females 228 -.12 1.13 -.16 1.00 .19 .86 .22 .89 .27 .89 .19 .99
Arabic females 88 .08 .89 -.05 .87 .Ol .84 -.03 .83 .33 .68 .22 .83
Males 1656 -.Ol 1.00 .oo .98 .03 1.00 .07 1.04 -.18 1.01 -.16 .98
Females 2103 -.06 .96 -.09 .94 -.08 -98 -.08 .98 -.08 .95 -.12 .94 gate. “includes all males and females who took the TOEFL, therefore n is greater than the sum of males
and females in the seven language groups.
25
Table 11
Differences in Impact for Language Groupsa by TOEFL Subtests
Reference group
Spanish
Spanish males
Thai males
Russian males .ll ns -.03 ns .08 ns
Arabic males .09 ns .08 ns -.12 .05 S>A
Chinese females -.16 .OOl s>c .Ol ns -.18 .OOl s>c
Korean females -.lO ns .02 ns -.lO ns
Spanish females
Chinese males
corean males
apanese males
Thai males
<ussian males
Spanish males
Arabic males
dales Note. ns = not significant, A = Arabic, C = Chinese, J = Japanese, K = Korean, R = Russian, S = Spanish,
and T = Thai, M = male, F = female.
aMeans and Ns by subgroup can be found in Table 10. bPositive numbers indicate focal group is
relatively higher on computer-based test.
26
Appendix A
Scatter Plots by Gender for SAT, GMAT, and TOEFL
SAT Math
- Fema _____
1 Male
200 300 400 500 600 700 800
Computer Score
SAT Verbal 800
700
2 600 0 0
WI 500
v-l Q) n a” 400
300
200
200 300 400 500
Computer
600 700 800
Score
le
+ Female _____
n Male
Figure Al. SAT Scores by Gender
27
GMAT 1 Quantitative 60
50
40 d)
E * 8 30
!? PI
20
10
0
0 10 20 30 40 50 60
Computer Score
6C
5c
4c
2 s
; 30
2 PC
20
10
0
GMAT 1 Verbal
0 10 20 30 40 50 60
Computer Score
+ Female __---
1 Male
r Female _____
z Male
Figure A2. GMATl Scores by Gender
28
GMAT2 Quantitative
50
40
E z * 30 fiz !i?
e( 20
10 ‘- Female
OJ , , 1 , ) 1 -i--&ale 0 10 20 30 40 50 60
Computer Score
GMAT2 Verbal
60- 50-
40- a,
i * 30- 8 Et PC
20-
0 10 20 30 40 50 60
Computer Score
Female __-__
Male
Figure A3. GMAT2 Scores by Gender
29
3-
TOEFL Writing
2-
l-
Computer Z-Score
TOEFL Listening
-4 - i (3
-5 -5 -4 -3 -2 -1 0 1 2 3
Computer Z-Score
+ Female --___
;’ Male
Figure A4. TOEFL Scores by Gender
30
TOEFL Reading
-5 -4 -3 -2 -1 0 1 2 3
Computer Z-Score
- Female _--__
r IMale
Figure A4. TOEFL Scores by Gender, continued
31
Appendix B
Graphs of Performance by Race/Ethnicity, Gender, and Language Group
Sample Graph
80 __ Higher smre on Paper Test
OI/ 0 20 40 60 80 100
Computer Score
Sample Graph
32
600
i?2 550 8
ul $ 4
Q 500
SAT Math: Race/Ethnicity
450
455 505 555
Computer Score* ’ 5 point practice effect on x-axis
605
GREl Quantitative: Race/Ethnicity Praxis Mathematics: Race/Ethnicity
450 1 / Hispaw
350 450 550 650
Computer Score
I GMAT 1 Quantitative: RaceBhnicity
40
35
f! 8 ‘2 30
$
d
25
African American
20 25 30 35 40
Computer Score
I .60
.20 e!
8 3 -.20
$
n” -.60
-1.40
-1.40 -1 .oo -.60 -.20 .20 .60
Computer Z-Score
Figure Bl. Quantitative Tests by Race/Ethnic@
33
SAT Verbal: RaceIEthnicity Praxis Reading: RaceIEthnicity
White H Asian
Hispanic
-1 .oo t/
ml African American
Computer Score* ’ 5 point practice effect on x-axis
GREl Verbal: Race/Ethnicity Praxis Writing: RacelEthnicity
650 --
g 8 B
-.20
N
i -.60
d i
e
// PI
African Americric;
1 Hispanic
350 d African American I
350 450 550 650
Computer Score
-1 .oo t /
-1.4oY I -1.40 -1.00 -.60 -.20 .20 .60
Computer Z-Score
GMAT 1 Verbal: Race/Ethnicity
35
?! s @ B
30
4 a
25 African American
20
20 25 30 35 40
Computer Score
Figure B2. Verbal Tests by Race/Ethnic@
34
SAT Math: Males by Race/Ethnicity
600
e! 8 550
v)
$
B n
500
African American
450
455 505 555
Computer Score’ * 5 point practice effect on x-axis
605
GREl Quantitative: Males by Race/Ethnicity
650 --
African Americcal
350 450 550 650
Computer Score
GM AT 1 Quantitative: Males by Race/Ethnicity
40 ,
35 --
?! 8 m 8
30--
% n.
25 --
/
African American
I 20 Y
20 25 30 35
Computer Score
Praxis Mathematics: Males by Race/Ethnicity
I Hispanic 1: *
African Americka
-1.40 -1.00 -.60 -.20 .20 .60
Computer Z-Score
Figure B3. Quantitative Tests by Race/Ethnic@ for Males
35
/
SAT Math: Females by Race/Ethnicity
000
550 t 8 w t mp D- 500
455 505 555
Computer Score* * 5 point practice effect on x-axis
605
GRE 1 Quantitative: Praxis Mathematics: Females by RaceLEthnicity Females by Race/Ethnicity
Computer Score
GMAT 1 Quantitative: Females by RacdEthnicity
35
rican American
20
20 25 30 35 40
Computer Score
-1.40 -1 .oo -.60 -.20 .20 .60
Computer Z-Score
Figure B4. Quantitative Tests by Race/Ethnic@ for Females
36
SAT Verbal: Males by RaceEthnicity
450
455 505 555
Computer Score* 5 point practice effect on x-axis
7
605
GRE 1 Verbal: Praxis Writing: Males by Race/Ethnicity Males by RaceEthnicity
.60 650 --
er .20
8 3 -.20
k -.60 iY
t /* African American
350 450 550
Computer Score
650 I -1.40 -1.00 -.60 -.20 .20 .60 / Computer Z-Score
GMAT 1 Verbal: Males by Race/Ethnicity
25 ncan Amencan
20 25 30 35 40
Computer Score
Praxis Reading: Males by Race/Ethnicity
.60
/ .20 --
e! ite
8 9 -.20 -- N
ii -.60 -- p"
-1.00 -- African Americzat
-1.40 Y I I I -1.40 -1.00 -.60 -.20 .20 .60
Computer Z-Score
Figure B5. Verbal Tests by Race/Ethnic@ for Males
37
SAT Verbal: Females by Race/Eth.nicity
Computer Score* t 5 point practice effect on x-axis
GRE 1 Verbal: Females by Race/Ethnicity
650 --
African American
350 450 550 650
Computer Score
GMAT 1 Verbal: Females by Race/Ethnicity
35
C 8 * t
30
4 a_
25 African Amencan
25 30 35
Computer Score
Praxis Reading: Females by Race/Ethnicity
p! .20
s 3 -.20
8 4
0, -.60 I
y: 1:: t/” African Atyerican I
.60
/
t
-1.40 -1 .oo -.60 -.20 .20 .60
Computer Z-Score
PRAXIS writing Females by Race/Ethnicity
.60
f
.20
q -.20 N
8 2 -.60 P
-1 .oo African American
-1.40 -1.40 -1 .OO -.60 -.20 .20 .60
Computer Z-Score
Figure B6. Verbal Tests by Race/Ethnic@ for Females
38
* L
600 -
SAT Math: Gender
550 --
w
455 505 555 605
Computer Score*
5 point practice effect on x-axis
GFEl Quantitative: Gender
650 --
350 450 550
Computer Score
650
GMAT 1 Quantitative: Gender
35 --
p! s f/J 30 -- t 4 n
20 25 30 35 40
Computer Score
25
GMAT2 Quantitative: Gender
20 21 26 31 36 41
Computer Score* * 1 point practice effect on x-axis
Praxis: Math by Gender
.60
g! .20 8 v) A -.20
8 4 -.60 n
-1 .oo
-1.40 -1.40 -1 .oo -.60 -.20 .20 .60
Computer Z-Score
1
Figure B7. Quantitative Tests by Gender
39
SAT Verbal: Gender GMAT2 Verbal: Gender
600
550 -- t ifi v) $ 4 0. 500 --
25
450
455 505 555
Computer Score* t 5 point practice effect on x-axis
605 21 26 31 36 41
Computer Score* * 1 point practice effect on x-axis
I
GREl Verbal: Gender Praxis Reading: Gender
.60
650 -- .20
?! $j 550 -- fn t 4 n
s! 8 -.20 cn I; b B
-.60
0. -1 .oo
-1.40
350 450 550
Computer Score
650
-1.40 -1.00 -.60 -.20 .20 .60
Computer &Score
GMAT 1 Verbal: Gender Praxis Writing: Gender
.60
35 --
-.60
20 d , I
20 25 30 35 40
Computer Score
-1.40 -1.00 -.60 -20 .20 .60
Computer Z-Score
1
Figure B8. Verbal Tests by Gender
40
TOEFL Writing: Gender TOEFL Listening: Gender
.70
g! 8 2 .oo
$ 4 P
-.70
-.70 .oo .70
Computer Z-Score
TOEFL Reading: Gender
.70
g! 8
3 .oo
ii 2
-.70
-.70 .oo
Computer Z-Score
.oo
Computer Z-Score
Figure B8. Verbal Tests by Gender, continued
41
TOEFL Writing: Language
‘70/1 2 s v)
I4 $ .oo 4 n !
-.70 -
-.70 .oo .70
Computer Z-Score
TOEFL Reading: Language
/
Russia)
Computer Z-Score
TOEFL Listening: Language
.701
e! 8
(I)
r;l .oo i h
Computer Z-Score
GRE2 Verbal: Language
650
e! 3 550 VJ
t
4 P
450
350
350 450 550 650
Computer Score
GMAT 1 Verbal: Language
35
20
20 25 30 35 40
Computer Score
40
35
9) G
g 30 8 4 n
25
20
GMAT2 Verbal: Language
T
21 26 31 36 41
’ 1 point practice
Computer Score* effect on x-axis
Figure B9. Verbal Tests by Language Group
42
TOEFL Writing: Females by Language
/
-.70 Y
-.70 .oo
Computer Z-Score
.70
TOEFL Reading: Females by Language
‘703
.oo
Computer ZScore
.70
I
TOEFL Listening: Females by Language
1
Spanish /I
I .oo
Computer Z-Score
GRE2 Verbal: Females by Language
350 450 550 650
Computer Score
GMAT 1 Verbal: Females by Language
35 --
e s fB 30.- & 4 0.
20 25 30 35 40
Computer Score
GMAT2 Verbal: Females by Language
21 26 31 36
Computer Score’ * 1 point practice effect on x-axis
Figure BlO. Verbal Tests by Language Group for Females
43
TOEFL Writing: Males by Language
.oo
Computer Z-Score
TOEFL Reading: Males by Language
.oo
Computer Z-Score
TOEFL Listening: Males by Language
g
1;1 .oo $
4 n
-.70
-.70 .oo
Computer Z-Score l
GRE2 Verbal: Males by Language
350 450 550 650
Computer Score
40
GMAT 1 Verbal: Males by Language
35 --
f! s m L
30--
4 P
20 25 30 35
Computer Score
40
GMAT2 Verbal: Males by Language
35 --
L 8 w 30 -- $ 4 II
21 26 31 36
Computer Score* 1 point practice effect on x-axis
Figure Bll. Verbal Tests by Language Group for Males
44
GRE2 Quantitative: Language
350: 350 450 550 650
Computer Score
GMATl Quantitative: Language
40
35
25
20 25 30 35 40
Computer Score
40
35
GMAT2 Quantitative: Language
25
20
21 26 31 36 41
Computer Score* ’ 1 point practice effect on x-axis
CBT and Group Differences
45
Figure B12. Quantitative Tests Language Group
45
GRJZZ Quantitative: Females by Language
650 --
350 #
350 4.50 550 650
Computer Score
40 -
GMAT 1 Quantitative: Females by Language
35 --
t s
&
v, joI
!i? n Engltsh
25
/
B
20 I 20 25 30 35 40
Computer Score
GMAT2 Quantitative: Females by Language
?! s
; 30.-
2 P
25 --
20 ti
21 26 31 36
Computer Score* * 1 point practice effect on x-axis
650
GRE2 Quantitative: Males by Language
English
Computer Score
GMATl Quantitative:
25 30 35 40
Computer Score
GMAT2 Quantitative: Males by Language
21 26 31
Computer Score*
’ 1 point practice effect on x-axis
Figure B13. Quantitative Tests by Gender and Language Group
46