detecting construct-irrelevant variance in an - ets · pdf filedetecting construct-irrelevant...

27

Upload: dangdien

Post on 07-Feb-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett
Page 2: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Detecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task

Ann Gallagher Randy Elliot Bennett

Cara Cahalan

GRE Board Report No. 9513P

October 2000

This report presents the findings of a research project funded by and carried

out under the auspices of the Graduate Record Examinations Board

Educational Testing Service, Princeton, NJ 08541

Page 3: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, the modernized ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are

registered trademarks of Educational Testing Service.

Educational Testing Service Princeton, NJ 08541

Copyright 0 2000 by Educational Testing Service. All rights reserved.

Page 4: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Abstract

The purpose of this study was to evaluate whether variance due to computer-based presentation

was associated with performance on a new constructed-response type -- Mathematical Expression -- that

requires examinees to build mathematical expressions using a mouse and an on-screen tool palette.

Participants took parallel computer-based and paper-based tests consisting of Mathematical Expression

items, plus a test of their skill in entering and editing data using the computer interface. Comparisons of

mean performance, reliability, speededness, and relations with external indicators were conducted across

the paper-based and computer-based tests; also, computer-based math score was regressed on edit/entry

score after controlling for paper-and-pencil math score and background information. Although no

statistical evidence of construct-irrelevant variance was detected, some examinees reported mechanical

difficulties in responding and indicated a preference for the paper-and-pencil test.

Keywords: Computer-based testing, Item sets, Mathematics, Speededness

Page 5: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......................... 1

Method .......................................................................................................................................................... 3

Participants ....................................................................................................................................... 3

Instruments ....................................................................................................................................... 3

Procedure .......................................................................................................................................... 4

Data Analysis ................................................................................................................................... 4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*..............................................................................*.................... 6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................ 8

Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*.................................................................................. 10

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................... 13

Author Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................ 14

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................ 15

List of Tables

Table 1. A Mathematical Expression Key and Example Responses . . . . . . . . . . . . ..*..................*........................ 10

Table 2. Means, Standard Deviations, and Coefficient Alpha Reliabilities for Mathematical Expression and Edit/Entry Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..*....................................................... 10

Table 3. Correlations Between the Mathematical Expression Test, Edit/Entry Test, and Other Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... 11

Table 4. Hierarchical Multiple Regression of Computer-Based Mathematical Expression Scores on Paper-and-Pencil Scores, Background Variables, and Edit/Entry Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Table 5. An Example Paper-and-Pencil Response That Would Not Have Fit in the Computer-Based Mathematical Expression Answer Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..~............. 12

List of Figures

Figure 1. The Mathematical Expression interface with an example item and a correct response. . . . . . . . . . . . .12

Page 6: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Introduction

One of the promises of computer-based testing is the ability to present examinees with open-

ended tasks that are more like the ones they encounter in academic and work settings (Bennett, 1993).

Mathematical Expression (ME) is one such response type. ME was created as part of an experimental test

for admission to quantitatively oriented graduate programs. This response type can be used with any

question for which the answer is a rational expression, including questions that ask the examinee to

mathematically model a problem situation. ME is particularly exciting because it permits the developers

of computer-based mathematics tests to use automatically storable, open-ended items, the correct

answers to which may take many different surface forms (see Table 1 for an example key and a few

equivalent responses). Because these responses can be scored in real time using symbol manipulation

techniques, ME items can be included in computer-adaptive tests.

In delivering a test on computer, one key concern is fmding a way for examinees to respond that

is insensitive to individual differences in computer familiarity. For open-ended items, the challenge is

particularly complex. By definition, these items require examinees to enter more information and, thus,

could potentially require greater computer skill.

In developing the ME interface, considerable care was taken to keep computer-skill requirements

to a minimum. For example, the interface is completely mouse driven: Examinees build their expressions

by clicking symbols in an on-screen palette (see Figure 1). This strategy circumvents the need for

keyboard facility as well as the problem that some mathematical symbols have no keyboard equivalents.

On the palette, digits and arithmetic operators appear in the standard calculator configuration, which

makes them easy to find.

In addition, the interface provides for exponent and subscript modes, so that users do not have to

enter syntactic markers, such as carats, to denote these positions. The user simply clicks on the Exponent

or Subscript button to make the next number he or she selects appear in the intended position. The

interface also provides graphical displays of complex expressions involving division that use a horizontal

division bar rather than the less visually meaningful slash. The natural, graphical representation of

exponents, subscripts, and division makes it easier for users to parse expressions they have just entered

and minimizes the chances of a mismatch between the system’s interpretation of an expression and the

user’s intention.

Page 7: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

To limit construct-irrelevant errors (such as typos), and to facilitate interpretation and scoring, the

M.E interface imposes certain minimal constraints on the entry of expressions. For example, the interface

disables certain buttons on the tool palette based on the entry mode selected. If the user has selected

exponent-mode, for instance, the interface disables the entry of certain mathematical operators, like

multiplication and division, as well as alphabetic characters. Also, when users submit their final answers,

the interface checks these expressions for syntactic correctness and flags those that display

inappropriately juxtaposed operators (e.g., a multiplication symbol followed immediately by a division

symbol), malformed numbers (e.g., a number containing two decimal points), or unbalanced parentheses.

The ME interface obviously requires some orientation. To accomplish this, a brief tutorial is used

to familiarize examinees with the response type prior to taking the test. The tutorial introduces the symbol

palette and demonstrates how examinees can formulate expressions using the Subscript and Exponent

buttons, the variable and constants menu (accessed by pressing the a-z key shown in Figure l), and other

features.

Although every effort was made to design an ME interface that required minimal computer skill,

building an expression with the interface is still a more complex task than writing one with paper and

pencil. For this reason, facility with the ME interface could well produce an unwanted performance

effect. Preliminary evidence provided by Bennett, Steffen, Singley, Morley and Jacquemin (1997) seems

to indicate that ME tasks do not introduce any more construct-irrelevant variance than do other task

types. These investigators compared the functioning of ME items to other computer-delivered item types,

including standard multiple-choice questions, questions requiring entry of numeric values, and questions

asking the examinee to shade portions of a coordinate system. Their results showed that ME items have

roughly the same distribution of difficulty as these other response types. In addition, ME questions had

item-total correlations similar to those for the other items. Third, ME items took no longer to answer than

other constructed-response problems written to measure mathematical modeling skills (though both types

took longer than multiple-choice modeling questions). Finally, ME showed gender differences

comparable to those for the other quantitative questions.

Whereas the data provided by Bennett et al. (1997) are encouraging, they provide only an indirect

evaluation of whether the ME interface introduces irrelevant variance. In the current study, our goal was

to test more directly the hypothesis that individual differences in facility with the ME interface affect

performance on computer-based mathematical tests.

2

Page 8: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Method

Participants

We recruited 226 volunteers from 10 colleges and universities located in different regions of the

United States to participate in this study. Of these individuals, 48 were eliminated because they either

were not enrolled in quantitatively oriented undergraduate majors or they were not close to making the

transition to graduate school. Of the 178 remaining participants, 57% were college seniors and 43% were

first-year graduate students. Thirty-six percent of the participants were women and 79% were U.S.

citizens. The racial/ethnic distribution of the sample was 58% White, 15% Asian American, 13%

Hispanic, 7% other, and 5% Black. Most participants (53%) reported an undergraduate major in

engineering, with the remainder distributed among mathematics (23%), physical science (16%), and

computer science (8%). The majority (47%) indicated an intention to pursue a masters’ degree, while

many (35%) said they would be pursuing doctorates.

Of the 178 students in the sample, 75 (42%) reported a score from the quantitative section of the

Graduate Record Examinations (GREB) General Test. Of these 75 participants, most (71%) were first-

year graduate students, and very likely a more select group than the sample as a whole. The mean score

of those participants reporting GRE scores was 759 (SD = 41), which is substantially above the average

scores for all of their undergraduate fields. For example, in our sample, engineering majors had a mean

GRE quantitative score of 760, whereas in the 1995-96 academic year, students intending graduate study

in engineering scored a mean of 687 (Graduate Record Examinations Board, 1997).

All but one of our participants reported an undergraduate grade-point average (UGPA). UGPA

data were reported in six categories ranging from “Below 1.5” to “3.5-4.0,” with the latter marking the

high end of the scale. Most participants reported a UGPA of either 3.5-4.0 (41%) or 3.0-3.49 (33%).

Instruments

Mathematical Expression test. Two 16-item ME tests were created for the study. These tests were

designed to contain equal proportions of easy and difficult items, based on both mathematics content and

the procedural complexity of entering the response.

Page 9: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Edit/entry test. This computer-based test was designed to measure participants’ skill in using the

ME ‘interface. The test consisted of five editing items and five entry items. Editing items required the

examinee to modify a given mathematical expression to match a given example. Entry items asked the

examinee to enter a given expression. Editing and entry items were designed to cover a range of

difficulty, with emphasis on mathematical expressions that were somewhat more complex than those that

would norrnally appear on an operational mathematical reasoning test.

Questionnaire and interview. Participants also completed a questionnaire about their personal

background, computer experience, perception of the ME tasks, and plans for graduate study. A debriefing

interview was conducted to ensure that important information about the interface was not overlooked and

to respond to any questions or concerns subjects may have had.

Procedure

Each examinee took part in a three-hour session, for which they received $45. All individuals

took both ME tests, one on paper and the other on computer, with one hour allotted for each test. Students

were assigned randomly to one of four order conditions:

l ME test 1 on computer, ME test 2 on paper, edit/entry test

l ME test 2 on computer, ME test 1 on paper, edit/entry test

l ME test 1 on paper, ME test 2 on computer, edit/entry test

l ME test 2 on paper, ME test 1 on computer, edit/entry test

The edit/entry test was administered after the ME tests to avoid providing additional practice to

students before taking the computer-based ME test. The session concluded with the questionnaire and

debriefing interview.

Data Analvsis

To locate evidence of irrelevant variance due to the ME interface, we conducted several analyses.

The f?rst set of analyses was targeted at determining the extent to which the paper-and-pencil ME test

forms were approximately equivalent to their computer-delivered counterparts. To the extent that they

were equivalent, we presumed the case for irrelevant variance would be considerably weakened.

4

Page 10: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

To assess equivalence, we first compared coefficient alpha reliabilities across test modes --

computer versus paper-and-pencil -- within each ME test form. Second, we compared mean scores

resulting from different test modes within test forms, and vice versa. For the former comparison, we used

a between-subjects one-way analysis of variance for each test form, with ME scores as the dependent

variable and test mode as the independent variable. For the latter comparison, we used a between-subjects

one-way analysis of variance for each test mode, with ME score as the dependent variable and test form

as the independent variable.

Third, we looked at speededness across test modes within each test form, computing the

proportion of students completing the test and the proportion reaching all but the last item. These

measures are, at best, a very loose approximation of speededness and one that is not precisely comparable

across test modes, because in computer mode, we required examinees to respond to an item before they

could be presented with another item -- something we could not control on the paper version. As a result,

participants’ skipping behavior is readily detected on paper as a blank response; on computer, omits are

less obvious as test takers could skip questions simply by making any response.

Fourth, we compared the pattern of relations of the paper-and-pencil and computer test modes

with other variables, including edit/entry scores, GRE quantitative scores, undergraduate major (coded as

engineering vs. other), gender, and level of education (college senior vs. first-year graduate student).1 For

this and subsequent analyses, we combined ME scores across test forms within computer and paper-and-

pencil test modes to increase statistical power. To achieve this combination, we first standardized

participants’ ME scores for each 16-item test forrn within each mode, and then we collapsed them across

the order conditions.

For our second set of analyses, we used hierarchical multiple regression to examine the extent to

which skill in using the ME interface was directly related to performance on the computer-based ME test.

For this analysis, we used ME score on the computer-delivered test as the dependent variable. We first

entered paper-and-pencil ME score into the equation, followed by background information -- major

(coded as engineering vs. other), level of education (college senior vs. first-year graduate student), and

gender -- to control for any group differences in computer-based ME performance. Finally, we entered

edit/entry score -- our measure of mechanical skill in responding to the computer-based ME test. Here,

’ We used “engineering versus other” for undergraduate major because just over half of our sample indicated an engineering major.

5

Page 11: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

we presumed that any significant effect for edit/entry score, after controlling for paper-and-pencil ME

score and background information, would suggest construct-irrelevant variance due to lack of facility

with the ME interface.

Results

Table 2 shows mean performance and coefficient alpha reliabilities for both test forms for both

the computer-based and paper-based ME tests, and for the edit/entry measure. The reliabilities for the ME

tests ranged from .79 to .85, with no indication of differences between the computer-delivered and paper-

and-pencil versions. The reliability of the lo-item edit/entry task was .72.

Analyses of the mean scores showed no performance differences between the computer and

paper test versions (_F [ 1,176] = .55, p > .05 for the first paper-and-pencil form vs. the first computer-

based form; F [ 1,176] = .29, p > .05 for the second paper-and-pencil form vs. the second computer-based -

form). There were mean differences, however, between the two ME paper-and-pencil forms F [ 1,176] =

25.99, p < .OOl), and between the two forms delivered on computer p [ 1, 1761 = 22.48, p < .OOl),

suggesting that one form was harder than the other. Eta-squared was computed for each within-mode

comparison and revealed effect sizes of. 13 and . 11 for paper-based and computer-based tests,

respectively. According to Cohen (1988), these eta-squares are characterized as medium effect sizes.

With respect to timing, 98% of those taking the paper version of ME test 1 finished the test,

compared with 85% of those taking that test on computer, a statistically significant difference @ = 3.06,~

< .Ol). For ME test 2, 85% completed the paper version and 90% finished the computer version, which

was not a significant difference @= -.91, p > .05). Regarding the percentages of participants who reached

all but the last item on each test, the differences were significant for both ME forms, but in opposite

directions. For ME test 1, 100% of those taking the paper version reached the next to last item, while 93%

of those taking the computerized test went that far @= 2.53@ .05). For ME test 2, 87% of examinees

taking the paper-and-pencil test completed the penultimate question, while 96% of those taking the

computer-based test did so @= -2.12, p < .05).

Table 3 shows correlations found among ME score, edit/entry test score, and various external

criteria after combining the standardized scores on the two ME forms. The observed correlation between

the ME paper-based and computer-based scores was .78; corrected for attenuation, that value was .97,

6

Page 12: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

suggesting that the two modes were measuring the same construct. 2 Consistent with this suggestion is

that the ME computer-based and paper-based scores also showed the same pattern of relations with

external criteria; no statistically significant differences were found between the correlation of the ME

computer-delivered test with any given external variable, and the correlation of the ME paper-and-pencil

test with the same external variable Q, range = -.40 to 1.69, dfrange = 72 to 175). Both ME versions were -

significantly related to UGPA, GRE quantitative score, gender, and level of education. Similarly, both

ME tests were unrelated to the edit/entry test or to undergraduate major. Finally, the edit/entry test was

unrelated to any measure of accomplishment -- GRE quantitative score, UGPA, or level of educational

level -- suggesting that, although reliable, the construct it measured was generally irrelevant to academic

study.

Table 4 presents the results of regressing computer-based ME score on paper-based ME score,

background variables, and the edit/entry test. The paper-based ME score accounted for 6 1% of the

variance in computer-delivered ME score (F [ 1,176] = 272.92, p < .OOl). Adding the background

information accounted for another 3% of the variance. Finally, and most importantly, no significant

variance was attributable to the edit/entry measure.3

Compiled responses to the ME interface questionnaire can be found in the Appendix. With

respect to computer familiarity, all participants indicated using a computer almost daily, and all but one

indicated almost always using a mouse. Regarding the computer-based format, 57% found it easy to use

the computer to take the ME test, 42% found it somewhat difficult, and 2% thought it was very difficult.

Of those who found it somewhat or very difficult, the difficulty cited by the largest portion of participants

(29%) was that the on-screen palette was hard to use. When asked if they had difficulty entering

fractions, exponents/subscripts, or expressions involving square roots, 48% said that they had no

difficulty with any of these functions, but 30% cited problems with entering fractions.

2 The correction for attenuation requires a reliability for each measure and the correlation between the two measures. Because there were two paper-based ME forms and two computer-based ME forms, we estimated a reliability for the two paper-and-pencil measures by taking the (geometric) mean of their coefficient alpha reliabilities, and then estimated a reliability for the two computer-delivered measures in the same way. To estimate the relationship between the computer-delivered and paper-and-pencil measures, we computed the paper-computer correlation for each of the four administration orders and then took the mean of these four values using the r-to-z transformation. 3 We reran this regression including participants who had been eliminated because they either did not have quantitatively oriented undergraduate majors or they were not close to making the transition to graduate school. Even with this larger and more diverse sample (n = 2 19), the results were substantively identical to those presented - here.

Page 13: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Polled as to whether they would prefer to take an ME test on computer or paper, 77% opted for

paper-and-pencil and only 7% chose computer. Consistent with this preference, 44% percent of

participants felt that taking the test on computer was more tiring than taking it on paper, compared with

15% who found it more tiring on paper and 41% who believed the two modes were equivalent. Finally,

48% thought that, had the test been real, they would have been more anxious about taking the computer-

delivered test than they would the paper-and-pencil form; 43 percent would have felt about as anxious

either way, and only 8% would have felt less anxious with the computer version.

Conclusion

This study found no strong evidence to support the hypothesis that individual differences in

facility with the ME computer interface would affect performance on open-ended, computerized

mathematics tasks. Mean performance, reliability, and relations with other variables were closely similar

for both paper-and-pencil and computerized test modes. Although one computer-based test form appeared

speeded relative to its paper-and-pencil counterpart, the reverse was true for the second test form,

weakening any claim that speededness might be a result of lack of interface familiarity. Regression

results also showed no signs of irrelevant variance connected with the ME interface. Our edit/entry test

added nothing to the prediction of computer-based mathematical performance and, indeed, had about the

same level of zero-order relationship to the computer-based ME test as it did to the paper-and-pencil one.

These results complement the indirect evidence, reported by Bennett et al. (1997), that ME items function

similarly to other computer-based response types (including multiple-choice) written to test advanced

mathematical content.

Whereas the statistical evidence does not support the presence of an interface competency effect,

examinee perceptions did suggest that the interface was not always easy to use. This perception came

through most clearly with respect to the use of the on-screen palette, the method by which examinees

create mathematical expressions.‘Using this palette is clearly more time-consuming and cumbersome than

writing an expression by hand, especially if the expression is a complex one.

To better understand this phenomenon, we retrospectively sampled examinee paper-and-pencil

responses and then tried to enter them on computer, finding that some paper responses were, in fact, too

long for the on-screen answer box (see Table 5). We suppose that some examinees tried to enter such

expressions on the computer-based ME test, but were forced to reformulate them to make them fit the

8

Page 14: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

required frame. If this is so, these individuals were able to complete this reformulation quickly enough to

avoid a negative impact on their scores (which we otherwise should have detected in our statistical

analyses). With more stringent time limits than those imposed here, however, an effect might well have

appeared.

The fact that some students had difficulty with the interface suggests that we should continue our

efforts to improve it, or at least that we should make sure time limits are generous enough to allow for the

mechanics of responding using the interface. In the end, however, it is hard to envision a mouse-driven

interface that is as natural for entering mathematical expressions as paper and pencil. Given that, the ideal

solution may be handwriting the expression on some digital surface that recognizes free-form symbolic

input and that is connected to the computing device on which the testing software resides. This concept is

evident in today’s personal digital assistants, which recognize a form of textual entry.

While the current findings provide some insights, this study had several limitations. First, the

sample size was relatively small, so marginal effects could not easily be detected. Second, for those who

did report GRE quantitative scores, the mean was unusually high. Thus, our findings may not be

generalizable to students with lower mathematical ability levels; such students might experience greater

difficulty with the ME interface. Third, our failure to prove the irrelevant variance hypothesis does not

confirm that such contamination is absent, as the null hypothesis cannot be proven.

Finally, this study needs to be viewed as one part of a larger validation program. The study is

meaningful only in the context of theoretical rationales and empirical results that converge to support a

larger validity argument (Messick, 1989). As a response type, ME is characteristic of a growing class of

open-ended computer-based tasks. The larger validity argument for these tasks begins with the contention

that, by their open-ended nature, they replicate some of the complexity inherent in the problems

encountered in academic and work settings. At the same time, however, our renditions of these tasks can

add irrelevant complexity in, among other things, the way we structure the human-computer interaction.

This research highlights the need to approach with care how we render those tasks and illustrates one

method of monitoring the success of our development efforts.

Page 15: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Tables and Figures

Table 1. A Mathematical Expression Key and Example Responses

Mathematical Expression key

cm - 2P)@ - 2p) 4

Some example correct responses

(n-2P)(m -2P) 4

.25(-2p + m)(-2p + n)

p2-pnl2-pm/2+mnl4

Table 2. Means, Standard Deviations, and Coefficient Alpha Reliabilities for Mathematical Expression and Edit/Entry Tests

Test Mean Standard deviation

Coefficient alvha

ME test 1

Computer-based 10.07 4.09 .85

Paper-based 10.5 1 3.83 .83

ME test 2

Computer-based 7.34 3.58 .79

Paper-based 7.63 3.70 .80

Edit/entry test 6.29 2.56 .72

Note. Each Mathematical Expression (ME) test contained 16 items. The edit/entry test included 10 items. Eighty-nine participants took each ME test, while all 178 participants took the edit/entry test.

10

Page 16: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Table 3. Correlations Between the Mathematical Expression Test, the Edit/Entry Test, and Other Variables

ME -- Edit/

paper entry version test

UGPA GRE Level of

quantitative Undergraduate

major Gender

education score

ME -- computer version

.78** .08 .46”” .55** -.ll .25** .41**

ME -- paper .lO .43** .52** -.13 S7” .36** version

Edit/entry test

UGPA

.09 .18 -.08 .09 .12

GRE quantitative score

-.03 .15 .26*

Undergraduate major

Gender

-.14 .13

.21**

Note. All correlations are based on a sample size of 177-178, except for those with GRE quantitative score, which are based on 75 participants. Undergraduate major was coded as engineering (0) versus other (1). Gender was coded as female (0) versus male (1). Level of education was coded as college senior (0) versus first-year graduate student (1). * p < .05 ““p<.Ol

Table 4. Hierarchical Multiple Regression of Computer-Based Mathematical Expression Scores on Paper-and-Pencil Scores, Background Variables, and Edit/Entry Test*

Block and independent variable R2 Increment in R2 - -

1. ME -- paper version .61*** .61***

2. Background data

Gender

Undergraduate major

Level of Education

3. Edit/entry test

.64*** .03**

_64*“* .oo

* n = 178 **-p < .Ol *** p < .OOl

11

Page 17: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Table 5. An Example Paper-and-Pencil Response That Would Not Have Fit in the Computer-Based Mathematical Expression Answer Box

Example paper response

( c c,+c2+c3+c4 2

I

_ cI+c2+c3+cP j2 + ( c _cI+cz+c3+c4 2+ c _c1+c*+c3+c4 2 2 ) ( 3

4

4 4

) +(c*- 4 )

Figure 1. The Mathematical Expression interface with an example item and a correct response. Copyright (c) Educational Testing Service, 1996.

12

Page 18: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

References

Bennett, R. E. (1993). On the meanings of constructed response. In R. E. Bennett & W. C. Ward (Eds.), Construction vs. choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. l-27). Hillsdale, NJ: Lawrence Erlbaum Associates.

Bennett, R. E., Steffen, M., Singley, M. K., Morley, M., & Jacquemin, D. (1997). Evaluating an automatically storable, open-ended response type for measuring mathematical reasoning in computer-adaptive tests. Journal of Educational Measurement, 34 163- 177. -’

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

Graduate Record Examinations Board. (1997). Sex, race, ethnic& and performance on the GRE General Test: A Technical Report. Princeton, N.J.: Educational Testing Service.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: MacMillan.

13

Page 19: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Author Note

Correspondence concerning this article should be addressed to Ann Gallagher, MS 17R,

Educational Testing Service, Princeton, NJ 0854 1; or [email protected].

14

Page 20: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Appendix

Mathematical Expression Interface Questionnaire

15

Page 21: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

The following set of questions ask about your reaction to the computer administration of the

Mathematical Expression items. N

1. In answering the questions on this test, how easy was it to use the computer?

1. Easy (go to question 3) 101

2. Somewhat difficult 74

3. Very difficult 3

2. If you found it “Somewhat d@cult” or “Very difficult” to use the computer:

(Circle all that apply.)

1. The computer screens were confusing (difficult to interpret).

2. The on-screen keyboard made it difficult to change or enter my answer.

3. The mouse was hard to use.

4. Any other problems:

%

56.7

41.6

1.7

13 7.3

52 29.2

14 7.9

a.

b.

C.

d.

e.

f.

g*

h.

i.

j.

k.

Tool was slow 16

Rather skip problems than review

Wanted hints to solve

Usually write directly on problem

Required extra time to copy from paper to computer

Hand-eye coordination problem

Restrictions on format

Rather use real keyboard

Tiring on eyes

Difficult reading from screen and paper

Wished for scrap paper

41

1

8

11

1

14

22

1

2

1

3. Did you have dlflculties entering any of the following?(Circle all that apply.)

1. I had no difficulties entering any of the following.

2. Fractions

3. Exponents and/or subscripts

4. Expressions involving square roots

9.0

23.0

0.6

4.5

6.2

0.6

7.9

12.4

0.6

1.1

0.6

86 48.3

54 30.3

31 17.4

20 11.2

16

Page 22: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

N % 4. Did you have any trouble seeing the words and/or symbols on the screen?

1. No (go to question 6) 168 94.4

2. Yes 9 5.1

3. Did not answer 1 0.6

5. Ifyou answered “Yes, ” which of the following were problems for you?

(Circle all that apply.)

1. The size of the type

2. Too many words on each screen

3. The contrast (the brightness of the letters against a dark background)

4. The lighting in the room causing glare on the screen

6. Which statement(s) reflect your reaction(s) to the explanation of how to use

the mouse? (Circle all that apply.)

1. Adequate explanation; I wouldn’t change it

2. I already knew the information presented from past experience

3. Too long, too much information

4. Too little opportunity to practice

5. Information is not clear

7. If you could take a computer test or a paper-and-pencil test that covered

the same material as this test, which would you prefer to take?

1. Computer test

2. No preference; either is fine

3. Paper-and-pencil test

8. Compared to answering paper-and-pencil questions of the same length, these

computerized questions were.

1. Less tiring than answering paper-and-pencil questions

2. About as tiring as answering paper-and-pencil questions

3. More tiring than answering paper-and-pencil questions

2 1.1

3 1.7

5 2.8

4 2.2

40 22.5

137 77.0

22 12.4

4 2.2

4 2.2

13 7.3

28 15.7

137 77.0

27 15.2

73 41.0

78 43.8

17

Page 23: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

N % 9. If this computer-based test had been a real test (one that counted), how anxious

would you have been compared with taking a real paper-and-pencil test?

1. Less anxious than taking a paper-and-pencil test 15 8.4

2. About as anxious as taking a paper-and pencil test 77 43.3

3. More anxious than taking a paper-and-pencil test 86 48.3

The next set of questions asks about your computer experience.

9. Have you used a computer before?

1. Yes

2. No (go to page 4, the “Background Questions” section.)

10. For what kinds of activities do you use a computer? (Circle all that apply.)

1. Graphics

2. Games

3. Statistical Analysis

4. Spreadsheets

5. Database Management

6. Word Processing

7. Other (programming)

11. For which of the following have you used a computer? (Circle all that apply.)

1. School

2. Work

3. Personal

4. Hobbies

12. How often do you use a computer?

1. Routinely (almost daily use)

2. Regularly (some time each week)

3. Rarely (only a few times in the last five years)

177 99.4

1 0.6

122 68.5

135 78.8

87 48.9

126 70.8

49 27.5

165 92.7

134 75.3

173 97.2

143 80.3

157 88.2

101 56.7

178 100.0

0 0.0

0 0.0

18

Page 24: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

N % 13. When you use a computer, how often do you use a mouse?

1. Routinely (almost always uses the mouse to perform functions) 177 99.4

2. Regularly (sometime uses the mouse to perform functions) 1 0.6

3. Rarely (usually uses the keyboard to perform functions) 0 0.0

4. Never 0 0.0

14. Do you own a personal computer?

1. Yes, IBM/IBM Compatible 93

2. Yes, Mac/Apple 13

3. Yes, Other 3

4. No 69

15. If you answered “No” to the previous question, do you have a personal

computer available for your use?

1. Yes

2. No

3. Did not answer

52.2

7.3

1.7

38.8

62

7

109

34.8

3.9

61.2

19

Page 25: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

Background Questions

I. Gender

1. Male

2. Female

2. Do you understand English as well as or better than any other language?

1. Yes

2. No

3. How do you describe yourself?

1. African-American/Afro-American/Black (non-Hispanic)

2. American Indian/Native American/Alaska Native

3. Asian American/Pacific American/Pacific Islander American

4. Caucasian/White (non-Hispanic)

5. Hispanic/Latino/Chicano/Mexican American/Puerto Rican

6. Other

7. Did not answer

4. Are you a U.S. citizen or resident alien?

1. Yes

2. No

5. Please indicate any permanent disabilities you have (circle all that apply)

N %

64 36.0

114 64.0

160 89.9

18 10.1

9 5.1

0 0.0

27 15.2

103 57.9

23 12.9

13 7.3

3 1.7

140 78.7

38 21.3

1. None 130 73.0

2. Physical disability 1 0.6

3. Learning disability 1 0.6

4. Deafness or other hearing impairment 0 0.0

5. Visual impairment (other than blindness) including glasses or contact lenses 43 24.2

6. Blindness 0 0.0

7. Did not answer 3 1.7

20

Page 26: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

N %

6. What is your current educational status?

1. Senior 101 56.7

2. First-year graduate student 69 38.8

3. Summer after senior year 8 4.5

7. Undergraduate Major

1.

2.

3.

4.

9.

Electrical Engineering

Chemical Engineering

Mechanical Engineering

Civil Engineering

Industrial Engineering

Other Engineering

Computer Science

Mathematics

Physical Sciences

8. What is your overall undergraduate grade point average to date?

1.

2.

3.

4.

5.

6.

7.

(based on a system where 4.0 = A)

3.5 - 4.0

3.0 - 3.49

2.5 - 2.99

2.0 - 2.49

1.5 - 1.99

Below 1.5

Did not answer

9. Are you a graduate student or do you plan to apply to graduate school?

1. Yes

2. No (go to question 11)

27

13

23

14

4

13

15

40

29

15.2

7.3

12.9

7.9

2.2

7.3

8.4

22.5

16.3

73 41.0

59 33.1

35 19.7

9 5.1

1 0.6

0 0.0

1 0.6

150 84.3

28 15.7

21

Page 27: Detecting Construct-Irrelevant Variance in an - ETS · PDF fileDetecting Construct-Irrelevant Variance in an Open-Ended, Computerized Mathematics Task Ann Gallagher Randy Elliot Bennett

N % 10. If YES, in which of the following major fields.

1. Electrical Engineering

2. Chemical Engineering

3. Mechanical Engineering

4. Civil Engineering

5. Industrial Engineering

6. Other Engineering

7. Computer Science

8. Mathematics

9. Physical Sciences

22

7

13

9

8

10

13

25

24

12.4

3.9

7.3

5.1

4.5

5.6

7.3

14.0

13.5

10. Biological Sciences 4 2.2

11. Economics 1 0.6

12. Other 14 7.9

13. Did not answer 28 15.7

N % I I. If you plan to apply to graduate school, which graduate degree will you seek?

1. Masters degree 83 46.6

2. Doctoral degree 63 35.4

3. Did not answer 32 18.0

22