a comparison of open-ended and multiple-choicea comparison of open-ended and multiple-choice...

59

Upload: others

Post on 27-Jan-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General
Page 2: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

A Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section

of the Graduate Record Examinations General Test

Brent Bridgeman

GRE Board Report No. 88-13P

April 1993

This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.

Educational Testing Service, Princeton, N.J. 08541

Page 3: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

Copyright @ 1993 by Educational Testing Service. All rights reserved.

Page 4: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Abstract

Free-response counterparts to a set of items from the quantitative section of the Graduate Record Examinations General Test were developed. Examinees responded to these items by gridding numerical answers on a machine- readable answer sheet in essentially the same way social security numbers are now gridded. In addition, the answer grid included special symbols for a negative sign (-), a decimal point (.), a division sign (/), and a variable

(k)*

The test section with the special answer sheet was administered at the end of a regular GRE administration. Test forms were spiraled so that random groups received either the grid-in questions or the same questions in a multiple-choice format. Because the amount of extra time required to grid in responses was unknown, both a long and a short form were developed, with students randomly receiving one or the other. Sample size for these comparisons was 12,465. In a separate data collection effort, 364 paid volunteers who had recently taken the General Test used a computer keyboard to enter answers to the same set of questions.

Correct and incorrect answers chosen in the multiple-choice format were compared with the answers generated in the free-response format. Three- parameter item characteristic curves were compared, as were correlational patterns with other test scores and undergraduate grade point average. Format by gender and format by ethnicity (Asian American, Black, Hispanic, and White) interactions were explored with analyses of covariance.

Despite the format differences noted for individual items, total scores for the multiple-choice and free-response tests demonstrated remarkably similar correlational patterns, and there were no significant interactions of test format with either gender or ethnicity.

Page 5: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Acknowledgments

The author wishes to thank Fred Fischer of GRE test development for his help developing the open-ended items and creating scoring rules for them. Thanks to Caryn Ashare, Joyce Gant, Candus Hedberg, and staff at the ETS regional offices in Austin, Evanston, Princeton, and Washington for their assistance in the data collection for computer-administered tests, and thanks to Susan Vitella and Ruth Spitz for developing the scannable answer sheets and coordinating the distribution of the experimental forms to the GRE testing centers. The computer program for the delivery of the computer items was ably written by Jeffrey Jenkins. Thanks to several people who made important contributions to the data analyses: study files were matched with GRE files by Patricia Lynn, Min hwei Wang performed the LOGIST analyses, and Ming mei Wang assisted in the interpretation of the results of those analyses; most other analyses were performed by Inga Novatkoski. Thanks to the GRE Research Committee for their financial and moral support. Finally, special thanks to the thousands of GRE candidates who worked conscientiously on test items that they knew would not affect their scores.

Page 6: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Introduction

Quantitative items presented in an open-ended response format offer at least three major advantages over their multiple-choice counterparts. First, they reduce measurement error by eliminating random guessing. This is particularly valuable in an adaptive testing situation where branching decisions might be made on the basis of responses to one or two items. Second, they eliminate unintended corrective feedback that is inherent with multiple-choice items. If the answer computed by the examinee is not among the answer choices in a multiple-choice format, the examinee knows that an error was made and may try a different strategy to compute the correct answer. This type of feedback is not available with open-ended questions. A third advantage of open-ended quantitative items is that problems cannot be solved by working backward from the answer choices. For example, an algebra problem such as 2(x+4)-38-x becomes a much simpler arithmetic problem if the examinee can just substitute the possible values of x given in the answer choices until the correct value is found. Because this last advantage makes test items more like the kinds of problems students must solve in their academic work, this enhances the face validity and should also improve the construct validity of the test.

The inability to work problems backward in an open-ended format is not a trivial difference; multiple-choice and open-ended versions of an item may require very different sets of skills. For example, consider the following problem from an actual General Test (GRE 83-1, section 4, item 30):

A car is traveling at an average speed of 80 kilometers per hour. On the average, how many seconds does it take the car to travel K kilometers? (A) K/80 (B) K/45 (0 (4KV3 (D) 45K (E) 288,000K

The student who has no idea how to solve this problem numerically can still get the right answer merely by knowing (or assuming) that 80 kilometers per hour is a reasonable highway speed for a car and substituting 1 for K. The question then is simply to find a reasonable amount of time for the car to go one kilometer. Would a car go one kilometer in l/80 of a second, or in l/45 of a second, or in 288,000 seconds? The only answer that is even close to being reasonable is that the car could go one kilometer in 45 seconds. This strategy would be worthless in an open-ended format where no answer choices were given and a precise numerical answer was required. In fairness, it must be noted that apparently very few examinees actually use the strategy outlined above. Only 22% of the examinees got this item right when it was administered in December of 1982, and on a five-choice test with no penalty for guessing 20% could be expected to get it right by chance alone.

Although there are sound logical grounds for supposing that the cognitive demands of open-ended and multiple-choice quantitative items could be quite different, empirical evidence that the two item types assess distinctive traits is lacking. A recent review by Traub and MacRury (1990) suggests that there is evidence that some free-response essay tests measure different abilities from those measured by multiple-choice tests but that when

1

Page 7: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

the free response is a number or a few words, format differences may be inconsequential. The one study they cited that focused on mathematical reasoning (Traub & Fisher, 1977) found that there was no evidence that multiple-choice and open-ended mathematics tests measured different traits in eighth-grade students.

Open-ended items in the quantitative area that require a numerical response (or a very simple formula) can be administered either by computer or by use of a scannable answer sheet that can be incorporated into regular paper-and-pencil test administrations. Numerical answers can be accommodated on the scannable answer sheet in the same way social security numbers are entered. Experimental forms of the mathematical portion of the Scholastic Aptitude Test in which examinees grid in four-digit answers (including decimal points and the division sign "/") have already been tested (Braswell, 1990).

The current study investigated open-ended items in both a scannable format and a computer-administered mode. Except for the response mode, all items were closely linked to the existing discrete quantitative and data interpretation item types. The main question to be addressed was the extent to which the open-ended versions of the items paralleled the multiple-choice versions in terms of difficulty, discrimination, and correlational structure. The extent to which the new format might increase or reduce ethnic and gender differences was also investigated. Attitudes of examinees toward the computer format were determined, and practical problems in the translation of multiple- choice items to a free-response mode were identified.

Development of Scannable Items

A scannable answer sheet was developed that allows almost any question from the current GBE quantitative test to be answered in an open-ended response format. The candidate grids in a number or a simple formula rather than selecting from among answers A-E. The answer sheet accommodates decimals, fractions, negative numbers, and equations with one variable (see Appendix A). Seven experimental test forms were developed with items adapted from the discrete quantitative and data interpretation sections of two old GRE forms (83-1, section 4, and 83-3, section 3). Although the answer sheet does

not include a symbol for pi, some problems using pi were adapted by asking the candidate to use a value of 3.1 for pi and compute a numerical answer. Problems that asked for interpretations of graphs were sometimes modified to allow approximate answers (e.g., "to the nearest $lOO,...). Of the 30 items in these sections, only 2 were found that could not be easily modified for the free-response format'. (See Appendix B for the multiple-choice and free- response versions of the items from Form 83-l and from Form 83-3. The

‘Both items that could not be adapted were from the data interpretation section and were of similar formats. One of these questions was: Which of the following trends, relative to the average costs for a new one-family home from 1949 to 1977, can be inferred from the graph?

I. The cost of labor and materials as a percent of the total cost decreased. II. Builder’s profit as a percent of the total increased 2.5 percent.

III. The cost of financing as a percent of the total cost mOre than doubled. (A) III only (B) I and II only (C) I and III only.....

2

Page 8: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

quantitative comparison items from these two old forms were retained, in their original multiple-choice format, on the experimental test forms.)

DeVelODment of Computer-administered Item

The 14 items for computer delivery and scoring were identical to the items in the scannable portion of the study. For all of the responses the candidate used the computer keyboard to enter numerical answers. Other response options (e.g., moving the cursor to a point on a graph) were considered but were rejected because they unnecessarily complicated the interface; the same general skill (actually a slightly more complex skill) could be assessed by asking the candidate to input the numerical value associated with a particular feature on the graph. Although the scannable version contained sample items, there could be no check that the instructions were understood. In contrast, the computer-delivered version contained corrective feedback if input errors were made on the sample problems. For example, the correct answer to one of the sample problems was 6 l/2, and 6.5 was also accepted as a correct response. But if 61/2 were input as the answer, the computer reminded the candidate that there must be a space between the whole number and the fraction.

The sample problems were untimed. Fifteen minutes were allowed for the 14 test questions. A clock in the corner of the screen displayed the amount of time remaining. A student who found the clock display to be distracting, could turn it off; the clock kept running even when it was not displayed.

An item could be skipped by pressing the Enter key. At the end of the test a list of all the items was displayed with the skipped items marked. The examinee could then return to any marked or unmarked item and add or change an answer as long as the time limit had not expired.

The questions were designed for delivery on networked PCs with one file server for five to nine PCs. The system used a standard keyboard and a Genius monitor; a mouse was not used.

Experimental Desiun for Scannable Items

The scannable items (or their multiple-choice counterparts) were administered at all of the test centers in six states as the last section of a regular GEE General Test administration. Because this last section required a special answer sheet and obviously contained experimental items, all students receiving the special answer sheet were informed at the beginning of the last section that their scores on that section would not count toward their regular GEE scores. The regular answer sheets were collected and the special answer sheets were distributed. Some test forms required examinees to use only the multiple-choice section of the answer sheet; other forms also used the grid-in section. Forms were spiraled within centers to assure random assignment to multiple-choice or grid-in versions of the test.

Experience with grid-in items from the special SAT mathematics studies suggested that grid-in items take considerably longer to answer than standard five-choice items. The physical act of gridding up to five ovals rather than

3

Page 9: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

a single oval for each item requires a little extra time, but the major time loss appears to be in reading the special gridding instructions and checking numerical calculations. In a regular multiple-choice format, if the answer calculated is one of the answer choices, examinees may assume that they did not make a minor computational error. But in a grid-in format there is no check for such minor computational errors and examinees may feel a greater need to recheck all computations. Instead of the usual 15 four-choice quantitative comparison and 15 five-choice items in 30 minutes, the experimental form contained 10 four-choice quantitative comparison items and 14 grid-in items in 30 minutes (i.e., 6 fewer items than usual). As a check that even this shorter test might be unduly speeded, a short form was developed that contained the same 14 grid-in items plus 5 four-choice quantitative comparison items, for a total of 19 items.

As a check on the impact on candidate effort and performance of knowing that a section was experimental, a parallel experiment was conducted in a set of centers in three different states. In these centers, the multiple-choice versions of the items from one of the experimental forms were administered as the last section of the regular General Test. For a random half of the candidates at these centers, a special instruction printed at the beginning of this last section in the test booklet informed them that the section was experimental and would not affect their scores. The other half of the candidates were not given this special instruction; they had only the standard instruction that one of the seven sections in their test contained trial items that would not affect their scores, but they had no way of knowing which of the seven sections was the experimental one.

The seven forms of the test that were administered may be summarized as follows:

Form l-- 30 multiple-choice items from Form 83-1, section 4; administered as part of a regular administration with no indication that the section was experimental.

Form 2 --Same as Form 1, but with printed instruction that the items were experimental.

Form 3 --Same as Form 1, but with special answer sheet. Clearly an experimental form because the regular answer sheet was collected and an announcement made before this answer sheet was distributed.

Form 4-- 10 multiple-choice quantitative comparison items identical to 10 of the 15 items on Form 1; 14 grid-in items that had multiple-choice counterparts in Forms l-3.

Form 5-- Short form. Same as form 4, but with five fewer quantitative comparison items.

Form 6-- 30 multiple-choice items from Form 83-3, section 3.

Form 7-- 10 Quantitative comparison items from Form 6 plus 14 grid-in items with multiple-choice counterparts in Form 6.

4

Page 10: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Forms 1 and 2 were spiraled and administered in test centers in one set of states. Forms 3-7 were spiraled and administered at centers in a different set of states. Forms 6 and 7 provided a replication for Forms 3 and 4 to make certain that findings were not unique to that one set of items.

Experimental Desiqn for Computer Items

Students who had taken the October 1989 GRE General Test, who had completed the biographical information questionnaire (BIQ), and who lived near one of the four Educational Testing Service (ETS) offices where the computer test was to be administered (Austin TX, Evanston IL, Princeton NZ, and Washington DC), were sent letters inviting them to participate in a study "designed to evaluate some new computer-delivered test items that have been developed for the GPE General Test." They were told that they would be paid $40 for the two-hour testing session that was to take place at the local ETS regional office. (The mathematics test described here was only a small portion of the total computer test, which included a number of experimental items for the analytical portion of the General Test). Invitation letters were sent to 3,277 candidates. They were told that a limited number of testing appointments were available and would be filled on a first-called, first-scheduled basis. The available places in the four centers filled within a few weeks of the mailing. A total sample of 364 candidates was eventually tested. Although this sample was geographically diverse and represented a range of skills and background characteristics, it should not be considered a random sample because of its volunteer nature. Thus, for example, candidates who were particularly computer phobic may be underrepresented.

Results and Discussion

Data Screeninq for Multiple-choice and Grid-in Items

Because examinees knew that their performance on the experimental items would not be reported to anyone, their only motivation was whatever intrinsic satisfaction they received from knowing that they had tried. In order to minimize contamination from students who did not take the test at all seriously, a number of steps were undertaken to remove totally unmotivated examinees. First, examinees who marked their answer sheets in obvious patterns were dropped (e.g., examinees who answered "An to all items or who answered "A" to item 1, "B" to item 2, "CM to item 3, etc.). Next, examinees who attempted fewer than half of the items were dropped. (Because the GRE has no penalty for guessing, it is always in the candidate's best interest to respond to every item). For the remaining examinees, a regression equation was created predicting scores on the experimental test from scores on the operational quantitative sections. Examinees who scored more than four standard deviations below their predicted scores were dropped. A summary of the cases deleted is presented in Table 1. These exclusion rules had to be conservative (i.e., exclude only the most obviously unmotivated cases) or subsequent format comparisons would be meaningless. Thus, for example, if regression outliers only one or two standard deviations below predicted scores

5

Page 11: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

were excluded from the sample, then the remaining sample would include only those students whose performance on the operational multiple-choice items was highly correlated with their performance on the grid-in items. As shown in Table 1, a maximum of 2.1% of the cases were deleted from any test form for any reason.

Table 1

Cases Deleted from Sample

Test Form 1 2 3 4 5 6 7

Reason dropped

Pattern marking 1 3 2 12 13 6 8

Attempting fewer than half of the items 1 2 24 41 34 21 19

Regression outliers 0 2 6 2 1 4 3

Total dropped 2 7 32 55 48 31 30

% dropped . 1 .3 1.2 2.1 1.9 1.2 1.2

Final sample size 2,150 2,119 2,579 2,534 2,494 2,462 2,396

Comparison of Scores for Motivated and Unmotivated Students

Even after screening out the scores of the obviously unmotivated students, substantial differences might still exist between motivated and unmotivated students that could threaten the validity of the analyses of the new question formats. The comparison between Forms 1 and 2 was intended to provide evidence of the extent of such differences. Score means on Forms 1 and 2 were virtually identical, but observations by test center supervisors suggested that this portion of the experiment did not work as intended. For Form 2, the special direction that Section 7 of the test contained experimental items that would not affect the reported score was printed at the top of the first page of Section 7. The remainder of the page contained the standard directions for answering mathematics questions of the quantitative comparison type. Because most examinees were familiar with these directions from previous test preparation or from having already encountered them on the two quantitative sections they had just taken, they apparently started work on the questions without reading any of the directions. The supervisor at one test center interviewed students after they had completed the test and discovered that not one of them had noticed the special instruction.

6

Page 12: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

The comparison of performance on Form 1 with performance on Form 3 provides a more valid test of the impact of motivation. Although this

comparison is slightly weakened because the two forms were not administered to random samples of students within the same test centers, it has the advantage of clearly separating students who thought the test counted toward their scores from students who knew that the test was experimental. (In centers where Form 3 was administered, regular answer sheets were collected and then special answer sheets were distributed with an announcement made that the last section would contain experimental questions that would not affect the reported scores.) On average, students in the centers where Forms 1 and 2 were administered had slightly higher regular GRE quantitative scores than did students in the centers where Forms 3-7 were administered (see Table 2).

Means and standard deviations on the experimental items for males and females on Forms l-3 are presented in Table 2; the means are based only on the 14 items used in subsequent analyses of test format (i.e., multiple-choice vs. grid-in) effects. Despite the slightly lower ability of the Form 3 students, score means for the unmotivated Form 3 students were less than one question lower than the means for the Form 1 students. After adjustment using the operational GRE quantitative and verbal scores as covariates, the intercepts differed by less than .2 of a question.

Table 2

Means and Standard Deviations by Sex, for Forms 1-3

14 Common Items GRE quantitative

Group n _M_ SD M - SD

Male

Form 1 902 10.10 2.80 605 129 Form 2 928 9.90 2.80 601 129 Form 3 1080 9.23 3.16 588 136

Female

Form 1 1245 8.68 2.76 522 117 Form 2 1186 8.56 2.76 522 115 Form 3 1494 7.90 2.94 509 119

An analysis of covariance (with GRE quantitative and GRE verbal as covariates) was conducted comparing the Form 1 scores on the 14 common items with the Form 3 scores on the same items for the 3,810 students who reported gender and ethnicity information and who indicated that English was their best language. Four ethnic categories were used: Asian American, Black, Hispanic, and White. A regression approach was used in which each main effect and

7

Page 13: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

interaction was corrected for every other term in the model. Despite the large sample size, not a single main effect or interaction was significant at the .05 level. For the critical interactions of form with gender and ethnicity the R values were greater than .5. Across and within gender and ethnic groups in this study, the effect of knowing that scores did not count appeared to be minimal.

Three-parameter item characteristic curves (Birnbaum, 1968) for Forms 1 and 3 were computed using the LOGIST program (Wingersky, Patrick, 6 Lord, 1976). Theta (ability) estimates for students with both forms were derived from the operational GRE quantitative items. For all items, the shapes of the curves were essentially identical in the respective forms. The curves are presented in Appendix C.

The finding of near equivalence between the scores of motivated and unmotivated candidates should not be generalized to other testing situations. Examinees in this study did not know that the last section would not count until immediately before they took it. Perhaps they could not quickly shut off their motivation. If they had been told well in advance of the testing day that the last section would not count they might have relaxed more quickly. Similarly, if the experimental test had been administered on a different day, or even following a lengthy break, the results might have been different.

Scoring Grid-in Responses

Although scoring the numerical answer to a quantitative problem where there is a single correct answer might appear to be very simple, for most problems there were at least four or five correct answers. For example, the answer to the first grid-in problem in Forms 4 and 5 was 145 (see Appendix B). Although examinees were instructed to right justify answers, both right and left justification were counted as correct, so bb145 (b=blank) was one correct answer and 145bb was also correct. Other correct answers included 145.0, b145b, and 00145. Twenty-four examinees answered 14145, apparently because they started to left justify the answer and then decided to right justify it without completely erasing the first attempt. This answer was counted as correct, as was lb145 (marked by 28 examinees). The scanner was set to record all marks; problems of this type could be reduced, but not totally eliminated, if it were set to discriminate light (partially erased) marks from dark marks.

Question 4 asked for an answer "in terms of k." The intended correct answer was 3k/20 and the equivalent .15k was also acceptable, as were O.lSk, k3/20, and .3k/2. The phrase "in terms of k" is ambiguous, so answers that omitted reference to k (e.g., 3/20 and .15) were also scored as correct. The response 3/20k was a problem because the general instructions indicated that "all numbers to the right of the '/' will be treated as the denominator'; because k is not a number it could be argued that the intended response was 3/20 k. Although 3/20k was not accepted as a correct response, this type of ambiguity should be avoided on operational tests. The symbol "k" could be removed from the answer sheet and questions rewritten in a way that did not require its use.

8

Page 14: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Question 6 was a data interpretation question based on reading a graph. In the multiple-choice version of this item all answer choices were rounded to the nearest hundred dollars. For the grid-in version, examinees were asked to answer "to the nearest hundred dollars." The intended correct answer was 5,800, but 296 examinees gave the exact answer of 5,832. Because the primary focus of the item was graph reading and simple computation, not rounding ability, the exact answer was counted as correct, although an answer incorrectly rounded to 5,900 was considered to be incorrect. A similar problem was encountered on Question 13, where calculations required the use of pi and examinees were instructed to use 3.1 for pi. Despite this instruction, some examinees apparently used 3.14 for pi. Although their answers were counted as correct, it could be argued that they should be penalized for failure to follow instructions. In physics or chemistry problems where the number of significant digits in an answer is an important issue, answers that are carried out to too many places should probably be scored as incorrect. Whenever an item allows (or requires) rounding, the instructions to the student should explicitly state that exact answers will (or will not) be counted as correct. A scoring rule must also be developed for answers such as 5,830 (Question 6) that are neither exactly correct nor rounded properly but are nevertheless close to the intended answer. For some graph reading and estimation problems a range of answers may be counted as correct.

Scoring Responses on the Commuter Terminal

Many of the scoring issues for grid-in responses also apply to the responses entered at the computer. Thus, for example, correct answers to Question 5 included .5, 0.5, .50, and l/2. An advantage of the computer answers over the grid-in answers is that partial erasures do not cause a problem. In addition, problems with right and left justification do not occur with computer administration. Because the computer administration included a brief tutorial on entering answers and because the examinees were given practice entering answers, entry errors could be immediately explained so that they would not recur on the actual test items.

Summarv of Grid-in and Computer Responses

The seven most frequent answers for each grid-in item in Form 4 and their multiple-choice counterparts from Form 3 are summarized in Table 3. The left side of the table shows the five response options from the multiple- choice version of the test with the number and percentage of the group selecting that option. The right side of the table indicates the seven most frequent answers for each question in the grid-in format. Equivalent answers are listed together. For example, for Question 4, the answer .15k is included under 3k/20. Note that in the grid-in version of Question 7 the examinee is asked to supply only the denominator for the fraction. The number of students leaving the question blank is indicated by a W--11 on the table. Answers for the 364 students in the computer format are summarized in Appendix D.

9

Page 15: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Table 3

Summarv of Item ResDonses for Multiple-Choice and Grid-In Formants

Question Multiple-Choice (Form 3) Grid-in (Form 4) Number A B C D E 12 3 4 5 6 7

Answers 145* 102 90 72 30 145 129 -- 43 82 97 337

1 n 2,363 65 52 44 45 1,794 71 60 45 41 34 26

x 92 3 2 2 2 71 3 2 2 2 1 1 Answers 65 fO* 75 93 98 70 -- 60 14 98 840 70.56

2 n 175 1,948 195 88 123 1,599 229 131 102 58 16 11

% 7 76 8 3 5 63 9 5 4 2 1 .l Answers 54* 48 45 27 I2 54 -- 36 6 9 45 27

3 n 1,906 132 173 184 160 1,571 224 114 94 28 27 27

% 74 5 7 7 6 62 9 4 4 1 1 1 Answers 15K 6 5 3K* 3: !c 3K/20 --

20 3/20K 5 5K/5 15K 6K/103

60

4 n 190 299 1,735 180 80 725 788 120 79 64 35 29

x 7 12 68 7 3 29 31 5 3 3 2 1 Answers .25 .5* .7 5 7 .5 -- .7 .05 .25 .07 .43

5 n 178 1,836 403 123 20 1,329 263 204 200 124 99 59

x 7 71 16 5 1 52 IO 8 8 5 4 2 Answers 3,000 3,800 5,800* 9,700 13,500 5,800 54,000 -- 5,400 5,900 600 5,500

6 n 50 110 2243 77 68 823 268 228 193 151 101 78

X 2 4 87 3 3 32 11 9 8 6 4 3

Answers 1 1 1 1 3 __ 2 5 7 4 8 8 5 4 2

7 n 138 206 151 1890 149 1372 467 119 74 59 50 29

X 5 8 6 73 6 54 18 5 3 2 2 1

Page 16: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

. . . . l-a Marl tlnl e-f!hnl PP lFnrm 21 n (Fnrm /II

Number A B C D E 12 3 4 5 6 7

Answers 7,000 10,000 14,000 16,000 18,000* 18,000 -- 20,000 19,000 17,000 8,000 2,000

8 n 192 246 306 224 1505 687 521 207 106 61 61 54

% 7 10 12 9 58 27 21 8 4 2 2 2

Answers 16* 18 23 31 33 -- 23 16 15 24 25 12

9 n 658 261 1,302 120 118 713 514 302 124 68 55 35

% 26 10 50 5 5 28 20 12 5 3 2 1 Answers 6+d 3+* 2 1* 1 -_ 2 10 5 .

2

10 n 82 355 326 195 1,533 1,456 562 74 38 34 28 16

% 3 14 13 8 59 57 22 3 1 1 1 >l Answers I 4 51* 51/10 -- 48/5 1 28/S 5 3

2 10

11 n 117 510 349 198 1,241 805 708 127 73 37 30 21

% 5 20 14 8 48 32 28 5 3 1 1 >I Ansuers 60 56 53* 4/ 28 53 -- 4/ 28 5 45 3

12 n 147 216 1,400 382 241 889 876 108 46 39 32 26

% 6 8 54 15 9 35 35 4 2 2 1 1

Answers 21* 2rQ 4rle &I 8d -- 6.2 12.4 24.8 3.1 49.6 9.3

13 n 931 537 454 413 67 1,045 572 240 132 70 31 16

% 36 21 18 16 3 41 23 9 5 3 1 >l Answers 8: K 4K 45K* 288,000K -- 45K 3,600K 4,800 K/45 28,800 K/80

45 3

14 n 550 347 394 597 467 1,166 313 79 50 47 29 28

% 21 13 15 23 18 46 12 3 2 2 1 1

Note.-- * indicates correct ansuers

Page 17: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Although only the seven most frequent responses are given in Table 3, for each grid-in question there .were at least 100 unique incorrect responses. In the much smaller sample of students who took the test on the computer, the number of different wrong answers ranged from 30 (item 10) to 95 (item 11).

An analysis of individual items highlights the potential impact of response format differences. Although 71% of the examinees answered the easiest item correctly in the grid-in format, this was still not nearly as high as the 92% who got it correct in the multiple-choice format. The higher percentage of examinees answering the item correctly in the multiple-choice format is caused not only by the opportunity to guess but also by the implicit corrective feedback that is part of the multiple-choice format. As shown in Table 3, not one of the five most common incorrect grid-in answers was included among the distracters for Question 1. Students who got the answer 129 in the multiple-choice format would know that they had made an error and could try again to compute the answer; without this implicit corrective feedback in the grid-in format, the students would get the item wrong. The most common incorrect answer in the grid-in format was included among the multiple-choice answer choices for only 5 of the 14 items (the comparable figure for Forms 6 and 7 was 8 out of 14). However, the corrective feedback inherent in multiple-choice questions is not necessarily a negative feature of that item type. In some cases a minor computational error can lead to an incorrect answer. When the computed answer is not included among the multiple-choice answer choices, the examinee has an opportunity to fix the minor error. In the grid-in format, minor computational errors that were intended to be only a small part of a complex conceptual problem would cause the entire problem to be scored as incorrect. Whether the cueing provided by answer choices increases or decreases the validity of a particular question, the question "can you solve this problem?*' is clearly different from the question "can you solve this question given these distracters?".

The lack of correspondence between the incorrect answers spontaneously produced in a grid-in format and the answer choices provided in a multiple- choice format was also noted by Braswell (1990) for a set of items from the Scholastic Aptitude Test. It appears that even experienced item writers are not good at predicting the errors that students will actually make. Even post hoc explanations may be difficult. For example, on Question 1, it is easy to see how the second most popular incorrect answer (43) was calculated (by ignoring the parentheses), but it is not immediately apparent why 129 was the most popular incorrect answer. For tests that retain a multiple-choice format, it may be useful to use a free-response format on a small sample during item development in order to obtain an optimal set of incorrect answer choices for each item.

Table 4 presents the percent correct for each of the 14 questions that were common across the three formats (computer, grid-in, and multiple-choice). Because of the impact of guessing, percent correct scores were always highest for the multiple-choice format. To adjust for the impact of an expected percent-correct score was created that estimated students who would get the item right in the multiple-choice

random guessing, the percentage of format assuming who knew the the five

that the grid- correct answer

i n result reflected the percentage of students and all other students guessed randomly among

12

Page 18: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Table 4

Comnarison of Percent Correct and Expected Percent Correct For Three Item Formats

Item

% Correct Actual Multiple- Expected minus

Computer Grid-in Choice % Correct expected

1 63 71 92

2 61 63 76

3 62 62 74

4 34 29 68

5 59 52 71

6 24 32 87

7 62 54 73

8 28 27 58

9 14 12 26

10 52 57 59

11 27 32 48

12 31 35 54

13 14 23 36

14 4 12 23

77

70

70

43

62

46

63

42

30

66

46

48

38

30

15

6

4

25

9

41

10

16

-4

-7

2

6

-2

-7

alternatives. As indicated in the table, the actual percent correct sometimes differed markedly from the expected value. Although students were randomly assigned to either the grid-in or multiple-choice format, the scores for the computer format came from a separate sample of paid volunteers. The mean GRE quantitative score in this group was 577 (SD=129) compared to the GRE quantitative mean of 543 (SD-131) in the Form 4 sample. Nevertheless, with a few exceptions, the percents correct in the grid-in and computer groups were remarkably similar. The largest difference in percent correct between the computer and grid-in groups was 8 points; for the grid-in and multiple-choice samples, a difference of at least 8 points between the actual and expected percent correct was found for 6 of the 14 questions. Thus, for individual question-level analyses, the difference between free-response and multiple-

13

Page 19: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

choice formats appears to be much more important than the difference between computer and grid-in administration modes, even after adjusting for the impact of random guessing.

Question 1 demonstrates how distracters can lead students to the correct answer, but Question 14 demonstrates that this does not always happen. The grid-in results suggest that about 12% of the examinees can correctly compute the answer. Assuming that 12% could also select the correct answer in the multiple-choice format, if the rest of the students guessed randomly, a total of about 30% of the group should select the right answer. But, in fact, only 23% of the group got the problem right in its multiple-choice format, suggesting that the guesses were significantly worse than random. If some of the examinees got the right answer by using a backward strategy similar to the one described in the introduction of this paper, the guesses were worse yet. A possible explanation of the relatively poor performance on Question 14 in the multiple-choice format is that candidates ran out of time so that even students who could have calculated the correct answer were forced by time pressure to guess. The very poor performance on Question 14 in the computer sample almost certainly indicates that the students taking the test in that format ran out of time.

Question 6 showed the largest format differences. It was a very difficult item in the computer and grid-in formats (24 and 32% correct, respectively) and an easy item in the multiple-choice format (87% correct). The most common incorrect answer in both the computer and grid-in formats was 54,000. This answer may reflect an ambiguity in the question. The question asked for "the average cost of financing...." By one interpretation, financing could refer to total cost rather than the intended interpretation, which was the cost of a bank loan. Because 54,000 was not one of the multiple-choice answers, any student making this interpretation would know that it was incorrect and could then apply the more limited definition of financing. But even counting 54,000 as a correct answer would bring the total percent correct in the grid-in format only to 43.

In the multiple-choice format, Question 6 could be answered correctly by a rough approximation from the graph provided. Although the grid-in version of this item did not require a precise answer, it did demand that the answer be rounded to the nearest $100. But accurate rounding is not equivalent to approximation. If instead of rounding the final answer, the 10.8% financing charge were rounded to 10% or ll%, the incorrect answers 5,400 and 5,900 would result. Although these answers appear to uncover a real mathematical misunderstanding of the consequences of rounding errors, students making this type of error would still get the item right in the multiple-choice format. Nevertheless, counting all grid-in answers between 5,000 and 6,000 as correct as well as counting 54,000 as correct resulted in a total percent correct of 65, which is still well short of the 87% correct in the multiple-choice format. Rather than requiring a rounded answer in the grid-in format, a better approach for problems of this type might be to ask for an approximate answer and score as correct any answer within a specified range. This approach still requires that the examinee be given some guidance as to the range of acceptable answers; an approximation that was correct only to the nearest $10,000 would not receive credit as a correct answer.

14

Page 20: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Question 10 was easier than expected in the grid-in format, possibly reflecting unintended feedback implicit in the grid-in version. Note that 30% of the students in the multiple-choice format selected an incorrect answer that included the square root of five as part of the answer. Because the grid-in directions for this item did not ask students to round answers and did not give any special instructions for dealing with square roots, students who calculated an answer that included the square root of five might have concluded (correctly) that their initial answer was incorrect.

Item Response Theory Results

Item response theory provides a method for graphically comparing the performance of items in the multiple-choice (Form 3) and grid-in (Form 4) formats. The three-parameter logistic model was used to compare items on these two forms. Ability estimates (thetas) were derived from the quantitative items in the operational portion of the General Test. The item plots are presented in Figure 1. They show the probability of answering the item correctly for students in 17 ability levels ranging from three standard deviations below the mean to three standard deviations above the mean. The item characteristic curve for the multiple-choice form is represented by a solid line and the curve for the grid-in form is a dashed line. The actual group means on the smoothed logistic curves are represented by squares for the multiple-choice groups and hexagons for the grid-in groups; the sizes of the squares and hexagons show the relative number of people in each ability category.

As suggested by the discussion of individual items in the section above, some of the curves were quite similar for the multiple-choice and grid-in groups (e.g., Questions 3 and 10) while others differed markedly (e.g., Questions 4 and 6). In some cases, the curves were about equally separated over the entire ability range (e.g., Questions 3 and 11). Other curves were relatively close in the upper ability ranges but significantly separated for lower ability students (e.g., Questions 5 and 12). It is quite clear that the change in format does not produce a uniform impact on all items or at all ability levels within the same item. Item characteristic curves were not run for Forms 6 and 7, but conventional item statistics (item difficulty and biserial correlation) revealed essentially the same pattern (or, more precisely, lack of a pattern). These indices are presented in Appendix E. Some items had nearly equivalent difficulty and discrimination in the two forms but others displayed substantial differences: Data from both form comparisons (3 vs. 4 and 6 vs. 7) suggest that item statistics derived from a multiple-choice test administration may provide only a very crude estimate of how the item will perform in an open-ended format.

Analvsis of Format Effects on Total Scores

Means and standard deviations, by gender, for the 14 items that were administered in both the multiple-choice and grid-in formats for Forms 3-7 are presented in Table 5. Forms 6 and 7 shared the same item stems, but they were different from the item stems shared by Forms 1-5. The within-form GRE quantitative means for the Form 3-7 sample ranged from 542 to 544 with

15

Page 21: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC PLOTS FOR I4 EXPERIMENTAL ITEMS: FORM 3 d, FORM 4

t 0b I

I t I 1

ITEM 3

ITEM 5

0

I 1 7 1

-3 0

Page 22: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC PLOTS FOR 14 EXPERIMENTAL ITEMS: FORM 3 8. FORM 4

ITEM 7

ITEM 9

3

O- k -3

0

O- -

Page 23: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC PLOTS FOR 14 EXPERIMENTN- ITEM= FORM 3 & FORM 4

0 5cH # 215

0 CH t 139 C

ITEY 1 4

-‘3 5 CH

0 CM

0 # 216

# 200 C

Page 24: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

standard deviations ranging from 131 to 133. Thus, the spiral randomization design was successful in assigning students of equal mathematical ability (as defined by GRE quantitative) to the various experimental forms.

The gender difference, in standard deviation units (male mean minus female mean divided by pooled standard deviation), was nearly identical in the multiple-choice Form 3 (.44) and the grid-in Form 4 (.39). For the Form 6 (multiple-choice) versus Form 7 (grid-in) comparison, the gender difference was again about the same for both forms (.42 and .45 for Forms 6 and 7, respectively).

Table 5

Means and Standard Deviations of 14 Common Items, by Gender, for Forms 3-7

Group

Male

Form 3 (multiple-choice) 1,080 9.23 Form 4 (grid-in) 1,103 6.43 Form 5 (short grid-in) 1,050 7.01

Form 6 (multiple-choice) 1,063 8.62 Form 7 (grid-in) 1,079 6.55

Female

Form 3 1,494 7.90 Form 4 1,425 4.98 Form 5 1,438 5.12

__-____-__-_________~~~~~~~~~~~~~~~~~~~-~~-~~~--~

Form 6 1,393 7.57 Form 7 1,314 5.33

3.16 3.45 3.51

2.68 2.87

2.94

3.96 3.14

2.40 2.53

An analysis of covariance using the regression approach (similar to the Form 1 vs. Form 3 analysis described above) comparing Form 3 (multiple-choice) with Form 4 (grid-in) indicated the expected form effect (F[l,4044] = 213.5, ~<.0001) and no significant two- or three-way interactions (all ~s>.14). The replication comparison of Form 6 with Form 7 again indicated a form effect (FP, 38621 = 24.9, @.OOOl) and no significant two-way interactions (QS>.~). Although the three-way form x gender x ethnicity interaction was statistically significant at the .05 level in this large sample (F[3, 38621 = 2.74, 2 = .04), it was of little practical significance, contributing only .0007 to the

19

Page 25: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

squared multiple correlation (i.e., accounting for .07% of the total score variance). Means, standard deviations, and cell sizes for the gender within ethnic group analyses are in Appendix F. Given the absence of any significant interactions for the Form 3 versus Form 4 comparison and the very small size of the three-way interaction for the Form 6 versus Form 7 comparison, it would appear that differential format effects by gender or ethnicity are nonexistent or trivial.

Speededness Analvsis

Grid-in test. Because gridding a number of up to five digits was presumed to take more time than gridding a single letter, Form 4 contained only 24 questions (14 in grid-in format) to be answered in 30 minutes rather than the usual 30 questions. As a check on whether even this reduced question load created an unduly speeded test, Form 5 was developed by dropping an additional five multiple-choice questions from the beginning of the test (for a total of 5 multiple-choice quantitative comparison items plus the same 14 grid-in items as in Form 4).

As can be seen in Table 5, the additional time was of minimal benefit, with average scores improving by less than one point. An analysis of covariance indicated that form effects did not significantly interact with gender, ethnicity, or gender x ethnicity (all ps>.3). On the last two items in each form, where the effects of speededness should be greatest, the percent correct in Form 5 was only slightly higher than in Form 4 (on grid-in item 13, the percent correct for Form 5 was 25 while for Form 4 it was 23; on grid-in item 14, the percent correct scores were 15 and 13 for Forms 5 and 4, respectively). The 30-minute Form 4 test consisting of 10 multiple-choice quantitative comparison questions and 14 grid-in questions appears to have been a reasonable length.

Computer test. Timing for the computer test differed from the timing for the grid-in sections because in the computer format all instructions and sample items were presented before timing of the test began. In addition, the time it takes to type an answer on a keyboard could not be directly related to the time it takes to grid a response. Experience with a few pilot subjects suggested that 15 minutes was a reasonable time limit for the 14 open-ended questions. Posttest questionnaires generally confirmed this initial judgment, although there was a suggestion that the time limit was too strict; of the 348 questionnaire respondents, 58% indicated that the time limit was '*about right (just enough time to finish)," while 4% indicated that the time limit was "too long (lots of time left over)," and 38% indicated that the time limit was "too short (couldn't finish)." As noted in the above section describing performance on individual items, the very poor performance on the last item (compared to performance on this same item in the other formats) probably reflects the number of students who ran out of time. Adding two or three minutes should be sufficient to take care of the undue speededness of this section.

In a computer administration it might be possible to allow extremely generous time limits, but the possibility that the untimed items are measuring a different construct must then be addressed. Some items on a typical General

20

Page 26: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Test quantitative section that can be answered through laborious calculations can be answered much more quickly by estimation or insight. If time limits are sufficiently generous, the laborious calculator and the insightful person will receive the same score.

Correlational Analvses

If quantitative scores from the grid-in and multiple-choice formats are measuring the same underlying construct, they should have similar correlations with other variables. As indicated in Table 6, correlations of the 14 common

Table 6 Correlations, bv Gender. of 14 Common Items with GRE Scores and Undergraduate Grade Point Average

Form

Correlations with 14 Common Items

Group n GRE-Q GRE-A GRE-V UGPA

1 M 815 .80 .61 .33 .25 F 1,173 .79 .61 .40 .27

Tot 1,991 .81 .62 .38 .24

2 M 836 .83 .59 .34 .24 F 1,111 .79 .57 .39 .24

Tot 1,952 .82 .59 .38 .23

3 M 965 .79 .54 .39 .25 F 1,364 .77 .61 .41 .28

Tot 2,334 .79 .58 .40 .25

4 M 982 .82 .65 .40 .24 F 1,313 .79 .65 .44 .21

Tot 2,301 .82 .65 .42 .22

5 M 944 .80 .64 .38 .26 F 1,334 .82 .64 .48 .29

Tot 2,284 .83 .65 .44 .26

6 M 950 F 1,282

Tot 2,235

7 M 981 .78 .67 .47 .33 F 1,223 .77 .65 .47 .21

Tot 2,207 .79 .66 .47 .24

.77 .67 .44 .22

.67 .59 .42 .19 l 73 .63 .44 .19

GRE-Q: GRE quantitative; GRE-A: GRE analytical; GRE-V: GRE verbal

21

Page 27: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

items from the experimental forms have about the same correlation with regular GRE scores and undergraduate grade point average (UGPA as reported on the BIQ) regardless of question format. Note particularly the correlation with GRE quantitative for Form 3 versus Form 4 and for Form 6 versus Form 7. Although the multiple-choice Forms 3 and 6 share method variance with the GRE quantitative score (they are all based on multiple-choice items) the correlations with GRE quantitative are actually slightly higher for the grid- in forms. The differences in the correlations, though small, are statistically significant in these large samples (z = 2.9 for the Form 3 vs. Form 4 comparison and z - 4.7 for the Form 6 vs. Form 7 comparison).

To determine whether format effects might be related to the extent to which students routinely used mathematics, the sample was divided into quantitative and nonquantitative groups. Self-reported undergraduate major field as marked on the BIQ was used to divide the sample. The categorization of the major fields is listed at the bottom of Table 7; note that some fields, such as business, that might be highly qllantitative for some students but relatively nonquantitative for other students were not included in either category. The test means reported in Table 7 suggest that the major field groupings functioned as intended. In each of the seven subsamples, the mean GRE quantitative score for the quantitative majors was substantially above the mean score for the nonquantitative majors, while the GRE verbal scores were uniformly higher for the nonquantitative majors. Across forms, correlations with GRE quantitative were slightly higher for the quantitative majors, probably because of the greater heterogeneity in this group with GRE quantitative standard deviations about 20 points higher than in the nonquantitative groups.

Within both the quantitative and nonquantitative groups the overall findings noted in Table 6 were replicated; despite the method differences, the grid-in items correlated more highly with the GRE quantitative scores than did their multiple-choice counterparts. The higher correlations for the grid-in items was particularly striking in the Form 6 versus Form 7 comparison.

How could scores on a multiple-choice test (GRE quantitative) be more highly correlated with grid-in scores than with multiple-choice versions of the same questions? A possible answer to this question is that both question formats assess the same underlying construct, but that the grid-in format is not negatively influenced by the random errors introduced by guessing. This suggests that the grid-in forms might be more reliable than their multiple- choice counterparts. The coefficient alpha reliability for the multiple- choice Form 3 was .77 and for the grid-in Forms 4 and 5 the reliabilities were .78 and .81, respectively (all based on the 14 common items). A much more striking contrast was noted in the comparison of the reliabilities for the multiple-choice Form 6 and its grid-in counterpart Form 7; the alpha reliability for Form 6 was .64 and for Form 7 it was .73. Note that in Tables 6 and 7 correlations were especially low for Form 6.

22

Page 28: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Table 7

Means. Standard Deviations. and Correlations for Quantitative and Nonauantitative Maior Fields

MaJor Correlations with Field Means (SDS) 14 Common Items

Form Group N GRE-Q GRE-A GRE-V UGPA= GRE-Q GRE-A GRE-V UGPA 1 Q’ 1139 582(131) 5/1(124) 505(110) . . . . . .

600 519(117) 559(115) 538(110) 5.3(1.0) .79 .66 .52 .22

2 Q N

3 Q N

4 Q N

5 Q N

6 Q N

7 Q N

1125 577(130) 569(121) 506(110) 5.2(1.1) .83 .59 .37 .28 591 524(119) 555(124) 539(117) 5.3(1.0) .79 .60 .46 .21

1418 563(137) 540(132) 490(114) 5.0(1.2) .81 .59 .41 .28 640 500(119) 531(121) 513(119) 5.1(1.1) .73 .58 .45 .28

1411 564(137) 542(130) 490(116) 5.0(1.2) .83 .66 .45 .26 662 504(117) 528(123) 516(126) 5.2(1.1) l 77 .64. .48 .20

1382 565(139) 544(134) 491(114) 5.0(1.1) .84 .67 .48 612 500(120) 532(123) 522(120) 5.2(1.1) .79 .63 .50

1357 570(137) 550(132) 499(116) 5.1(1.2) .74 .64 .44 623 501(115) 528(121) 519(116) 5.1(1.1) .67 .60 .47

1355 569(139) 549(132) 498(117) 5.0(1.2) .80 .69 .52 601 502(115) 535(118) 522(112) 5.1(1.1) .73 .61 .45

.29

.29

.22

.15

.29

.21

GRE-Q: GRE quantitative; GRE-A: GRE analytical; GRE-V: GRE verbal

YJ (quantatitive) includes chemistry, computer science, mathematical science, physics, astronomy, engineering, economics, experimental psychology, industrial and organizational psychology, psychometrics, and quantatitive psychology.

bN (nonquantatitive) includes art history, art theory, art criticism, performing arts, studio arts, English language and literature, foreign language and literature, history, humanities, education administration, education curriculum and instruction, early childhood education, elementary education, communications, library science, religion, and social work.

=Undergraduate grade point average on a 7 point scale: D or lower =l, C- -2 . . . . A - =6, A - 7

Page 29: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Questionnaire Responses for Computer Format Students

In addition to the question on test timing discussed above, three questions from the questionnaires administered at the end of the computer test are relevant to the quantitative items (other questions related to the analytical items on the computer test). Question 7 asked:

Think back to when you worked on quantitative items on the regular GRE in October. Did you ever try to work backwards from the answer choices or use any other strategy that you could not use on the open- ended quantitative questions on the computer test that you just completed? (circle your choice) Yes No

Of the 32 1 students who answered answered ” no." Question 8 asked

this question, 88% answered "yes" and 12%

Which kind of quantitative test would you rather take, the multiple-choice type on the regular GRE or the open-ended type on the computer test? multiple-choice open-ended no difference

Most students (81%) indicated that they preferred multiple-choice tests, 11% preferred the open-ended test, and 8% indicated no difference. Question 9 asked:

Which kind of quantitative test do you think is a fairer indicator of your quantitative ability, the multiple-choice test on the regular GRE or the open-ended type on the computer test? (same choices as above)

In contrast to the strong preference for the multiple-choice test expressed in the responses to question 8, students were about equally divided on the fairness issue; 43% thought multiple-choice was fairer, 41% thought open-ended was fairer, and the remaining 16% saw no difference. Braswell (1990) found a similar distinction between preference and perceived fairness in his sample of high school students.

Conclusions

At the level of the individual item, there were striking differences between the open-ended (grid-in or computer type-in) and multiple-choice formats. Some items that were relatively easy in the multiple-choice format were relatively difficult in the open-ended format. Item characteristic curves for questions in the grid-in and multiple-choice formats were nearly overlapping for some items and highly discrepant for others. Format effects appeared to be particularly large when the multiple-choice options were not an accurate reflection of the errors actually made by students. In addition, it became clear that asking for a rounded answer in the grid-in format was not equivalent to asking for approximation in a multiple-choice format. If the intent of the test is to describe specific skills that students possess, the open-ended format seems to be clearly superior. For example, if a teacher or

24

Page 30: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

education policy maker wanted to determine how many students could answer the question: If x = 2 and y = 3, then (2x)3+(3y)2= , a very different answer might be obtained if the question were in a multiple-choice format, especially if the most common errors were not included among the answer choices. Such format questions would be critical for a program such as the National Assessment of Educational Progress that attempts to provide a national profile of student skills. Similarly, a testing program that sought to characterize the kinds of errors made by students could probably profit from an open-ended format, although for such purposes a format that allowed students to show the solution process would be superior to one that focused only on the final answer obtained.

Despite the impact of format differences at the item level, total test scores in the open-ended and multiple-choice formats appeared to be comparable. Both formats ranked the relative abilities of students in the same order; gender and ethnic differences were neither lessened nor exaggerated; and correlations with other test scores and college grades were about the same. For a fixed number of items, the open-ended test may be slightly more reliable because of the elimination of random guessing. But for a fixed length of time, the reliabilities would be virtually identical because the open-ended test contains fewer items.

More accurate descriptions of performance at the individual item level might be intrinsically interesting, and eliminating random guessing might be useful in a computer adaptive testing mode where branching decisions are based on performance on a single item. In such situations the open-ended format has clear advantages. If only the rank ordering of students by total test scores is of interest, the two formats may be more nearly equivalent. But tests do more than assign numbers to people. They also help to determine what students and teachers perceive as important. If outsmarting a multiple-choice test is seen as an important goal, students will become poorer (and coaching schools richer) learning techniques for beating multiple-choice tests that are of very dubious value in real-world problem solving. The engineer in the field or the chemist in the lab is seldom confronted with five numerical answers of which one and only one will be the correct solution to the problem at hand. Test preparation for an examination with an open-ended answer format would have to emphasize techniques for computing the correct answer, not methods for selecting among five answer choices. Thus, with the grid-in format, coaching and test preparation should become more nearly synonymous with sound instructional strategies that are designed to foster understanding of basic mathematical concepts. Ultimately, the decision to accept or reject open- ended answer formats may rest as much on these nonpsychometric considerations as on any small differences in test reliability or validity.

25

Page 31: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

References

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord f M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Braswell, J. (1990, April). A comoarison of item characteristics of multiple-choice and urid-in tvpe ouestions. Paper presented at the annual meeting of the American Educational Research Association, Boston.

Traub, R. E., t Fisher, C. W. (1977). On the equivalence of constructed-response and multiple-choice tests. Applied Psvcholoaical Measurement, I, 355-369.

Traub, R. E., C MacRury, R. (1990). Multiple-choice vs. free-response in the testing of scholastic achievement. In K. Ingenkamp (Ed.), Yearbook on educational measurement. Weinheim, Germany: Beltz Publishing Company.

Wingersky, M. S., Patrick, R., 6i Lord, F. M. (1988). A user's quide to LOGIST 6.0. Princeton, NJ: Educational Testing Service.

Page 32: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix A

Answer Sheet for Grid-in Questions

Page 33: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

_______. __.- _ __ _ _ TEST CENTER NUMBER

REGISTRATIONNUMBER j SECTION 7 - PART A SIDE 1

Part B Questions 1 - 14

Each of the questions in Part B requires you to solve the problem, write your answer in the boxes on the answer sheet, and then grid your answer.

Answers should be gridded as far to the right as possible. An answer such as ‘12 can be gridded as either .5 or 112, but an exact fraction is preferable to an inexact decimal. Thus, for example, l/z should be gridded as l/3 not .3333. Do not use mixed numbers; 1% should be gridded as 312 or as 1.5. All numbers to the left of the “I” will be treated as the numerator and all numbers to the right of the “I” will be treated as the denominator. Dollar signs ($) and percent signs (O/O) may be omitted.

The samples below show how to grid various types of numbers.

Ans: 5’12 ,,

Change to 2 Ans: 5’12

Change to 5.5 OR Ans: 62.5

negative or minus 8-b

division sign ,-+

variable “K” .-L

decimal point _

Ans: 52% Ans: $12.30 Ans: -7 Ans: 3K-2

Page 34: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

-

01 a- m _!

-

-

-

-

-

-

-

-

-

-

-

-

-

mZ mZ:

- -

m1 -

01 - - - - -

m5: mZ

- mZ

- - - - -

SIDE 2

SECTION 7 - PART B

5

1

- - - - - -

mTT m- m- m

1 Name: Social Security No.: I

547TF69P30 I N 630224 01815-07

Page 35: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix B

Multiple-Choice and Grid-in Versions of Experimental Items for Forms l-7

Page 36: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Form 1, Form 2 and Term 3

7 7 . 7 0 7 0 7 0 7 7 Directions: Each of the Questions 16-30 has five answer choices. For each oi these questtons, select tne best OI tnt answer choices given.

16. If x = 2 and y = 3, then (2~)~ +(3y)’ =

0 1 (A) 145 (B) 102 (C) 90 (D) 72 (E) 30

17. A couple 6oid their home S months before the

annual coverage on their homeowner’s policy

expired. If the company ‘6 policy Ls to refund & of the yearly premium for each month the

policy Ls not used and their premium was $168 a year, how much of their payment would the

insurance company return to them?

18.

0

A /I

A cake is &&cd In a pan with a square bottom

9 Lnche6 by 9 inches. If the cake is cut into 1 ks 1~ inches long and 1 inch wide, how many

bars will there bc?

If lCbcy=3k, then q =

(A) S6S (RI 570 (C) 375 (D) 593 (El $98

W (A) 1Sk (B$j W g (D) 6 (E) &

a

do 09 +0.16 = .

(A) 0.25 (B) 0.5 (C) 0.7 (D) 5 (E) 7

GO ON TO THE NEXT Pk

Page 37: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 0 7 0 7 7 Questions 21-25 refer to the following graph.

23.

AVERAGE COSTS FOR A NEW ONE-FAMILY HOME

Build& Profit.

In 1977, approximately what was the average cost of financing a new one-family home?

(A) 53.000 (B) S3.800 (Cl S5.800 (D) S9,700 (E) $13,500

In 1969, the ratio of the average wst of financing a new one-family home to the average cost of land was approximately

(A)$ (B) f (Cl $

Approximately what was 1977 in the average cost a new one-family home?

(A) s7,ooo (B) b1WOO (Cl 614,ooo (D) 516,000 fE1 $lg.OW

the increase from 1949 to of labor and materials for

24.

0 In 1977, if the average cost of labor was

i the average cost of materfals for a new one-

family home, then labor cost was approximate1

what percent of the total COST of such a home?

Which of the following trends, relative to the average costs for a new one-family home from 1919 to 1977, can be inferred from the graph?

I. The cost of labor and materials as a percent of the total cost decreased.

Il. Builder’s profit as a percent of the total cost increased 2.5 percent.

111. The cosr of financing as a percent of the total cost more than doubled.

( A) III only (B)IandI!only (C)landIIIc

(D) II and III only (E) I, 11, and III

CO ON TO THE NEXT PA

Page 38: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 a 7 l 7 7

26. What Is the area of AABC above?

(A) 6+$3 (B) 3+a (C) 3 +fi 2

(D) 2 (El 1

7.

b 0

6 6 3

-+4= 5 7

9 (A)$ (B)z (0 ;

51

s1 014 Wn

5

29.

0 I3 In the figure above, if RS is the diameter of the circle and ST = 2, what is the area of the circle?

(A) 2n (B) 2&z (C) rnd?

(D) 8n (E) 8d

A car is traveling at afi average speed of 80 kilo- meters per hour. On the average, how many seconds does it take the car to travel K kilometr

(D) 45K (E) 288,OOOK

28. If 15 ; x 2 25 and y - x = 3, what is the greatest possible value of x + y 1

(A) 60 (B) 56 (0 53 (0 47 (E) 28

S T 0 P IF YOU FINISH BEFORE TIME IS CALLED, YOU MAY CHECK YOUR WORK ON THIS SECTION ONLY.

Do NUT WORK ON ANY UT’HER SECTION IN THE TEST.

Page 39: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 Form 4 and Form 5

7 0 7 0 7 0 7 0 7 7 PART B

Directions: Bach of the Questions l-14 requires you to solve the problem. write your answer in the appropriate box on the answer sheet, and then grid your answer. Set your answer sheet for gndding instructions.

1. Ifx = 2 and y = 3, then (2~)’ + (3~)~ =

2. A coupie sold their home 5 months before the

annual coverage on their homeowner’s policy

expired. If the company’s policy is to refund 1

12 of the yearly premium for each month the

poiicy is not used and their premium was 5168

a year. how much of their payment, in dollars,

would the insurance company return to them?

3.

4.

5.

A cake is baked in a pan with a square bottom

9 inches by 9 inches. If the cake is cut into 1 l

bars 11 inches long and 1 inch wide, how many

bars will there be?

If IOxy = 3k, then, in terms of k, ? =

Expressed as a decimal, JO.09 + 0.16 =

GO ON TO THE NEXT PAGE.

Page 40: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 0

Questions 69 refer to the following graph.

AVERAGE COSTS FOR A NEW ONE-FAMILY HOME

6.

7.

Builder’s Profit,

To the nearest SlOO, what was the average cost of financing a new one-family home in 1977 ?

In 1969, the ratio of the average cost of financing a new one-family home to the average cost of land was approximately 1 to n, where n is what whole number?

7 7

8. To the nearest 61,000, what was the increase from 1949 to 1977 in the average cost of labor and materials for a new one-family home?

9. In 1977, if the average cost of labor was 2 1 the

average cost of materials for a new one-family

home, then labor cost was what percent. to the

nearest whole percent, of the total cost of such

a home?

GO ON TO THE I’JEXT PAGE.

Page 41: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 0 7 0 7 7 B

L!Yl 43 2

A 1 c

10. What is the afca of AABC above?

6 6 3

11. 5+4=

z

R x0 63 x9

T

S

13. In the figure above, if RS is a diameter of the circle and ST = 2, what is the area of the circle? (Use 3.1 for x.)

14. A car is traveling at an average speed of 80 kilome- ters per hour. in tetms of K, how many seconds, on the average, does it take the car to travel K kiIometcrs?

12. If 15 <e x 5 25 and y - x = 3, what is the greatest possible value of x + y ?

STOP IF YOU FINISH BEFORE TIME IS CALLED, YOU MAY CHECK YOUR WORK ON THIS SEmON ONLY.

DO NOT TURN TO ANY OTHER SECTION IN THE TESr.

Page 42: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Form 6 Directions: Each of the Questions 16-30 has five answer choices. For each of these questions. seiect the best of the answer choices given.

03) ;

5 0 12

0 f .

(El &

(W 4 (C) 7

(D) 7;

(E) 9

(A) 10

9. On the average. lo

b L of one percent of the watches

produced in a factory are defective. Out of a

production lot of 2.000 watches. how many can be

expected not to be defective?

(A) 1,800 (B) 1.980 (C) 1.988 (D) 1.996 (E) 1,998

20. A locksmith has n keys that open a total of n locks. Each key opens exactly one lock. It takes 15 seconds to determine whether a given key opens a given lock. If. after a total of 3 minutes spent exciusiveiy on determining whether or not the first k of the II keys fit a given lock. 40 percent of the keys have been tried. then n =

(A) 12 (W 16 (c) 18 (D) 20 (E) 30

GO OX TO THE NEXT PAGE.

(W 2 (C) -2 (D) -5 (E) -10

Page 43: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 a 7 0 7 0 7 Questions 2 I-25 refer IO the foilowing data.

21.

8

17 _a.

0 7

23.

@

THE HUTTON FAhlILY’S EXPENSES FOR 1978 ON A NET INCOME OF 521.600 (The Hutton family has 4 members.)

Savings

Transponation

Professional Services

Medicai Services

Furniture

Food

Rent and Utilities

!A I .

&00 I I I s*ioo I I

&oo ,

&oo (all expenses to nearest $100)

7

What was the amount of the Hutton family’s expenses for rent and utilities during 1978 ?

(A) S3.800 (B) 53.900 (0 s4.000 (D) S4.100 (E) S4.200

If. for 8 years. the Hutton family saved the same amount of money per year as it saved in 1978. what would be the average amount saved per familv member during the &year period?

(A) S 16.200 (B) 53.600 (C) St.400 (D) S1.800 (E) 5900

Approximately what percent of the Hutton farnil+ net income was not required for rent and utilities. food, and ciothis

(A) 52% (B) 48% (C) 47% (D) 46% (E) 44%

7 . Which of the following can be inferred from the

0 ~~~ raph about the Hutton family’s expenses for 1978 ?

I. Monthly expenses for rent and utilities never exceeded 5360.

II. No family member’s clothing expenses exceeded S2.800.

III. Expenses for medical services accounted for at least 10 percent of the family’s net income.

(A) I only (B) II only (C) III only (D) II and III only (E) I, II, and III

25 In 1978, if the Hutton family had also spent all their transportation money to purchase furniture. by approximately what percent would the expenses for furniture have been increased?

(A) 58% (B) 64% (C) 128% (D) 158% (E) 258%

GO ON TO THE NEXT PAGE.

Page 44: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 . 7 0 7 0 7 0 7 7

26. In the figure above. x + ,V =

64) 60 (B) 135 (C) 165 (D) 195 (E) 220

27. The total number of integers between 100 and 200

0 II

that equal the cube of some integer is

(A) one (B) two (C) three (D) four (E) five

28. Of the 28 handkerchiefs in a drawer. 6 are red. 5 are blue. and the rest are white. If Bob selects handker- chiefs at random while packing, what is the least number he must remove from the drawer to be sure that he has 3 handkerchiefs of the same color?

(A) (B) 0 0 (E)

4 7 9

13 19

29. If. for all x. 2.~’ + 7s + 6 = (2.~ 4 p)(s A q),

0 13

then p + q could be

(A) 3

30. If 7 white cubes and 20 red cubes. all of equal size.

0 I4 are fastened together to form one large cube. as shown above. what is the smallest fractional part of the surface area of the large cube that could be white?

4 0 27

PI ;

(El &

STOP IF YOU FINISH BEFORE TIME IS CALLED. YOU MAY CHECK YOUR WORK ON THIS SECTION ONLY.

DO NOT TURN TO ANY OTHER SKTION IN THE TEST.

Page 45: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Form 1. Form 2 anC Form 3

7 7 0 7 a i a 7 0 7 7 Form 7

PART B

DirectIons: Exh of the Questions 1-14 requires you to soive the problem. write your answer in the appropriate box on he answr sheet. and then grid vour answer. See your answer sheet for gndding mstrucrions.

4.

x A ’ 2. If 4 A = 8. then Z$ =

Y 4

5== . .

(X.Y 1 .----5----___u.

5.

1 ’ -5 0

-5 i

! -x 5

3. In the figure above, s Y- 2~ =

1 On the average, 10 of one percent of the watches

produced in a factory are defective. Out of a

production lot of 2.000 watches. how many can be

expected not to be defective?

A locksmith has n keys that open a total of n locks. Each key opens exactiy one lock. It takes 15 seconds to determine whether a given key opens a given lock. If. after a total of 3 minutes spent exclusivciy on determining whether or not the first k of the n keys fit a given lock. 40 percent of the keys have been tried. then n =

GO ON TO THE XEXT PAGE.

Page 46: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 0 7 0 7 7 Questions 6-9 refer to the following data.

THE HUTTON FAMILY’S EXPENSES FOR 1978 ON A NET INCOME OF 521.600 (The Hutton family has 4 members.)

Savings

Transportation

c Professional 2 Services Y Y

rl; Medical

% Services m *g Furniture

2 8 Clothing

Food

so s1,ooo s2,ooo

(all expenses to nearest SlOO)

6. To the nearest SIOO. what was the amount of the Hutton family’s expenses for rent and utilities during 1978 ?

8.

7. If. fdr 8 years. the Hutton family saved the same amount of money per year as it saved in 1978. what would be the average amount saved per family member during the 8-year period?

9.

What percent of the Hutton family’s net income was not required for rent and utilities. food. and clothxg? (Give your answer to the nearest whole percent.)

In 1978, if the Hutton family had also spent all their transportation money to purchase furniture. by what percent would the expenses for furniture have been increased? (Give your answer to the nearest whole percent.)

GO ON TO THE NEXT PAGE.

Page 47: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

7 7 0 7 0 7 0 7 0 7 7

10.

11.

1’ -.

x0 60”

- 135” Y

In the figure above. .Y + y =

The total number of integers between that equal the cube of some integer is

100 and 200

Of the 28 handkerchiefs in a drawer. 6 are red. 5 are blue, and the rest are white. If Bob seiects handker- chiefs at random while packing. what is the least number he must remove from the drawer to be sure that he has 3 handkerchiefs of the same color?

13.

14.

If. for ail x. 2x’ + 7x + 6 = (2-K + p )(.u + q ), then p + q could be what whole number vaiue?

STOP

If 7 white cubes and 20 red cubes. ail of equai size. are fastened together to form one large cube. as shown above, what is the smallest fractional part of the surface area of the large cube that could be white?

IF YOU FINISH BEFORE TIME IS CALLED. YOU hlAY CHECK YOUR WORK ON THIS SECTION ONLY. DO NOT TURN TO ANY OTHER SECTION IN THE TEST.

Page 48: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix C

Item Characteristic Curves for Form 1 and Form 3

Page 49: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC PLOTS FOR EXPER I MENTAL ITEMS: FORM I b FORM 3

Form 1

ITEI? 1

0 i I i 1 I I

-3 0

Size of square (Form 1) or hexagon (Form 3) shows relative size of group at each theta level.

ITEM 2

3

0 ; I I I I I

-'3 0

ITEM 3

0 1’ 1 I I I I

-'3 0 3

Page 50: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC FLOTS FOR EXPERIMENTAL ITEMS: FORM I b FORM 3

ITEM 5

i

,‘I 6 I I

I I

ITEx 6

-3

t

# I

I

ITEM 7

3

0

Page 51: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

ICC PLOTS FOR EXPERIMEN-TAL ITEMS: FORM I 8 FORM 3

"Omitted item" This item had no Grid-in counterpart

I I . 4 i

ITEM 11

I 1 .

0 I I 1 I I

3

ITEF 13

ITEY 14

Page 52: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix D

Responses to Computer Items

Page 53: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

SummarY of Five Most Common Responses for the Computer T-terns

Question number 1 2 3 4 5

Answers 145 -- 129 82 30 1 n 228 17 16 10 9

% 63 5 4 3 2

Answers 70 -- 60 14 98 2 n 222 36 16 16 7

% 61 10 4 4 2

Answers 54 -- 36 6 56 3 n 225 27 13 12 6

% 62 7 4 3 2

Answers 34120 -- 5 3/2OQ 315 4 n 123 104 17 14 11

% 34 29 5 4 3

Answers 5 2;6

. 05 . 25 .7 -- 5 n ?9 23 18 18

% 59 8 6 5 5

Answers 5800 54000 5400 5900 600 6 n 87 44 37 30 21

% 24 12 10 8 6

Page 54: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Answers 3 -- 2 5 7 7 n 227 37 17 16 a

% 62 10 5 4 2

Answers 18000 -- 20000 19poo 17000 a n 101 46 40 16 13

% 28 13 11 5 4

9 Answers -- 23 16 15 la

n 86 69 48 34 13 % 24 19 13 9 4

Answers 1 -- 2 5 16 10 n 191 97 18 10 10

% 52 27 5 3 3

Answers -- 51/10 48/5Q 1 2 11 n 124 99 19 13 4

% 34 27 5 4 1

Answers -- 53 51 47 5 12 n 156 114 21 10 7

% 43 31 6 3 2

Answers -- 6.2 12.4 3.1 24.8 13 n 223 52 25 8 7

% 61 14 7 2 2

Rnswers -- 454 Q/80 3600Q 80/Q 14 n 258 15 7 4 3

% 71 4 2 1 1

Page 55: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix E

Percent Correct and Biserial Correlations for Forms 6 and 7

Page 56: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Percent Correct and Biserial Correlations for the 14 Common Items on Forms 6 and 7

Question

1

2

3

4

5

6

7

8

9

10

11

12

13

14

% Correct Form 6 Form 7

95 85

85 74

75 64

61

52

86

81

51

37

48

36

31

39

25

44

46

85

56

18

5

33

30

16

23

10

Biserial ra Form 6 Form 7

. 49 . 42

. 47 .44

.39 .41

.38 .53

. 56 . 61

. 36 . 47

. 44 .42

.29 .43

.26 .56

. 31 . 34

. 42 . 61

. 30 . 55

. 28 . 50

. 22 . 47

aComputed with item score removed from criterion score

Page 57: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Appendix F

Means, Standard Deviations, and Numbers for Gender Within Ethnic Group Analyses

Page 58: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Table Fl

Means and Standard Deviations for the 14 Common Items on Forms 3 and 4. bv Gender within Enthnic Groups

GrouD

White M F

Black M F

Asian M F

Hispanic M F

Form 3 Form 4

678 9.4 2.9 723 6.5 3.5 1037 8.2 2.7 984 5.2 2.8

66 6.5 3.6 53 2.4 2.9 103 5.4 2.6 89 2.7 2.3

35 10.3 3.1 35 7.5 3.2 36 8.8 2.8 25 5.1 2.9

49 8.0 3.2 52 6.1 3.1 64 6.7 2.7 63 3.7 2.9

i! Y SD

Page 59: A Comparison of Open-Ended and Multiple-ChoiceA Comparison of Open-Ended and Multiple-Choice Question Formats for the Quantitative Section of the Graduate Record Examinations General

Table F2

Means and Standard Deviations for the 14 Common Items on Forms 6 and 7. bv Gender within Enthnic Grouts

Group

White M F

Black M F

Asian M F

Hispanic M F

Form 6

N U SD

682 9.0 2.5 966 7.8 2.4

54 5.8 2.1 50 3.8 2.6 75 5.9 2.3 94 3.3 2.2

39 9.3 2.7 30 7.1 3.0 26 7.2 2.2 21 5.8 2.6

35 6.8 3.1 50 5.2 2.3 68 6.3 2.5 55 4.9 2.5

Form 7

738 6.9 2.8 927 5.6 2.5