special issue: a century of language teaching and research: looking back and looking ahead, part 1...

Language Testing in the Modern Language JournalAuthor(s): Bernard SpolskySource: The Modern Language Journal, Vol. 84, No. 4, Special Issue: A Century of LanguageTeaching and Research: Looking Back and Looking Ahead, Part 1 (Winter, 2000), pp. 536-552Published by: Wiley on behalf of the National Federation of Modern Language Teachers AssociationsStable URL: http://www.jstor.org/stable/330305 .

Accessed: 21/05/2014 09:04

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and National Federation of Modern Language Teachers Associations are collaborating with JSTOR todigitize, preserve and extend access to The Modern Language Journal.

http://www.jstor.org

This content downloaded from 37.191.10.188 on Wed, 21 May 2014 09:04:09 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=black

http://www.jstor.org/action/showPublisher?publisherCode=nfmlta

http://www.jstor.org/stable/330305?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Language Testing in

The Modern Language Journal BERNARD SPOLSKY Department of English Bar-Ilan University Ramat-Gan 52900 Israel Email: [email protected]

The Modern LanguageJournal published on average two articles a year dealing with language tests in its first 80 years. This probably reflects the actual (if not the desirable) level of interest in language testing of the language teaching profession. For the early years, before more specialized journals appeared, it gives an excellent picture of the history of the field in the United States. Later, the coverage became spottier, but there continued to appear a number of important articles especially on topics like prognosis and aptitude tests, the cloze test, oral testing, and the controversy over the ACTFL Proficiency Guidelines. As a whole, the articles show a valuable concern with the use rather than the form of language tests.

WHY, IT IS REASONABLE TO ASK, WOULD

anyone want to sit and read through all past issues of The Modern Language Journal? And why would the editors invite a number of scholars to do this? Part of the answer must be the normal pride of association with a continuing tradition and part, perhaps, an attempt to arrive at some kind of evaluation of the 80 years of collective and individual effort that might help to shape future policy. My task is even more difficult than that of other writers in this special issue for testing was never a central topic of The Modern LanguageJour- nal; it is not specified in the statement of focus. Nonetheless, the MLJ has over the years published a respectable body of articles on testing, an

average of just under two articles a volume that

probably reflects the actual (if not the desirable) level of interest of the profession. Reading straight through the first 80 volumes with the

paper whitening and crisping as one approaches the present, the language tester is pleasantly sur-

prised to come across old favorites-classic papers on language testing--and is even from time to time rewarded by gems that have been ignored over the years by those who concentrate their

reading on the pages of Language Learning,

The Modern LanguageJournal, 84, iv, (2000) 0026-7902/00/536-552 $1.50/0 ?2000 The Modern Language Journal

TESOL Quarterly, Applied Linguistics, and Language Testing. It is, in fact, the articles published in the

MLJ before those applied linguistics journals started to appear that give a most useful view of the growth of the field of language testing. When I was studying the history of language testing for

my book (Spolsky, 1995a), I first appreciated how well the early volumes of the MLJ recorded the main trends of development and the growth of interest in and changing attitudes toward lan-

guage testing among American foreign language scholars and teachers.

What I intend to do in this article is to review all the articles that have a primary focus on test-

ing (there are many others that report the use of tests as a research method, but these I generally exclude), with a first purpose of helping other readers be more efficient in deciding what they may need to read. At the same time, I cannot avoid trying to answer the question, Was it all worth the effort? Do these papers record serious contributions to the field of language testing, or as one sometimes suspects, little exercises in- tended to fill up personal publication lists? We

presumably will never know how many readers benefited from the articles, or how many writers received tenure or promotion as a result of their

publication. How influential can a journal like The Modern

Language Journal be in advancing a field or an



Bernard Spolsky

idea? In language testing and teaching, it seems that ideas sometimes spread in spite of lack of

publication. I take an example offered in a conversation by Richard D. Lambert a few years ago. He proposed that one of the most influential federal government activities in the language

teaching field was not a big, expensive program like those funded by the National Defense Educa- tion Act (NDEA) but the fairly cheap in-house

activity of a handful of professionals at the For-

eign Service Institute (FSI) of the U.S. Depart- ment of State who developed an internal testing system that had a revolutionary, if controversial, impact on the teaching field many years later. The Guidelines movement (see below) derived from work that was not just not expensive, but little publicized. Apart from a brief an- nouncement in the Linguistic Reporter (Rice, 1959), the first published description was by Wilds (1975) and the first critical discussion was

by Jones (1979). By that time, the FSI oral interview was well enough known by word of mouth for it to become the basis for the guidelines proposal that emerged in 1980. Even if there were no other journals, then, we need not be surprised to find that a history of language teaching or testing that draws only on the MLJwould be limited and leave gaps.

This article, then, is not a history of language testing over the past 80 years, but an attempt to

appreciate the role of The Modern Language Jour- nal in this history. It is only fair to mention that I have a bias: I believe (and have made clear in

print; see, e.g., Spolsky, 1995a) that testing is an

important but potentially dangerous component of language teaching. It deserves better under-

standing than most language teachers have of it, and it demands more careful use than most testing experts seem ready to acknowledge.

THE EARLYYEARS: THE FIRST MLJ ARTICLE ON TESTING

The earliest article on testing, by the Commit- tee on Resolutions and Investigations of the Asso- ciation of Modern Language Teachers of the Mid- dle States and Maryland (1917), appeared in the first volume and deals with a theme that is still regularly treated but widely supposed to be an innovation of the Armed Services Training Pro- gram (ASTP) 30 years later and of the audiolin- gual method after that-the testing of proficiency in the spoken language. In 1913, the Association of Modern Language Teachers of the Middle States and Maryland proposed an aural and oral test in French, German, and Spanish, asking for

537

comments on the idea from 1,000 schools. The majority of respondents agreed that a test was the

only fair way to recognize and encourage the

widespread direct method of teaching, which emphasized speaking the language. The still small but already influential College Entrance Exami- nation Board approved the idea too, reported the Committee (1917), but the technical problem of

achieving absolute uniformity prevented imple- mentation. This article has the distinction of be-

ing, according to Carroll (1954), in a regrettably unpublished history of the field, one of the earliest calls in the United States for objective psychological testing. The questionnaire that had been sent out to the 1,000 schools in 1914 included three specimen tests, consisting of a 10-minute dictation, the written reproduction in the foreign language of a short prose passage read by the examiner, and written answers in the foreign lan-

guage to questions read by the examiner. The Committee recognized "that no actual oral test is included in this examination" but they were con- fident "that no candidate could pass it who had not received abundant oral, as well as aural training" (p. 252). An earlier plan had included an individual test in pronunciation and speaking but as this idea had not proved feasible, schools were free to administer an individual test if they wished.

Three points are worth making about this article. The first is the motivation of those proposing the test, who wanted to win back curricular control from a written examination that was working against teaching techniques that concentrated on the spoken language. The article demonstrated, thus, a struggle for power over the curriculum and recognized clearly the curricular impact (washback) of a test. This interest in impact has turned out to be a dominant theme for MLJarti- cles on testing, which have emphasized the use or effect of a test rather than its design. Second, the pattern revealed, in this case of feasibility over- coming desirability, was also a regular motif. Even though everyone agreed that it was beneficial to test the use of the spoken language, it turned out to be more practical to do it indirectly by using a pen-and-paper test than by having students actu- ally speak in the test. In the interests of efficiency or feasibility, then, a quite different set of skills was tested. In testing practice, we continue regularly to find ease triumphing over principle. The third theme, appearing as early as the first year of theJournal, was the desire for an objective test, for uniformity. This regard for objectivity was to be strengthened greatly as a result of the activities of psychometrists during and immediately after the



538

First World War. It was the successful selling of the myth of the usefulness of the Army Alpha tests (Yerkes, 1921; Yoakum & Yerkes, 1920) that established in the American public its faith not just in

intelligence testing but also in the possibility of

objective testing of other human abilities. It was also the continuation of a campaign, begun in 19th century England, to establish a meritocracy based on examinations.

LANGUAGE TESTING IN THE INTERWAR YEARS

Aural and Oral Testing

Attempts to test oral and aural skills continued after WWI. Decker (1925) proposed adding a test of spoken language to the New York Regents' examinations. A number of MLJ articles took up the issue of aural and oral tests. Lundeberg (1929) listed half a dozen projects under way, including his own. Lundeberg himself, in col- laboration with Tharp (1930), wrote a widely used audition test, the results of which were com-

pared with general semester grades in classes at the Universities of Illinois and Wisconsin, and at The Ohio State University. Although forms of the

Lundeberg-Tharp test correlated highly with each other (about .80), the Lundeberg-Tharp test correlated more weakly (about .50) with sub-

jective judgments of listening ability. Scores for

subjective and objective parts of the audition test correlated poorly (about .35) with final semester

grades in language courses, which seemed to be affected mainly by grammar.

The most significant of the aural tests during the interwar period was the Bryn Mawr Test of

Ability to Understand Spoken French (Rogers & Clarke, 1933). Originally created in 1926 by a committee of French teachers in the Philadel-

phia area (F. M. Clarke, 1931), the test was ex-

panded during the fall of 1928 under the aus-

pices of the Modern Foreign Language Study (Henmon et al., 1929). A question in French

(e.g., Avec quoi ecrit-on?) was read to the pupil, who marked one of five words that answered the

question. The written words were in English because this was to be a test of aural ability and the pupils might not know how to read French. The test was later standardized with the assistance of French teachers in six different New York high schools. The standardized form of the

Bryn Mawr test consisted of 80 sentences, with the vocabulary selected from a French word book prepared by the Modern Language Study and arranged in order of difficulty. The English

The Modern LanguageJournal 84 (2000)

answers too were restricted to the first 1,000 words of The Teacher's Word Book by E. L. Thorndike (1921). Initial administrations of the test showed a correlation of about .80 between the two forms and seemed to agree with teachers' grades. Rogers (1929), in an early reference to authenticity, concluded that 'The test can be criticized on the score that it departs from the real situation, but nevertheless it is useful in ena-

bling the teacher to discover stumbling blocks in the way of achievement" (p. 248). By 1931, the audition test had been administered to 1,300 students in 40 classes in six New York high schools (F. M. Clarke, 1931).

Seeking Objectivity

The concern for objectivity was expressed in

early articles in the late 1920s. Broom and Kaulfers (1927) reported on objective tests in

Spanish and French. The following year, Cheyd- leur (1928) reported using "new-type" objective test items for language testing. Brinsmade (1928) complained about the strong weight still given to translation in the College Board Examination and argued for using new-type tests.

How to achieve objectivity remained a problem for a number of years. Seibert and Goddard (1935), both at Goucher College, were also wor- ried about the effect of translation tests, especially unsuitable in earlier stages of learning, and

agreed with Viitor's remark 50 years earlier that translation was to be considered an art that had no place in the schoolroom. Free composition was preferable, but scoring was difficult. Even the simplest of compositions involved a great deal besides language ability (such as imagina- tion, clarity, organization) that compromised the

purity of measurement that a tester should strive to achieve. Furthermore, there was so much vari- ation in free composition that comparison would be almost impossible. Their placement test included four objective parts (vocabulary, grammar, reading, and aural comprehension) and a

composition that required students to retell a

story read in class. The most reliable kind of

marking they found was to score by the density of errors, using a scale for seriousness of error of /2, 1, and 2. They found that this density of error score correlated satisfactorily (between .70 and .80) with subjective marking of the same papers by a competent teacher. Using the retelling question and scoring for density of error, they achieved correlations between the composition and the objective section of the examination of about .80.



Bernard Spolsky

In order to relieve fears about multiple-choice testing, Smith and Campbell (1942) compared recall items with recognition items by giving the American Council of Education Cooperative French test and a parallel recall test to 168 second-year students. The two tests, correlating above .80, seemed interchangeable to the authors of the study.

Test Development

One of the easiest things to do, it has been suggested, is to develop a new kind of test-what is hard to know is what an existing test really measures. In the 1930s, emphasis was on writing tests rather than studying their properties. Cheyd- leur (1931) described a placement test from the University of Wisconsin. A second article by Cheydleur (1932b) talked about grammar tests. In the same year that he completed his Ph.D. dissertation on prognosis, Kaulfers (1933) proposed a technique for testing reading comprehension.

Alongside the slow changes in the use of objective modern language tests, there continued to be proposals for improvement in the tests them- selves. At the University of Chicago, there was research on the testing of vocabulary and on the value of contextualization. A 4-hour comprehensive test was developed there that included vocabulary in context, translation from French to English and vice versa, and knowledge of grammar and syntax (Haden & Stalnaker, 1934). Al- though the translation sections were the least reliable, the overall reliability was very high, and there was a high correlation (.83) with instructors' ratings. To study the hypothesis that objective tests for vocabulary should be contextualized, two Chicago scholars (Kurath & Stalnaker, 1936; Stalnaker & Kurath, 1935) developed two tests, one with and one without context. Very little difference was noted. The two tests correlated equally well with teachers' estimates, with IQ scores, with a comprehensive German language test, and with each other (about .95). A small majority of students preferred the contextual test. It is interesting to note that test writers still regularly ignore this finding about the irrelevance of context in vocabulary items; test-wise students have learned to ignore the surplus context and focus on the words being tested.

In a comprehension maturity test developed by Feder and Cochran (1936), each of the four responses to a model choice test was differently valued. One was false, a second picked an out- standing detail of the passage, a third was more

539

complete, and the fourth was the correct summa- rizing statement.

In 1937, the Committee on Foreign Languages of the American Council on Education (ACE) announced the publication of a series of Reading Tests in French and German that had been prepared by the Cooperative Test Service of the Council (Fife, 1937). In the same year, Kaulfers (1937) proposed a technical improvement in the pen-and-paper testing of French pronunciation. Students were presented with sets of five words, four of which were said to include the same vowel sound; their task was to write the word that did not have the vowel and also to write the phonetic symbol for the common sound.

During the interwar years, then, the MLJ recorded a number of the attempts being made to develop more and better objective measures of language abilities.

Prognosis or Aptitude Testing

Another early and major theme in MLJarticles on testing was language learning aptitude. Kaulfers (1930) was the first to raise the prognosis question, as it was called at that time, in the MLJ. Effective prognosis was seen as a screening device needed to keep poor students out of language classes. Kaulfers was already starting to write his Ph.D. dissertation on the topic. Cheydleur (1932a) continued the burst of activity in prognosis with complaints about "mortality" in language classes. Richardson (1933) administered the Sy- monds' Foreign Language Prognosis Tests (Sy- monds, 1930a, 1930b) to 242 first-year high school students who were planning to take foreign language courses and found a correlation of about .60 with their first semester grades. The fact that IQ scores added little to the prediction, he believed, supported the claim for the existence of special linguistic abilities. He also found that the prognosis tests gave better predictions than did intelligence tests with two cohorts of 120 high school students.

Reviewing the work of the Modern Foreign Language Study for the MLJ, Vincent Henmon (1934) argued that it had contributed usefully to answering the fundamental questions facing teachers of the modern foreign languages, had provided useful leads on prognosis and the question of who should study foreign languages, and had shown that adults can learn foreign languages better than or as well as children. This is another ignored finding; later investigators expressed surprise when they found older learners



540

to be more successful in formal language learning than younger learners.

Continuing the study of prognosis, Wagner and Strabel (1935), among other studies of the supe- rior student at the University of Buffalo, found that neither the American Council of Education Intelligence Tests nor the Iowa High School Con- tent Examinations were of much predictive value, but the New York Regents' examinations in lan-

guages and high school grades in languages were good predictors of overall achievement in all subjects in the first 2 years at college and equally good predictors of grades in languages.

In two articles based on her master's thesis at the University of Minnesota, Sister Virgil Michel (1934, 1936), a teacher at the St. Joseph's Acad-

emy, reported an elaborate study using both the

Symonds' Foreign Language Prognosis Tests and the Iowa Foreign Language Aptitude Test. She

acknowledged (1936) her inspiration by the statement by Symonds that "prognostic testing is the romantic chapter in the history of educational measurement," agreed also with the platitude that failing students should have been guided into easier classes, but judged that educational

prognosis was "still in its infancy"(p. 275). From a review of earlier research, she learned that IQ scores alone were not good enough predictors of

foreign language aptitude, that previous school achievement was a little better, that some special aptitude tests were still better, but that not

enough multiple correlation studies were available to show the best combination of tests to

predict success in foreign language learning. She therefore administered the Symonds' Foreign Language Prognosis Aptitude Test to a group of

high school students and the Iowa Foreign Lan-

guage Test to a smaller group of beginning col-

lege German students, and to both groups, she

gave a newly devised German prognosis test that she had constructed with the assistance of Oscar Burkhard, professor of German at the University of Minnesota. The new test consisted of a mem-

ory test of short German sentences with their

English translations, an analogies test of words that were cognate in German and English, and a series of German grammar rules and exercises. It

proved to be reliable (.92 Spearman-Brown split half). For the high school students, none of these tests gave useful correlations with the Columbia Research Bureau German Test or with teachers' marks at the end of the first semester. Multiple correlations combining the tests did not help much to improve correlations. Sister Michel noted that the Symonds' test using Esperanto seemed not to have done as well with German as

The Modern Language Journal 84 (2000)

with French and Spanish. For the university students, combinations of the Iowa test (which also used Esperanto) and the German prognosis tests did achieve correlations with the end-of-semester marks, but the correlations were no higher than those using instead the high school average. The

College Aptitude Test was not a predictor of Ger- man achievement. Her thesis concluded some- what pessimistically:

In general, the experiment corroborates the findings of the majority of investigators in foreign language prognosis in so far as the correlations are rather low, in so far as predicting success in any one subject is much more difficult than prognosis of success in all subjects in high school or university, and in so far as it points with increasing insistence to the need for further research in order to secure more efficient predictive measures than those that exist at present. (as cited in Coleman & King, 1938, p. 435)

Research in prognosis did, in fact, continue to

appear in the MLJ. In a study conducted at the

Virginia State College (now, Virginia State Uni-

versity), Matheus (1937) reported on 100 students of three different languages. The George Washington University Series Language Aptitude Test (prepared by Hunt, Wallace, Doran, Buynitzky, & Scharz, 1929) correlated .66 with the ACE Psychological Examination; the two tests correlated .41 with semester grades in French, German, and Spanish. An article by Tallent (1938) was recorded by Kaulfers (1939) as the 60th article published since 1901 showing that

prognostic testing cannot be relied on to solve

foreign language problems. Tallent, whose article was based on her 1937 master's thesis, found that

IQ scores had only a low correlation (.21) with modern language grades; English placement test

grades were a better predictor (.49) of modern

language grades. She surmised that the low correlations could come from the lack of validity of both the IQ tests and the modern language grades.

One article that appeared in the MLJin 1939 looked ahead to much of the work that was to come. In it, Spoerl (1939) asked what, in fact, constituted language learning ability? Was it intel-

ligence, or courage, or form-color preference, or

memory? She tested 38 advanced German students at the American International College in

Springfield, MA, using the Henmon-Nelson test of mental ability, the Allport Ascendance-Submis- sion Reaction Study (to test attitude and open- ness to suggestions in the new foreign language situation), and the Revised Minnesota Paper Form Board Test (to see if form recognition was



Bernard Spolsky

relevant). She also gave them the Cooperative German Test. Major differences emerged between men and women. The correlation between class grade and the Cooperative test score was .35 for men and .73 for women. Similarly, the correlation between the intelligence measure and the

grade was .63 for women and .12 for men. Nei- ther the test of forms nor the ascendance submis- sion test had significant relation to the German scores. Her conclusion was that whereas intelli-

gence was significant for women, it was not for men. She did not speculate on the reason.

A study of 170 high school pupils in Massachu- setts by Maronpot (1939) found that the Sy- monds' Foreign Language Prognosis test correlated .70 with teachers' final yearly grades, a result that was better than the correlation between the grades and the general scholastic average (.51) or IQ scores (.27). In this case, the test results were not used to exclude students identi- fied as modern language risks, but to salvage them by grouping. For what were called demo- cratic reasons, low-aptitude students were permit- ted to study French but in a special class that made lower linguistic demands.

At San Francisco Junior College, Gabbert (1941) developed a prognosis test, based on the Buchanan and the Keniston word and idiom lists, that, given at the beginning of a semester, predicted (with a correlation of .69) student grades on a final examination (which included vocabulary and idioms). He proposed thus to be able to warn students at the beginning of a semester if they were likely to fail the course.

Goal of Instruction

In an essay in the MLJ, James Conant (1934), elected the year before as President of Harvard, raised the issue of goals for foreign language instruction for college students. He urged that the only language requirement that should be kept was for a reading knowledge of French or Ger- man and that the provision for an elementary knowledge of these languages or a reading knowledge of a classical language should be dropped. Moreover, he argued that the reading knowledge to be required for a specific subject, such as chemistry, might best be examined for each field. This very modest view of goals for language learning on the part of a scholar who was later to be so influential in shaping U.S. scientific and educational policy helps explains the weakness that was noticed 30 years later and led to the efforts to correct it through the National Defense Educa- tion Act.

541

The uncertainty about goals continued to

plague language testers. Frantz (1939), a teacher of German at Bucknell University, was disturbed that many universities and colleges exempted students from further language study if they could

pass a test of reading knowledge. The lack of standardization, he said, produced a situation at his institution that was confusing and that dis- couraged both students and instructors. The standard of attainment was loosely defined, de-

partmental examinations were subjective, and externally produced examinations were too difficult for students. He therefore set out to study the

experiences of other universities. He summarized the 39 responses he had received to a questionnaire that he had sent to 50 institutions. A bare

majority (not including some major universities such as the Universities of California, Minnesota, and Pennsylvania and Princeton and Stanford Universities) did allow exemption to students who passed a special reading knowledge test. The tests that were employed were, he wrote, almost as diverse as the institutions replying. Two universities allowed students to use dictionaries during the test. Five universities emphasized reading pas- sages. One required translation to and from Ger- man. One included sight-reading and prose com-

position. Another included oral and written exercises. Another required paraphrases. Four claimed their tests were standardized. Twelve prepared their own tests; four included externally prepared tests in the battery. The tests took between 1 /2 hours and 3 hours to administer. The pass mark and rate varied by institution, and within institutions, by field. The pass rate varied from about 50% to about 90%. Twelve institutions were satisfied with what they were doing; four were dissatisfied. Thirty institutions re-

sponded to a questionnaire asking for their definition of reading knowledge of a language with widely disparate results. The main answers were: "ability to read or translate with understanding or give the accurate rendering of a relatively difficult text or a reasonably correct translation of a typi- cal text without the excessive use of a dictionary" (7 responses); "ability to read and understand without using a dictionary a given passage ... of normal difficulty" (3 responses); "ability to get the sense of a moderately difficult passage ... to read a text of average difficulty at sight... to get the main ideas of a paragraph with its essential connotations ... to read with understanding texts of both narratives and of content" (8 responses); "ability to use a language as a tool" (3 responses): "ability to understand a text... in one's field" (3 responses; pp. 443-444). Four institutions de-



542

fined reading knowledge as determined by the amount of time of study; one response said it was a "most elastic term"(p. 444), and one respon- dent confessed that he had no definition. Twenty of the respondents considered that the required standard could be reached in 2 years of work, ranging from 3 to 5 hours a week. Nine thought that it took longer. Frantz concluded that more satisfactory tests could only be developed when there was agreement on what was being measured.

Frantz's article deserves regular rereading because it makes a key point, namely, that the devel-

opment of valid tests depends on the clear definition of what is being measured. Whereas many universities required their students to have some skills in a foreign language, few had thought through the precise specification of these abilities, leaving the decision to language depart- ments that had little knowledge of modern lan-

guage testing. His article provides further evidence of the limited respect that U.S. educa- tors had for knowledge of a language other than

English.

The First (Proficiency) Scale?

In a first sighting of a topic that was to be very controversial later, Peter Sammartino (1938) published the Language Achievement Scale that his department at Columbia University used.

Quite different from the absolute scale that his

colleague from across campus, Professor Edward Thorndike, was talking about at a conference in Dinard, France, in the same year (Monroe, 1939), Sammartino's scale was rather a listing of the abilities in each of seven areas that made up the curriculum: silent reading power, aural comprehension, civilization (but not literature), speaking, grammar, translation from French, and free

composition. It was stated simply so that the student too could understand it, with each point labeled as a goal for general, minor, and major students. Teachers were expected to use comprehensive examinations (given each semester) and other tests to determine when the individual

goals were achieved. The scale for speaking was as follows.

1. Ability to pronounce all elementary sounds as found in simple words and read sentences with correct inflection, intonation, enunciation, and pronunciation.

2. Ability to answer simple questions when there is no vocabulary or thought difficulty.

3. Ability to read extended prose and poetry in an intelligent and clear fashion.


4. Ability to give a five or ten-minute talk (prepared) on a simple topic or about something heard or seen.

5. Ability to engage in ordinary, everyday conversation. [Up to here, the standards are for general students and minors; the next two standards are for majors.]

6. Ability to give in almost faultless French a prepared talk on a special topic in an advanced field such the style of Victor Hugo, the painting of Renoir, etc., and also be prepared to answer questions on the topic.

7. Ability to participate in a discussion with native Frenchmen on some definite topic such as literature, economics, politics, etc. (Sammartino, 1938, pp. 430-431)

It is obvious that no attempt was made in this scale to arrive at the equal-unit intervals that Thorndike advocated as necessary for measurement. Nor does it have the kind of occupational precision that later made the FSI scale so useful. Nonetheless, it must be recognized as a clear fore- runner of the scales that were to prove so attrac- tive to the language teaching profession when the ACTFL Guidelines (1982, 1986) appeared.

THE WORLD WAR II EXPERIENCE

The most interesting discussion of oral lan-

guage testing arising out of the wartime experience of the Army Specialized Training Program (ASTP; see Spolsky, 1995b) was a programmatic article in the MLJ by Walter Kaulfers (1944). Kaulfers believed that if the 1925 survey of for-

eign language teachers (which led to the Cole- man report that was published in 1929) were to be repeated in 1944, the majority of teachers would not again put reading as the most important objective. Planes and radio having made the world smaller, there was a developing consensus that aural-oral abilities were more important than ever and that these abilities must become the

primary goals of a curriculum driven by require- ments outside the school. This belief was strong- est, Kaulfers believed, in the ASTP Language and Area schools for whose programs he had been asked to plan progress tests. The large number of trainees meant that any test specifications must be practical; must be machine scorable as far as

possible to keep down costs; must not require an

elementary- and intermediate-level student to read and write the foreign language (for this would only be taught at higher levels); should demonstrate the examinee's ability to perform in a real-life situation where lack of such ability could be militarily damaging; should have items



Bernard Spolsky

graded or scaled in difficulty; must provide scores that could be interpreted in terms of performance norms, which would make clear what a test-taker who reaches a certain point on the scale can be expected to do with the foreign language in real-life situations; should avoid the "correlation fallacy," the false notion invalidating existing standardized tests that a pen-and-pencil task cor- relates with the corresponding ability to speak and understand; and finally must permit uni- form, standardized administration, using, for ex-

ample, recorded discs or tone-control talking machines. Kaulfers suggested some examples of

techniques that might be suitable. In one exam-

ple, the examiner would read a question twice and the candidate would be required to mark answers to the question. In a second example, the examiner would make a statement correctly and

incorrectly; the candidate would mark which was the correct version.

Kaulfers (1944) proposed a performance scale for measuring aural comprehension, as follows:

0 Cannot understand the spoken language. 1-5 Can catch a word here and there and

occasionally guess the general meaning through inference.

6-10 Can understand the ordinary questions and answers relating to the routine transactions involved in independent travel abroad.

11-15 Can understand ordinary conversation on common nontechnical topics, with the aid of occasional repetition or periphrastic restatements.

16-20 Can understand popular radio talks, talking pictures, ordinary telephone conversations, and minor dialectal variations without difficulty. (p. 139)

The functionality of this scale, which is similar to Sammartino's (1938), is obvious as it reflects the probable needs of the soldier just as Sam- martino's scale reflected the assumed needs of a

university student. Some of the phrases that Kaulfers used in this scale turn up in the later FSI scale, thereby showing the seminal importance of his work; the techniques and approaches are very close to those that were later adopted.

Measuring oral fluency, Kaulfers (1944) believed, was unusually difficult. He argued against using pen-and-pencil tests: "It simply cannot be taken for granted that ability to express oneself in

writing is correlated with a like ability to speak the language extemporaneously" (p. 140). The notion of performance testing that Kaulfers clearly appreciated is still of interest over half a century

543

later. He proposed using the notion of pla- teaus-groups of three items at each level-for maximum efficiency. If an examinee failed three items in succession, the test stopped because his level had been reached. A modification of this

technique is now used in computer-controlled tests. The fields tested must be functional and

distinguish various areas of language fluency, that is, the ability to ask questions, to answer questions, to give directions. He proposed that norms should be established based on the performance of bilingual subjects, an early recognition of the

unsuitability of using native speakers as the norm. Suitable examiners would be educated people who had lived abroad, whose standards would be based on normal and not on normative usage, and whose training would include the trial rating of phonographic recordings. The use of record-

ings to train oral examiners was being tried in

England at about the same time (see Roach, 1945). The test would be given privately in quiet but pleasant surroundings. The examiner would be cordial but businesslike and informal. The test would begin with informal conversations to relax the examinee, followed by some practice items. These suggestions are very similar to those given to FSI oral interviewers. Kaulfers also mentioned the usefulness of recording oral tests on the paper discs used before plastic disks and tapes became common.

Kaulfer's proposals, novel though they seem in this context, were derived from his deep knowl-

edge of prewar modern language testing and may be considered the logical continuation of it. The test that he proposed and the techniques that he recommended are remarkably close to those later

adopted in the FSI oral interviews. In 1944, however, his proposals seem to have fallen on deaf ears-the ASTP did not last long enough for sys- tematic evaluation to begin-and to have been left in the archives for future historians of the field to rediscover.

THE EARLY POSTWAR PERIOD

The level of interest in testing in the MLJ dropped for a while after WWII. Bottke and Mil- ligan (1945) suggested some new aptitude tests but gave no results of their use. Bovee and Froelich (1946) also worked on aptitude, but added little new. The following year, Clapp (1947) meditated on placement and the time to start learning a foreign language, issues that continue to need empirical study. A year later, MLJ readers were reminded of testing by Coutant (1948), and Kaulfers (1948) touched on testing



544

in his review of postwar teaching in California. In the same year, Miller (1948) described a test of Spanish-American culture. For the next few years, nothing of interest to the field of testing appeared.

SOME OCCASIONAL INTERESTS OF THE 1950s

Raymond (1951), in an article that foreshad- owed but was ignored by proponents of the later fad for cloze tests, described a completion-type test, based on the original proposal of Ebinghaus and derived from the Minkus Completion Test in the Revised Stanford-Binet Test. He used this test with high school and college students, recognized its similarity to an intelligence test, and considered it a better teaching than testing tech-

nique. Interest in the spoken language continued.

Furness (1955), in a short article based on her

University of Colorado doctoral dissertation, ar-

gued the need for better testing of listening com-

prehension. Stabb (1955) described the Colgate University oral language examination. The test called on candidates to retell written narrative, to describe pictures and photos, and to answer written questions orally. These items were scored for

pronunciation as well.

Echoing an ongoing debate about including essays in native language testing, Eley (1956) noted the cost of objectivity. Objective testing was less rich than subjective testing. Machines and

objective tests can avoid error, but they do not

provide the critical insight of human markers. He

proposed the training of judges for greater reli-

ability. In Borg and Goodman's (1956) article, two

testers at the Lackland Base of the U.S. Air Force (where much of its English as a Foreign Lan-

guage teaching was conducted) described a psy- chometrically designed placement test for for-

eign trainees. Buechel (1957), also at USAF Lackland, proposed some rating scales that would be useful for tracking student progress. Harding (1958), another scholar with the USAF, proposed tests for students for the Air Force Russian program and reported on the value of the Carroll and Sapon Psi Lambda test battery (1955) as a

prognosis test. T. Mueller (1959) proposed an "auding" test

that involved focused listening ("Did I use the

past tense?") and that he believed to be reliable.

Valley (1959; then at Educational Testing Ser- vice) reported that some colleges used the results of advanced credit courses and others did not.


Ayer (1960) reported that a "same-different" test seemed to correlate with "pronunciation skill." Sadnavitch and Popham (1961) reported that the Spanish Comprehension Test using pictures and oral stimuli agreed with teacher judgments.

APTITUDE EMERGES AGAIN INTO THE 1960s

Interest in aptitude reemerged in the postwar period and took on a new urgency because of the

expense of government-supported intensive courses in the lesser taught languages as Ameri- cans tried to make up for the country's low level of language capacity. Salomon (1954) wrote a historical survey of prognostic testing from its

beginning in 1917 up to Bottke and Milligan's (1945) work and judged that the field was "still in its infancy (or, perhaps, as much has been found out as is to be discovered)"(p. 303).

Eterno (1961), in a pilot study, found that there were relationships between pronunciation ability and musical ability. Pimsleur was starting work on his predictive battery. In an MLJarticle, Pimsleur, Mosberg, and Morrison (1962a) reviewed the ex-

perimental literature to identify which personal factors might help or hinder language learning. Von Wittich (1962) used IQ scores and grade point averages to predict foreign language grades and did not think much more was needed. Blick- enstaff (1963) summarized empirical studies of the correlation of musical ability and foreign lan-

guage ability and found them inconclusive. Leu-

tenegger, Mueller, and Wershow (1965) found no

strong correlations between auditory skills and

language learning achievement. None of the basic work by John Carroll and

Stanley Sapon that culminated in the 1957 publication of the Modern Language Aptitude Test (MLAT) appears in the MLJ, but, a decade later, Payne and Vaughn (1967) used the MLAT and found that it predicted the language learning of students who studied abroad in Florence. Pim- sleur (1969) described the value of a competing test that he had developed. The preparatory research had been reported in Pimsleur, Mosberg, and Morrison (1962b) and the battery had been

published in Pimsleur (1966). A number of articles in the 1960s offered in-

novative testing techniques. Mendeloff (1963) claimed success for an aural comprehension test delivered via video. K. A. Mueller and Wiersma (1963) developed a speaking test that correlated with other measures. Guerra, Abrahamson, and Newmark (1964), following Kaulfers (1944), de-

veloped an oral scale for teachers to use. Di Gi-



Bernard Spolsky

ulio (1967) had students write down sentences

they heard. McArthur (1965) analyzed a test for elementary school foreign language students to find good and bad items.

In studies of test use, Shane (1971) reported a validation study of the usefulness of the MLA

Cooperative Russian Test for placement. To make it work, he noted, local norms needed to be de-

veloped. Aleamoni and Spencer (1968) used the MLA tests and had problems getting the system working.

MLJ issues 3 and 4 of Volume 53 contained articles from the seminal Bilingualism in the Barrio

study conducted by Fishman and his colleagues (later published as Fishman, Cooper, & Ma, 1971), several of which showed new and sophisti- cated uses of tests in sociolinguistic studies. R. L.

Cooper (1969), R. L. Cooper and Greenfield (1969) and R. L. Cooper, Fowles, and Givner (1969) were all concerned with aspects of testing and with exploring the complexity of the notion of bilingualism. They followed the principles sketched earlier in R. L. Cooper (1968), an article that had effectively added the sociolinguistic di- mension to the testing field.

TESTING ISSUES IN THE 1970s

Selling the Cloze

When the cloze test burst upon the foreign lan-

guage teaching scene at the end of the 1960s, the initial articles about it ignored Raymond (1951), who might have hoped to be remembered as the first scholar to propose the technique in the MLJ, and the doubts expressed by Carroll, Carton, and Wilds (1959), who saw that it had serious problems as a test of language proficiency. In the first of a series of MLJ articles on the cloze, Oller (1972) discussed scoring methods for English as a Second Language cloze tests and concluded in favor of marking any acceptable response correct. Stubbs and Tucker (1974) reported that the cloze test, as used at the American University in Beirut over several years, correlated with other measures and was powerful and economical. Briere, Clausing, Senko, and Purcell (1978) reported on the use of cloze tests with students of German, Chinese, Japanese, Russian, and French at the end of each of the first three semesters of study. The average scores were higher each semester for all languages other than German. Brown (1980) tried out cloze tests and found that any method of scoring (acceptable or exact response, multiple- choice or open-ended) worked. He was, however, hesitant about the cloze, not seeing any satisfac-

545

tory explanation of why it worked. Thus, he seems to accept the argument by Spolsky (1971) and Oiler (1976b) that it does tap overall second lan-

guage proficiency, but not their explanations of

why, nor is he concerned by Carroll's (1993) conclusion that a cloze test is really testing a different

special ability. Caulfield and Smith (1981) found that the

cloze test, as developed by Oiler, and the reduced redundancy test developed by Spolsky, Sigurd, Sato, Walker, and Aterburn (1968) and further developed by Gaies, Gradman, and Spolsky (1977), correlated about .80 with the MLA test, the latter correlating about .80 with an interview. Both the cloze and the reduced redundancy test were, of course, cheaper to administer than an interview. Lange and Clausing (1981) found that

leaving blanks every nth word and scoring correct

only the response from the original text produced the best German cloze tests.

DiversiJying Exploration

The other articles on testing published in the MLJin the 1970s covered a wide range of topics. Discussing the issue of test use, Taylor (1972) found problems with the proposal of K. A. Muel- ler (1972) to judge teachers' competence by how well their pupils perform on an achievement test (a proposal that would have restored the practice of the medieval Italian town of Treviso of paying the schoolmaster by his pupils' examination results, as Madaus, 1990, describes it).

In measuring the Japanese ability of 600 Prot- estant missionaries in Japan, Jacobsen and Im- hoof (1974) found differences between men and women.

Politzer (1974) reported similarities in the order of learning in first and second language acquisition, encouraging the kind of test developed in the Bilingual Syntax Measure by Burt, Dulay, and Hernandez-Chavez. In a brief essay, Oiler (1976a) liked the idea behind the Bilingual Syn- tax Measure but regretted the absence of data on its reliability.

In a pioneering study on computerized testing, T. A. Boyle, Smith, and Eckert (1976) described a multiple-choice test of vocabulary. In a review article, Bell, Doyle, and Talbott (1977) described the ongoing development of an oral test for children that used repetition tasks and structured responses and that included listening comprehension. Meredith (1978) suggested that an enforced latency period (making students think before answering) might lead to better performance in a test.



546

Madsen (1979) used printed versions of aural

comprehension tests (like the TOEFL listening items) in Egypt and found that they correlated about .80 with the Michigan aural test. Natalicio (1979) provided an uncritical review of repetition and dictation as testing techniques.

TOWARDS A COMMON YARDSTICK IN THE 1980s

In an article forming part of the Report of the President's Commission on Foreign Language and International Studies, Woodford (1980) ar-

gued the need for a "common yardstick" on the basis of which national listening and reading tests could be developed. In this argument, he echoed the goal of the Modern Language Project in

Europe as described in van Ek (1975), but the U.S. initiative took a different track. On page 179 of Volume 66 (1982), it was reported that work on the ACTFL language proficiency guideline projects had started.

In 1982, the Journal began the policy of blind

reviewing, but I could not sense any difference in the quality or range of articles in testing. Both established scholars (like Clark) and future stars (like Wesche and Shohamy) appeared after the

policy change. J. L. D. Clark (1983), in a paper read earlier at

the ACTFL annual meeting, updated the brief

history of language testing that had appeared in

Spolsky (1977) and focused on direct proficiency testing, communicative skills in classroom tests, and the promise of computerized testing. In the first item to be reprinted in the MLJ as the best item from 1981 in the Canadian Modern Language Review, Wesche (1983) argued for communicative tests and described three such tests.

The debate over prognostic testing continued, as Curtin, Avner, and Smith (1983) concluded that there was not any warrant to use the Pimsleur

Aptitude Battery alone as a predictor of foreign language grades.

More articles on the cloze continued to appear. Shohamy (1982), whose article in MLJwas her first publication on language testing, showed that pupils preferred oral interviews to cloze tests. Their performance on a cloze test was correlated to their liking for the technique. Heilen- man (1983) thought that a cloze test could be used cautiously, with room allowed for later ne-

gotiation, for placement. Bensoussan and Ram- raz (1984) used multiple-choice fill-ins and said that it was not quite clear what such a test measures. Larson (1983) found that there were good correlations (from .50 to .70) among the skills


testing in the various parts of the final examinations of second-year courses in French, Ger- man, and Spanish at Brigham Young University, but certainly not high enough to justify drop- ping any part of the examination. Fischer (1984) developed a scale to help scorers but found that it was only moderately reliable. Hale, Stansfield, and Duran (1984), a group of researchers work-

ing at the Educational Testing Service, listed all the items that had so far been published about the Test of English as a Foreign Language (TOEFL).

The ACTFL Proficiency Guidelines

With the publication in 1982 of the ACTFL

Proficiency Guidelines, the attacks on this attempt to provide a common yardstick began. In the first two articles in MLJon the topic, Savignon (1985) and Lantolf and Frawley (1985) were particularly distressed by the maintenance of the educated native speaker comparison from the FSI guidelines and by the analytic nature of the levels and

descriptors. Following up, Bachman (starting to

emerge as a powerful figure in language testing) and Savignon (the prime proponent of communicative teaching) united in a more detailed at- tack (1986) on the proficiency guidelines, argu- ing that it was wrong to use the FSI model, developed for diplomats, as the basis for assessing college performance. Recognizing the potential impact on teaching of the new guidelines, Schulz (1986) expressed doubt that the change from the current grammar-based achievement testing to the communicative-based and oral proficiency goal was realistic. Kramsch (1986) joined the at- tack on the guidelines, essentially but elegantly restating the century-old argument (Latham, 1877) that their specificity narrowed the range of

teaching, by encouraging people to teach only what could be tested. Lowe (1986), with long experience in government oral proficiency test-

ing, answered the critics, defending the Guide- lines, but agreeing that some further study and research was needed.

The debate over the change to proficiency remained a major concern of testing articles in the

MLJfor the next few years. Hieke (1985) argued that oral fluency needed assessing, and suggested some approaches. In a defense of the ACTFL Pro-

ficiency Guidelines, Byrnes (1987) outlined some of the valuable research that could be carried out by using the framework and by analyzing samples collected during the testing.



Bernard Spolsky

Other Issues in the 1980s

The concern about prognosis and the nature of language learning aptitude continued. Currall and Kirk (1986) argued that a student's grade point average, plus past French grades, plus an interview, was a good way to predict success. Pur- suing a theme started earlier by Oller (1979) with his arguments that intelligence and language learning ability are one and the same thing, J. P. Boyle (1987) reported a study in which a battery of standard English as a Foreign Language and

intelligence tests was administered to 200 first- year Chinese university students. Factor analysis, he claimed, showed that reasoning is a separate factor from language abilities (general proficiency and vocabulary) and from memory.

T. C. Cooper (1987), also concerned with intelligence and language, found an argument in favor of keeping foreign language teaching in the curriculum in evidence he produced that showed that students who had studied foreign languages in high school obtained better scores on the verbal portion of the Scholastic Aptitude Test (SAT). He was a little surprised to find that students of German did best, followed by students of French and of Spanish, for he had assumed a direct relation between learning a Romance language and handling Latinate vocabulary rather than sus- pecting an indirect social causation (such as asking which students choose which language, especially a less common language as German in the United States must now be characterized).

Picking up on the emphasis in the MLJon the use of language tests that we have noted already, Spurling (1987) followed the progress of 200 limited-English students admitted (because of an open admission policy) to Marin Community Col- lege in spite of low scores on the English admission examination. Although the correlation between the English test and the students' grades at the end of their first year was statistically significant, it accounted for only a small portion of the variance. The correlation was higher for Hispanic students than for Asians and others. This result convinced him that it would have been unwise to use a simple cutoff score to decide admission.

In 1987, in a MLJ editorial announcing the establishment of the National Foreign Language Center, Lambert (1987) set as one of the goals of the Center to help develop a "common metric" which would comprise "objective, efficient measures of the overall level of the language competence of individuals"(p. 6). This notion, interest- ingly enough, seemed at first to be more appreciated in Europe (for the related work of

547

the Council of Europe and the European Com- munity in this area, see North, 1992) than in the United States, at least until the recent emergence of interest in national standards.

RECENT ISSUES FROM THE 1990s

We enter in 1988 a dry period for language testing articles in the MLJ, with the exception of a very important challenge to the hierarchies of the ACTFL Proficiency Guidelines by Lee and Musumeci (1988). Perhaps this paucity of testing articles reflects the fact that there were now suffi- cient other outlets for article writers, notably with the establishment in 1984 of Language Testing, a journal dedicated to the topic.

The paper by Lee and Mesumeci is a polished, skillful, and significant piece of testing research. Making use of the Guidelines for foreign language reading, they constructed a test that measured the three principal aspects on which the hierarchy was based, text type, reading skill, and task- based performance. They then gave the test to some 200 students in Italian classes at four different levels of instruction (the first through the fourth semester). They found effects for text type and skills, but not for level. Thus, they showed that the hierarchy of levels hypothesized in the Guidelines did not correspond with the differences associated with the number of semesters of language learning. Put another way, the construct of reading ability in the Guidelines proved to be neither robust nor sensitive enough to show differences in development in actual cases.

The other articles over the last decade have been less striking. Heining-Boynton (1990) proposed an inventory for evaluating foreign language programs in elementary schools. Meredith (1990) described using the Oral Proficiency In- terview (OPI) as part of a battery of tests and fine-tuned it by adding extra points to the scale. Anderson (1991) found that there was no single set of processing strategies that significantly contributed to success on two reading measures. Dunkel (1991) described a possible way of using a computer to test listening comprehension.

Returning to the theme of cloze tests, Abraham and Chapelle (1992) reported an empirical study of three kinds of cloze tests, the classic fixed-in- terval cloze, the multiple-choice cloze derived from it, and the "rational" cloze in which the tester chooses which words to omit. By studying item difficulty, they found that the three measures seemed to be testing different things.

Revisiting a topic that was studied in the 1960s in the debate over the reliability of scores on essay



548

tests, Shohamy, Gordon, and Kraemer (1992) provided evidence from experience of the usefulness of writing scales and intensive training for raters in order to improve agreement among them.

To reduce the expense of training and paying examiners for the OPI that was becoming popular, Stansfield and Kenyon (1992), working at the Center for Applied Linguistics, developed a tape- recorded version of the interview and reported on its qualities. The same investigators, together with a colleague, described in Stansfield, Scott, and Kenyon (1992) the development of a test for translators.

Following up on the proposal in Lambert's editorial, and taking advantage of a program for small team residencies at the National Foreign Language Center, three language testing scholars

presented in Dunkel, Henning, and Chaudron (1993) their programmatic answer to a call for a

theory of second language listening comprehension that they attributed to John Carroll.

In another exploration of the complexity of

language testing, Wolf (1993) found that different measures of reading comprehension produced different results. In another useful review, asking the kind of basic questions about tests that are often taken for granted, Halleck (1995) ana-

lyzed a number of OPIs and discovered increas-

ing syntactic maturity in interviews scored higher on the proficiency rating scale. In a similar vein, in a study supporting the basic model of the ACTFL Proficiency Guidelines (1986) for reading, Edwards (1996) provided empirical support for its text typology. However, Thompson (1996), working with a small sample, did not find the results of tests based on the Guidelines to confirm

development over time, thus supporting Lee and Mesumeci.

SUMMING UP AND THE FUTURE

As this review has shown, The Modern Language Journal is not a primary source for understanding the development of language testing in the United States. It has, however, published a significant number of important articles that attest to the historical growth of language testing and to the reluctant recognition by the profession of the fact that language tests both drive and reflect

language teaching. A major concern about test-

ing oral proficiency has been an ongoing theme from the earliest volumes through to the ACTFL

Proficiency Guidelines. For those to whom the spoken language is a key part of the curriculum for modern languages, finding an efficient way of


assessing ability in this area has been a continuing challenge. Other important themes recur

throughout the 80 volumes of the MLJ-cloze tests, proficiency guidelines, prognosis and aptitude testing, and the use of tests.

In terms of the quality of the research re-

ported, however, the level is very uneven; many of the articles had little impact at the time of their

publication, and few can be said to have led to

important advances in the field. Most are now of little more than archival interest. A bare handful are likely to be cited in current writing in lan-

guage testing. This judgment leads one to question the role

of articles on such a specialized topic as language testing in a journal of generalist concern like the

MLJ, especially now that there is a journal de- voted to it (Language Testing) and that a goodly number of other journals in applied linguistics offer an outlet for testing articles. The answer to this quandary is, I believe, for the MLJto concentrate on articles about testing that are directed to the general public of language teachers and not to testers. It would be valuable then to encourage or commission papers from time to time that review the changing current state of the art in

language testing, to publish articles dealing with the impact of testing on teaching, and otherwise to limit coverage of testing issues to major controversial developments such as the ACTFL Profi- ciency Guidelines.

ACKNOWLEDGMENTS

I am grateful to James P. Lantolf and to Sally Sieloff Magnan for the invitation to explore again the pages of the MLJthat I found so useful in earlier excursions into the history of language testing, to Suzanne Moore and Dave Benseler for the A Comprehensive Index to The Mod- ern Language Journal 1916-1996 (1999) which helped speed up the process, and to the anonymous readers who saw what was missing in an earlier draft.

REFERENCES

Abraham, R. G., & Chapelle, C. A. (1992). The meaning of cloze test scores: An item difficulty perspective. MLJ, 76, 468-479.

Aleamoni, L. M., & Spencer, R. E. (1968). Development of the University of Illinois foreign language placement and proficiency system and its results for fall, 1966 and 1967. MLJ, 52, 355-359.

American Council on the Teaching of Foreign Lan- guages. (1982). A CTFL provisional proficiency guide-



Bernard Spolsky

lines. New York: American Council on the Teach- ing of Foreign Languages.

American Council on the Teaching of Foreign Lan- guages. (1986). ACTFL proficiency guidelines. New York: American Council on the Teaching of For- eign Languages.

Anderson, N. J. (1991). Individual differences in strat- egy use in second language reading and testing. MLJ, 75, 460-472.

Ayer, G. W. (1960). An auditory discrimination test based on Spanish. MLJ, 44, 227-230.

Bachman, L., & Savignon, S.J. (1986). The evaluation of communicative language proficiency: A critique of the ACTFL oral interview. MLJ, 70,380-390.

Bell, T. O., Doyle, V., & Talbott, B. L. (1977). Review essay: The Mat-Sea-Cal oral proficiency tests: A review of. MLJ, 61, 136-138.

Bensoussan, M., & Ramraz, R. (1984). Testing EFL reading comprehension using a multiple-choice rational cloze. MLJ, 68, 230-239.

Blickenstaff, C. B. (1963). Musical talents and foreign language learning ability. MLJ, 47, 359-363.

Borg, W. R., & Goodman,J. S. (1956). Development of an individual test of English for foreign students.

MLJ, 40, 240-244. Bottke, K. G., & Milligan, E. E. (1945). Test of aural and

oral aptitude for foreign language study. MLJ, 29, 705-709.

Bovee, A. G., & Froehlich, G. J. (1946). Some observa- tions on the relationship between mental ability and achievement in French. MLJ, 30, 333-336.

Boyle, J. P. (1987). Intelligence, reasoning, and language proficiency. MLJ, 71, 277-288.

Boyle, T. A., Smith, W. F., & Eckert, R. G. (1976). Com- puter mediated testing: A branched program achievement test. MLJ, 60, 428-440.

Briere, E. J., Clausing, G., Senko, D., & Purcell, E. (1978). A look at cloze testing across languages and levels. MLJ, 62, 23-26.

Brinsmade, C. (1928). Concerning the College Board examinations in modern languages. MLJ, 8, 87-100, 212-227.

Broom, E., & Kaulfers, W. V. (1927). Two types of objective tests in Spanish. MLJ, 11, 517-521.

Brown,J. D. (1980). Relative merits of four methods for scoring cloze tests. MLJ, 64, 311-317.

Buechel, E. H. (1957). Grades and ratings in language proficiency evaluations. MLJ, 41, 41-47.

Burt, M. K., Dulay, H. C., & Hernandez-Chavez, E. (1975). Bilingual syntax measure. San Antonio, TX: Psychological Corporation.

Byrnes, H. (1987). Proficiency as a framework for research in second language acquisition. MLJ, 71, 44-49.

Carroll, J. B. (1954). Notes on the measurement of achievement in foreign languages. Unpublished manuscript.

Carroll,J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge: Cambridge Uni- versity Press.

Carroll, J. B., Carton, A. S., & Wilds, C. (1959). An investigation of "cloze" items in the measurement of

549

achievement in foreign languages. Cambridge, MA: Laboratory for Research in Instruction, Graduate School of Education, Harvard University.

Carroll,J. B., & Sapon, S. M. (1955). Psi Lambda Foreign Language Aptitude Battery. Boston, MA: Laboratory for Research in Instruction, Graduate School of Education, Harvard University.

Carroll,J. B., & Sapon, S. M. (1957). Modern Language Aptitude Test. New York: Psychological Corpora- tion.

Caulfield,J., & Smith, W. C. (1981). The reduced redundancy test and the cloze procedure as measures of global language proficiency. MLJ, 65, 54-58.

Cheydleur, F. D. (1928). Results and significance of the new type of modern language tests. MLJ, 12, 513-531.

Cheydleur, F. D. (1931). The use of placement tests in modern languages at the University of Wisconsin.

MLJ, 15, 262-280. Cheydleur, F. D. (1932a). Mortality of modern language

students: Its causes and prevention. MLJ, 17, 104-136.

Cheydleur, F. D. (1932b). The relationship between functional and theoretical grammar. MLJ, 16, 310-334.

Clapp, H. L. (1947). Meditations on a placement program or when should a foreign language be studied? MLJ, 31, 203-207.

Clark, J. L. D. (1983). Language testing: Past and current status-Directions for the future. MLJ, 67, 431-443.

Clarke, F. M. (1931). Results of the Bryn Mawr Test in French administered in New York City high schools. Bulletin of High Points, 13, 4-13.

Coleman, A. (1929). The teaching of modern foreign languages in the United States. New York: Macmillan.

Coleman, A., & King, C. B. (Eds.). (1938). An analytical bibliography of modern language teaching, Vol. 2, 1932-1937. Chicago: University of Chicago Press.

Committee on Resolutions and Investigations. (1917). Report of Committee on Resolutions and Investi- gations appointed by the Association of Modern Language Teachers of the Middle States and Maryland. MLJ, 1, 250-261.

Conant,J. B. (1934). Notes and news: President Conant speaks again. MLJ, 19, 465-466.

Cooper, R. L. (1968). An elaborated language testing model. Language Learning, 18, 57-72.

Cooper, R. L. (1969). Two contextualized measures of degree of bilingualism. MLJ, 53, 172-178.

Cooper, R. L., Fowles, B. R., & Givner, A. (1969). Listen- ing comprehension in a bilingual community. MLJ, 53, 235-241.

Cooper, R. L., & Greenfield, L. (1969). Word frequency estimation as a measure of degree of bilingualism. MLJ, 53, 163-166.

Cooper, T. C. (1987). Foreign language study and SAT- verbal scores. MLJ, 71, 381-387.

Coutant, V. (1948). Evaluation in foreign language teaching. MLJ, 32, 596-599.

Currall, S. C., & Kirk, R. E. (1986). Predicting success in



550

intensive foreign language courses. MLJ, 70, 107-113.

Curtin, C., Avner, A., & Smith, L. A. (1983). The Pim- sleur Battery as a predictor of student performance. MLJ, 67, 33-40.

Decker, W. C. (1925). Oral and aural tests as integral parts of Regents' examinations. MLJ, 9, 369-371.

Di Giulio, E. (1967). Testing in the language laboratory.

MLJ, 51, 103-104. Dunkel, P. A. (1991). Computerized testing of non-par-

ticipatory L2 listening comprehension proficiency: An ESL prototype development effort.

MLJ, 75, 64-73. Dunkel, P. A., Henning, G., & Chaudron, C. (1993). The

assessment of an L2 listening comprehension construct: A tentative model for test specification and development. MLJ, 77, 180-191.

Edwards, A. L. (1996). Reading proficiency assessment and the ILR/ACTFL text typology: A reevaluation.

MLJ, 80, 350-361. Eley, E. G. (1956). Testing the language arts. MLJ, 40,

310-315. Eterno, J. A. (1961). Foreign language pronunciation

and musical aptitude. MLJ, 45, 168-170. Feder, D. D., & Cochran, G. (1936). Comprehension

maturity tests: A new departure in measuring read-

ing ability. MLJ, 20, 201-208. Fife, R. H. (1937). Reading tests in French and German.

MLJ, 22, 56-57. Fischer, R. A. (1984). Testing written communicative

competence in French. MLJ, 68, 13-20. Fishman,J. A., Cooper, R. L., & Ma, R. (1971). Bilingual-

ism in the barrio. Bloomington, IN: Research Cen- ter for the Language Sciences, Indiana University.

Frantz, A. I. (1939). The reading knowledge test in the

foreign languages: A survey. MLJ, 23, 440-446. Furness, E. L. (1955). A plea for a broader program of

testing. MLJ, 39, 255-259. Gabbert, T. A. (1941). Predicting success in an interme-

diate junior college reading course in Spanish.

MLJ, 8, 637-641. Gaies, S., Gradman, H., & Spolsky, B. (1977). Towards

the measurement of functional proficiency: Con- textualization of the noise test. TESOL Quarterly, 11,51-57.

Guerra, E. L., Abrahamson, D. A., & Newmark, M.

(1964). The New York City foreign language oral

ability rating scale. MLJ, 48, 486-489. Haden, E., & Stalnaker, J. M. (1934). A new type of com-

prehensive foreign language test. MLJ, 19,81-92. Hale, G. A., Stansfield, C. W., & Duran, R. P. (1984). A

comprehensive TOEFL bibliography, 1963-1982.

MLJ, 68, 45-51. Halleck, G. B. (1995). Assessing oral proficiency: A com-

parison of holistic and objective measures. MLJ, 79, 223-234.

Harding, F. D. J. (1958). Tests in the selection of lan-

guage students. MLJ, 42, 120-122. Heilenman, L. K. (1983). The use of a cloze procedure

in foreign language placement. MLJ, 67, 121-126.

Heining-Boynton, A. L. (1990). The development and


testing of the FLES program evaluation inventory.

MLJ, 74, 432-439. Henmon, V. A. C. (1934). Recent developments in the

study of foreign language problems. MLJ, 19, 187-201.

Henmon, V. A. C., Bohan, J. E., Brigham, C. C., Hop- kins, L. T., Rice, G. A., Symonds, P. M., Todd,J. W., & Van Tassel, R.J. (Eds.). (1929). Prognosis tests in the modern foreign languages: Reports prepared for the Modern Foreign Language Study and the Canadian Committee on Modern Languages (Vol. 16). New York: Macmillan Company.

Hieke, A. E. (1985). A componential approach to oral

fluency evaluation. MLJ, 69, 135-142. Hunt, T., Wallace, F. C., Doran, S., Buynitzky, K. C., &

Scharz, R. E. (1929). Language aptitude test: George Washington University. Washington, DC: Center for Psychological Service, George Washington Uni- versity.

Jacobsen, M., & Imhoof, M. (1974). Predicting success in learning a second language. MLJ, 58, 329-336.

Jones, R. L. (1979). The oral interview of the Foreign Service Institute. In B. Spolsky (Ed.), Some major tests (pp. 104-115). Washington, DC: Center for

Applied Linguistics. Kaulfers, W. V. (1930). Why prognose in the foreign

languages? MLJ, 14, 296-301. Kaulfers, W. V. (1933). Practical techniques for testing

comprehension in extensive reading. MLJ, 17, 321-327.

Kaulfers, W. V. (1937). Objective tests and exercises in French pronunciation. MLJ, 22, 186-188.

Kaulfers, W. V. (1939). Prognosis and its alternatives in relation to the guidance of students. German Quar- terly, 12, 81-84.

Kaulfers, W. V. (1944). Wartime development in mod-

ern-language achievement testing. MLJ, 28, 136-150.

Kaulfers, W. V. (1948). Post-war language teaching in California colleges. MLJ, 32, 368-372.

Kramsch, C. J. (1986). From language proficiency to interactional competence. MLJ, 70, 366-372.

Kurath, W., & Stalnaker, J. M. (1936). Two German

vocabulary tests. MLJ, 21, 95-102. Lambert, R. D. (1987). The case for a national foreign

language center: An editorial. MLJ, 71, 1-11.

Lange, D. L., & Clausing, G. (1981). An examination of two methods of generating and scoring CLOZE tests with students of German on three levels. MLJ, 65, 254-261.

Lantolf, J. P., & Frawley, W. (1985). Oral-proficiency testing: A critical analysis. MLJ, 69, 228-234.

Larson,J. W. (1983). Skills correlations: A study of three final examinations. MLJ, 67, 228-234.

Latham, H. (1877). On the action of examinations considered as a means of selection. Cambridge, England: Deighton, Bell.

Leutenegger, R. R., Mueller, T. H., & Wershow, I. R.

(1965). Auditory factors in foreign language ac-

quisition. MLJ, 49, 22-31. Lowe, J. P. (1986). Proficiency: Panacea, framework,



Bernard Spolsky

process? A reply to Kramsch, Schulz, and, particularly, to Bachman and Savignon. MLJ, 70, 391-397.

Lundeberg, 0. K. (1929). Recent developments in audition-speech tests. MLJ, 144, 193-202.

Madaus, G. P. (1990). Testing as a social technology. Bos- ton: Boston College.

Madsen, H. S. (1979). An indirect measure of listening comprehension. MLJ, 63, 429-435.

Maronpot, R. P. (1939). Discovering and salvaging modern language risks. MLJ, 23, 595-598.

Matheus, J. F. (1937). Correlation between psychological test scores, language aptitude test scores, and semester grades. MLJ, 22, 104-106.

McArthur, J. (1965). Measurement, interpretation and evaluation-TV FLES: Item analysis in test instru- ments. MLJ, 49, 217-219.

Mendeloff, H. (1963). Aural placement by television.

MLJ, 47, 110-113. Meredith, R. A. (1978). Improved oral test scores

through delayed response. MLJ, 62, 321-327. Meredith, R. A. (1990). The oral proficiency interview in

real life: Sharpening the scale. MLJ, 74,288-296. Michel, S. V. (1934). Prognosis in the modern foreign lan-

guages. Unpublished master's thesis, University of Minnesota, Minneapolis.

Michel, S. V. (1936). Prognosis in German. MLJ, 20, 275-287.

Miller, M. M. (1948). Test on Spanish and Spanish- American life and culture. MLJ, 32, 140-144.

Monroe, P. (Ed.). (1939). Conference on examinations, Di- nard, France, 1938. New York City: Teachers Col- lege, Columbia University.

Moore, S. S., & Benseler, D. P. (Eds.). (1999). A comprehensive index to The Modern Language Journal 1916-1996. Boston, MA: Blackwell.

Mueller, K. A. (1972) .Judging the competency of teachers on the performance of their students. MLJ, 56, 10-12.

Mueller, K. A., & Wiersma, W. (1963). Correlation of foreign language speaking competency and grades in ten midwestern liberal arts colleges. MLJ, 47, 535-555.

Mueller, T. (1959). Auding tests. MLJ, 43, 185-187. Natalicio, D. S. (1979). Repetition and dictation as lan-

guage testing techniques. MLJ, 63, 165-176. North, B. (1992). Options for scales of proficiency for a

European language framework (Occasional Paper). Washington, DC: National Foreign Language Center.

Oller,J. W. (1972). Scoring methods and difficulty levels for cloze tests of proficiency in English as a second language. MLJ, 56, 151-158.

Oller, J. W. (1976a). Review essay: The measurement of bilingualism: A review of. Modern Language Review, 60, 399-400.

Oller, J. W. (1976b). Evidence of a general language proficiency factor: An expectancy grammar. Die Neuen Sprachen, 76, 165-174.

Oller, J. W. (1979). Language tests at school: A pragmatic approach. London: Longman.

551

Payne, D. A., & Vaughn, H. A. (1967). Forecasting Ital- ian language proficiency of culturally immersed students. MLJ, 51, 3-6.

Pimsleur, P. (1966). Pimsleur language aptitude battery. New York: Harcourt Brace Jovanovich.

Pimsleur, P. (1969). Knowing your students in advance.

MLJ, 53, 85-87. Pimsleur, P., Mosberg, L., & Morrison, A. L. (1962).

Student factors in foreign language learning. MLJ, 46, 160-170.

Politzer, R. L. (1974). Developmental sentence scoring as a method of measuring second language acquisition. MLJ, 58, 245-250.

Raymond,J. (1951). A controlled association exercise in Spanish. MLJ, 35, 280-289.

Rice, F. (1959). The Foreign Service Institute tests language proficiency. Linguistic Reporter, 1, 2,4.

Richardson, H. D. (1933). Discovering aptitude for the foreign languages. MLJ, 18, 160-170.

Roach, J. 0. (1945). Some problems of oral examinations in modern languages: An experimental approach based on the Cambridge examinations in English for foreign students, being a report circulated to oral examiners and local examiners for those examinations. Cambridge, England: Local Examinations Syndicate.

Rogers, A. L. (1929). French aural comprehension test. In V. A. C. Henmon (Ed.), Achievement tests in the modern foreign languages (pp. 311-321). New York: Macmillan.

Rogers, A. L., & Clarke, F. M. (1933). Report on the Bryn Mawr Test of Ability to Understand Spoken French. MLJ, 17, 241-248.

Sadnavitch, J. M., & Popham, W. J. (1961). Measure- ment of Spanish achievement in the elementary schools. MLJ, 45, 297-299.

Salomon, E. (1954). A generation of prognosis testing. MLJ, 38, 299-303.

Sammartino, P. (1938). A language achievement scale.

MLJ, 22, 429-432. Savignon, S. J. (1985). Evaluation of communicative

competence: The ACTFL Provisional Proficiency Guidelines. MLJ, 69, 129-134.

Schulz, R. A. (1986). From achievement to proficiency through classroom instruction: Some caveats. MLJ, 70, 373-379.

Seibert, L. C., & Goddard, E. R. (1935). A more objective method of scoring composition. MLJ, 20, 143-150.

Shane, A. M. (1971). An evaluation of the existing college norms for the MLA-Cooperative Russian Test and its efficacy as a placement examination. MLJ, 55, 93-99.

Shohamy, E. (1982). Affective considerations in language testing. MLJ, 66, 13-17.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters' background and training on the reliability of direct writing tests. MLJ, 76, 27-33.

Smith, F. P., & Campbell, H. (1942). Objective achievement testing in French recognition versus recall tests. MLJ, 26, 192-198.

Spoerl, D. T. (1939). A study of some of the possible



552

factors involved in foreign language learning. MLJ, 23, 428-431.

Spolsky, B. (1971). Reduced redundancy as a language testing tool. In G. E. Perren &J. L. M. Trim (Eds.), Applications of linguistics: Selected papers of the Second International Congress of Applied Linguistics, Cam-

bridge, September 1969 (pp. 383-390). Cambridge: Cambridge University Press.

Spolsky, B. (1977). Language testing: Art or science. In G. Nickel (Ed.), Proceedings of the Fourth Interna- tional Congress of Applied Linguistics (Vol. 3, pp. 7-28). Stuttgart, Germany: Hochschulverlag.

Spolsky, B. (1995a). Measured words: The development of objective language testing. Oxford: Oxford University Press.

Spolsky, B. (1995b). The impact of the Army Specialized Training Program: A reconsideration. In G. Cook & B. Seidelhofer (Eds.), For H.G. Widdowson: Prin-

ciples and practice in the study of language: A festschrift on the occasion of his sixtieth birthday (pp. 323-334). Oxford: Oxford University Press.

Spolsky, B., Sigurd, B., Sato, M., Walker, E., & Aterburn, C. (1968). Preliminary studies in the development of techniques for testing overall second language proficiency. Language Learning, 18, 79-101.

Spurling, S. (1987). The fair use of an English language admissions test. MLJ, 71, 410-421.

Stabb, M. S. (1955). An experiment in oral testing. MLJ, 39, 232-236.

Stalnaker,J. M., & Kurath, W. A. (1935). A comparison of two types of foreign language vocabulary test.

Journal ofEducational Psychology, 26, 435-442.

Stansfield, C. W., & Kenyon, D. M. (1992). The development and validation of a simulated oral proficiency interview. MLJ, 76, 129-141.

Stansfield, C. W., Scott, M. L., & Kenyon, D. M. (1992). The measurement of translation ability. MLJ, 76, 455-467.

Stubbs, J. B., & Tucker, G. R. (1974). The cloze test as a measure of English proficiency. MLJ, 58, 239-241.

Symonds, P. M. (1930a). A foreign language prognosis test. Teachers College Record, 31, 540-546.


Symonds, P. M. (1930b). Foreign language prognosis test. New York: Teachers College, Columbia University.

Tallent, E. R. E. (1938). Three coefficients of correlation that concern modern foreign languages. MLJ, 22, 591-594.

Taylor, J. S. (1972). Would achievement testing of students solve the problem of incompetent language teachers? MLJ, 56, 360-364.

Tharp,J. B. (1930). Effect of oral-aural ability on scholastic ability. MLJ, 15, 10-26.

Thompson, I. (1996). Assessing foreign language skills: Data from Russian. MLJ, 80, 47-65.

Thorndike, E. L. (Ed.). (1921). The teacher's word book. New York: Teachers' College, Columbia Univer-

sity. Valley, J. R. (1959). College actions on CEEB advanced

placement language examination candidates.

MLJ, 43, 261-263. van Ek, J. A. (1975). The threshold level Strasbourg,

France: Council of Europe. von Wittich, B. (1962). Prediction of success in foreign

language study. MLJ, 46, 208-212.

Wagner, M. E., & Strabel, E. (1935). Predicting success and failure in college ancient and modern foreign languages. MLJ, 19, 285-293.

Wesche, M. B. (1983). Communicative testing in a second language. MLJ, 67, 41-55.

Wilds, C. (1975). The oral interview test. In B. Spolsky & R. L. Jones (Eds.), Testing language proficiency (pp. 29-37). Washington, DC: Center for Applied Linguistics.

Wolf, D. F. (1993). A comparison of assessment tasks used to measure FL reading comprehension. MLJ, 77, 473-489.

Woodford, P. E. (1980). Foreign language testing. MLJ, 64, 97-102.

Yerkes, R. M. (Ed.). (1921). Psychological examining in the United States Army. Washington, DC: Government

Printing Office. Yoakum, C. S., & Yerkes, R. M. (Eds.). (1920). Army

mental tests. New York: H. Holt.

NFLC Affiliates with University of Maryland

The National Foreign Language Center (NFLC) is pleased to announce its affiliation with the University

of Maryland, effective July 1, 2000. As part of their change of institutional home, the NFLC is moving

to new offices in Washington, D.C.

The National Foreign Language Center at the University of Maryland 1029 Vermont Avenue NW

Suite 1000

Washington, DC 20005 Voice: 202-637-8881 Fax: 202-637-9244

Web Page: http:// www.nflc.org



special issue: a century of language teaching and research: looking back and looking ahead, part 1...

Documents