volume 3, 2005

140

Upload: dinhkhanh

Post on 03-Jan-2017

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Volume 3, 2005
Page 2: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment

Volume 3

2005

Edited by

Jeff S. Johnson

Published by

English Language Institute University of Michigan

401 E. Liberty, Suite 350 Ann Arbor, MI 48104-2298

[email protected] http://www.lsa.umich.edu/eli

Page 3: Volume 3, 2005

First Printing, June, 2005 © 2005 by the English Language Institute, University of Michigan. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. The Regents of the University of Michigan: David A. Brandon, Laurence B. Deitch, Olivia P. Maynard, Rebecca McGowan, Andrea Fischer Newman, Andrew C. Richner, S. Martin Taylor, Katherine E. White, Mary Sue Coleman (ex officio).

Page 4: Volume 3, 2005

iii

Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Spaan Fellowship Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Spaan Fellow Working Paper Contents, Vols. 1 and 2 . . . . . . . . . . . . . . . . . . . vii Xiaomei Song Language Learner Strategy Use and English Proficiency

on the Michigan English Language Assessment Battery . . . . . . . . . . . . . 1 Aek Phakiti

An Empirical Investigation into the Nature of and Factors Affecting Test Takers’ Calibration within the Context of an English Placement Test (EPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Yang Lu A Validation Study of the ECCE NNS and NS Examiners’

Conversational Styles from a Discourse Analytic Perspective . . . . . . . 73 Noriko Iwashita

An Investigation of Lexical Profiles in Performance on EAP Speaking Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Young-Ju Lee A Summary of Construct Validation of an

English for Academic Purposes Placement Test . . . . . . . . . . . . . . . . . 113

Page 5: Volume 3, 2005

iv

Introduction This volume of the Spaan Fellow Working Papers in Second or Foreign Language Assessment contains five research reports written by Spaan Fellows from our second and third cohorts. Xiaomei Song’s research looks into the effect of learner strategies on ESL proficiency, as measured by the MELAB. She adapted a validated learner strategy questionnaire to fit the context of her participants, and used factor analysis to identify nine reliable learning strategies. She then measured the predictive power of these strategies on English proficiency, for writing, listening, reading (including grammar and vocabulary knowledge), and also a composite score combining each of these. Xiaomei’s results suggest significant relationships between language learner strategy use and proficiency. Among the relationships, she found a positive effect on all her proficiency variables (writing, listening, reading, and composite) for what she calls a “linking with prior knowledge” strategy, used by learners who connect what they learn with what they know, try to organize language material in their minds, and apply what they learn to new situations. This finding is important and helps contribute to our understanding of second language acquisition and can have strong implications for second language teaching. Xiaomei also found a negative effect, again for each proficiency variable, for the strategy she calls “repeating/confirming information.” This strategy is used by learners who like to repeat words and sentences they learn, write summaries of new English they hear or read, and feel the need to confirm that others understand them. In effect, the more deliberate a learner is, the lower his or her proficiency scores are. It seems to me this result could be more a reflection of the nature of the proficiency measure — a timed test — than on ability in general, and it highlights the need for research in this area. Aek Phakiti reports in depth on his study of the ability of Thai university student test takers to predict their success on the English Placement Test (EPT) test items, what he calls calibration. He found that most of the students overestimated the number of items they were able to answer correctly, except for the advanced-level students, who underestimated their results. Correlations between test results and confidence results were very weak for all groups except for the intermediate-level proficiency group, who were consistent with their estimates on the listening, vocabulary, and reading items, and the advanced-level group, who were very consistent in their estimates of grammar item success. Aek also looks at gender and item difficulty effects on calibration. He found no significant differences for gender, and, interestingly, that examinees overestimated their success for difficult test items and underestimated their success for easy items. Aek calls for more research in the calibration of test results and confidence, and he sees “calibrative competence” as an important and new facet of communicative language ability. Yang Lu uses a discourse analytic approach to look at differences between native and nonnative English speaker oral interview administrators for the Examination for the Certificate of Competency in English (ECCE). She transcribed 20 speaking test sessions and counted the number of occurrences of non-eliciting moves; that is, moves that did not assist the examinee in providing further input and thus made the rating session less efficient. Yang

Page 6: Volume 3, 2005

v

found that there was a difference between the two sets of speaker discourse, as the nonnative English speaker interlocutors performed more non-eliciting behavior than the native English speakers. She in part attributes this result to possible cultural differences dealing with goal orientation and control. The findings deserve the attention of the ECCE speaking test developers and rater trainers, but, as Yang points out, her findings are based on a small sample and further study is needed to confirm or contradict the overall results with the behavior of the rater population as a whole. Noriko Iwashita’s paper highlights a problem concerning the measurement of English vocabulary production for academic purposes: defining the construct itself. She uses speaking test data to see how different tasks, those independent of other language test sections and those integrated with the other test sections, elicit academic English words, but she also defines academic English vocabulary in two ways: words on the Academic Word List, and words used in actual academic settings found in MICASE data. Noriko’s conclusions, while useful and important, are perhaps not as interesting as the questions the study raises about the definition of academic vocabulary as it is used in university settings; that is, the real-world requirements faced by international students. Much more research is needed in this important area, and corpora such as MICASE should be taken advantage of to help make measurement more meaningful. Young-Ju Lee uses the Messick framework for test validation to provide various types of evidence, both quantitative and qualitative, to support a writing placement test used at a university. Her paper is a summary of her dissertation research. Except for a few odd findings, such as negative correlations between test scores and GPA for some programs at the university, Young-Ju’s results support the use of the placement test. Her multifaceted methodology is quite impressive, and this study is a good example of some of the steps institutions should consider when validating their in-house tests. Validation is an ongoing process—and not in the least a simple endeavor, as this study certainly shows—but it would be wonderful if the processes begun by Young-Ju are continued by future assessment students once she leaves the university. Our next Spaan Fellow Working Papers will include studies concerning DIF and rater first-language bias in composition tasks, the factor structure of assessment batteries, rating-scale validation, a comparison of classical test theory and IRT, a comparison of test equating methods, and a cognitive processing model for reading test performance. I thank the five Spaan Fellows who have worked hard to produce these useful and interesting reports. I also thank Maria Huntley for her generous assistance with the Song research project, Sarah Briggs for her help with the Iwashita data, Mary Spaan for her guidance with the Yu report, and Dawne Adam and Eric Lagergren for their terrific help in editing the papers. Jeff S. Johnson, Editor

Page 7: Volume 3, 2005

vi

The University of Michigan

SPAAN FELLOWSHIP FOR STUDIES IN SECOND OR FOREIGN LANGUAGE ASSESSMENT

In recognition of Mary Spaan’s contributions to the field of language assessment for more than three decades at the University of Michigan, the English Language Institute has initiated the Spaan Fellowship Fund to provide financial support for those wishing to carry out research projects related to second or foreign language assessment and evaluation. The Spaan Fellowship has been created to fund up to six annual awards, ranging from $3,000 to $4,000 each. These fellowships are offered to cover the cost of data collection and analyses, or to defray living and/or travel expenses for those who would like to make use of the English Language Institute’s resources to carry out a research project in second or foreign language assessment or evaluation. These resources include the ELI Testing and Certification Division’s extensive archival test data (ECCE, ECPE, and MELAB) and the Michigan Corpus of Academic Spoken English (MICASE). Research projects can be completed either in Ann Arbor or off-site. Applications are welcome from anyone with a research and development background related to second or foreign language assessment and evaluation, especially those interested in analyzing some aspect of the English Language Institute’s suite of tests (MELAB, ECPE, ECCE, or other ELI test publications). Spaan Fellows are likely to be international second or foreign language assessment specialists or teachers who carry out test development or prepare testing-related research articles and dissertations; doctoral graduate students from one of Michigan’s universities who are studying linguistics, education, psychology, or related fields; and doctoral graduate students in foreign language assessment or psychometrics from elsewhere in the United States or abroad. For more information about the Spaan Fellowship, please visit our Web site:

http://www.lsa.umich.edu/eli/spaanfellowship.htm

Page 8: Volume 3, 2005

vii

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 1

Development of a Standardized Test for Young EFL Learners Fleurquin, Fernando A Construct Validation Study of Emphasis Type Questions in the Michigan English Language Assessment Battery Shin, Sang-Keun Investigating the Construct Validity of the Cloze Section in the Examination for the Certificate of Proficiency in English Saito, Yoko An Investigation into Answer-Changing Practices on Multiple-Choice Questions with Gulf Arab Learners in an EFL Context Al-Hamly, Mashael, & Coombe, Christine

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 2

A Construct Validation Study of the Extended Listening Sections of the ECPE and MELAB Wagner, Elvis Evaluating the Dimensionality of the Michigan English Language Assessment Battery Jiao, Hong Effects of Language Errors and Importance Attributed to Language on Language and Rhetorical-Level Essay Scoring Weltig, Matthew S. Investigating Language Performance on the Graph Description Task in a Semi-Direct Oral Test Xi, Xiaoming Switching Constructs: On the Selection of an Appropriate Blueprint for Academic Literacy Assessment Van Dyk, Tobie

Page 9: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 3, 2005 1 English Language Institute, University of Michigan

Language Learner Strategy Use and English Proficiency on the Michigan English Language Assessment Battery

Xiaomei Song

Queen’s University

Using a 43-item strategy-use questionnaire, this study examines the nature of language strategies reported by test takers of the Michigan English Language Assessment Battery (MELAB). It further investigates the relationships between test takers’ reported strategy use and language test performance on the MELAB in the context of English as a second language (ESL). The results show that MELAB test takers’ perceptions of cognitive strategy use primarily fall into six dimensions: repeating/confirming information strategies, writing strategies, practicing strategies, generating strategies, applying rules strategies, and linking with prior knowledge strategies. MELAB test takers’ perceptions of metacognitive strategy use fall into three dimensions: evaluating, monitoring, and assessing. The results also reveal that some strategies had a significant, positive effect on language performance and some had a significant, negative effect on language performance, whereas others seemed to have no effect with this group of participants.

Language testing researchers have been concerned with the identification of individual characteristics that influence variation in performance on language tests since the 1970s. One important variable that may account for the differences on language performance, according to Dreyer and Oxford (1996), is the use of language strategies, which are thought to be used by students at all instructional levels with various outcomes. The present study examines the nature of learner strategies reported by test-takers of the Michigan English Language Assessment Battery (MELAB). This study also investigates the relationships between reported learner strategy use and language test performance on the MELAB in the context of English as a second language (ESL).

Factors Affecting Second Language Performance

Language researchers have long held an interest in factors that may affect performance and scores on language tests. Bachman (1990) proposed a model to investigate the effects of three types of systematic sources of variability on test scores: communicative language ability, the personal characteristics of test takers, and the characteristics of the test method or test tasks. Among the three types of systematic sources of variability, communicative language ability was considered the central factor accounting for the variation of test scores in second language learning. It consists of three components: language competence, strategic competence, and psycho-physiological mechanisms. Bachman also argued that the second factor that influences test performance—test-taker characteristics—includes a variety of personal attributes such as age, gender, native language, educational background, attitudes, motivation, anxiety, learning strategies, and cognitive style. Bachman’s third factor—test

Page 10: Volume 3, 2005

2

method—refers to the characteristics of the test instruments used to elicit test performance and the effects that they may have on test score variation. The current study examines the second factor—cognitive and metacognitive strategy use as a part of test-taker characteristics. The study also examines the relationships between English proficiency test scores on the MELAB and cognitive and metacognitive strategy use.

Language Strategy Use

Research has investigated the individual learner’s learning behaviors in relation to

second language acquisition (SLA) since the 1970s. Since Rubin (1975) and Stern (1975) first explained their tentative conceptions of strategies used by “good” language learners, advances made in cognitive psychology have led to an ever-growing interest in language strategies. Learning strategies are broadly defined as operations and procedures employed by learners to facilitate the acquisition, storage, retrieval, and use of information in their learning (Rigney, 1978). Oxford (1990) expanded this definition by saying that learning strategies are “specific actions taken by learners to make learning easier, faster, more enjoyable, more self-directed, more effective, and more transferable to new situations” (p. 8). Studies that examine how strategies play a role in language learning and development have been conducted not only in the first language area but also in the second language area (e.g., Baker & Brown, 1984; O’Malley & Chamot, 1990; Oxford, 1990; Paris, Cross, & Lipson, 1984).

The earliest concerns were with identification of characteristics of the “good language learner.” Researchers expected to identify strategies used by successful learners with the idea that they might be transferred to less successful learners. Based on videotaped classroom observations, Rubin (1975) first identified seven strategies that seemed to characterize “good” learning behaviors. Stern (1975) summarized ten strategies of “good learners”: planning, active, empathic, formal, experimental, semantic, practice, communication, monitoring, and internalization strategies. In 1978, Naiman, Frohlich, Stern, and Todesco used semistructured interviews with 34 “successful” students to explore learning strategies that were commonly used among these “good” learners. However, they found that their initial expectation of isolating specific learning strategies of successful learners was not met, and they concluded that “this approach [had] not been successful” (p. 65). The researchers explained that systematic patterns of learning behaviors were rarely evidenced in classrooms. Though there is an absence of firm theoretical frameworks and successful results, these studies have aroused much interest in examining the behaviors that distinguish between successful and unsuccessful learners in SLA.

Advances made in second language acquisition, cognitive psychology, and information processing systems have allowed studies to be conducted employing a wide range of methods of data collection and criteria to categorize learning strategies used by EFL/ESL language learners when they are performing different language tasks, including reading, listening, writing, and speaking. The methods of data collection can be direct, such as observation (e.g., Stern, 1975), interview (e.g., Naiman et al., 1978), think-aloud (e.g., Anderson & Vandergrift, 1996), and diary (e.g., Oxford, Lavine, Felkins, Hollaway, & Saleh, 1996). The methods can also be indirect, such as written questionnaires (e.g., Bialystok, 1978). However, as some researchers have indicated (Cohen, 1998; McDonough, 1995; O’Malley & Chamot, 1990), each kind of data collection method has its own limitation, and one method alone does not enable learners to demonstrate all of their strategies in language learning.

Page 11: Volume 3, 2005

3

Therefore, most successful research has employed multiple data collection procedures for gathering and validating learning strategies data (Ellis, 1994). For instance, O’Malley and Chamot and their colleagues asked students to retrospectively report strategy use through group-interview in the descriptive phase, and then the researchers used the think-aloud method when students were engaged in language tasks in the longitudinal phase of their study (1990). Nevertheless, multiple data collection procedures may lead to another problem. O’Malley and Chamot (1990) pointed out that results from different data collection procedures varied considerably, and thus there was no consensus on the classification of language strategies. Although there is lack of agreement in data collection procedures for strategy use, a large number of studies conducted in this field have developed from simple collections of strategies by classroom observation to more sophisticated investigations, which increase generalizability and explanatory power.

Besides using different data collection methods to categorize language strategies, researchers also classify strategies on the basis of contrasting criteria. For example, early research was mainly based on the criterion of “good language learners.” Afterwards, Rubin (1981) proposed a direct/indirect dichotomy, whereas Bialystok (1981) defined four learning strategies: formal practicing, functional practicing, monitoring, and inferencing. Wenden (1991) suggested cognitive strategies and self-management strategies, whereas Ridley (1997) defined lexical problem-solving, monitoring, and deliberate study strategies. Even though O’Malley and Chamot’s (1990) strategy system and Oxford’s (1990) classification, which are considered the two most influential classifications of language strategies, show a considerable degree of overlap, some disagreement exists concerning strategy classification. O’Malley and Chamot (1990) distinguished three broad types of learning strategies: cognitive, metacognitive, and socio-affective strategies, whereas Oxford (1990) categorized strategies as memory, cognitive, compensation, metacognitive, affective, and social. However, O’Malley and Chamot did not provide reliability or construct validity for their taxonomy of strategy use (Oxford & Burry-Stock, 1995). Although Ellis (1994) deemed Oxford’s classification as the most comprehensive classification of learning strategies, Hsiao and Oxford (2002) conducted confirmatory factor analysis (CFA) and found that the six-factor model did not provide a fully adequate fit to the data.

Based on Hunt’s (1982) and Gagne, Yekovich, and Yekovich’s (1993) information processing theories and using a series of statistical methods, Purpura (1997, 1998a, 1998b, 1999) classified three processing variables of cognitive strategies and one process type variable of metacognitive strategies. Conducted with 1,382 EFL test takers and using statistical analyses including exploratory factor analysis, confirmative factor analysis, and structural equation modeling, a three-factor model of cognitive strategy use that involves the comprehending, storing/memory, and using/retrieval processes, and a one-factor model of metacognitive strategy use that involves assessment were eventually defined (1999). The process-type variable of the comprehending processes is represented by strategy-type variables called analyzing inductively and clarifying/verifying; the storing/memory process is represented by associating, transferring, repeating/rehearsing, applying rules, and summarizing; and the using/retrieval process is presented by analyzing inductively, inferencing, applying rules, linking with prior knowledge, and practicing naturalistically. Metacognitive strategy use consists of only one underlying factor represented by a general assessment process, which is represented by four strategy-type variables called assessing the situation, monitoring, self-evaluation, and self-testing.

Page 12: Volume 3, 2005

4

Based on the review of the major classifications of strategy use in this area, this study adopted and revised Purpura’s strategy use questionnaire to elicit information about test takers’ strategy use. From the perspective of this current study, his classification focusing on characteristics of test takers is the most appropriate for studying MELAB test takers. Therefore, Purpura’s cognitive and metacognitive strategy-use questionnaire (1999) was employed as a basis to collect information of language learner strategy use in this study.

Michigan English Language Assessment Battery

The Michigan English Language Assessment Battery (MELAB) is used as a measure of communicative language ability within the framework of Bachman’s model in this study. The test is developed by the English Language Institute at the University of Michigan. The test is given on scheduled dates and times, at several locations. It is normally held once, twice, or three times a month. The MELAB evaluates advanced-level English language competence of adult nonnative speakers of English. Potential examinees include:

1. Students applying to United States, Canadian, British, and other educational institutions where the language of instruction is English;

2. Professionals who need English for work or training purposes; 3. Anyone interested in obtaining a general assessment of their English language

proficiency for educational or employment opportunities.

The MELAB consists of three parts: a composition, a listening test, and a written test containing grammar, cloze, vocabulary, and reading comprehension problems (GCVR). An optional speaking test is also available. Many educational institutions in the United States, Canada, the United Kingdom, and some other countries accept the MELAB as an alternative to the Test of English as a Foreign Language (TOEFL). The entire test takes from 2-1/2 to 3-1/2 hours, including check-in procedures. A description of the test can be seen in Table 1 (see English Language Institute, 2003).

The first section, writing, is a 30-minute impromptu essay response to one of two topics. Test takers may be asked to give an opinion of something and explain why they believe this, to describe something from their experience, or to explain a problem and offer possible solutions (e.g., “What are the characteristics of a good teacher? Explain and give examples”). Most MELAB compositions are one or two pages long (about 200–300 words). Each essay is scored by at least two trained raters based on a clearly developed ten-step holistic scale. The scale descriptors concentrate on topic development, organization, and range, accuracy, and appropriateness of grammar and vocabulary. The ten-point writing scale is set at nearly equal intervals between 53 and 97 to conform to the equated listening and GCVR scales so that the three sections are on the same scale and can therefore be averaged to the final score.

The listening section of the test is a tape-recorded segment containing 50 questions. In the short sentence problems, test takers might be asked a question, hear a statement, or listen to a sentence spoken with special emphasis. In the last half of the listening section, test takers listen to a lecture and a conversation, each followed by several questions. All listening items are multiple choice with three options.

Section 3 of the MELAB usually contains 100 questions: 30 grammar, 20 cloze, 30 vocabulary, and 20 reading. Test takers have 75 minutes to complete the GCVR multiple-

Page 13: Volume 3, 2005

5

choice questions. Sometimes a longer version containing experimental items is given. If a longer test is given, the time limit is extended proportionally. The reported score is scaled from 15 to 100.

The optional speaking section requires test takers to have a 10–15 minute conversation with local examiners, who rate the overall communicative language proficiency. Local examiners consider fluency and intelligibility, grammar and vocabulary, and interactional skills. Functional language use or sociolinguistic proficiency is also considered. Examiners ask test takers questions about their background, future plans, and opinions on certain issues. Local examiners might also ask test takers to explain or describe in detail something about their field of specialization. Table 1. Description of the MELAB Sections Tasks Description Total N Time (minutes) Scoring

Writing 200–300 word composition 1 30 10 pt. holisticscale, 53–97

Listening Discrete items based on questions and extended discourse 50 30 30–100

GCVR: 100 75 15–100 (Grammar)

Discrete items based on a two-turn conversational format (30)

(Cloze) Discrete items based on one passage (20) (Vocabulary)

Discrete items based on a single-sentence format (30)

(Reading) Discrete items based on four passages (20)

Final Score 33–99

Speaking 10–15 holistic scale,1–4

The MELAB has been shown to be reliable and fair, and the test benefits schools and test takers. The listening and GCVR sections of the test are highly reliable, with reliability coefficients (K-R21 and Cronbach’s alpha) ranging from 0.82 to 0.95 (English Language Institute, 1996). Also, MELAB test questions and forms are extensively pretested for optimum reliability. The MELAB Technical Manual provides content-related evidence of validity and describes the process of test development, the nature of the skills that the test is designed to measure, and a description of the prompts and item types. The technical manual also presents comparative statistics for test takers grouped by reason for testing, sex, age, and native language groups. It shows that the test minimizes the risk that some test takers would be disadvantaged or advantaged by unequal content knowledge. Tight control of current and

Page 14: Volume 3, 2005

6

retired test forms ensures accurate scores that are undistorted by cram classes or prior knowledge of test questions. As a result, the MELAB helps schools become more effective recruiters by offering test takers more choices and increasing flexibility. Test takers can also benefit from the MELAB because its score report contains not only the scores of each section and the total scores, but also a brief description of each section, along with score ranges, means, and standard deviation for each section and for the final score (Weigle, 2000).

In conclusion, the MELAB is a thoughtfully constructed, reliable, and well-documented test with good fairness. Potential test users are given ample information to look through to access strengths and weaknesses in language learning and using.

Empirical Studies about Strategy Use and Language Performance Many studies employ quantitative and/or qualitative methods to investigate the relationships between strategy use and language performance (e.g., Bedell & Oxford, 1996; Bialystok, 1981; Mangubhai, 1991; O’Malley & Chamot, 1990). Based on diverse definitions and classifications of language strategies and using different analysis methods, these studies shed light on the relationships between strategy use and language performance from different perspectives. Some studies explore whether students who were better in language performance reported higher levels and frequencies of strategy use (e.g., Green & Oxford, 1995), whereas other studies examine whether higher level and frequency of strategy use contributed to better language performance (e.g., Park, 1997). Some researchers concluded that a causal, reciprocal relationship exists between strategy use and language performance, which indicates strategy use and language performance are both causes and outcomes of each other (e.g., Bremner, 1999).

As a result of the different perspectives that these studies produced, researchers have adopted various methods to measure strategy use and language performance. As stated earlier, methods used to assess strategy use include interview, think-aloud, observation, questionnaire, diary, and other methodologies. Methods used to gauge language performance are also various, such as professional language career status (e.g., Ehrman & Oxford, 1989), entrance and placement examinations (e.g., Sheorey, 1999), self-rating of language proficiency (e.g., Glenn, 2000), and language achievement and proficiency tests (e.g., Phakiti, 2003). In the last case, studies using language achievement and proficiency tests employ different language tasks. Some focus on oral tasks (e.g., Bruen, 2001) and some on reading tasks (e.g., Phakiti, 2003), whereas some use reading, writing, listening, and speaking tasks to measure language performance (e.g., Bremner, 1999).

Early studies in the 1980s reported differentiating results about the relationships between strategy use and language performance. Bialystok (1981) found that three strategies (functional practice, formal practice, and monitoring) were linked to language performance in Grade 12 students, whereas only functional practice was significantly related to language performance in Grade 10 students in the context of French as a second language. In contrast, in a study conducted with Chinese EFL university students, Huang and Van Naerssen (1985) found only functional practice strategies were linked to oral proficiency. Another important study with ESL learners by Politzer and McGroarty (1985) found few statistically significant correlations between strategy use as a whole and language performance, although certain individual strategy items showed significant correlations with language performance.

Page 15: Volume 3, 2005

7

Since Oxford developed the Strategy Inventory for Language Learning (SILL) in 1990, a majority of subsequent studies have used the SILL or adapted the SILL as an instrument to investigate strategy use and the relationships between strategy use and language performance. Generally speaking, in a large number of these SILL studies, conducted in various geographical and cultural settings, a positive relationship between strategy use and language performance was reported (e.g., Bruen, 2001; Glenn, 2000; Park, 1997; Sheorey, 1999). “In most but not all instances, the relationship is linear, showing that more advanced or more proficient students use strategies more frequently” (Oxford & Burry-Stock, 1995, p. 10). “Students who were better in their language performance generally reported higher levels of overall strategy use and frequent use of a greater number of strategy categories” (Green & Oxford, 1995, p. 265).

Unlike previous researchers, Purpura (1997, 1999) conducted studies investigating the psychometric characteristics of a strategy use questionnaire and a language proficiency test. Then, he employed a series of statistical methods to investigate the relationships between strategy use and language performance. As stated before, a three-factor model of cognitive strategy use that involves the comprehending, storing/memory, and using/retrieval processes, and a one-factor model of metacognitive strategy use that involves assessment were defined. Two underlying factors of the language test were found: reading ability and lexico-grammatical ability. Results showed that metacognitive strategy use did not directly impact on language performance, but did have a significant, positive, direct effect on cognitive strategy use. Specifically, metacognitive strategy use had “a moderate, direct influence on the comprehending processes and a strong, direct impact on both the memory and retrieval processes” (Purpura, 1999, p. 172). Cognitive strategy use had no significant, direct influence on reading ability but had an impact on reading indirectly through lexico-grammatical ability. The test takers’ lexico-grammatical ability was closely related to the reading ability. However, the relationships between cognitive strategy use and lexico-grammatical ability were complex. In the three-factor model of cognitive strategy use, the comprehending processes had an insignificant effect on lexico-grammatical ability and the retrieval processes had a significant, positive impact on lexico-grammatical ability, while the memory processes produced a significant, negative effect on lexico-grammatical ability. Purpura concluded that the “greater degree to which a strategy was used did not necessarily correspond to the better performance” (1999, p. 180). Using a 35-item questionnaire derived from Purpura’s (1999) study, Phakiti (2003) explored the relationships between strategy use and reading performance with Thai EFL university students. He found a positive relationship of cognitive strategy use and metacognitive strategy use on the reading performance, but the relationship was weak (r = 0.391 and 0.469, respectively).

In summary, there seems to be neither consensus regarding strategy use in language learning nor agreement about the relationships between strategy use and language performance. This may be partially due to the fact that different strategy definitions, classifications, and measurement techniques have been utilized, as well as the existence of different interpretations of what it means to be proficient in language performance. Another important reason that contributes to these differences is that these studies were conducted in different cultural surroundings, some dealing with second language learning and some with foreign language learning. Participants also varied in terms of education levels and background. Thus, this current study aims to contribute to this field with information about

Page 16: Volume 3, 2005

8

MELAB ESL test takers’ reported strategy use and the relationships between their reported strategies and language performance.

Method Participants The participants in this study were MELAB test takers from a major MELAB test center in North America. A total of 179 test takers, who took the MELAB between July and November 2004 were recruited to participate in this study. Among the 161 respondents to the valid questionnaires, 146 were females and 15 were males. The age of the test takers ranged from 16 to 52, with a mean of 34.14. Through conversations with the test takers, it was found most of them took the MELAB to become recognized professionals in North America, such as nurses. Others intended to apply to higher educational institutions in North America. The participants had various English-learning experiences. Some had studied English since primary school, while a few had studied English for only several months. The mean for the period of time for English study was 12.16 years. These participants’ first languages include 30 different languages across five major language sectors (Afro-Asian, Austronesian, Eurasian, Sino-Indian, and Indo-European). The most frequently reported first language is Tagalog/Filipino/Llokano (24.2%), followed by Russian (9.6%), Hindi (6.6%), Malayalam (5.9%), Romanian (5.1%), Spanish (5.1%), Farsi/Persian (4.4%), Punjabi (4.4%), Tamil (4.4%), Chinese/Mandarin (3.8%), Arabic (2.9%), Urdu (2.9%), Japanese (2.2%), Korean (2.2%), Polish (2.2%), English (2.2%), Portuguese (2.2%), Slovak (2.2%), Gujarati (2.2%), Amharic (2.2%), Somali (0.7%), Tigrinya (0.7%), Thai (0.7%), Telegu (0.7%), Bulgarian (0.7%), Dutch (0.7%), French (0.7%), and Bengali (0.7%). Interestingly, some test takers claimed English was their first language because they had learned English and primarily used English in their daily lives since they were young. Still, these test takers came to take the MELAB, which is designed for nonnative speakers. Instruments

Purpura’s cognitive and metacognitive strategy use questionnaire (1999) was revised for this study to elicit information on strategy use. The MELAB scores were adopted as a measure of language performance. Both the survey questionnaire and the MELAB were used to understand the relationships between strategy use and language performance.

The questionnaire of strategy use for this study consisted of two parts. Demographic information including student ID, gender, years of English study, age, and first language was requested in the first part. The second part contained 27 items of cognitive strategy use and 16 items of metacognitive strategy use, which were adapted from Purpura’s metacognitive strategy use questionnaire (1999). The questionnaire used in this study was expected to measure ten scales of cognitive strategy use and four scales of metacognitive strategy use. The questionnaire used a 6-point Likert scale: 0 (never), 1 (rarely), 2 (sometimes), 3 (often), 4 (usually), and 5 (always), which is the same as Purpura’s (1999) study. Table 2 presents the composite scales of the questionnaire (the complete questionnaire is given in Appendix A).

The MELAB is a standardized English proficiency test whose stated purpose is to “evaluate the advanced level English competence of adult non-native speakers of English” (English Language Institute, 1996). As explained, the MELAB consists of three required sections (writing, listening, and GCVR) and one optional section (speaking). This study used

Page 17: Volume 3, 2005

9

the scores of each of the three required sections (writing, listening, and GCVR) and the total scores to measure communicative language ability. Data Collection and Analysis

The questionnaires were collected at a major MELAB test center in North America. With the assistance of this MELAB test center, the researcher had the opportunity to distribute the questionnaires and consent forms either on the day that MELAB test takers registered for the exam or on the test date before the MELAB administration started. Among the 179 questionnaires collected, there were 18 copies with missing values exceeding 10% of the total number of variables; that is, more than four questions were not answered. Those cases were removed from the database, reducing the total number of valid questionnaires to 161. Twenty-one questionnaires with missing values totaling less than 10% were included in the database. The missing data were spread across the questionnaire and did not cluster to particular, hypothesized scales. After obtaining consent from these test takers, their test scores were collected from the English Language Institute at the University of Michigan. Two test takers’ scores were not available, which reduced the number of total participants to 159 for the second research question. Then, participants’ scores on the MELAB were matched with their responses on the questionnaires. Finally, test takers’ scores and responses were coded and entered into an SPSS file with 100% verification to ensure that there were no incorrect data. Some inconsistencies were identified and corrected upon verifying the original data. SPSS Version 11.0 was employed for analyzing the data in this study. Table 2. Composites for the Strategy Use Questionnaire Strategy Use Scales Items used

Analyzing 23, 26, 27 Clarifying 13, 25 Repeating 3, 16, 17 Summarizing 4, 20 Applying rules 5, 11, 18 Associating 6, 7, 8 Transferring 9, 10, 12 Inferencing 21, 24 Linking with prior knowledge 1, 2, 14

Cognitive Strategy Use

Practicing 15, 19, 22 Assessing the situation 28, 30, 31 Monitoring 32, 33, 34 Self-evaluating 29, 35, 36, 39, 40, 43

Metacognitive Strategy Use

Self-testing 37, 38, 41, 42

Descriptive Statistics

To have an understanding of strategy use at the item level and to enhance factor analysis and regression analysis, descriptive statistics for each questionnaire item were calculated. Distributions were also examined to check the assumptions regarding normality. A

Page 18: Volume 3, 2005

10

normal distribution of each item by all participants should be represented by a graph that approximates a bell-shaped curve (Creswell, 2002). To check normality, I examined the range, mean, standard deviation, skewness, and kurtosis of each questionnaire item. Because the statistical analyses in this study assumed a normal distribution, items with extreme skewness or kurtosis were considered for deletion from further data analyses. A kurtosis and skewness value between +1 and –1 is considered to be excellent, and a value between +2 and –2 is acceptable (Creswell, 2002). Items with an absolute skewness value of more than 4 and an absolute kurtosis value of more than 8 are suggested to be excluded (Kline, 1998).

Internal Consistency Reliability Estimates

Internal consistency reliability estimates were computed to provide an estimate of how the questionnaire items correlated with each other. An instrument that is used to measure samples is reliable to the extent that “it measures whatever it is measuring consistently” (Best & Kahn, 1998, p. 283). Cronbach’s alpha is considered to be an appropriate measure of internal consistency with which to estimate the level of reliability of items within an instrument (Pedhazur & Schmelkin, 1991). Instrument items should be related to other items if they measure a single construct. Therefore, reliability estimates using Cronbach’s alpha were examined to provide an estimate of whether the questionnaire and each scale had a high level of internal consistency. Factor Analysis

The aim of exploratory factor analysis is to explore how many main constructs are necessary to explain the relations among a set of indicators. Although Purpura summarized the traits of strategy use, the constructs extracted from his study might be different from this study because “indicators may have different meanings in different places, cultures, subcultures and the like” (Pedhazur & Schmelkin, 1991, p. 53). Therefore, exploratory factor analysis was used to identify how the 43 items clustered together in this study within the ESL context.

This study computed exploratory factor analysis with the reported cognitive strategy use and metacognitive strategy use separately. As was pointed out by Pedhazur and Schmelkin, “exploratory factor analysis is not, or should not be, a blind process in which all manner of variables or items are thrown into a factor-analytic ‘grinder’ in the expectation that something meaningful will emerge” (1991, p. 591). Since John Flavell and his colleagues introduced the terminology “metacognition” in the 1970s (Flavell, 1971, 1979; Flavell & Wellman, 1977), metacognition has become a widely accepted and distinctive construct in psychological research. In the early 1970s, attracted by the lure of this new-sounding concept “metacognition,” psychologists engaged in demonstration studies to see how the new idea would work. Later, Ann Brown and her colleagues stated that the initial stage to see how the new idea of metacognition worked was over (Brown, Bransford, Ferrara, & Campione, 1983). The new stage should be “devoted to the task of developing workable theories and procedures for separate parts of the problem space” (Brown et al., 1983, p.125). Cognitive processes that include cognition and metacognition are operationalized by a variety of strategy types. As found in the literature over the past 30 years, metacognitive strategies are generally considered to be different from cognitive strategies in that they can be applied to a variety of language learning tasks, whereas cognitive strategies are limited to specific types of language tasks (O’Malley & Chamot, 1990; Purpura, 1999). For instance, “reading English books” as

Page 19: Volume 3, 2005

11

one type of cognitive strategy use applies only to the task of reading, whereas “before I begin an English assignment, I make sure I have a dictionary or other resources” as one type of metacognitive strategy use applies to all language learning situations. Based on the existing literature in this area, cognitive strategy use and metacognitive strategy use were factor analyzed independently in this study.

Various methods of factor analysis and rotation techniques were employed to obtain the most meaningful interpretation. Normally, factor loadings are considered to be high when they are greater than 0.6 and moderately high if they are above 0.3 (Kline, 1994). To ensure a meaningful interpretability of the solution, various factor solutions were tested to compare the results. The solution with the most meaningful interpretation was adopted in this study. Regression Analysis

To address the second research question about the relationships between language strategy use and MELAB performance, regression analysis was performed to examine whether these learner strategies had an effect on the MELAB scores. The stepwise regression method was used in this study because this method is “a model-building rather than model-testing procedure” (Tabachnick & Fidell, 2001, p. 138). It finds an equation that predicts the maximum variance for the specific data set under consideration. To be specific, this study used stepwise regression analysis to examine the relationships of strategy use with the MELAB writing scores, listening scores, GCVR scores, and total scores.

To determine significance throughout the study, I used the standard of p < 0.05. This means that the relationships between strategy use and MELAB scores were considered statistically significant if they could have occurred by chance fewer than 5 times out of 100. R square, which indicates the correlations between each independent variable and a dependent variable, was employed to show how well a dependent variable (MELAB) was explained by independent variables (strategy use). The beta weight was also reported to examine the magnitude of the prediction of reported strategy use in this study.

Results Descriptive Statistics

The descriptive statistics for the item-level data of the strategy use questionnaire were analyzed based on the 161 participants. The distributions for 27 items of cognitive strategy use and 16 items of metacognitive strategy use are presented in Appendix B. The means of these items ranged from 2.41 (I try to improve my English by looking for words in my own language that are similar to words in English in spelling, pronunciation, or meaning) to 4.43 (I try to improve my English by looking for opportunities to speak English as much as possible). A large number of strategies (72.1%) was reported to be often/always used. The standard deviations ranged from 0.85 to 1.71. A majority of skewness and kurtosis values ranged between +1 and –1. Item 15 and Item 31 had extreme skewness and kurtosis (1.84, 4.49; 1.89, 3.56). Subsequent analyses were computed with an awareness that the two items might be problematic because of the threat to a normal distribution that they posed.

Regarding the instrument of language performance, the MELAB total scores ranged from 53 to 97 with a mean of 75.29 and a standard deviation (SD) of 9.49 (see Appendix B). Writing ranged from 65 to 95 with a mean of 76.42 and a SD of 6.27, listening from 49 to 100 with a mean of 76.38 and a SD of 11.71, and GCVR from 36 to 100 with a mean of 73.14 and a SD of 13.99. All the scores were normally distributed within ±1 for skewness and kurtosis.

Page 20: Volume 3, 2005

12

Internal Consistency Reliability Estimates Internal consistency reliability estimates were calculated with the 43-item strategy use questionnaire (α = 0.94, see Table 3). The reliability estimate for the 27 cognitive strategy use items is 0.91, and the reliability estimate for the 16 metacognitive strategy use items is 0.89. These estimates are comparatively high. The reliability estimates of the ten scales of cognitive strategy use and four scales of metacognitive strategy use range from 0.49 to 0.89. Clarifying and inferencing, both consisting of two items, have reliability estimates lower than 0.60. Table 3. Internal Consistency Reliability Estimates for the Strategy Use Questionnaire Strategy Use Scales Items used Reliability estimates

Analyzing 23, 26, 27 0.78 Clarifying 13, 25 0.49 Repeating 3, 16, 17 0.72 Summarizing 4, 20 0.62 Applying rules 5, 11, 18 0.70 Associating 6, 7, 8 0.66 Transferring 9, 10, 12 0.89 Inferencing 21, 24 0.54 Linking with prior knowledge 1, 2, 14 0.69

Cognitive Strategy Use

Practicing 15, 19, 22 0.76 Subtotal 0.91

Assessing the situation 28, 30, 31 0.60 Monitoring 32, 33, 34 0.79

Self-evaluating 29, 35, 36, 39, 40, 43 0.82

Metacognitive Strategy Use

Self-testing 37, 38, 41, 42 0.80 Subtotal 0.89 Total 0.94

Factor Analysis Cognitive Strategy Use

Exploratory factor analysis was performed with the 27 cognitive strategy use items. Principal axis factoring and a varimax solution were used because they seemed to maximize interpretation after comparing with the results from various other methods of factor analysis and factor solutions. Although factor loadings larger than 0.3 were expected to be considered, it was found that factor loadings greater than 0.4 were more acceptable because they maximized parsimony and interpretability. Six factors had eigenvalues greater than 1.0. It was then decided that items loading on more than one factor would be considered for deletion from further factor analyses because these items might not measure the intended factors. Therefore, Items 12, 17, and 20 were deleted after examining the factor loadings and the wording of the items. Item 15 was kept in the factor analysis because this item showed a clear factor loading. As a result, principal axis factoring with a varimax solution yielded six factors

Page 21: Volume 3, 2005

13

with eigenvalues greater than 1.0, accounting for 61.59% of the total variance. A display of the inferential statistics of factor analysis is presented in Table 4.

As shown in Table 4, four items loaded on Factor 1, which accounted for 29.39% of the variance. After reading the individual items scrupulously, I found that these items either repeated or further asked for confirmation of information already received or produced. Factor 1, therefore, was named repeating/confirming information strategies. Factor 2 was represented by Items 23, 25, 26, and 27. These items especially dealt with strategies that were employed when test takers engaged in writing tasks. This factor was, therefore, labeled writing strategies, and it explained 9.09% of the total variance. Table 4. Pattern Matrix for Cognitive Strategy Use Items F1 F2 F3 F4 F5 F6 Q1 .748 Q2 .548 Q3 .512 Q4 .564 Q5 .590 Q6 .419 Q7 .455 Q8 .502 Q9 .524 Q10 .533 Q11 .516 Q13 .667 Q14 .462 Q15 .526 Q16 .642 Q18 .532 Q19 .670 Q21 .514 Q22 .664 Q23 .490 Q24 .545 Q25 .568 Q26 .600 Q27 .628

Principal Axis Factoring with Varimax rotation and Kaiser Normalization, converged in eight iterations.

Factor 3, accounting for 7.03% of the total variance, measured to what extent test takers improved their English by actual practicing. It was labeled practicing strategies as originally designed. Factor 4, explaining 6.49% of the variance, was represented by Items 6, 7, 8, 9, 10, 21, and 24. These six items can be defined as the strategies with which learners transform the unfamiliar into the familiar by generating their own connections among the

Page 22: Volume 3, 2005

14

phonetic, semantic, and syntactic information. Thus, Factor 4 was named generating strategies, which represents strategies used to make connections among different parts of information. Factor 5, labeled applying rules as originally designed, measured to what extent test takers applied rules to their language learning. This factor explained 4.9% of the total variance. Factor 6 measured strategies used to make connections from that which is already understood to that which is to be learned. Factor 6, accounting for 4.73% of the variance, was labeled linking with prior knowledge as hypothesized.

To summarize, based on the method of principal axis factoring with the 6-factor varimax solution, MELAB test takers’ perceptions of cognitive strategy use primarily fell into six dimensions: repeating/confirming information strategies, writing strategies, practicing strategies, generating strategies, applying rules strategies, and linking with prior knowledge strategies. Metacognitive Strategy Use

Exploratory factor analysis was performed with 16 items of metacognitive strategy use. Principal axis factoring with a quartimax solution was adopted because it maximized parsimony and interpretability. Factor loadings greater than 0.4 were accepted because this provided a meaningful interpretation. An examination of initial eigenvalues indicated that three factors had eigenvalues greater than 1.0. Items 31 and 38 were deleted because these items loaded on more than one factor. The final factor analysis extracted three factors with eigenvalues greater than 1.0, accounting for 61.99% of the total variance. Table 5 presents the pattern matrix of metacognitive strategy use. Table 5. Pattern Matrix for Metacognitive Strategy Use Item F1 F2 F3 Q28 .804 Q29 .665 Q30 .541 Q32 .787 Q33 .783 Q34 .836 Q35 .724 Q36 .650 Q37 .780 Q39 .750 Q40 .711 Q41 .802 Q42 .710 Q43 .779 Principal Component Analysis, Quartimax Rotation with Kaiser Normalization, converged in 5 iterations.

Factor 1 was represented by Items 35, 36, 37, 39, 40, 41, 42, and 43. Because these items are all concerned with evaluating the effectiveness of test takers’ performance, Factor 1

Page 23: Volume 3, 2005

15

was labeled evaluating. This factor accounted for 39.37% of the total variance. Factor 2 was named monitoring as originally designed because the items that represented Factor 2 measured how test takers monitored their own or another’s performance of a task. Factor 2 explained 14.33% of the variance. Items 28, 29, and 30 represented Factor 3. These three items examined how test takers generated an overall plan of action before engaging in a task. Factor 3, thus, was labeled assessing, and explained 8.29% of the total variance. In short, metacognitive strategy use had three underlying factors: evaluating, monitoring, and assessing. Regression Analysis Relationship between Strategy Use and MELAB Writing Stepwise regression analysis was performed to examine whether these learner strategies had an effect on the MELAB writing scores. Tables 6 and 7 present the inferential statistics of regression analysis. As can be seen, repeating/confirming information, linking with prior knowledge, writing strategies, and generating strategies had a significant effect on the prediction of the MELAB writing score. The linear regression model presented above is able to explain 21.4% of the total variance on the MELAB. Among these indicators, repeating/confirming information and generating strategies showed a negative impact on the MELAB writing score, whereas linking with prior knowledge and writing strategies showed a positive impact on the MELAB writing score. In descending order, repeating/confirming information, linking with prior knowledge, writing strategies, and generating strategies contributed significantly to the MELAB writing score. Table 6. Model Summary for Writing Model R R Square Adjusted R Square 1 .275(a) .076 .070 2 .398(b) .158 .147 3 .429(c) .184 .168 4 .462(d) .214 .193

(a) predictors: (constant), repeating/confirming information; (b) predictors: (constant), repeating/confirming information, linking with prior knowledge; (c) predictors: (constant), repeating/confirming information, linking with prior knowledge, writing strategies; (d) predictors: (constant), repeating/confirming information, linking with prior knowledge, writing strategies, generating strategies. Table 7. Regression Analysis for Variables Predicting Writing B Beta t Sig. (constant) 75.371 33.745 Repeating/confirming information −2.181 −.363 −4.108 .000 Linking with prior knowledge 2.060 .287 3.376 .001 Writing strategies 1.371 .211 2.462 .015 Generating strategies −1.421 −.211 −2.406 .017

Page 24: Volume 3, 2005

16

Relationship between Strategy Use and MELAB Listening Tables 8 and 9 show the inferential statistics of multiple regression on the MELAB listening scores. As shown, the significant predictors of the MELAB listening score were repeating/conforming information, linking with prior knowledge, and generating strategies. The linear regression model accounts for 17.2% of the total variance on the MELAB listening score. Among these predictors, repeating/confirming information and generating strategies showed a negative impact on the MELAB listening score, whereas linking with prior knowledge showed a positive impact on the MELAB listening score. The significant contributors to the MELAB listening score, in descending order, were repeating/confirming information, linking with prior knowledge, and generating strategies. Table 8. Model Summary for Listening Model R R Square Adjusted R Square 1 .233(a) .054 .048 2 .376(b) .141 .130 3 .415(c) .172 .156

(a) predictors: (constant), repeating/confirming information; (b) predictors: (constant), repeating/confirming information, linking with prior knowledge; (c) predictors: (constant), repeating/confirming information, linking with prior knowledge, generating strategies. Table 9. Regression Analysis for Variables Predicting Listening B Beta t Sig. (constant) 75.274 18.176 .000 Repeating/confirming information −2.909 −.259 −2.965 .004 Linking with prior knowledge 4.991 .373 4.544 .000 Generating strategies −2.689 −.214 −2.392 .018

Relationship between Strategy Use and MELAB GCVR

Stepwise regression analysis was also performed to examine whether these learner strategies had an effect on the MELAB GCVR scores. Tables 10 and 11 present a display of the regression analysis. In the tables it can be seen that monitoring and linking with prior knowledge had a significant, positive contribution to the prediction of the MELAB GCVR scores, whereas repeating/confirming information showed a significant, negative impact on the MELAB GCVR scores. The regression model is able to explain 12.5% of the total variance. The significant contributors to the MELAB GCVR scores, in descending order, were monitoring, repeating/confirming information, and linking with prior knowledge.

Page 25: Volume 3, 2005

17

Table 10. Model Summary for GCVR Model R R Square Adjusted R Square 1 .216(a) .047 .041 2 .296(b) .088 .076 3 .353(c) .125 .108

(a) predictors: (constant), monitoring; (b) predictors: (constant), monitoring, repeating/confirming information; (c) predictors: (constant), monitoring, repeating/confirming information, linking with prior knowledge. Table 11. Regression Analysis for Variables Predicting GCVR B Beta t Sig. (Constant) 64.002 10.823 Monitoring 2.227 .148 1.855 .066 Repeating/confirming information −3.806 −.284 −3.484 .001 Linking with prior knowledge 3.520 .220 2.565 .011

Relationship between Strategy Use and MELAB Total Scores

In order to understand how strategy use predicts the MELAB total scores, stepwise multiple regression was performed with strategy use as independent variables and the MELAB total as the dependent variable. Tables 12 and 13 present the inferential statistics of the regression analysis. Table 12. Model Summary for Total Scores Model R R Square Adjusted R Square 1 .257(a) .066 .060 2 .405(b) .164 .153 3 .435(c) .189 .173 4 .457(d) .209 .188

(a) predictors: (constant), repeating/confirming information; (b) predictors: (constant), repeating/confirming information, linking with prior knowledge; (c) predictors: (constant), repeating/confirming information, linking with prior knowledge, generating strategies; (d) predictors: (constant), repeating/confirming information, linking with prior knowledge, generating strategies, monitoring. Table 13. Regression Analysis for Variables Predicting Total Scores B Beta t Sig. (constant) 70.345 18.135 Repeating/confirming information −2.513 −.276 −3.192 .002 Linking with prior knowledge 3.629 .334 3.945 .000 Generating strategies −2.051 −.201 −2.291 .023 Monitoring 1.538 .150 1.976 .050

Page 26: Volume 3, 2005

18

As indicated, repeating/confirming information and generating strategies made a significant, positive contribution to the prediction of the MELAB total scores whereas linking with prior knowledge and monitoring showed a significant, negative impact on MELAB total scores. The regression model accounts for 18.9% of the total variance. The significant contributors to the MELAB scores, in descending order, were, repeating/confirming information, linking with prior knowledge, generating strategies, and monitoring.

Summary and Discussion

The study examines the nature of learner strategies reported by MELAB test takers

and how their reported strategy use had an effect on their MELAB performance in the ESL context. Using a 43-item strategy use questionnaire, it was found that cognitive strategy use had six underlying factors and metacognitive strategy use had three underlying factors. Specifically, MELAB test takers’ perceptions of cognitive strategy use primarily fell into six dimensions: repeating/confirming information strategies, writing strategies, practicing strategies, generating strategies, applying rules strategies, and linking with prior knowledge strategies. MELAB test takers’ perceptions of metacognitive strategy use had three dimensions: evaluating, monitoring, and assessing. The exploratory factor analysis results in this study were partially consistent with what was originally hypothesized and Purpura’s framework. Practicing, applying rules, linking with prior knowledge, and monitoring fit with the originally designed framework. Writing strategies consisted of the originally designed factor “analyzing” plus Item 25, and assessing strategies consisted of “assessing” plus Item 29. Generating strategies combined “associating,” “transferring,” and “inferencing,” and evaluating strategies combined “self-evaluating” and “self-testing.” Additionally, with this group of MELAB test takers, this study found a new construct: repeating/confirming information. There are several reasons why this study extracted different constructs from those in Purpura (1999). First, because the study was conducted in an ESL context, this group of participants generally had great amounts of exposure to English. A majority of the participants might have had to comprehend and produce the language for survival reasons. As a result, their strategy use might be different from that of the participants in Purpura’s study, which were mainly EFL learners. Second, due to the small number of participants, this study had difficulties distinguishing “associating” and “transferring” from “inferencing,” and “self-evaluating” from “self-testing.” A larger number of participants and more questionnaire items are needed for further analysis. Last, some items need to be designed and worded carefully. For example, Item 29 (“before I begin an English assignment, I make sure I have a dictionary or other resources”) can be explained as test takers’ assessing strategy because test takers learn English by assessing their available internal and external resources. This study also addresses how cognitive and metacognitive strategy use affected MELAB scores. As for how strategy use relates to predicating MELAB writing scores, repeating/confirming information and generating strategies showed a significant, negative impact, whereas linking with prior knowledge and writing strategies showed a significant, positive impact. The regression model accounted for 21.4% of the MELAB writing score in total. Regarding predicting MELAB listening scores, repeating/confirming information and generating strategies showed a negative impact, whereas linking with prior knowledge showed a positive impact. The linear regression model explained 17.2% of the total variance.

Page 27: Volume 3, 2005

19

Regarding predicting of MELAB GCVR scores, monitoring and linking with prior knowledge had a significant, positive contribution to the prediction, whereas repeating/confirming information showed a significant, negative impact. The regression model explained 12.5% of the MELAB GCVR. As for predicating MELAB total scores, repeating/confirming information and generating strategies had a significant, positive contribution, whereas linking with prior knowledge and monitoring showed a significant, negative impact. The regression model accounted for 18.9% of the MELAB total scores. In summary, repeating/confirming information consistently had a significant, negative contribution, whereas linking with prior knowledge consistently showed a significant, positive effect. The results suggest that the more the test takers mechanically repeated information, the worse they performed; the more the test takers synthesized what was learned and applied it to practice, the better they performed. While generating strategies played a negative, significant role in the MELAB writing, listening, and total scores, it produced no significant impact on the GCVR. This might be because the better-performing test takers made fewer connections among the phonetic, semantic, and syntactic language input in the writing and listening sections than the low scorers, but they made the same effort as other test takers in the GCVR section because these multiple-choice tasks require literal information. It is understandable that writing strategies only had a significant, positive effect on the MELAB writing score, and not on the listening and GCVR sections. Monitoring, a strong positive predictor of the MELAB GCVR, was also a positive predictor of the MELAB total score. It indicates that the more the test takers observed the effectiveness of their own or others’ performance, the better they scored in the GCVR and total. However, it is hard to interpret why monitoring only predicted the MELAB GCVR score, not writing or listening scores. Appling rules, practicing, assessing, and evaluating had no significant effect on any section of the MELAB. The test takers showed no distinctive difference in using these strategies. This study concludes that not every type of strategy use enhances language performance. Some strategies have a significant, positive effect on language performance, some produce a significant, negative contribution on language performance, and others seem to have no effect. These results corroborate the results of other studies in this area. For example, Gu and Johnson (1996) found some positive and some negative predictors of vocabulary strategies on a language proficiency test. Using a survey questionnaire, Wen and Johnson (1997) concluded that vocabulary strategy, mother-tongue-avoidance strategy, and management strategy had positive effects on English achievement, form-focused strategy and meaning-focused strategy had little effect, and tolerating-ambiguity strategy had a negative effect. Therefore, strategy use can be seen as a set of complex behaviors, dependent on the nature of different tasks and contributing differently to language performance.

The study provides evidence of a linear relationship between strategy use and the MELAB; however, the effect of strategy use on language test performance was weak, explaining about 12.5% to 21.4% of the score variance. This result is consistent with results from some other studies. Park’s study (1997) revealed that cognitive strategies and social strategies together contributed to 13% of TOEFL score variance. Phakiti (2003) also found a weak relationship between cognitive and metacognitive strategies to the reading test performance in his study (explaining about 15%−22% of the test score variance). In a Chinese EFL context, cognitive and metacognitive strategy use accounted for 8.6% of the College English Test Band 4 (Song, 2004). In this study, it is not difficult to explain why strategy use predicted a small proportion of the MELAB scores. Bachman (1990) proposed that the factors

Page 28: Volume 3, 2005

20

affecting performance on language tests are communicative language ability, the personal characteristics of test takers, and the characteristics of the test method or test tasks. Strategy use is only one part of the personal characteristics of test takers, and, therefore, would explain only a small proportion of the MELAB performance.

Limitations

Although this study revealed some interesting findings, these findings are certainly not conclusive and comprehensive in nature. There are several limitations that may affect internal and external validity of this study. First, data analyses were based on the assumption that cognitive and metacognitive strategies are two different dimensions. Although researchers have found empirical evidence that they are different constructs, the factor loading structures were not apparent when all cognitive and metacognitive strategy items were factor analyzed together in this study. A number of possible interactions among these strategies exist in the operational setting. Therefore, issues with regard to the nature of strategy use are limitations in this study that may affect internal validity. Also, the analytic procedure of regression analysis has its weakness because of the interrelatedness of cognitive and metacognitive strategy use. Another concern is the question of whether mental processes can be validly elicited by merely using a self-reported questionnaire. It is also difficult to include a comprehensive list of strategies used by test takers. Moreover, because this study focuses on test takers’ cognitive characteristics, communication, social, and affective strategies are not discussed in the study. Other potentially influential variables, such as attitudes, anxiety, motivation, and effort, which have been considered to influence language performance, are also not included in this study. Further research is needed to obtain a more comprehensive picture of strategy use and its relationships with language performance.

Acknowledgments I would like to extend my gratitude to the English Language Institute of the University of Michigan for funding this project. My sincere appreciation also goes to Maria Huntley, Dr. Jeff Johnson, Dr. Georgia Wilder, and Terry Yao for their assistance and suggestions on the project.

References Anderson, N. J., & Vandergrift, L. (1996). Increasing metacognitive awareness in the L2

classroom by using think-aloud protocols and other verbal report formats. In R. L. Oxford (Ed.), Language learning strategies around the world: Cross-cultural perspectives (pp. 3–18). Honolulu: University of Hawaii, Second Language Teaching & Curriculum Center.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.

Baker, L., & Brown, A. L. (1984). Metacognitive skills of reading. In D. Pearson, M. Kamil, R. Barr, & P. Mosenthal (Eds.), Handbook of reading research (pp. 353–394). New York: Longman.

Bedell, D. A., & Oxford, R. L. (1996). Cross-cultural comparisons of language learning strategies in the People’s Republic of China and other countries. In R. L. Oxford (Ed.),

Page 29: Volume 3, 2005

21

Language learning strategies around the world: Cross-cultural perspectives (pp. 47–60). Honolulu: University of Hawaii, Second Language Teaching & Curriculum Center.

Best, J. W., & Kahn, J. V. (1998). Research in education (8th ed.). Boston, MA: Heinle & Heinle.

Bialystok, E. (1978). A theoretical model of second language learning. Modern Language Journal, 28, 69–83.

Bialystok, E. (1981). The role of conscious strategies in second language proficiency. Modern Language Journal, 65, 24–35.

Bremner, S. (1999). Language learning strategies and language proficiency: Investigating the relationship in Hong Kong. Canadian Modern Language Review, 55, 490–514.

Brown, A. L., Bransford, J. D., Ferrara, R. A., & Campione, J. (1983). Learning, understanding, and remembering. In P. H. Mussen (Series Ed.) & J. H. Flavell & E. M. Markman (Vol. Eds.), Handbook of child psychology: Vol. 3. Cognitive development (4th ed., pp. 77–167). New York: Wiley.

Bruen, J. (2001). Strategies for success: Profiling the effective learner of German. Foreign Language Annals, 34, 216–225.

Cohen, A. D. (1998). Strategies in learning and using a second language. New York: Longman.

Creswell, J. W. (2002). Education research: Planning, conducting, and evaluation of quantitative and qualitative research. Upper Saddle River, NJ: Prentice Hall.

Dreyer, C., & Oxford, R. L. (1996). Learning strategies and other predictors of ESL proficiency among Afrikaans in South Africa. In R. L. Oxford (Ed.), Language learning strategies around the world: Cross-cultural perspectives (pp. 61–74). Honolulu: University of Hawaii, Second Language Teaching & Curriculum Center.

Ehrman, M., & Oxford, R. L. (1989). Effects of sex differences, career choice, and psychological type on adult language learning strategies. Modern Language Journal, 73, 1–13.

Ellis, R. (1994). The study of second language acquisition. Oxford, UK: Oxford University Press.

English Language Institute. University of Michigan. (2003). MELAB information bulletin and registration forms 2003–2004. Ann Arbor: English Language Institute, University of Michigan.

English Language Institute. University of Michigan. (1996). MELAB technical manual. Ann Arbor: English Language Institute, University of Michigan.

Flavell, J. H. (1971). First discussant’s comments: What is memory development the development of? Human Development, 14, 272–278.

Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of psychological inquiry. American Psychologist, 34, 272–278.

Flavell, J. H., & Wellman, H. M. (1977). Metamemory. In R. V. Kail, Jr. & J. W. Hagen (Eds.), Perspectives on the development of memory and cognition (pp. 3–25). Hillsdale, NJ: Erlbaum, 1977.

Gagne, E. D., Yekovich C. W., & Yekovich F. R. (1993). The cognitive psychology of school learning (2nd ed.). New York: HarperCollins College Publishers.

Glenn, W. (2000). Language learning strategy use of bilingual foreign language learners in Singapore. Language Learning, 50, 203–244.

Page 30: Volume 3, 2005

22

Green, N. M., & Oxford. R. (1995). A closer look at learning strategies, L2 proficiency, and gender. TESOL Quarterly, 29, 261–297.

Gu, Y., & Johnson, R. K. (1996). Vocabulary strategies and language learning outcomes. Language Testing, 46, 643–679.

Hsiao T., & Oxford, R. L. (2002). Comparing theories of language learning strategies: A confirmatory factor analysis. Modern Language Journal, 86, 368–383.

Huang, X., & Van Naerssen, M. (1985). Learning strategies for oral communication. Applied Linguistics, 8, 287–307.

Hunt, M. M. (1982). The universe within: A new science explores the human mind. New York: Simon and Schuster.

Kline, P. (1994). An easy guide to factor analysis. New York: Routledge. Kline, R. (1998). Principles and practices of structural equation modeling. New York:

Guilford Press. Mangubhai, F. (1991). The processing behaviors of adult second language learners and their

relationship to second language proficiency. Applied Linguistics, 12, 268–298. McDonough, S. H. (1995). Strategy and skill in learning a foreign language. New York: St.

Martin’s Press. Naiman, N., Frohlich, M., Stern, H. H., & Todesco, A. (1978). The good language learner.

Toronto, Ontario, Canada: Ontario Institute for Studies in Education. O’Malley, J. M., & Chamot, A. U. (1990). Learning strategies in second language acquisition.

New York: Cambridge University Press. Oxford, R. L. (1990). Language learning strategies: What every teacher should know. Boston:

Heinle & Heinle. Oxford, R. L., & Burry-Stock, J. (1995). Assessing the use of language strategies worldwide

with the ESL/EFL version of the strategy Inventory for Language Learning (SILL). System, 23, 1–23.

Oxford, R. L., Lavine, R. Z., Felkins, G., Hollaway, M. E., & Saleh, A. (1996). Telling their stories: Language students use diaries and recollection. In R. L. Oxford (Ed.), Language learning strategies around the world: Cross-cultural perspectives (pp. 19–34). Honolulu: University of Hawaii, Second Language Teaching & Curriculum Center.

Paris, S. G., Cross, D., & Lipson, M. Y. (1984). Informed strategies for learning: A program to improve children’s reading awareness and comprehension. Journal of Educational Psychology, 76, 1239–1252.

Park, G. (1997). Language learning strategies and English proficiency in Korean university students. Foreign Language Annals, 30, 211–221.

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Erlbaum.

Phakiti, A. (2003). A closer look at the relationship of cognitive and metacognitive strategy use to EFL reading achievement test performance. Language Testing, 20, 26–56.

Politzer, R., & McGoarty, M. (1985). An exploratory study of learning behaviors and their relationship to gains in linguistic and communicative competence. TESOL Quarterly, 19, 103–124.

Purpura, J. M. (1997). An analysis of the relationships between test takers’ cognitive and metacognitive strategy use and second language test performance. Language Learning, 47, 289–325.

Page 31: Volume 3, 2005

23

Purpura, J. M. (1998a). The development and construct validation of an instrument designed to investigate selected cognitive background characteristics of test-takers. In A. J. Kunnan (Ed.), Validation in language assessment (pp. 111–140). Mahwah, NJ: Erlbaum.

Purpura, J. M. (1998b). Investigating the effects of strategy use and second language test performance with high- and low-ability test takers: A structural equation modeling approach. Language Testing, 15, 333–379.

Purpura, J. M. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge, UK: Cambridge University Press.

Ridley, J. (1997). Reflection and strategies in foreign language learning: A study of four university-level ab initio learners of German. Frankfurt am Main, Germany: Peter Lang.

Rigney, J. W. (1978). Learning strategies: A theoretical perspective. In H. F. O’Neil (Ed.), Learning strategies (pp. 165–205). New York: Academic Press.

Rubin, J. (1975). What the “good language learner” can teach us. TESOL Quarterly, 9, 41–51. Rubin, J. (1981). Study of cognitive processes in second language learning. Applied

Linguistics, 2, 117–131. Sheorey, R. (1999). An examination of language learning strategy use in the setting of an

indigenized variety of English. System, 27, 173–190. Song, X. (2004). Language learning strategy use and language performance for Chinese

learners of English. Unpublished master’s thesis, Queen’s University, Kingston, Ontario, Canada.

Stern, H. H. (1975). What can we learn from the good language learner? Canadian Modern Language Review, 31, 304–318.

Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics. Boston: Allyn & Bacon.

Weigle, S. (2000). Test review: The Michigan English Language Assessment Battery (MELAB). Language Testing, 17, 449–455.

Wen, Q., & Johnson, R. (1997). L2 learner variables and English achievement: A study of tertiary-level English majors in China. Applied Linguistics, 18, 27–48.

Wenden, A. (1991). Learner strategies for learner autonomy. Englewood Cliffs, NJ: Prentice Hall.

Page 32: Volume 3, 2005

24

Appendix A Dear friends: My name is Xiaomei Song. Today I invite you to do a survey about English strategy use. It will take you about 20 minutes to complete. Please indicate the degree to which you agree with each of the statements by circling the following scale. 5 indicates that the statement is true of you almost always and 0 indicates that the statement is very rarely true of you. Do not answer how you think you should be, or what other peoples do. There are no right or wrong answers to these statements. Part One. Some information about you:

MELAB Testing ID: Years of English studying: Gender: Age: First Language:

Part Two. Cognitive Strategies for Language Learning 0 1 2 3 4 5

<----------------------------------> Never Rarely Sometimes Often Usually Always

When I am learning new material in English... 1. I try to connect what I am learning with what I already know. 0 1 2 3 4 5 6 2. I try to somehow organize the material in my mind. 0 1 2 3 4 5 6 3. I repeat words to make sure that I have understood them correctly. 0 1 2 3 4 5 6 4. I make written summaries of information that I hear or read in English.

0 1 2 3 4 5 6 5. I learn best when I am taught the rules. 0 1 2 3 4 5 6 I learn new words in English by... 6. relating the sound of the new word to the sound of a familiar word. 0 1 2 3 4 5 6 7. remembering where the new word was located on the page, or where I first saw or heard it. 0 1 2 3 4 5 6 8. thinking of words I know that sound like the new word. 0 1 2 3 4 5 6 I learn grammar in English by... 9. using the grammar of my own language to help me learn the rules. 0 1 2 3 4 5 6 10. comparing grammar rules in my own language with grammar rules in English. 0 1 2 3 4 5 6 11. memorizing the rules and applying them to new situations. 0 1 2 3 4 5 6 I try to improve my English by... 12. looking for words in my own language that are similar to words in English in spelling, pronunciation, or meaning. 0 1 2 3 4 5 6 13. asking other people to tell me if I have understood or said something correctly. 0 1 2 3 4 5 6 14. applying what I have learned to new situations. 0 1 2 3 4 5 6 15. looking for opportunities to speak English as much as possible. 0 1 2 3 4 5 6 I try to improve my oral communication in English by... 16. repeating sentences in English until I can say them easily. 0 1 2 3 4 5 6 17. repeating what I hear native speakers say. 0 1 2 3 4 5 6 18. using my knowledge of grammar rules to help me form new sentences. 0 1 2 3 4 5 6

Page 33: Volume 3, 2005

25

19. watching TV or listening to the radio. 0 1 2 3 4 5 6 I try to improve my reading in English by... 20. summarizing new information to remember it. 0 1 2 3 4 5 6 21. trying to understand without looking up every new word. 0 1 2 3 4 5 6 22. reading English books, newspaper , and magazines. 0 1 2 3 4 5 6 23. looking for the ways that writers show relationships between ideas. 0 1 2 3 4 5 6 24. guessing the meaning of new words from context. 0 1 2 3 4 5 6 I try to improve my writing in English by... 25. showing my writing to another person. 0 1 2 3 4 5 6 26. analyzing how other writers organize their paragraphs. 0 1 2 3 4 5 6 27. analyzing the ways that other writers show relationships between ideas.

0 1 2 3 4 5 6

Part Three. Metacognitive Strategies for Language Learning 28. Before I talk to someone in English, I think about how much the person knows about what I’m going to say. 0 1 2 3 4 5 6 29. Before I begin an English assignment, I make sure I have a dictionary or other resources. 0 1 2 3 4 5 6 30. Before I begin an English test, I think about which parts of the test are the most important. 0 1 2 3 4 5 6 31. Before I begin an English test, I decide how important it is for me to get a good grade on the test. 0 1 2 3 4 5 6 32. When I listen to English, I recognize other people’s grammar mistakes.

0 1 2 3 4 5 6 33. When I am speaking English, I know when I have pronounced something correctly or incorrectly. 0 1 2 3 4 5 6 34. When I speak English, I know when I make grammar mistakes. 0 1 2 3 4 5 6 35. When someone is speaking English, I try to concentrate on what the person is saying. 0 1 2 3 4 5 6 36. When someone does not understand my English, I try to understand what I said wrong. 0 1 2 3 4 5 6 37. When I have learned a new English grammar rule, I test myself to make sure I know how to use it. 0 1 2 3 4 5 6 38. When I have learned a new word or phrase in English, I test myself to make sure I have memorized it. 0 1 2 3 4 5 6 39. After I finish a conversation in English, I think about how I could say things better. 0 1 2 3 4 5 6 40. After I say something in English, I check whether the person I am talking to has really understood what I meant. 0 1 2 3 4 5 6 41. After I have taken a test in English, I think about how I can do better the next time. 0 1 2 3 4 5 6 42. I test my knowledge of new English words by using them in new situations. 0 1 2 3 4 5 6 43. I try to learn from the mistakes I make in English. 0 1 2 3 4 5 6

Thanks for your participation.

Page 34: Volume 3, 2005

26

Appendix B

Item N Min Max Mean SD Skewness Kurtosis Q1 161 0 5 3.75 1.216 −.935 .445 Q2 161 0 5 3.76 1.083 −.820 .433 Q3 159 0 5 3.87 1.241 −.944 .057 Q4 160 0 5 2.78 1.401 −.287 −.787 Q5 160 0 5 3.75 1.293 −1.077 .531 Q6 160 0 5 3.06 1.390 −.596 −.241 Q7 161 0 5 3.22 1.400 −.589 −.423 Q8 159 0 5 3.36 1.255 −.525 −.214 Q9 160 0 5 2.53 1.663 −.147 −1.224 Q10 161 0 5 2.66 1.669 −.247 −1.191 Q11 161 0 5 3.55 1.245 −.797 .176 Q12 160 0 5 2.41 1.706 −.056 −1.354 Q13 161 0 5 3.02 1.410 −.397 −.614 Q14 158 0 5 3.84 1.046 −.883 .775 Q15 161 0 5 4.43 .850 −1.835 4.489 Q16 160 0 5 3.49 1.423 −.789 −.213 Q17 160 0 5 3.54 1.213 −.764 .269 Q18 161 0 5 3.79 1.169 −.886 .322 Q19 161 0 5 4.27 1.019 −1.685 2.963 Q20 161 0 5 3.12 1.247 −.461 −.198 Q21 160 0 5 3.39 1.239 −.767 .008 Q22 158 0 5 4.13 1.023 −1.175 1.304 Q23 159 0 5 3.72 1.096 −.774 .485 Q24 160 0 5 3.27 1.316 −.660 −.079 Q25 161 0 5 3.01 1.487 −.241 −1.037 Q26 159 0 5 3.65 1.120 −.855 .331 Q27 160 0 5 3.53 1.110 −.583 −.137 Q28 161 0 5 2.62 1.491 −.289 −.876 Q29 161 0 5 2.99 1.553 −.506 −.717 Q30 160 0 5 3.52 1.441 −1.016 .295 Q31 161 0 5 4.14 1.212 −1.887 3.578 Q32 159 0 5 3.44 1.230 −.707 .315 Q33 161 1 5 3.93 .997 −.845 .376 Q34 161 1 5 3.81 1.081 −.720 −.073 Q35 158 0 5 4.23 1.157 −1.742 2.611 Q36 161 0 5 3.81 1.175 −.835 .250 Q37 161 0 5 3.65 1.242 −.884 .443 Q38 159 0 5 3.47 1.262 −.783 .117 Q39 160 0 5 3.86 1.184 −1.111 1.065 Q40 161 0 5 3.83 1.233 −1.325 1.739 Q41 161 0 5 4.11 1.090 −1.604 2.909 Q42 161 0 5 3.88 1.133 −1.239 1.896 Q43 161 0 5 4.26 .984 −1.580 2.848 Writing 159 65 95 76.42 6.265 1.038 1.021 Listening 159 49 100 76.38 11.709 −.537 −.343 GCVR 159 36 100 73.14 13.985 −.210 −.513 Total 159 53 97 75.29 9.499 .126 −.535

Page 35: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 3, 2005 27 English Language Institute, University of Michigan

An Empirical Investigation into the Nature of and Factors Affecting Test Takers’ Calibration within the

Context of an English Placement Test (EPT)

Aek Phakiti Maejo University

This paper reports on an empirical study of the nature of and factors affecting test takers’ calibration within the context of an English placement test designed by the English Language Institute, University of Michigan. Calibration is a term used in psychology to denote the perfect relationship between a confidence judgment of performance and actual performance. Test takers are said to be calibrated when their confidence matches their actual performance perfectly. This study looks at both the single-case confidence (i.e., confidence for each test item) and relative-frequency confidence (i.e., confidence for the overall test). The study employs Rasch Item Response Theory (IRT) to analyze the test so as to identify easy and difficult test items and learners’ ability levels. The study was carried out at a Thai university in which 295 learners participated. The results suggest that: (1) the test takers were generally miscalibrated, suggesting a tendency to be overconfident; (2) test takers at different proficiency levels exhibited differences in calibration and confidence; and (3) advanced level test takers were underconfident in all test sections, whereas others were overconfident. Two factors (gender variables and hard-easy effects) believed to affect calibration were empirically investigated. It was found that females obtained a better calibration score than their male counterparts, and that test takers tended to be overconfident in difficult questions and underconfident in easy questions. Discussion of the findings, implications, and an agenda for further research are articulated.

Factors affecting language test performance are varied and complex. In a unified model, Bachman (1990; Bachman & Palmer, 1996) sought to explain language test performance and set out four major influences on language test scores: communicative language ability (CLA), test-method facets, individual characteristics, and random measurement error. With reference to his unified model, Bachman (2000) grouped research involving factors affecting language test performance into three intertwined areas: characteristics of the testing procedures (as relevant to test-method facets), characteristics of the test takers themselves (as relevant to CLA and test-taker characteristics), and the processes and strategies used by test takers in response to test tasks (as an interaction among CLA, test-method facets, and test-taker characteristics). Of these, the present study focuses on the processes involved during test completion. A review of the literature suggests that language testing (LT), second language acquisition (SLA), and L2 learning research has invested much attention to understand L2 individuals’ strategic processing, such as cognitive and metacognitive strategy use (e.g., Anderson, 2005; Cohen, 1998; Oxford, 2003; Phakiti, 2003; Purpura, 1999). These studies found that strategic processing, such as assessing the

Page 36: Volume 3, 2005

28

situation, goal setting, self-monitoring, self-assessment, as well as other cognitive strategies, plays a crucial role in determining success in language test performance. These studies have shed light on how L2 individuals go about taking a language test or learning the target language, and how successful L2 individuals differ from less successful ones in terms of strategic processing.

Today’s educational system forces learners to make high-stakes decisions and judgments more and more frequently and often to take greater and greater risks in their decision making. Because decisions and judgments often have high stakes, learners need to be able to accurately approximate the likely success of their performance. This ability is vital, as they must maintain or develop themselves in a high educational standard. As well, this ability is a necessary component of lifelong learning in which individuals can work independently and assess their developing language abilities after leaving school or university. Furthermore, people must be well adapted to handling real-world environments that involve redundant and unreliable data. This paper argues that part of the confidence in current performance, referred to as calibration, is also involved in strategic processing. Calibration1 is a psychological term used to denote the perfect relationship between a confidence judgment on performance and actual performance. At present, calibration research has been neglected not only in LT research but also in SLA research. In the current theory of communicative language ability, little is known of how individuals go about assessing and perceiving confidence in their current language use performance and the extent to which they are realistic, and why they are not realistic. The aim of this study is therefore to examine the nature of test takers’ confidence in their performance achievement and their calibration, and to identify factors positively and negatively affecting their calibration. To achieve this aim, I first discuss how confidence in performance can be generated through a modified theoretical framework of human information processing proposed by Gigerenzer, Hoffrage, and Kleinbölting (1991). Second, I discuss the relevant literature of calibrative studies that relate to the aims of the study.

Modeling Calibration and Miscalibration In order to model the nature of calibration, two variables are commonly used:

confidence in the correctness of performance, and actual performance judged by the truth or external standards independent of an individual providing the confidence. Confidence in performance can be either expressed in percentages (e.g., Böjorkman, 1994; Yates, Lee, & Shinotsuka, 1996) or quantified as high, medium, or low degrees (e.g., Glenberg and Epstein, 1987). This confidence in turn is treated as a probability. The derived probability is considered subjective in the sense that different individuals are allowed to have different probabilities about the degree of success for the same language event. The Local and Probabilistic Mental Models

To further describe the nature of calibration discussed above, we need a theoretical framework that explains how confidence may be generated in order to observe the evolution of confidence judgment over time. This framework is essential to guide research to corroborate the nature of calibration as reflected through not only intraindividual and

1 It is important to note that calibration in this paper differs from that generally used in the LT literature to talk about the calibration of test items in terms of difficulty and the implications for parallel test forms, etc.

Page 37: Volume 3, 2005

29

interindividual differences, but also through cross-sectional and longitudinal observations. This framework can also be helpful to elucidate the reasons for poor calibration. In the present study, the theory of probabilistic mental models (PMM) proposed by Gigerenzer et al. (1991), which is supported by several empirical studies (see Juslin 1994; Kleitman & Stankov, 2001; Schneider 1995), is applied to postulate how an individual may generate confidence in a multiple-choice test. Figure 1 illustrates a flow chart of the local and probabilistic mental models in a typical four multiple-choice test. Based on this framework, confidence in performance is estimated using percentages. As a rule of thumb, the starting point (the lowest) on a confidence rating scale depends on the number of alternatives (k) given to a question (i.e., 100/k). However, as reflected by the model in Figure 1, we need to distinguish chances to get the answer correct (i.e., 25%, 50%, 75%, or 100%) from an actual confidence in performance (0% to 100%). That is, when individuals know that they have 25% chance to get the answer correct, it does not necessarily imply that their lowest confidence will be 25%. An awareness of this distinction is important because in the case of failure (when their performance is 0%), their confidence needs to be 0% in order to be well calibrated. There are two features of the model presented in Figure 1 that differ from Gigerenzer et al.’s (1991). The first is that the present model includes a 0% confidence, and the second is that within the local MM and PMM, internal and external feedback is incorporated as part of information processing (to be discussed below). Local Mental Model (MM)

When L2 learners are presented with a task (e.g., listening text, reading text) and/or a question with four alternatives, they will initially attempt to construct a local mental model (local MM) of the task (see Figure 1, if “yes”). This attempt can be characterized as memory searching and rudimentary logical operations. When the individuals can generate certain knowledge by constructing a local MM, they will have sufficient evidence for the answer and a confidence of 100%. Generally, a local MM can be successfully constructed in at least one of three conditions: (1) if the knowledge can be retrieved from memory for all the four alternatives; (2) if intervals that do not overlap can be retrieved; or (3) elementary logical operations such as the method of exclusion can compensate for missing knowledge. This part of cognitive operations corresponds to the theory of human-information processing (e.g., Gagné, Yekovich, & Yekovich, 1993; Kintsch, 1998) in that confidence judgments within this mental model can be highly automatic (little conscious attention is involved) because information processing is fast and hence individuals are not necessarily aware of their confidence. This explains why sometimes we do not think about our confidence in our performance. In the context of L2 use or learning, when learners have mastered the target language necessary for use and are performing the task in familiar environments, after extensive practice, and in the areas of expertise, confidence can be tacit. In order to further understand confidence in this model, a local MM must be defined as follows. First, the MM is local because only the four given alternatives are taken into account. Second, it is direct because it contains only the target variable and hence no probability cues are used. Third, no inferences, besides fundamental operations of deductive logic, occur. During this operation, it may occasionally take some time to retrieve information from memory. In this occasion, use of internal and external feedback is often invoked tacitly. Finally, if the search is successful, the confidence in the knowledge produced is certain. Gigerenzer et al. (1991) noted that within the local MM, memory can fail and thus certain knowledge retrieved can be inaccurate.

Page 38: Volume 3, 2005

30

Inaccurate retrieval of information can be seen as a source of miscalibration (i.e., overconfidence) within a local MM. A prediction that we can make from this local MM is that if L2 learners have the adequate linguistic, sociolinguistic, pragmatic, discourse and/or strategic competence required by the given task, they will construct their mental model at the local cognitive level and their confidence will be generated locally.

Figure 1. Cognitive processing and confidence generation in solving a multiple-choice test task (adapted from Gigerenzer et al., 1991). Probabilistic Mental Model (PMM)

Unlike the local MM, confidence within the PPM is different because it is generated based on probability. If the attempt to construct a local MM fails (see Figure 1, if “no”), a PMM is then constructed. This construction goes beyond the structure of the task by using probabilistic information gained from the environment. The PMM theory uses the following terms to explain the confidence phenomenon: a reference class of objects that are mentioned in the test items, and a target variable that represents a category of interest. To complete a

Page 39: Volume 3, 2005

31

task, individuals use a PMM as the basis for a process of inductive inferences by employing a network of other variables. An example of inductive inferences includes using all possible contextual information that is mentioned in the test items as logical evidence to interpret meanings as a means to answer the question. Other inductive inferencing strategies are guessing word meaning, predicting outcomes, supplying missing information, determining author’s tone, and so on. Such a network represents the probability cue that is used to discriminate between alternatives.

Within this operation, each of the probability cues has different levels of validity for the target variable (i.e., the desired answer/performance). In other words, individuals define the probability representation associated with the more likely answer based on their own cognitive and contextual resources. Unlike individuals’ generated cue validities, ecological validity corresponds to the probability associated with the more likely outcome, which is defined as a truth independent of the individuals’ cue validities. When a cue validity matches the ecological validity perfectly, mental operations will lead to the correct answer. If confidence is high here, individuals are likely to be calibrated. Note that if they know that their probability cue is not ecologically valid, and if their confidence is also low, they are also likely to be calibrated. Accordingly, task completion based on invalid cues and invalid confidence can be a source of miscalibration within the PMM. Gigerenzer et al. (1991) note that the assumption that confidence equals cue validity is not arbitrary; instead it is rational and simple in the sense that good calibration is to be expected if a cue validity corresponds to an ecological validity.

Gigerenzer et al. (1991) postulate that within the PMM, the order of probability cues is not randomly generated. Rather, the cue order reflects the hierarchy of cue validities. Once a probability cue with the highest hierarchy is generated, it is tested to see whether it can be activated to solve that problem (see Figure 1). This logic suggests that before the best probability cue is found for the problem, several other cues may have been generated and tested. According to Gigerenzer et al. (1991), if the number of problems or test questions is large and other kinds of time pressure apply, and if the activation rate of cues is rather small, then it can be assumed that the cue generation and testing cycle ends after the first cue being activated has been found. For realistic learners, if no cue can be activated within this attempt, it can be assumed that the answer or the target variable is decided randomly and 0% (or 25%) confidence will be chosen. Based on the discussion of the two models here, it can be argued that in the same task, the level of attention paid to confidence judgments can differ significantly from individual to individual. That is, some individuals can have a high degree of confidence in the task performance without an awareness of it (i.e., information processing within the local MM), whereas others can be highly aware of their high confidence (i.e., information processing within the PMM). Internal and External Feedback

In both the local MM and PMM, feedback plays a significant role not only to assist desirable performance but also to assess confidence in performance. The use of feedback within the local MM and the PMM can, however, differ significantly; that is, feedback in the local MM can be faster and more automatic than that within the PMM because individuals do not need to test cues and cue validities in regard to decisions for the correct answer. According to Butler and Winne (1995), as well as Stone (2000), who discuss feedback extensively, feedback can dramatically influence confidence judgments. A primary role of

Page 40: Volume 3, 2005

32

feedback in relation to calibration is to improve the quality of performance and realistic confidence. On the one hand, internal feedback (i.e., internally self-generated feedback within an individual during task engagement) includes judgments of success in the task in regards to the desired goals, judgments of a relative productivity of various cognitive processes such as strategies along with expected rates of progress, and positive or negative feelings associated with productivity. On the other hand, external feedback includes outcome feedback (such as indication of right or wrong answers) and cognitive feedback (such as valid reasons for good or bad performance). Cognitive feedback can be expected to have a significant effect on self-assessment during cognitive engagements, whereas outcome feedback tends to impact confidence in overall achievement. When external feedback enhances internal feedback, individuals engage in better self-monitoring, self-testing, and metacognitive judgments. Both kinds of feedback can thus function to confirm, add to, or conflict. Without feedback, individuals can fail to adjust their information processing as task difficulty arises, and they may then be overconfident in their performance. In high-stakes situations, high validity of confidence is important because confidence is in turn feedback per se. If confidence is low, self-regulated individuals will know that some strategic actions may be needed to improve the performance and to accomplish the task more successfully. Two Types of Confidence

Confidence judgments are of two categories: single-case, confidence in the correctness of each test item, and relative-frequency, confidence of an overall test performance. According to the PMM theory, these two categories are based on different cognitive processes (Gigerenzer et al., 1991). Single-case confidence is confidence in a specific performance determined by the perceived knowledge to answer the question and the available choices. Relative-frequency confidence is confidence in overall test performance based on the number of questions thought to have been answered correctly. Relative-frequency confidence is similar to postdiction (Glenberg, Sanocki, Epstein, & Morris, 1987) or an evaluation score (Kleitman & Stankov, 2001). Kleitman and Stankov argue that relative-frequency confidence also reflects contextual factors pertaining to the entire test, such as test instructions, the characteristics of test item questions, and time constraints. The functions of single-case confidence and relative-frequency confidence differ in human information processing. The former is related to internal feedback for specific cognitive processing to deal with the given tasks at hand, whereas the latter is related to internal feedback to self-reflect on how well one has performed in the test task (post self-evaluation). To accommodate the model in Figure 1, confidence assessment involves two cognitive stages. The first stage involves searching one’s knowledge and engaging in task completion, and this stage ends when an answer is selected. In the second stage the evidence is reviewed and confidence in the chosen choice is assessed. Here, individuals retrieve from memory a subset of the available cues (e.g., the frequencies with which a given combination of cues predicts the right and wrong answers), they aggregate the cue validity according to perceptions which eventually result in an internal feeling of confidence, and then they express this internal feeling in terms of numerical probability. Plotting Calibration

Based on the discussion above, learners are considered to be well calibrated when they report a 100% level of confidence in their test performance and their actual test performance is also 100%. They are not well calibrated when their confidence is higher or lower than their

Page 41: Volume 3, 2005

33

performance. Calibration can be simply computed by subtracting the confidence from the actual performance. If confidence in performance is greater than the actual performance, test takers are overconfident, whereas they are underconfident if the actual performance exceeds their confidence. Both forms of miscalibration suggest such individuals are not realistic, that they were cognitively biased in their information processing. They fail to adjust their internalized response criteria to changes in the task demands during information processing. To further illustrate what it means by being calibrated, Figure 2 presents a calibration diagram that provides information about a tendency in confidence rating by an L2 learner in three different tasks. To explain the diagrams, the 45° line (called a unity line) represents performance. If the confidence indicators are on the 45° line, the test taker is perfectly realistic or well calibrated. If a confidence indicator is above the 45° line, the test taker is overconfident, and if it is below the 45° line, the test taker is underconfident.

Figure 2. Calibration Diagram. Implications for the Present Study

A number of psychological studies that look at human calibration show that people are generally miscalibrated (see Stone, 2000, for a comprehensive review). Most of these studies found that people are overconfident in their knowledge or performance. As pointed out earlier, previous LT and SLA research has not been conducted to find the extent to which L2 test takers/learners are calibrated. Much previous strategic research has been devoted to understanding strategy use (see e.g., Oxford, 2003; Purpura, 1999). Another area related to calibration research is research into L2 learners’ self-assessment, where learners are asked to judge their own ability (e.g., perceived proficiency or achievement) in the target language (see e.g., Oscarson, 1997; Ross, 1998). The strength of the relationships between self-assessment and language test scores or abilities is mixed. Some studies found moderate to strong relationships (Bachman & Palmer, 1989; Clark, 1981; Coombe, 1992; LeBlanc & Painchaud, 1985; Oscarson, 1978; Wilson, 1996). Some studies, however, found a weak relationship between self-assessment and language performance (Blanche & Merino, 1989; Moritz, 1995; Peirce, Swain, & Hart, 1993). The meta-analysis in the validity of self-assessment as a means to predict future language performance, as opposed to traditional language testing by Ross (1998), suggests that the relationships between learners’ self-assessment (e.g., based on can-

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Overcon

Calibrated

Undercon

Page 42: Volume 3, 2005

34

do statements, self-reports) and actual test performance were only either weak or moderate. Ross’s findings suggest that the learners in those studies were poorly calibrated or unrealistic in their perceived L2 abilities, knowledge, or performance. However, self-assessment research has not been tied with the theory of calibration or the extent to which learners are realistic in their assessment in their L2 abilities, knowledge, or performance based on their confidence. This review of the literature suggests a void that requires empirical investigation of other areas of strategic processing. Research into the validity of self-assessment can provide a key to an understanding of L2 learners’ calibrative ability. Research on calibration can also help unlock important underlying language competence as it explains why some learners cannot self-assess validly and what kind of factors influence their calibration. Such research may enable us to find ways to help them develop this ability.

It is fortunate that we do not have to start from scratch, theory-wise, because there are some theories and insights from psychological research that are useful for this research agenda. For the purpose of the present research, calibration in contexts of language testing is examined. In a typical educational setting, language tests are used for various purposes such as to determine the extent to which learners achieve the desired learning goals (e.g., achievement tests), to place learners in a suitable language program (e.g., placement, screening, or diagnostic tests) or to predict the likely ability to use the target language in the future (e.g., proficiency tests). The constructs of language ability/skills have been comprehensively discussed by various scholars such as Bachman and Palmer (1996). Based on such postulations, various approaches and methods to assessing L2 ability have been developed and used in the past decades (e.g., Bachman & Palmer, 1996; Brown, 1996; Davidson & Lynch, 2001; Lynch, 2003).

Given that each study can add to our understanding of the specific conditions and variables that can influence L2 individuals’ calibration of language performance, the present study investigates L2 learners’ calibration using an English placement test (EPT) designed by the English Language Institute, University of Michigan. The use of this test to understand calibration is significant in a number of respects. First, given that the EPT assesses various English language skills such as listening (Buck, 2001), grammar (Purpura, 2004), vocabulary (Read, 2000), and reading (Alderson, 2000), the study will show whether calibration is similar or different among these language skills. Understanding skill differences in regard to calibration will rigorously demonstrate not only individual differences in calibration but also the influences of language modes on calibration. Hence, this study will not only contribute to the ongoing validation processes of an EPT test, but also contribute to development of a theory of calibration that may be a missing puzzle piece in the theory of communicative competence in the fields of L2 testing, learning, and acquisition. In the present study, the nature of calibration among different proficiency levels will be examined across the language domains (listening, grammar, vocabulary, and reading). This study also attempts to unlock key factors that influence test takers’ calibration. The following are three fundamental research questions:

1. Are test takers calibrated? 2. How do test takers at different proficiency levels differ in terms of calibration

and confidence judgments? 3. What are factors affecting the nature of test takers’ calibration?

Page 43: Volume 3, 2005

35

Method Background and Participants

The study was carried out at one of the major universities in the north of Thailand where English as a foreign language (EFL) is compulsory for the completion of a bachelor’s degree. The university has recently launched a new policy to increase the standard of teaching and learning English by separating weak students of English from strong ones before they are placed into a suitable language program. If their English is poor, they will be put in a remedial English subject (a noncredit subject) before they can be enrolled in other required fundamental subjects. Given this policy, there is a need for an estimate of the proportion of students who need an English remedial subject. Initially, about 400 learners voluntarily took the EPT test. They were informed of the research procedures prior to the data-gathering period. They were also informed that they were free to withdraw any unprocessed data previously supplied. However, only 295 learners were included in the study (95 males, 32%; and 200 females, 68%). Some participants were excluded from the data analyses because they did not rate their single-case confidence. Furthermore, based on a Rasch Item Response Theory (IRT) analysis, I excluded misfitting and overfitting test takers, as the test did not measure their English ability properly (see McNamara, 1996). Inclusion of these participants could undermine the findings in the study. The participants were between the ages of 16 and 23 (mean = 19.36). Their English proficiency levels, as defined by the preliminary establishing cut scores that the English Language Institute provided, ranged from Beginner to Advanced levels. There were 108 beginner learners (scores 0–29), 156 beginner high learners (30–47), 17 intermediate low learners (48–60), 4 intermediate learners (61–74), 7 advanced low learners (75–84), and 3 advanced high learners (85–100). Based on the cut scores, the majority of the learners would need to take the remedial English subject. Research Instruments

The research instruments included the EPT (Form A) and an answer sheet specially designed to assess both the single-case and relative-frequency confidence.2 EPT Test

The English Placement Test (EPT) was developed by the English Language Institute, University of Michigan, for the purpose of a quick placement of English language students into homogeneous ability levels. The number of ability levels and the associated cutoff scores depend on the English language program in which the test is used. There are 100 multiple-choice questions in the test that cover four areas of the English language: listening comprehension, grammar, vocabulary, and reading. To understand this test in order to relate it to calibration in the study, I now discuss the descriptions of each section with the directions and test samples.

Listening section. Twenty questions (items 1–20) are devoted to listening comprehension. There are two types of items: one in which the speaker asks a question and

2 In fact, qualitative data by means of retrospective interviews with 6 successful test takers (3 males, 3 females) and 6 unsuccessful test takers (3 males, 3 females) were gathered after the test had been analyzed. Based on the person-ability and item difficulty map, these test takers were asked how they provided their confidence. They were asked to explain their confidence in the five most difficult items and five least difficult items in each test section. However, for the purpose of this paper, results based on qualitative data analysis are not reported because of the complexity of the data analysis that can complicate the nature of the present paper.

Page 44: Volume 3, 2005

36

the test taker is to respond by selecting an appropriate reply; and the other in which the speaker makes a statement and the test taker is to choose the phrase or sentence that summarizes or means about the same thing as the statement. In each question, the test taker is given three answer choices. In the present study, the EPT audiocassette recording was used. In this section, each item is read with a pause of 12 seconds before the number of the next item is read. None of the questions or statements are repeated. The following are the directions and test sample.

Directions: This is a test of how well you understand spoken English. The examiner will either ask a question or a statement. To show that you have understood what was said, you should choose the ONE answer choice that is a reasonable response or answer.

Example I: [Listening text (pause 1 second) When are you going? (pause 12 seconds)]

a. I am. b. Tomorrow. c. To Detroit.

The correct response is choice b, “Tomorrow.” Choice b has been marked on your answer sheet to show that it is the correct answer for Example I. Now here is an example of the statement type of problem. Listen to the statement and then choose the ONE phrase or sentence that corresponds to it.

Example I: [Listening text (pause 1 second) John and Mary went to the store. (pause 12 seconds)]

a. Only John went. b. Only Mary went. c. They both went.

“John and Mary went to the store,” means that they both went. On your answer sheet, for Example II, mark the space after choice c to show that “They both went,” is the correct answer.

Grammar section. There are 30 grammar questions (items 21–50). Each grammar

question represents a hypothetical conversation between two people. Part of the response in each conversation has been omitted and the student is to choose the one word or phrase that correctly completes the conversation from the four answer choices printed in the test booklet.

Directions: In each grammar problem there is a short conversation between two people. The conversation is not complete. You should look at the answer choices which follow the conversation, and then choose the ONE answer that correctly complete the conversation.

Example III: “What’s your name?” “_____ name is John.”

a. I b. Me c. My d. Mine

The correct answer is choice c, “My.” On your answer sheet, for Example III, mark choice c. Answer all the grammar problems this way.

Vocabulary section. There are 30 vocabulary questions (items 51–80) in this section of

the test. In each question, an incomplete sentence is given and the student must choose the one word that correctly completes the sentence from the four answer choices printed in the test booklet.

Page 45: Volume 3, 2005

37

Directions: In each vocabulary problem there is a sentence with a word missing. From the answer choices following the sentence, you should choose the ONE word that best fits into the sentence and makes it meaningful.

Example IV: I can’t _____ you his name, because I don’t know it. a. talk b. say c. speak d. tell

The correct answer is choice d, “tell.” On your answer sheet, for example IV, mark choice d. Answer all the vocabulary problems this way.

Reading section. There are 20 reading questions (items 81–100). Each question

consists of one sentence followed by a question concerning its meaning. The test taker is offered four answer choices, only one of which is correct.

Directions: In each reading comprehension problem you will read a sentence and then answer a question about it. Choose the ONE best answer to the question, using the information in the sentence you have just read.

Example V: John drove me to Eleanor’s house. Who drove?

a. I did. b. John did. c. John and I did. d. Eleanor did.

The correct answer is b, “John did.” On your answer sheet, mark choice b, for Example V. Answer all the reading problems this way.

Test Administration

Based on the administration procedures provided by the English Language Institute, the administration time for the test is approximately 75 minutes. First, test takers complete section one (listening) of the test by listening to the EPT cassette recording. Following the listening test, the test administrator reads the instructions and examples on page 3 of the test booklet (as seen above) to the test takers and answers any questions they may have. The test takers have 50 minutes to answer items 21 through 100. However, for the purpose of this study, there were some adjustments to the test administration as proposed by the English Language Institute. First, prior to the beginning of the test, the test takers were told how they would need to provide their confidence in the correctness of each test item immediately after they chose the answer (to be further discussed in the answer sheet below). Second, instead of completing the 80 test items and rating their confidence of each item within 50 minutes, I gave them 80 minutes to complete the test. This amount of time was appropriate and reasonable because the test takers needed enough time to complete the test as well as to assess their confidence. It was, nevertheless, found that there were some test takers who failed to complete the entire test within this amount of time. The EPT Answer Sheet

In order to assess single-case confidence, I designed an EPT answer sheet that allows test takers to provide their confidence of each test item. The answer sheet is given in Appendix A. For the listening test section, which has three-option multiple-choice questions, the confidence rating scale is 0, 33, 66, and 100. In a four-choice question, the confidence rating scale is 0, 25, 50, 75, and 100. In regard to the assessment of relative-frequency confidence of the entire test, at the end of the answer sheet test takers are requested to

Page 46: Volume 3, 2005

38

complete the following statement: “I think the number of the correct answers is (out of 100): ________.” It is fortunate that the EPT has 100 questions, which allows easy conversion into percentages.

Data Analyses Computer Programs

Quest (Adams & Khoo, 1996) for Rasch IRT analyses was used to analyze the EPT test. The Statistical Packages for Social Sciences (SPSS) version 10.1 for PC was used to compute descriptive statistics, reliability analyses, Pearson product-moment correlations, and multivariate analysis of variance (MANOVA). The Microsoft Excel software program was used to produce calibration graphs. EPT Analyses

The test data were analyzed for content and reliability by means of Rasch IRT using Quest. The Rasch IRT model proposes a simple mathematical relationship between ability and difficulty and then expresses this relationship as the probability of a certain response. A number of IRT analyses were conducted to eliminate misfitting and overfitting test takers. Misfitting statistics implies that a test taker’s performance was not assessed appropriately by the test. Overfitting statistics implies that a test taker’s performance was redundant and the test score did not say much about his or her ability. It was found that the person separation reliability (equivalent to K-R20) was 0.87 (the item estimates reliability was 0.96). Given the intended purpose of the test as a quick placement measure, the reliability coefficient was acceptable. The English Language Institute reported reliability estimates of the test, which ranged from 0.89 to 0.96 (N = 30 to 58). The reliability estimate found in the present study was lower. However, this was expected based on the case estimates where the majority of the test takers were homogeneous in their ability (i.e., low English ability). Conditions for high reliable measures, such as a wide range of ability and too many difficult test items, are discussed in Hatch and Lazaraton (1991). Note also that it was found that Section 1 (listening) affected the internal consistency of the entire test because there were only three alternatives. The item discrimination analysis indicated that a few items in this section did not function very well (Point-Biserial < 0.25). Table 1 presents the descriptive statistics for the four sections of the test. Table 1. Descriptive Statistics of the EPT score variables (N = 295) Section k Minimum Maximum Mean SD Listening 20 2.00 18.00 6.47 2.96 Grammar 30 2.00 29.00 11.38 4.46 Vocabulary 30 3.00 29.00 10.34 4.43 Reading 20 2.00 20.00 6.57 3.01 Total 100 16.00 96.00 34.75 12.64

Figure 3 presents the item difficulty and person ability map. The mapping of item

difficulty and person ability on the same scale is one of the most useful properties of a Rasch IRT analysis (McNamara, 1996).

Page 47: Volume 3, 2005

39

------------------------------------------------------------------------------------- Item Estimates (Thresholds) all on all (N = 295 L =100 Probability Level= .50) -------------------------------------------------------------------- 4.0 | | | | | X | | | 3.0 | | | | X | | | X | 2.0 | | X | X | X | X | 13 47 60 X | 46 62 | 49 71 1.0 X | 14 20 X | 25 67 78 X | 38 80 90 96 | 28 54 56 66 72 85 92 100 | 9 10 11 52 64 XX | 6 7 97 99 XX | 3 26 33 37 43 59 69 89 .0 XX | 2 5 8 21 29 61 68 94 XXXXX | 16 17 18 35 41 74 76 77 XXXXXX | 4 15 36 86 87 88 95 XXXX | 23 30 44 58 73 75 98 XXXXXXXX | 19 24 27 34 57 65 XXXXXXXXXXXXXX | 32 79 XXXXXXXXXXXXXXXXXXXXX | 31 53 70 81 83 XXXXXXXXXXXXXX | 45 55 -1.0 XXXXXXXXXXXXXXXXXXXX | 40 42 48 XXXXXXXXXXXXXXXXXXXXXX | 1 12 22 39 50 XXXXXXXXXX | XXXXXX | 63 XXXX | XX | X | X | -2.0 | | | | 51 | | | | -3.0 | =================================================================================== Each X represents 2 test takers Some thresholds could not be fitted to the display ===================================================================================

Figure 3. Item Difficulty and Person Ability Map based on the EPT.

The figure shows a continuum of item difficulty and person ability. On the left are the units of measurement on the scale (called logits), extending in this case from –3 to +4 (a 7-unit range). The average item difficulty is set at 0 per convention. The ability of each individual test taker is plotted on the scale (represented as Xs, each X represents two test takers). On the right are the item numbers (test questions): the higher the scale, the greater the

Page 48: Volume 3, 2005

40

item difficulty and candidate ability (see McNamara, 1996, for further discussion of the map). For the purpose of the study, I use the information from this map to understand the nature of calibration; for example, the nature of confidence regarding the difficult items and easy ones. Analyzing Confidence and Calibration

For the purpose of the quantitative data analyses, single-case confidence and item-level test performance were aggregated and averaged across test sections. Test performance was transformed into percentages for the calibration analyses. It must be noted that calibration found in this study is interpreted as from a group of learners, rather than from an individual (i.e., not the evolution of an individual’s intracalibration throughout the test). Apart from correlational analyses between confidence in performance and actual test performance, calibration (C) is computed based on the following simple linear model:

C = c − p where c is a confidence estimate and p is a relative test performance.

This formula provides a simple linear model for calibration (perfect correlation between confidence and performance is 1). If C is 0, test takers are well calibrated. If C is above 0, they are overconfident, and if C is below 0, they are underconfident. This calculation is used to produce calibration graphs (as explained earlier).

Results Research Question #1: Are Test Takers Calibrated?

In order to answer this research question, I first examined the descriptive statistics of the test performance and confidence in percentages. As can be seen in Table 2, a comparison between test performance and confidence in performance suggests that confidence was higher than performance across test sections. Based on this, the test takers were likely to be overconfident in their test performance.

Table 2. Descriptive Statistics of EPT Performance and Confidence in Percentage (N = 295) Section Minimum Maximum Mean SD Listening performance 10.00 90.00 32.36 14.82 Confidence in listening 0.00 91.55 41.30 17.72 Grammar performance 6.67 96.67 37.94 14.85 Confidence in grammar 0.00 97.50 46.94 17.94 Vocabulary performance 10.00 96.67 34.45 14.77 Confidence in vocabulary 0.00 99.17 40.86 17.96 Reading performance 10.00 100.00 32.83 15.06 Confidence in reading 0.00 97.50 35.51 18.66 Overall test performance 16.00 96.00 34.75 12.64 Overall single-case confidence 2.06 96.43 41.15 16.23 Relative-frequency confidence 5.00 97.00 34.64 13.31

Page 49: Volume 3, 2005

41

Before further investigating the extent to which the test takers were miscalibrated, an examination of calibration graphs would yield a general idea of how the 295 test takers exhibited their confidence in the correctness of their test performance. Figures 4 through 9 show the calibration graphs of the test takers in each test section and for the whole test. The test takers can be classified into three groups: realistic test takers (i.e., those falling on the unity line); underconfident test takers (i.e., those under the unity line); and overconfident test takers (i.e., those above the unity line). What these graphs also provide is a suggestion that most test takers had a tendency to be overconfident in their test performance, as also shown in Table 2, above.

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Listening Performance

Con

fiden

ce in

Per

form

ance

Test takers

Figure 4. Calibration Diagram for the Listening Section (N = 295).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Grammar Performance

Con

fiden

ce in

Per

form

ance

Test takers

Figure 5. Calibration Diagram for the Grammar Section (N = 295).

Page 50: Volume 3, 2005

42

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Vocabulary Performance

Con

fiden

ce in

Per

form

ance

Test takers

Figure 6. Calibration Diagram for the Vocabulary Section (N = 295).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Reading Performance

Con

fiden

ce in

Per

form

ance

Test takers

Figure 7. Calibration Diagram for the Reading Section (N = 295).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Overall Performance

Sing

le C

ase

Con

fiden

ce

in P

erfo

rman

ce

Test takers

Figure 8. Calibration Diagram (Single-Case Confidence) for the EPT (N = 295).

Page 51: Volume 3, 2005

43

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Overall Performance

Rel

ativ

e fr

eque

ncy

Con

fiden

ce in

Per

form

ance

Test takers

Figure 9. Calibration Diagram (Relative-Frequency Confidence) for the EPT (N = 295).

To find out the relationship between test performance and single-case confidence in performance, Pearson correlations were computed. Table 3 presents the Pearson correlation coefficients and the attenuated coefficients between confidence and performance.

Table 3. Relationships between Confidence and Performance EPT section Correlation coefficients Attenuated coefficients Listening 0.383* (R2 = 0.15) 0.550 (R2 = 0.30) Grammar 0.407* (R2 = 0.17) 0.502 (R2 = 0.25) Vocabulary 0.502* (R2 = 0.25) 0.612 (R2 = 0.37) Reading 0.453* (R2 = 0.21) 0.629 (R2 = 0.40) Overall test (single-case confidence) 0.540* (R2 = 0.29) 0.581 (R2 = 0.34)

* p < 0.01 As can be seen in Table 3, all these correlation coefficients indicated that the test

takers could not accurately estimate the accuracy of their test performance. In other words, their perceived confidence in their test performance could not be trusted, as their confidence explained only about 25% to 40% of their test score variance. In order to find out the extent to which they were miscalibrated, the simple linear model for calibration formula as presented above was used. Table 4 presents the calibration scores for each test section and the overall test scores based on the aggregated single-case confidence (see Table 2). As can be seen in Table 4, the test takers in general were found to be overconfident in their test performance. The test takers tended to be highly overconfident in the Grammar section but approached being realistic in the Reading section. As can be seen from Tables 3 and 4, the statistical findings and the calibration scores were consistent.

Page 52: Volume 3, 2005

44

Table 4. Test Section Calibration Scores EPT section Calibration scores Listening +8.94% Grammar +9.00% Vocabulary +5.84% Reading +2.68% Overall test (single-case confidence) +6.4%

The calibration score based on their relative-frequency confidence suggests that these

test takers approached being perfectly calibrated (C = −0.08%). The relative-frequency confidence indicates that they could generally say how well they had performed on the test. Based on the two sources of confidence, single-case and relative-frequency, two different pictures of calibration were obtained (i.e., aggregated single-case confidence suggests overconfidence, and relative-frequency confidence suggests good calibration). This finding is intriguing given the discussion of the differences between single-case confidence and relative-frequency confidence discussed earlier. The value for the relative-frequency confidence could reflect the fact that if the test takers had time at the end of the test, they could revisit the items they felt they had trouble with (i.e., indicating overall post self-evaluation). In this way, relative-frequency confidence would be related to future self-improvement motivation. However, for the sake of actual itemized test performance, should single-case confidence be more important than relative-frequency confidence, or are both equally important? This issue will be further discussed in the next section.

In line with the finding in calibration based on the averaged single-case confidence and relative-frequency confidence, it would be interesting to see whether a relationship between single-case and relative-frequency confidence exists. Logically, if test takers perceive that they have not been doing well in all test items, it is likely that their relative-frequency confidence will also be low. That is, if individuals can constantly and accurately assess their single-case confidence throughout the entire test, the cumulated single-case confidence will play a role in determining the relative-frequency confidence. However, Gigerenzer et al.’s (1991) theory does not lend support to this possibility. They argue that these two types of confidence are not correlated because they are derived from different cognitive classes. They postulate that relative-frequency confidence is evaluated based on the number of questions individuals think they have answered correctly, whereas in the case of single-case confidence, confidence is evaluated based on the kinds of specific questions and choices, and so forth. To provide evidence for Gigerenzer et al.’s (1991) theory and explore the possible relationship pointed out above, a correlation coefficient between the averaged single-case confidence and relative-frequency confidence was computed and a statistically significant relationship was found (r = 0.55, p < 0.01). The strength of this relationship was moderate, suggesting that to some extent single-case confidence was related to relative-frequency confidence (31% of the variances was shared). This finding implies that the quality of single-case confidence is important, as it would help inform post self-evaluation, contradicting Gigerenzer et al.’s (1991) assertion (to be further discussed in the Discussion section).

In summary, the answer to Research Question 1 is that the test takers were not calibrated during test task completion. They exhibited a tendency to be overconfident in their

Page 53: Volume 3, 2005

45

test performance. However, a conclusion based on these findings cannot yet be drawn, as we still need to look at this further, at least at the level of language proficiency and possible factors that influence inaccuracy in single-case confidence judgments. Research Question #2: How do Test Takers at Different Proficiency Levels Differ in Terms of Calibration and Confidence Judgments?

As presented in the method section, the test takers were classified as beginner, beginner high, intermediate, intermediate high, advanced, and advanced high learners. Due to the limited number of learners in the intermediate and advanced groups, for the purpose of statistical analysis, intermediate and intermediate high learners were grouped as intermediate learners (n = 21), and advanced and advanced high learners were grouped as advanced learners (n = 10). Table 5 presents the descriptive statistics of the test performance and confidence in the EPT in percentage by proficiency levels.

Table 6 presents only the significant correlation coefficients between single-case confidence and test performance by section. As can be seen in the table, the single-case confidence provided by the beginner level was the least valid among the four groups of learners. Even for the beginner high group, the relationships were weak (less than 7% of the shared variances). This means that the learners at the beginner levels exhibited the poorest calibration, while the beginner high learners did not differ much from them. The intermediate learners stood out as the most interesting in regard to the correlation coefficients (percentages of variance were reasonably higher than the others). Particularly, their overall single-case confidence accounted for 53% of the variance in their test performance. The advanced level learners could highly predict their grammar performance (64% of the confidence score accounted for the test score variance). It should be noted that findings in nonstatistically significant coefficients might be explained not only by the fact that calibration was so poor, but also the fact that the number of subjects and assessment methods, particularly the intermediate and advanced learner groups, could result in a method artifact. We hence need to consider these findings cautiously. To further explore who exhibited better calibration, calibration scores were computed (see Table 7).

The patterns of the calibration scores across different proficiency levels were truly intriguing. First, it was found that the advanced learners exhibited underconfidence across all the test sections. Second, the beginner and the beginner high learners consistently exhibited overconfidence across the test tasks. Third, although generally overconfident, the intermediate learners were the most realistic test taker group, particularly in the vocabulary section where they were perfectly calibrated. Fourth, the beginner high learners were most realistic in their relative-frequency confidence. Both the intermediate and advanced learners were underconfident in their overall relative-frequency confidence, whereas both groups of beginners were overconfident. Unlike other proficiency levels, the advanced level learners exhibited underconfidence throughout the test sections. Figures 10 through 13 show the calibration graphs of these learner groups.

Page 54: Volume 3, 2005

46

Table 5. Descriptive Statistics of Performance and Confidence by Proficiency Level (N = 295) Proficiency levels Minimum Maximum Mean SD Listening performance Beginner 10.00 55.00 24.81 10.09 Beginner high 15.00 60.00 32.53 10.14 Intermediate 30.00 70.00 46.42 11.95 Advanced 65.00 90.00 81.50 7.09 Confidence in listening Beginner 0.00 79.65 36.96 15.48 Beginner high 0.00 83.00 40.77 16.91 Intermediate 18.15 83.10 55.34 17.11 Advanced 33.00 91.55 66.91 20.25 Grammar performance Beginner 6.67 50.00 27.93 7.78 Beginner high 16.67 66.67 38.95 8.49 Intermediate 43.33 83.33 58.89 11.47 Advanced 63.33 96.67 86.33 10.24 Confidence in grammar Beginner 0.00 80.00 41.71 16.09 Beginner high 6.67 90.00 46.30 16.23 Intermediate 39.17 95.83 64.13 15.80 Advanced 45.83 97.50 77.25 17.17 Vocabulary performance Beginner 10.00 43.33 23.70 6.76 Beginner high 20.00 60.00 35.68 8.25 Intermediate 43.33 73.33 58.57 8.20 Advanced 50.00 99.17 74.33 15.83 Confidence in vocabulary Beginner 0.00 65.83 34.99 15.44 Beginner high 2.50 85.00 40.38 16.12 Intermediate 38.33 95.00 58.67 16.04 Advanced 50.00 99.17 74.33 15.83 Reading performance Beginner 10.00 45.00 24.44 8.01 Beginner high 10.00 65.00 33.21 10.58 Intermediate 25.00 75.00 49.76 13.73 Advanced 65.00 100.00 82.00 9.78 Confidence in Reading Beginner 0.00 72.50 31.71 15.25 Beginner high 0.00 82.00 33.45 16.42 Intermediate 23.75 92.50 50.65 21.23 Advanced 50.00 97.50 77.00 16.30 Overall test performance Beginner 16.00 29.00 25.33 3.26 Beginner high 30.00 47.00 35.53 4.70 Intermediate 48.00 69.00 54.38 7.24 Advanced 76.00 96.00 82.90 6.92 Overall single-case confidence Beginner 2.06 64.18 36.34 13.77 Beginner high 5.59 77.94 40.22 14.25 Intermediate 36.52 89.10 57.20 15.23 Advanced 44.71 96.43 73.87 15.85 Relative-frequency confidence Beginner 5.00 60.00 31.64 11.37 Beginner high 10.00 69.00 33.45 10.58 Intermediate 10.00 70.00 43.29 16.02 Advanced 40.00 97.00 67.50 15.55

Page 55: Volume 3, 2005

47

Table 6. Significant Correlation Coefficients between Single-Case Confidence and Test Performance by Test Section Proficiency levels Correlation coefficients Beginners ns for all test sections Beginner high 0.169* (listening); 0.247* (vocabulary); 0.279* (overall test) Intermediate 0.577* (listening); 0.602* (vocabulary); 0.488* (reading);

0.731* (overall test) Advanced 0.800* (grammar)

* p < 0.01 Table 7. Calibration Scores by Proficiency Level Test section Beginner Beginner high Intermediate Advanced Listening +12.15 +8.24 +8.91 −14.59 Grammar +13.78 +7.35 +5.24 −9.08 Vocabulary +11.29 +4.70 0.00 −6.33 Reading +7.27 +0.24 +1.89 −5.00 Overall (single-case) +11.01 +4.69 +2.81 −9.03 Overall (relative-frequency) +6.32 +1.92 −11.09 −15.4

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Listening

Grammar

Vocab

Reading

Figure 10. Calibration Diagram for the Beginner Learners (N = 108).

Page 56: Volume 3, 2005

48

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Listening

Grammar

Vocab

Reading

Figure 11. Calibration Diagram for the Beginner High Learners (N = 156).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Listening

Grammar

Vocab

Reading

Figure 12. Calibration Diagram for the Intermediate Learners (N = 21).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Listening

Grammar

Vocab

Reading

Figure 13. Calibration Diagram for the Advanced Learners (N = 10).

Page 57: Volume 3, 2005

49

As seen in Table 7, however, these test takers were somewhat realistic in the sense that they did not extremely overestimate or underestimate themselves based on their perceived ability. That is, the averaged confidence between different proficiency groups differed greatly (i.e., advanced learners rated their confidence higher than those with lower proficiency). A multivariate analysis of variance (MANOVA), in which proficiency levels and gender were treated as independent variables, was performed to see if there were statistically significant differences between confidence ratings among the four groups of learners. The test of between-subjects effects showed that there was no interaction effect between proficiency levels and gender. Hence, the statistically significant difference in the confidence ratings was due to the main effect (proficiency level) only. Table 8 shows the factorial MANOVA results for proficiency level. Table 8. Factorial MANOVA Results for Proficiency Levels Dependent variable df F p η2 Observed power Confidence in listening 7 7.760 0.000 0.159 1.000 Confidence in grammar 7 11.834 0.000 0.224 1.000 Confidence in vocabulary 7 14.124 0.000 0.256 1.000 Confidence in reading 7 14.738 0.000 0.264 1.000

There were statistically significant differences in confidence among the groups of learners (large effect size, i.e., η2 > 0.14; Cohen, 1977). Scheffe post hoc tests were conducted to point out which contrasts were different. It was found that the higher the proficiency level, the higher the confidence in performance rating (see Table 5 for descriptive statistics). That is, advanced learners reported significantly higher single-case confidence than intermediate learners, who in turn reported higher confidence than the two groups of beginner learners. However, the differences in confidence ratings between the beginner and beginner high learners were not statistically significant. Based on this analysis, although the test takers were not well calibrated (unrealistic) in their test performance, they were somewhat realistic in the sense that when compared with other groups of learners, their confidence was dependant upon their actual performance.

In summary, it was found that the nature of calibration and confidence could be highly complex, particularly in relation to language proficiency. It seems likely that as proficiency increases, language skills and associated cognitive processes increase in complexity. The criteria used to judge calibration become more complex, and therefore its assessment is likely to become more complex as well. Hence, it is possible that language proficiency may be one of the significant factors affecting the nature of learners’ calibration. This factor in the area of calibration research warrants future research. The next section further explores factors that might affect the test takers’ calibration (i.e., gender and test item difficulty). Research Question #3: What are Factors Affecting the Test Takers’ Calibration?

For the purpose of this paper, two factors were examined to understand reasons why the test takers were miscalibrated. The first factor was a gender variable and the second factor was the hard-easy item effect.

Page 58: Volume 3, 2005

50

Gender Factor on Calibration It has been well accepted that gender as a personal characteristic plays a crucial role in

language learning and use (e.g., Bachman & Palmer, 1996; Chavez, 2001; Phakiti, 2003). Perhaps it also plays a role in confidence rating in the present context. If we know that gender differences in confidence and calibration exist, we then need to be aware of how gender can affect calibrative development and achievement in L2 learning or use. This knowledge of the differences can be used to accommodate individual students’ needs, given that males and females deserve an equal chance of learning success. For the purpose of the present study, it is important to know which gender (males or females) approached better calibration. Table 9 presents descriptive statistics of the test performance and confidence in the EPT in percentages by gender. Table 9. Descriptive Statistics of Performance and Confidence by Gender (N = 295) Test performance and confidence Gender Minimum Maximum Mean SD Listening performance male 10.00 90.00 35.05 16.78 female 10.00 90.00 31.07 13.65 Confidence in listening male 3.30 86.50 44.54 16.99 female 0.00 91.55 39.77 17.91 Grammar performance male 6.67 90.00 38.73 15.68 female 10.00 96.67 37.57 14.46 Confidence in grammar male 13.33 92.50 48.19 16.82 female 0.00 97.50 45.68 18.11 Vocabulary performance male 10.00 93.33 34.84 16.52 female 10.00 96.67 34.27 13.91 Confidence in vocabulary male 3.33 95.00 43.82 17.99 female 0.00 96.17 39.45 17.77 Reading performance male 10.00 95.00 33.32 16.41 female 10.00 100.00 32.60 14.41 Confidence in Reading male 0.00 92.50 38.59 18.88 female 0.00 97.50 33.81 18.25 Overall test performance male 16.00 93.00 35.74 14.22 female 17.00 96.00 34.27 11.82 Overall single-case confidence male 7.78 88.26 43.78 15.70 female 2.06 95.68 39.67 16.24 Relative-frequency confidence male 10.00 69.00 35.63 13.16 female 5.00 97.00 34.18 13.38

To see the differences between male and female learners, the Pearson correlations were first computed and compared. Table 10 presents the correlation coefficients between confidence and test performance among male and female learners.

Page 59: Volume 3, 2005

51

Table 10. Correlation Coefficients between Confidence and Test Performance among Male and Female Learners (95 males; 200 females) Test performance and confidence Males Females Listening 0.271** 0.403** Grammar 0.373** 0.433** Vocabulary 0.454** 0.512** Reading 0.286** 0.449** Overall test performance (single-case) 0.413** 0.566** Overall test performance (relative-frequency) 0.530** 0.548**

** p < 0.01

Although caution is needed in terms of conclusions due to unequal numbers of genders, from these correlation coefficients it can be observed that female learners exhibited a better prediction of their test performance than males did. To further explore their differences in calibration, the calibration scores were calculated. Table 11 presents the calibration scores by gender. As can be seen, both males and females were overconfident in their test performance. Females’ calibration scores were better than males’, although the differences were not large. In the relative-frequency confidence, males exhibited underconfidence, whereas females exhibited overconfidence. However, males’ scores suggested better calibration. Figures 14 and 15 further illustrate the calibrations of the male and female learner groups. As seen in these figures, both male and female learners exhibited good calibration in overall test performance based on their relative confidence. They were, however, overconfident across the test sections. Table 11. Calibration Scores by 95 Males and 200 Females Section Males Females Listening +9.49 +8.88 Grammar +9.49 +9.03 Vocabulary +8.98 +6.13 Reading +5.27 +2.50 Overall (single-case) +8.04 +6.28 Overall (relative-frequency) −0.11 +0.18

To understand gender differences, as Phakiti (2003) pointed out, it may not be

sufficient to consider learners as representative of one or the other of a pair of dichotomous types. Hence, further analysis that acknowledges proficiency levels as well as gender will be needed. In order to find out if males and females in this study differed in terms of test performance and confidence judgments, the results from MANOVA were used. The test of between-subjects effects showed that, except for the grammar section, there were no interaction effects between proficiency levels and gender to understand test performance differences. As for the confidence ratings, there were no interaction effects between proficiency levels and gender. Hence, statistically significant differences, except that of the grammar performance, in test performance and confidence ratings were due to the main effect

Page 60: Volume 3, 2005

52

(proficiency levels) only. Table 12 reports on the factorial MANOVA results for proficiency levels.

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance Listening

Grammar

Vocab

Reading

Single con

Relative con

Figure 14. Calibration Diagram for the Male Learners (n = 95).

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance Listening

Grammar

Vocab

Reading

Single con

Relative con

Figure 15. Calibration Diagram for the Female Learners (n = 200). Table 12. Factorial MANOVA Results by Gender Dependent variables df F p η2 Observed power Performance in listening 1 0.984 0.322 0.003 0.167 Performance in grammar 1 2.129 0.146 0.007 0.307 Performance in vocabulary 1 1.181 0.278 0.004 0.191 Performance in reading 1 0.049 0.825 0.000 0.056 Confidence in listening 1 0.048 0.827 0.000 0.055 Confidence in grammar 1 0.845 0.359 0.003 0.150 Confidence in vocabulary 1 0.102 0.750 0.000 0.062 Confidence in reading 1 0.229 0.633 0.001 0.076

Page 61: Volume 3, 2005

53

Table 12 shows that there were no statistically significant differences in test performance and confidence judgments between male and female learners. These statistical results imply that if test performance and confidence differences between males and females did not exist, there would be no difference in their calibration. Analyses of multiple comparisons further supported this inference. It was found that males and females at the same proficiency levels did not differ statistically in terms of test performance and confidence judgments. In summary, based on the correlational analysis and computation of the calibration scores, although both genders exhibited miscalibration, females had a better calibration than males. However, no statistically significant differences in test performance and confidence judgment, not only at the male and female level, but also between males and females within proficiency level, were found. What these findings suggest is that gender could interact with the nature of calibration in a complex way, just as language proficiency does.

Hard-Easy Effects on Calibration

The final factor to be explored in this quantitative study is the effect of easy and difficult items on confidence that might affect test takers’ calibration negatively. In the study of confidence and calibration, some calibration researchers have empirically investigated a phenomenon known as the hard-easy effect, or discriminability effect (e.g., Baranski & Petrusic, 1994). This phenomenon occurs when individuals demonstrate overconfidence in their performance in difficult tasks, but show underconfidence in their performance of easy tasks. If the hard-easy effect exists, some logical explanations are needed to inform exactly why failure to adjust internalized response criteria to changes in the demands during information processing occurs. Understanding the hard-easy effect on confidence may also contribute to the understanding of the processes undertaken to complete cognitive tasks and confidence generation. Recall that, as discussed earlier in regard to the local MM, the first attempt to answer a test question would be characterized as memory searching and rudimentary logical operations. When the individuals can generate certain knowledge by constructing a local MM, they will have sufficient evidence for the answer and a confidence of 100%. That is, if the search is successful, the confidence in the knowledge produced is certain. Here, it is likely that in easy test items (i.e., when L2 learners have the adequate linguistic, sociolinguistic, pragmatic, discourse, and/or strategic competence required by the given task), they will construct their mental model at the local cognitive level and their confidence will be generated locally; that is, we can expect that the relationship between confidence and performance in easy test items will be high. Unlike the local MM, confidence within the PPM is different because it is generated based on probability. If the attempt to construct a local MM fails, a PMM is then constructed. This construction goes beyond the structure of the task by using probabilistic information gained from the environment. Here, it is likely that in difficult test items, it may be hard to accurately assess how well one has performed, thereby limiting the ability to make a confidence judgment. Hence, the relationship between confidence and performance in difficult test items will be low. By investigating the hard-easy effect, the cognitive operations modeled in this paper can be understood empirically.

Investigating the hard-easy effect can be achieved by means of Rasch IRT analysis and the Item Difficulty and Person Ability Map (Figure 3). For the purpose of this study, items 13, 14, 20, 25, 46, 47, 49, 60, 62, 67, 71, and 78 were selected as representing difficult test items, and items 1, 12, 22, 39, 40, 42, 45, 48, 50, 51, 55, and 63 were selected as

Page 62: Volume 3, 2005

54

representing easy test items. The infit and outfit mean square statistics indicated that these items were well fitting in the test. To analyze these items so that one can see the general tendency in confidence, their associated single-case confidence results were averaged. Table 13 presents the descriptive statistics for performance with the difficult and easy test items and their associated single-case confidence. The Pearson correlations were first computed to see the relationships between confidence and performance in these two difficulty categories. It was found that the correlation coefficient between confidence and performance for difficult test items was 0.257 (p < 0.01; attenuated coefficient = 0.392), and for the easy test items it was 0.454 (p < 0.01; attenuated coefficient = 0.795). These results support the postulation that the strength of the relationship between confidence and performance for easy test items would be higher than that of the difficult test items. However, this finding did not yield information on the nature of test takers’ calibration in regard to the hard-easy effects. Table 13. Descriptive Statistics for Performance and Confidence for Difficult and Easy Test Items Test items Mean performance Mean confidence Difficult test items 16.073 (SD = 16.649) 41.944 (SD = 14.527) Easy test items 59.407 (SD = 19.408) 49.601 (SD = 17.300)

Table 14 presents the calibration scores of the test takers in the difficult and easy test items. Figure 16 illustrates the calibration of the test takers for these items. As can be seen in the table and figure, the test takers’ given confidence was surprising. Logically, we would assume that for difficult test items test takers would be underconfident in their performance, and for easy test items they would be overconfident, but results are reversed. This finding is consistent with those found by psychological researchers and will be further discussed in the discussion section. Table 14. Calibration Scores for Difficult and Easy Test Items Test items Calibration score Difficult test items +25.871 Easy test items −9.806

To further examine this effect on discrete test items (to eliminate a possible effect of aggregation across test items), items 13 (as the most difficult) and 51 (as the easiest) were selected. Item 13 is in the listening section. The test takers hear “He left the car running when he went into the store.” The test takers then read: (a.) He forgot to put the brake on; (b.) He didn’t turn the engine off; and (c.) He ran to the store. The majority of the test takers chose choice (c) (256 test takers got this question wrong!). The averaged confidence was 41.49% (SD = 30.01), while the averaged performance was 13.22% (SD = 33.92). The relationship between confidence and performance for this item was nonsignificant. This finding suggests that the test takers severely failed to provide valid confidence in this item (to be further discussed below).

Page 63: Volume 3, 2005

55

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

diff items

easy items

Figure 16. Calibration Diagram for the Difficult and Easy Test Items (N = 295).

Item 51 was in the vocabulary section. The test takers read the following sentence: “All the students like Miss Kincaid; she must be the most _____ teacher at school.” They then chose: (a.) popular; (b.) central; (c.) ordinary; or (d.) sufficient. The majority of the test takers chose choice (a) (243 test takers got this question correct). Their averaged confidence was 55.76% (SD = 30.39), while their averaged performance was 82.37% (SD = 38.17). A statistical relationship between confidence and performance for this item was found (0.300. p < 0.01). This finding suggests that although the relationship was not strong, confidence in this item was more realistic than for that in item 13. Figure 17 shows the calibration graph for the test takers and items 13 and 51.

0

20

40

60

80

100

0 20 40 60 80 100Accuracy in Performance

Con

fiden

ce in

Per

form

ance

Item 13

Item 51

Figure 17. Calibration Diagram for Items 13 and 51 (N = 295).

As can be seen, these learners exhibited overconfidence in a difficult item and

underconfidence in an easy item. Although any conclusion cannot be made at this stage because more studies in this area are needed, the findings here are in line with calibration research that investigates a phenomenon known as the hard-easy effect (Baranski & Petrusic,

Page 64: Volume 3, 2005

56

1994; Lichtenstein & Fischhoff, 1977). This phenomenon occurs when individuals demonstrate overconfidence in their performance in difficult tasks but show underconfidence in their performance of easy tasks. Researchers in this area suggest that the reason for this miscalibration is mainly due to the fact that test takers might have bias in their memory search, which in turn leads to insensitivity to task difficulty (Griffin & Tversky, 1992). Such test takers, thus, might have had insufficient information concerning task difficulty to sufficiently alter their judgment when task difficulty changes. This insufficient adjustment is likely to occur when inadequate cues are present during test completion (Suantak, Bolger, & Ferrell, 1996). Much work is needed before findings concerning the hard-easy effect on calibration can be generalized.

Discussion

It has long been known that L2 proficiency or ability is a highly complex, dynamic, and multidimensional construct as various internal and external factors interact (see Bachman & Palmer, 1996). Multiple interactions of these factors in L2 use, learning, or acquisition often result in performance inconsistency. Given this, LT, SLA, and other language skill-based research has been conducted to describe and explain variability in L2 use, learning, or acquisition. Regardless of specific influential variables, it has been well accepted that two major systematic sources of language performance variability are: (1) variation due to individual differences in L2 proficiency, processing, and personal characteristics; and (2) variation due to characteristics of language tasks or contexts. The present study has taken the position that variability and differences in L2 test performance was partly due to differences in L2 processing (i.e., accuracy in confidence assessment) during test task engagement. Guided by the theoretical framework of how the human cognitive system may generate confidence during information processing, the present study investigated the nature of and factors affecting L2 learners’ calibration within the context of an English placement test designed by the English Language Institute, University of Michigan. The study was carried out at a Thai university with 295 Thai EFL learners. The study examined both the single-case confidence and relative-frequency confidence and employed Rasch IRT to analyze test quality and to make use of the person-ability and item-difficulty map for data analyses. The study found that the participants were poorly calibrated and exhibited a tendency to be overconfident in their test performance. Calibration was inferred from a relationship between accuracy in and confidence of test performance. This major finding may be used to explain why individual learners at different language proficiency levels could not achieve desired test performance and demonstrate their actual L2 ability. Note that realistic test takers were not necessarily highly proficient ones. Despite the level of language ability/performance, one needs to be realistic. Understanding this construct may perhaps further develop the theory of communicative language ability in that calibrative competence needs to be accounted for for successful communication.

This study shows that calibration could be significant for L2 test performance in that it can provide clues for what to do and what not to do. For example, a realistic test taker would not have a problem in knowing what was required and what was not required to achieve in the test. Overconfident test takers could stop engaging in a cognitive task prematurely. Overconfidence results in an erroneous sense of competence and contaminates calibration. Underconfident test takers, on the contrary, could spend too much time on a task that had

Page 65: Volume 3, 2005

57

already been successfully completed. One explanation for underconfidence is that test takers fail to detect declining task difficulty, and, due to this, their performance accuracy increases. Underconfidence can negatively impact performance in that test takers are influenced by a feeling that they cannot achieve the task goal and hence disengage from task completion when in fact they are likely to be able to complete it successfully. To overcome this, learners need to possess the ability to consciously make a strategic shift between their actual performance and performance monitoring in this regard. Researchers (e.g., Kleitman & Stankov, 2001) argue that those who are overconfident in one task are likely to exhibit overconfidence in other tasks as well. The same is applied to the underconfident test takers. Ross (1998) also argues that learners who are inclined to under- or overestimate their own proficiency/ performance in one skill are likely to do the same in another skill. Hence, poor calibration implies not only overconfidence coupled with absence of knowledge or ability, but also underconfidence when knowledge or ability is in hand. Although underconfidence may be more preferable than overconfidence in terms of language performance, learners at all levels need to be realistic about their performance.

In the next section, I will first discuss the findings in relation to the relevant literature. Justifications of the interpretations of the findings will be assisted by a discussion of the limitations of the study. I will then discuss the implications of the study in regard to L2 learning, acquisition, and use, and propose methodology for L2 instruction that provides L2 learners with an opportunity to improve their calibration. Finally, I will point out further research directions. Calibration and L2 Self-Assessment Research

As pointed out earlier, a number of psychological studies that looked at human calibration showed that people were generally miscalibrated and in most cases overconfidence was found. Research onto L2 learners’ self-assessment also found that self-assessment and language performance was poorly related, suggesting L2 learners’ lack of ability to validly self-assess. The present study’s findings that L2 learners were miscalibrated and overconfident in their test performance are consistent with the current literature. The findings may lend some practical explanations for why L2 learners in previous self-assessment research failed to realistically evaluate their L2 abilities, knowledge, or performance. As Oscarson (1997) pointed out, self-assessment would tend to be more accurate when based on task content closely tied to a specific situation than in a broad, general, or abstract context. In the present context of the study, even though the participants self-assessed their confidence in the correctness of their performance at the item level, they failed to be calibrated or validly self-assess. Hence, the findings in the present context suggest serious caution is needed in the use of their self-assessment scores in decision making, as L2 learners may lack a capacity to sufficiently determine their own language ability.

In regard to test section, the present study found moderate relationships between confidence and performance in four different test sections. Oscarson (1997) and Ross (1998) argue that learners would find it easier to assess their performance in decoding/receptive skills (reading and listening) than encoding/productive skills (speaking and writing). Although writing and speaking skills were not investigated in this study, the analysis shows that correlation between confidence and reading test performance was the highest of all test sections (r = 0.629). The calibration score for the reading section was also the best indication of overall calibration. Perhaps in the reading test section, external feedback available from the

Page 66: Volume 3, 2005

58

test items and given texts assists test taker confidence better than discrete test items such as those in the grammar and vocabulary sections. Moreover, in the listening test section, failure to retrieve previous spoken texts could result from not only incomprehensible input but also from high working memory load and anxiety. Because the items are not repeated, this section of the test offers little external feedback that assists listening performance and, accordingly, confidence. Unlike in the listening section, test takers could revisit the reading items as much as they needed to. In line with this explanation, Ross (1998) points out that the reason self-assessment in reading is strong may be because this skill is the first to be taught in the foreign language context, and learners are most likely experienced in using reading skills. Ross further argues that more extensive opportunities for reading other than speaking and listening may influence to some degree of the relative accuracy of self-assessment. The nature of confidence in listening comprehension in real-life language use, nonetheless, can differ greatly from listening in a language test, because in real-life listeners have time to gather information through discourse and interactions with the interlocutors. Hence, in such contexts, confidence and calibration will be different. It would also be instructive to further explore differences in the nature of calibration based on receptive and productive skills, both in test and nontest contexts, given that the nature of cognitive processing differs. The findings based on such research will be useful to understand the factors affecting calibration. Because the criteria to judge speaking and writing performance (e.g., by using holistic scales) differ from those to judge reading and listening skills (e.g., by using discrete right or wrong answers), and because confidence in speaking and writing is difficult to measure as a percentage, calibration can be different. Findings from such research can be used to inform L2 learners to be aware of such factors.

Regarding calibration at different proficient levels, the present study found that the beginner learners exhibited the poorest calibration scores across all test sections. Worst of all, this learner group exhibited the tendency to be highly overconfident in their test performance. Their confidence in each test section was not found to be correlated to test performance (see Table 5). Although the advanced learners also exhibited poor calibration, they were highly underconfident in their test performance. This finding is surprising because it is reasonable to expect that test takers who have high English proficiency levels should exhibit good calibration, because they would know if they could successfully complete a given task. One plausible explanation for this is that high-ability test takers might be more likely to encounter uncertainty in a test given their deep processing of test task engagement. They might experience a 50%–50% confidence phenomenon in that they can eliminate all but the best two final choices. Blanche and Merio (1989) alluded that low-proficient learners may overestimate their skills and high-proficient learners may underestimate their skills, as seen in the present study. Note that the correlation coefficients among various groups of test takers might be because some test takers who generally did well on the test knew that they did well and used high-confidence rating, whereas some who generally did poorly knew this and used lower confidence ratings. This type of calibration is of little interest to calibration research because it does not indicate that the learners can discriminate between what they know and what they do not know.

The most calibrated group of test takers is the intermediate group because their confidence correlated to their test performance in all test sections. Except for the grammar section, the strength of the correlation coefficients in this group was among the highest, and their calibration scores were better than the other groups. Note that of all test sections, each

Page 67: Volume 3, 2005

59

group of test takers showed the poorest calibration scores in the listening section. Reading, by contrast, was the area where all groups of test takers exhibited good calibration (see Figures 10–13). As Ross (1998) noted, because reading and listening are receptive skills that may not require preplanning and executing specific production strategies, learners should be more aware of their performance or proficiency in the productive skills of speaking and writing. However, the present finding does not lend support for this postulation. It seems likely that as proficiency increases, language skills and associated cognitive processes increase in complexity. The criteria used to judge calibration become more complex and therefore its assessment is likely to become more complex as well. Hence, it is possible that language proficiency may be a significant factor affecting the nature of learners’ calibration. This warrants future research, which must aim to determine how calibration leads to good performance by successful language learners or highly proficient ones and how they employ such processing in a broad range of domains.

As an attempt to understand why the test takers were miscalibrated, the possible effect of gender on calibration was examined. It was found that although females obtained better calibration scores than males did, the calibration scores did not significantly differ. Both gender groups exhibited a tendency to be overconfident in their performance (as consistent with the findings in prior analyses). Also similar between males and females in this study was that in the case of relative-frequency confidence they approached good calibration (see Table 11). Further MANOVA results also suggest that at the same proficiency levels, neither test performance nor confidence differed significantly between genders. However, as pointed out in the results section, gender might interact with confidence judgment in a complex way, and such complexity may influence calibration in a complex manner as well. The findings in the present study suggest, as in some previous studies (e.g., Coombe, 1992; Shrauger & Osberg, 1981; Smith & Baldauf, 1982; Strong-Krause, 1997), that we do not have enough evidence in regard to any clear-cut gender effects on self-assessment, confidence, and calibration. Further research in this area is thus needed.

As suggested in the literature, because it is possible that the nature of test tasks is a reason for miscalibration, the present study examined the hard-easy effect. It has been argued that this effect has a cognitive bias leading to insensitivity to task difficulty (Suantek, Bolger, & Ferrell, 1996). This phenomenon occurs when individuals show overconfidence in difficult tasks but show underconfidence in easy tasks. Based on a substantive selection of items, the present study found evidence for the hard-easy effect, implying that these test takers failed to adjust their internalized response criteria to changes in the demands during information processing. This finding contributes to the understanding of the processes undertaken to complete cognitive tasks and confidence generation. However, reasons for overconfidence in difficult questions and underconfidence in easy questions are not straightforward, as pointed out in regard to differences in information processing within the local MM and PMM. Explanation for the hard-easy effects is not simple because the relationship between confidence and performance in easy items (operating within the local MM) was much stronger than that of the difficult items. Further analysis is needed to explain why indications of calibration based on two analyses (calibration score and correlation coefficients) contradicted. Needless to say, both analyses suggest that explicit training is needed by these groups of learners (discussed below).

In sum, it was found that through the single-case confidence ratings, these Thai EFL learners (in general, by proficiency levels, and by gender) were poorly calibrated, exhibiting

Page 68: Volume 3, 2005

60

the tendency to be overconfident in their test performance. However, in the case of relative-frequency confidence, they were generally somewhat realistic in how much they believed they achieved in the test. These two types of confidence have different functions in human information processing. The former is for the purpose of specific problem solving at hand, whereas the latter is for the purpose of post self-evaluation of overall performance. Justifications for Calibration and Miscalibration Phenomena

Given the present findings, calibration and miscalibration must be explained and interpreted as the result of variation due to: (1) individual differences in L2 proficiency, calibrative ability, and personal characteristics; and (2) variation due to characteristics of language tasks or contexts. It is thus realistic to argue that the realism of confidence is not only determined by an individual (i.e., intraindividual-driven factors) but also from the context (i.e., context-driven factors). Attention to both intraindividual factors and contextual factors must be given to explain research findings in calibration. Both sources have a profound effect on confidence judgement error. In this section, I discuss intraindividual-driven and contextual-driven factors that might affect human calibration and research findings in miscalibration. Limitations of the present study are also pointed out. Intraindividual-Driven Factors

In psychological research, factors affecting miscalibration based on intraindividual factors have been investigated: for example, lack of knowledge (Juslin, 1994), incentives and extrinsic motivation for accurate judgments (Ashton, 1992; Phillips, 1987), feedback (Ferrell, 1994; Glenberg, Sanocki, Epstein, & Morris, 1987), and gender differences (in this study). These factors need empirical investigation in the contexts of L2 learning and use. Within the intraindividual approach, metacognition is found to be the most important factor that contributes to the degree of realism of confidence (see Schraw, 1994, for detailed discussion of metacognition). Researchers who investigate the nature of calibration (e.g., Björkman, 1992, 1994; Juslin, Winman, & Persson, 1995; Keren, 1991; Kleitman & Stankov, 2001; Liberman & Tversky, 1993; Stankov & Crawford, 1996) have argued that confidence judgments reflect a self-monitoring trait of metacognition. Metacognition refers to a higher order cognitive trait that involves monitoring and consequent regulation and orchestration of cognitive and affective processes to achieve cognitive goals in an L2 learning/use environment. Metacognition regulates the interaction between the individual and a specific context (Bachman & Palmer, 1996). It can be argued that the act of self-monitoring provides individuals with some basis for knowing that the desired performance is occurring. This knowing results in self-generated confidence that reflects certainty about the degree of performance success. A typical situation in which assessing confidence is likely to occur is when: (1) the language use situation requires us to be aware of our actions/performance and potential consequences of our actions; (2) the language task is somewhere between totally unfamiliar and totally familiar; and (3) it is important to make correct language responses or to achieve desired outcomes. Relating metacognition to the cognitive model in Figure 1, assessing confidence reflects self-awareness of goal achievement that leads to strategic reactions to ongoing changes in the task demands. Concurrent confidence can then act as internal feedback for efficient L2 processing, which will cause a strategic shift of planning, monitoring, and evaluating. At the same time, confidence is an affective domain of human information processing because communicators must know if they are satisfied or happy with

Page 69: Volume 3, 2005

61

their decisions, judgments, and performance. As can be seen here in terms of the function of realistic confidence, the lack of this ability may explain why L2 learners’ performance is not as effective or satisfactory as it should be. It can be argued that the ability to calibrate confidence with performance that consequently informs strategy use as a feedback loop (see Butler & Winne, 1995) is crucial for any language use performance, learning, and acquisition. Zimmerman (1994) argues that metacognitive learners are aware of what they know and what they do not know. Hence, when they are aware that their performance is satisfactory, their confidence is likely to be high; likewise, if their performance is unsatisfactory, their confidence is likely to be low. Therefore, it is important that future research explores whether individuals high in metacognition or strategic competence are more calibrated than those low in this competence. Furthermore, because a connection between strategy research and calibration research must be made, a lack of calibration may explain why the relationship between strategy use and L2 performance or learning is weak or moderate (Anderson, 2005; Purpura, 1999). Perhaps if L2 learners can employ appropriate strategies for use or learning as well as being calibrated, strategies may correlate more strongly with L2 language use or learning performance. This is also an area of future strategy research.

Besides metacognition, belief about one’s expertise has been found to affect learner confidence judgments (Johnson & Bruce, 2001). Perceived expertise or self-esteem often causes overconfidence (Yates, Lee, & Shinotsuka, 1996), a factor that is related to the self-classification hypothesis (Glenberg & Epstein, 1985). Individuals who classify themselves as good or poor L2 learners are likely to rate their confidence accordingly, despite the actual performance. When they classify themselves as proficient in English, they will perceive themselves as capable of correctly answering English test questions (i.e., suggesting a confidence trait). This means that these learners have a tendency to provide higher confidence in their performance than those believing that their English ability is poor. In other words, confidence as an individual trait may interrupt an individual’s self-assessment in a specific context. This type of calibration is of little interest to calibration research because it does not indicate that the learners can discriminate between what they know and what they do not know. In summary, factors such as proficiency levels, gender, motivation, metacognition, and belief about one’s expertise or proficiency or self-esteem are further areas that need to be considered in future calibration research. Contextual-Driven Factors

A number of research studies on calibration deal with contextual-driven factors—such as the context, task characteristics, and measurement instruments—as influences on calibration. Test method characteristics and associated measurement error need to be taken into account when investigating calibration. The reliability and validity of language measures are important in order to be confident about findings in the nature of L2 learners. Unreliable language performance measures can in part be responsible for miscalibration. It is common knowledge that measurement error can be high in research instruments used to measure individuals’ behaviors, attitudes, feelings, or motivation. In the present study, although the reliability of the placement test was acceptable (0.87), we must keep in mind that there was measurement error in the test scores. Hence, the reliability coefficient of the test in the present study is regarded as a limitation, although attenuation of correlation coefficients was attempted. In addition to the language, knowledge, or ability measures, confidence measures are another significant source of miscalibration or illusion. Although people can use ratio

Page 70: Volume 3, 2005

62

scales to estimate things (e.g., most people can differentiate 95% from 99%), they find it too subtle to distinguish 10% from 11% (Edwards, 1967). To make the issue of the measurement of confidence more complex, we can argue that individuals who say they are 85% confident in their performance do not necessarily expect to actually attain 85% success in the test. In fact, 85% confidence can be considered merely an expression of high confidence. Other factors influencing the confidence expression include qualitative aspects of instructions, confidence scales of tariffs, and individuals’ motivation to express confidence. Therefore, because of the possible effects of both L2 and confidence measures, researchers must be careful when making inferences based on research findings.

Another factor is data analysis. Calibration is often investigated by aggregation of individual confidence judgments in order to understand a general tendency of an individual to be realistic, overconfident, or underconfident. An advantage of aggregation is that we lesson measurement error. However, complex combinations of overestimation and underestimation require care when doing aggregation. Furthermore, in the present study, calibration was measured only at the level of interindividual differences, rather than intraindividual differences. Correlation coefficients measure calibration of learners as a whole or as a specific group, and individual differences based on MANOVAs suggest group differences, not differences between individual learners. Intra-calibration of an individual learner across tasks and times warrants future quantitative and qualitative research. Furthermore, in regard to calibration formulas, the present study only employed a simple linear model. Because each measure contains measurement error, researchers such as Allwood (1994), Björkman (1994) and Juslin, Winman, and Persson (1995) developed more substantive formulas. Björkman (1994), for example, proposed a substantive calculation for calibration which is as follows:

C = D2 + R2 + L

where D = ( c − p ), R = (sc − sp) and L = 2scsp (1−rcp). c refers to a mean confidence, p to a mean test performance, sc to the standard deviation of a confidence score, sp to the standard deviation of a performance score and rcp is the correlation between a confidence score and a performance score. C is between 0 and 1. The first component D2 (bias) is the square of the standard measure of over/underconfidence. A positive c − p indicates overconfidence and a negative c − p indicates underconfidence. Perfect calibration requires that c − p = 0. The second component, R2, measures accuracy or resolution as set by the criterion of calibration.

According to Björkman (1994), perfect calibration requires that the variance in confidence is matched by an appropriate variance in correct performance (i.e., R = 0). In addition, an estimate of the error variance of confidence assessment (Murphy-resolution, s2

e) can be computed using the following formula (Björkman, 1994): s2

c − s2p. Finally, L

(Linearity) measures the degree to which systematic or nonsystematic deviations from linearity in the calibration curve contribute to poor calibration. Perfect calibration requires a linear calibration curve, L = 0. In summary, we cannot be 100% confident that findings in perfect calibration or miscalibration are not in part the result of an artifact of the methods employed. Because no measure is precise, it is our responsibility to control error arising from our measures and to be critical of data analysis and interpretation.

Other factors that may be of interest are the hard-easy effects (as discussed previously) and a possible influence of culture on confidence expression. Understanding the effects of culture is not easy. Research by Yates, Lee, and Shinotsuka (1996) and Yates, Lee,

Page 71: Volume 3, 2005

63

Shinotsuka, Patalano, and Sieck (1998) clearly show that Asian people (e.g., Korean, Japanese, and Chinese) demonstrated the tendency to be more overconfident than Westerners (e.g., Americans). The nature of confidence held by Asians contradicts the assumption held by research in cross-cultural stereotypes (e.g., Bond and Cheung, 1983), that personal modesty is common in Asian countries. The overconfidence findings in the present study add to the literature. A study that gathers calibration data from contrastive cultural groups will be of wide interest to LT, SLA, and skill-based researchers. Perhaps the matter of under- or overconfidence is not always explained by culture. Essential issues regarding culture that also need special attention is the extent to which confidence expression is free from cultural influences, and how error derived from cultural factors should be treated (i.e., as systematic or random error).

Further Research Directions

Apart from the suggestions for further research pointed out above, the following are specific relevant research areas that are of importance for calibration research. Language Testing Research

Because language tests are used for various purposes, it is important to further investigate calibration in various test contexts. Language test contexts often involve at least one of the following problems: ill-structured problems; uncertain or dynamic environments; shifting, ill-defined, and/or competing goals; time stress; and high-stakes decisions. Unfamiliarity with certain test formats can have a tremendous effect on confidence in performance. In some high-stakes test situations in which test results have serious impacts on the test takers’ future, good calibration is essential, and in such contexts test takers will react differently than they would in low-stakes situations. To comprehensively understand calibration, this line of research must be expanded to other test formats that assess various language skills, because each language skill is unique and its assessment criteria are unique as well. Regardless of test format, future research needs to find out whether it is the nature of the test or the use of the test to make decisions that affect calibration. In the literature there is evidence that individuals may experience more difficulty in framing rational responses to a task within a tense or highly charged setting (see Bruce & Johnson, 1992), thereby inhibiting calibration. Tversky and Kahneman (1992) provide evidence that people tend to be underconfident in a complex cognitive task. Confidence in a high-stakes test situations may thus differ greatly from that in a nontest, low-stakes setting. Therefore, it is instructive to explore in greater detail how particular characteristics of the testing environment, which create the context of decision processes, are influential in determining accuracy of confidence (see Ronis & Yates, 1987). In addition, future research should focus on the extent to which calibration in a language test is the same as that in nontest language use. This area is significant for substantive inferences or claims on actual language ability inferred from test scores.

Note that in a typical official test setting it is essential to consider the consequences of asking test takers to rate single-case confidence, as in the present study. Requiring test takers to perform this secondary task during the test may interfere with the primary task of completing the test. The nature of the high-risk, test-taking situation raises ethical concerns, because providing answers to research questions may impede performance (see Shohamy, 2001, for further discussion). Relative-frequency confidence, however, can avoid this

Page 72: Volume 3, 2005

64

problem. Given single-case confidence is central to calibration study, a test simulation (as test validation research) may be conducted instead of a real official test. Another area of LT research is to look at is calibration within a computer adaptive testing (CAT) context (see e.g., Sawaki, 2001). The inability to preview whole texts and tasks and revisiting previous work will affect the nature of confidence and calibration in CAT. For example, it is possible that confidence might suffer with a well-designed CAT because the test will display items at a difficulty level of 50%, and test takers are likely to find questions quite hard. Hence, they may feel they do badly in the test when in fact they may be doing very well, as the test just gives them items at their ability level. Particularly, construct-irrelevant features generated by CAT that negatively affect confidence must be studied and eliminated for efficient CAT. The comparability of test takers’ calibration via various test formats warrants future research. L2 Classroom Research

As mentioned at the beginning of this paper, in today’s educational system learners are unavoidably forced to make high-stakes decisions. They thus need to have an ability to accurately approximate their likely success in decisions that affect their performance. Generally, the findings which imply that learners lack a capacity to sufficiently determine their own language performance have raised an important consideration in the use of L2 learners’ self-assessment to represent their L2 ability. Although we do not have enough empirical evidence that this ability develops automatically with language proficiency and age—and further research is certainly needed—some practical advice for L2 classroom language teaching/training from the present findings can be offered. In an L2 learning context, for example, overconfident learners would believe that their knowledge in a specific language domain is already very good, and they would be unmotivated to attempt to improve it. Their overconfidence may derive in part from the tendency to neglect contradictory evidence. Hence, their calibration may be improved by making such evidence more explicit. Underconfident language learners would likely spend too much time on language features that they should not have difficulty acquiring, or already had acquired, and therefore fail to move forward to learn new language features. Their underconfidence may be explained by the lack of the ability to access, generate, or use performance feedback to assist their decisions to move on.

Given all this, it is important that educational programs equip learners with the ability for lifelong learning. It is well known that explicit L2 learning instruction assists the acquisition of the target language (Ellis, 2005). However, there is little evidence that current practice in L2 instruction helps learners acquire realistic confidence judgment skills. If L2 learners do develop their language ability along a continuum of conscious incompetence, to conscious competence, to unconscious competence, realistic judgment skills need to be acquired early in language learning. Therefore, integration of metacognitive training or instruction of monitoring and assessing performance accuracy will be of value to L2 learners. Although the model of cognitive processing and confidence generation as presented in Figure 1 is directed at a multiple-choice task, it can be accommodated in or adapted to suit instruction. To develop such instruction, we need to consider the characteristics of the language syllabus design, teaching methodology, and materials and assessment methods. Confidence is not necessarily divided as 0%, 25%, 50%, 75%, and 100%, but must be adjusted depending on the nature of language tasks. It may be effective for each learner to have a record of calibration graphs as reminders of their calibrative development.

Page 73: Volume 3, 2005

65

Furthermore, learners at different proficiency levels who vary in terms of personal characteristics, such as gender, learning styles, cultural beliefs, and motivation, would need different kinds of metacognitive guidance. Metacognitive instruction of confidence assessment may vary according to language skills because the nature of cognitive processing, feedback, and task demands can be different.

Some specific features of metacognitive training can be prespecified. Metacognitive training must be accompanied by language tasks, motivation for performance accuracy, and at least three kinds of feedback. Performance feedback involves providing information about the accuracy of one’s judgment in general, such as that the learner is overconfident, underconfident, or calibrated. Outcome feedback involves providing information of whether or not a particular performance is correct. Environmental feedback involves providing information about the sorts of tasks or the nature of specific language features to learn or accomplish (e.g., lessons or learning objectives; note that positive and negative effects of feedback are extensively discussed in Butler & Wine, 1995, and Hom & Ciaramitaro, 2001). A primary role of feedback in a language classroom should be to improve learners’ quality of confidence. The calibration diagram used throughout this paper can be used to accommodate such training. Because it is important for learners to know why they are calibrated or miscalibrated, and for us to understand the reasons for their calibration or miscalibration, it is essential that learners are prompted to reflect on their thinking about their confidence. Examples of such lesson instruction are:

• Give all the possible reasons that you can find favoring and/or opposing each of the answers. Such reasons may include facts that you know, things that you vaguely remember, assumptions that make you believe that one answer is likely to be correct or incorrect, gut feelings, associations, and the like.

• Write down in the space provided one reason that supports your decision. Please write the best reason you can think of that either speaks for or provides evidence for the alternative/content you have chosen, or speaks against or points against the alternative/content you rejected.

With this kind of practice, learners may be able to learn to accurately assess how well

they perform or learn a language task. Note that individual learners must work hard in not only recruiting and weighing evidence, but also in developing contradicting reasons. Such practice would help reduce cognitive biases. Metacognitive training of confidence monitoring can provide learners with extensive experience to assist the accuracy of confidence. Because cue validity closely approximates ecological validity when the learners have repeated experiences with an environment, such metacognitive training would enhance their calibration. Classroom research on the effects of metacognitive training or instruction of monitoring and assessing confidence judgments as part of formal language instruction is essential. Evidence of such effects can be seen from observations of evolution of confidence realism from the beginning to the end of the instruction (e.g., based on calibration diagrams). Evidence of calibrative consistency over time is also vital. In addition, an experimental design that compares language performance between classes with and without a focused attention on confidence rating is a way to convince educators, policy makers, and stakes holders of the effectiveness of such instruction. If L2 learners’ judgment processes can be improved via training or explicit instruction in formalized education, this would have obvious applied benefits.

Page 74: Volume 3, 2005

66

Conclusion

It has been argued throughout this paper that in many real-life situations we need to be as realistic as possible about our performance. Realism to performance signifies calibration that refers to the perfect match between confidence and accuracy of actual performance. Being realistic or calibrated is a rudimentary part of our information processing mechanism because we need to be able to approximate the likelihood of our performance success (that is, whether our performance has satisfied the task demands). Of particular importance here is probabilistic confidence judgment. Probabilistic confidence is a form of self-assessment that has been arguably an essential part of human strategic processing. The function of this mental operation is related to conditional knowledge of why certain strategic actions are needed, thereby facilitating their upcoming performance. In this processing, if perceived confidence in a given task is low, task difficulty is encountered. To resolve task difficulty, intention or awareness to gather all possible resources together would result in strategy use which would, if suitable to the task, yield better performance. I presented a psychological human-information processing model of the local and probabilistic mental models that explains the relationships between performance and confidence in performance. This framework provides us with an opportunity to observe the evolution of confidence judgment over time that can then serve to corroborate both cross-sectional and longitudinal and intraindividual and interindividual differences. It is important to note that the model (see Figure 1) captures only salient features of how confidence may be generated and how it corresponds to the external world. The proposed model should be subjected to further empirical validation, and until then it can only be applied as preliminary.

In the present study, calibration was investigated in a language test context because, as in many other real-life tasks, a test context entails conditions that show the need for good calibration. The present study has shown that the nature of calibration is complex and needs to be investigated from various angles (e.g., proficiency levels, gender variables, and task characteristics). Although the study was limited because some factors that might influence confidence ratings were excluded, such as affect (e.g., motivation, volition, and test anxiety) and cognitive style, it is evident from the various analyses that test takers at different levels of proficiency and of a different gender in general exhibited poor calibration. Generally, these participants were overconfident in their test performance. Overconfidence with the correctness of performance is a robust phenomenon that signals failure to adjust response criteria with changes in task difficulty. In this regard, the fact that the success of a task event was predicted did not guarantee actual success due to the large number of factors involved. Overconfidence indicates an inability to link information relevant to assigned tasks—that is, these individual learners did not sample the task environment, and as a result they failed to count it as ecological evidence to enhance their performance and to make realistic confidence judgments. Thus, this finding might explain why some test takers tended to overestimate the success of their test performance. Therefore, calibration or realistic confidence in performance must be considered as significant and critical to language learning, use, or success.

To conclude, in terms of the causal sources of poor calibration, two broad accounts have been discerned and combined: a cognitive bias account (i.e., an inappropriate manner in which individuals process and evaluate information); and a methodological account (i.e., assessing confidence, assessing language performance, measurement error, and analyses of calibration). This study has added to our understanding of the specific conditions and

Page 75: Volume 3, 2005

67

variables that can influence L2 individuals’ calibration of language performance. The findings in the present study, though based on only a cross-sectional data gathering method, supported the proposed theoretical framework in terms of how confidence might be generated depending upon the kind of knowledge or ability needed for the given language task. Much needs to be done to empirically support, adjust, or modify the present theoretical and methodological framework in a way to allow human calibration to extend to other areas of L2 contexts, such as in language learning situations, real-life (natural) language use, and even other kinds of language assessment. Overconfidence and underconfidence may be common for human beings, but they may not be universal. Finally, although calibration alone is insufficient for successful L2 performance, learning, or acquisition, because it does not necessarily ensure informativeness, calibration is an essential and desirable property, especially for the purpose of communication. It may well be that calibrative competence has been a missing link in the theory of communicative language ability or proficiency. Individual differences in calibration as a subarea of SLA have not been acknowledged and blended in the theory. It is hoped that more attention will be paid to this area of study so that we can then begin to compare findings and realistically generalize and evaluate the nature of calibration to fine-tune the communicative language ability theory.

References Adams, R. J., & Khoo, S-K. (1996). Quest (Version 2.1) [Computer software]. Victoria,

Australia: Australian Council for Educational Research. Alderson, J. C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press. Allwood, C. M. (1994). Confidence in own and other’s knowledge. Scandinavian Journal of

Psychology, 35, 198–211. Anderson, N. (2005). L2 learning strategies. In E. Hinkel (Ed.), Handbook of research in

second language teaching and learning (pp. 757–771). Mahwah, NJ: Lawrence Erlbaum.

Ashton, R. H. (1992). Effects of justification and a mechanical aid on judgment performance. Organizational Behavior and Human Decision Processes, 52, 292–306.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17, 1–42.

Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self-ratings of communicative language ability. Language Testing, 6, 14–20.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford, UK: Oxford University Press.

Baranski, J. V., & Petrusic, W. M. (1994). The calibration and resolution of confidence in perceptual judgments. Perception and Psychophysics, 55, 412–428.

Björkman, M. (1992). Knowledge, calibration, and resolution: A linear model. Organizational Behavior and Human Decision Processes, 51, 1–21.

Björkman, M. (1994). Internal use theory: Calibration and resolution of confidence in general knowledge. Organizational Behavior and Human Decision Processes, 58, 386–405.

Blanche, P., & Merino, B. (1989). Self-assessment of foreign language skills: Implications for teachers and researchers. Language Leaning, 39, 313–340.

Page 76: Volume 3, 2005

68

Bond, M. H., & Cheung, T. S. (1983). The spontaneous self-concept of college students in Hong Kong, Japan, and the United States. Journal of Cross-Cultural Psychology, 14, 153–171.

Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall. Buck, G. (2001). Assessing listening. Cambridge, UK: Cambridge University Press. Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical

synthesis. Review of Educational Research, 65, 245–281. Chavez, M. (2001). Gender in the language classroom. Boston: McGraw Hill. Clark, J. L. D. (1981). Language. In T. S. Barrows, S. F. Klein, & J. L. D. Clark (Eds.),

College students’ knowledge and beliefs: A survey of global understanding (pp. 25–35). New Rochelle: Change Magazine Press.

Cohen, A. D. (1998). Strategies in learning and using a second language. London: Longman. Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic

Press. Coombe, C. (1992). The relationship between self-assessment ratings of functional skills and

basic English skills results in adult refugee ESL learners. Unpublished doctoral dissertation, Ohio State University, Columbus.

Davidson, F., & Lynch, B. K. (2001). Testcraft. New Haven, CT: Yale University Press. Edwards, W. (1967). Statistical methods (2nd ed.). New York: Holt, Rinehart and Winston. Ellis, R. (2005). Instructed language learning and task-based teaching. In E. Hinkel (Ed.),

Handbook of research in second language teaching and learning (pp.713–728). Mahwah, NJ: Lawrence Erlbaum.

Ferrell, W. R. (1994). Discrete subjective probabilities and decision analysis: Elicitation, calibration and combination. In G. Wright, & P. Ayton (Eds.), Subjective probability (pp. 410–451). Chichester, UK: Wiley.

Gagné, E. D., Yekovich, C. W., & Yekovich, F. R. (1993). The cognitive psychology of schooling learning. New York: HarperCollins College Publishers.

Gigerenzer, G., Hoffrage, U., & Kleinbölting, H. (1991). Probabilistic mental models. A Brunswikian theory of confidence. Psychological Review, 98, 506–528.

Glenberg, A. M., & Epstein, W. (1985). Calibration of comprehension. Journal of Experimental Psychology: Learning, Memory, and Cognition, 11, 702–718.

Glenberg, A. M., & Epstein, W. (1987). Inexpert calibration of comprehension. Memory and Cognition, 15, 84–93.

Glenberg, A. M., Sanocki, T., Epstein, W., & Morris, C. (1987). Enhancing calibration of comprehension. Journal of Experimental Psychology: General, 116, 119–136.

Griffin, D., & Tversky, A. (1992). The weighting of evidence and the determinants of confidence. Cognitive Psychology, 24, 411–435.

Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistics. Boston, MA: Heinle & Heinle Publishers.

Hom, H. L., & Ciaramitaro, M. (2001). GTIDHNIHS: I knew-it-all-along. Applied Cognitive Psychology, 15, 493–507.

Johnson, J. E., & Bruce, A. C. (2001). Calibration of subjective probability judgments in a naturalistic setting. Organizational Behavior and Human Decision Processes, 85, 265–290.

Page 77: Volume 3, 2005

69

Juslin, P. (1994). The overconfidence phenomenon as a consequence of informal experimenter guided selection of almanac items. Organizational Behavior and Human Decision Processes, 57, 226–246.

Juslin, P., Winman, A., & Persson, T. (1995). Can overconfidence be used as an indicator of reconstructive rather than retrieval processes? Cognition, 54, 99–130.

Keren, G. (1991). Calibration and probability judgments: Conceptual and methodological issues. Acta Psychologica 77, 217–273.

Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge, UK: Cambridge University Press.

Kleitman, S., & Stankov, L. (2001). Ecological and person-oriented aspects of metacognitive processing in test taking. Applied Cognitive Psychology, 15, 321–341.

LeBlanc, R., & Painchaud, G. (1985). Self-assessment as a second language placement instrument. TESOL Quarterly, 19, 673–687.

Liberman, V., & Tversky, A. (1993). On the evaluation of probability judgments: calibration, resolution, and monotonicity. Psychological Bulletin, 114, 162–173.

Lichtenstein, S., & Fischhoff, B. (1977). Do those who know also know more about how much they know? Organizational Behavior and Human Performance, 20, 159–183.

Lynch, B.K. (2003). Language assessment and program evaluation. Edinburgh, Scotland: Edinburgh University Press.

McNamara, T. (1996). Measuring second language performance. London: Longman. Moritz, C. (1995). Self-assessment of foreign language proficiency: A critical analysis of

issues and a study of cognitive orientations of French learners. Unpublished doctoral dissertation, Cornell University, Ithaca, NY.

Oscarson, M. (1978). Approaches to self-assessment in foreign language learning. Strasbourg, France: Council for Cultural Co-operation.

Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education, Volume 7: Language testing and assessment (pp. 175–187). Dordrecht, Netherlands: Kluwer Academic Publishers.

Oxford, R. L. (2003). Language learning styles and strategies: Concepts and relationships, IRAL, 41, 271–278.

Peirce, B. N., Swain, M., & Hart, D. (1993). Self-assessment, French immersion, and locus of control. Applied Linguistics, 14, 25–42.

Phakiti, A. (2003). A closer look at gender and strategy use in L2 reading. Language Learning, 53, 649–702.

Phillips, L. D. (1987). On the adequacy of judgmental forecasts. In G. Wright & P. Ayton (Eds.), Judgmental forecasting (pp. 11–30). Chichester, UK: Wiley.

Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge, UK: Cambridge University Press.

Purpura, J. E. (2004). Assessing grammar. Cambridge, UK: Cambridge University Press. Read, J. (2000). Assessing vocabulary. Cambridge, UK: Cambridge University Press. Ronis, D. L., & Yates, J. L. (1987). Components of probability judgement accuracy:

Individual consistency and effects of subject matter and assessment method. Organizational Behavior and Human Decision Processes, 40, 193–218.

Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of experiential factors. Language Testing, 15(1), 1–20.

Page 78: Volume 3, 2005

70

Sawaki, Y. (2001). Comparability of conventional and computerized tests of reading in a second language. Language Learning & Technology, 5, 38–59.

Schneider, S.L. (1995). Item difficulty, discrimination, and the confidence-frequency effect in a categorical judgment task. Organizational Behavior and Human Decision Processes, 61, 148–167.

Schraw, G. (1994). The effect of metacognitive knowledge on local and global monitoring. Contemporary Educational Psychology, 19, 143–154.

Shohamy, E. (2001). The power of tests: A critical perspective on the uses of language tests. Harlow, UK: Longman.

Shrauger, J. S., & Osberg, T. M. (1981). The relative accuracy of self-predictions and judgements by others in psychological assessment. Psychological Bulletin, 90, 322–351.

Smith, K., & Balduaf, R. B. (1982). The concurrent validity of self-rating with interviewer rating on the Australian Second Language Proficiency Scale. Educational and Psychological Measurement, 42, 1117–1124.

Stankov, L., & Crawford, J. D. (1996). Confidence judgments in studies of individual differences. Personality and Individual Differences, 21, 971–986.

Stone, N. J. (2000). Exploring the relationship between calibration and self-regulated learning. Educational Psychology Review, 12, 437–475.

Strong-Krause, D. (March, 1997). How effective is self-assessment for ESL placement? Paper presented at the annual meeting of Teachers of English to Speakers of Other Languages, Orlando, Florida.

Suantak, L., Bolger, F., & Ferrell, W. R. (1996). The hard-easy effect in subjective probability calibration. Organizational Behavior and Human Decision Processes, 67, 201–221.

Tversky, A., & Kahneman, D. (1992). Judgment under uncertainty: Heuristics and biases. In T. O. Nelson (Ed.), Metacognition: Core reading (pp. 379–392). Boston: Allyn and Bacon.

Wilson, K. M., & Landsay, R. (1996). Validity of global self-ratings of ESL speaking proficiency based on an FSI/ILR-referenced scale: An empirical assessment. (ETS Research Report RR-99–13). Princeton, NJ: Educational Testing Service.

Yates, J. F., Lee, J., & Shinotsuka, H. (1996). Beliefs about overconfidence, including its cross-national variation. Organizational Behavior and Human Decision Processes, 65, 138–147.

Yates, J. F., Lee, J., Shinotsuka, H., Patalano, A. L., & Sieck, W. R. (1998). Cross-cultural variations in probability judgment accuracy: Beyond general knowledge overconfidence? Organizational Behavior and Human Decision Processes, 74, 89–117.

Zimmerman, B. J. (1994). Dimensions of academic self-regulation: A conceptual framework for education. In D. H. Schunk & B. J. Zimmerman (Eds.), Self-regulation of learning and performance: Issues and educational applications (pp. 3–21). Hillsdale, NJ: Lawrence Erlbaum Associates.

Page 79: Volume 3, 2005

71

Appendix A

Sample Answer Sheet

A. Directions: Answer the test questions and immediately after each of your answers, provide your confidence in the correctness of your answer.

Part 1 Answer Confidence (%) No a b c 0 33 66 100

Ex. I o o o o o o o Ex. II o o o o

1. o o o o o o o 2. o o o o o o o 3. o o o o o o o 4. o o o o o o o 5. o o o o o o o 6. o o o o o o o 7. o o o o o o o 8. o o o o o o o 9. o o o o o o o 10. o o o o o o o

Parts 2–4 a b c d 0 25 50 75 100

Ex. III o o o o o o o o o Ex. IV o o o o o o o o o Ex. V o o o o o o o o o

21. o o o o o o o o o 22. o o o o o o o o o 23. o o o o o o o o o 24. o o o o o o o o o 25. o o o o o o o o o 26. o o o o o o o o o 27. o o o o o o o o o 28. o o o o o o o o o 29. o o o o o o o o o 30. o o o o o o o o o

B. Directions: It is commonly practiced that people evaluate their achievement after the test. Having completed this test, indicate your confidence in the overall test achievement in percentages. Your confidence can range from 0% to 100%.

I think the number of the correct answers is (out of 100): __________________

Page 80: Volume 3, 2005

72

Page 81: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 3, 2005 73 English Language Institute, University of Michigan

A Validation Study of the ECCE NNS and NS Examiners’ Conversational Styles from a Discourse Analytic Perspective

Yang Lu

University of Reading

This study explores the conversational styles of the Examination for the Certificate of Competency in English (ECCE) native and nonnative speaker (NS and NNS) examiners when responding to candidates’ replies and eliciting questions and justifications and their effect on the assessment of the test takers’ oral proficiency. A discourse analytic approach following the systematic and functional tradition was implemented to analyze twenty audiotaped ECCE speaking test events. The findings show that certain conversational styles such as informing, commenting, back-channeling, interrupting, and so forth, act as non-eliciting discourse elements in facilitating sufficient oral samples from candidates. Differences between the amount and types of discourse features produced by the NS and NNS examiners seem to suggest a variability caused by examiners’ linguistic and cultural backgrounds.

Face-to-face and multitask oral proficiency tests have been widely implemented in EFL oral assessment development. The interviewers, as they have been called in the traditional and still prevalent form of oral proficiency interviews (OPI), are now playing not only the role of examiner who conducts the test and asks questions but also the role of interlocutor who interacts with the examinees to assist them in completing the tasks. Consequently, the spoken discourse resulting from these exchanges has also been changed from an examiner-dominated nature to a co-constructive nature of discourse jointly shaped and developed by both examiners and test takers.

One of the possible sources of construct-irrelevant variance in speaking tests is examiners misrepresentation of the test developer’s intended constructs due to personal and sometimes unconscious discoursal styles during the spoken interaction with test takers. Some speaking-test developers have used structured face-to-face oral tests and interlocutor frameworks to minimize the negative effect on the validity of their speaking tests. Nevertheless, it has been indicated that even preformulated and scripted speaking tests have to accommodate the examiner’s deviation from the interlocutor framework through conscious or subconscious individual styles that change the way in which test takers are examined (Lazaraton, 1996a; O’Loughlin, 1997; O’Sullivan & Lu, 2004). This issue becomes particularly crucial when a speaking test recruits both NS and NNS examiners who may bring their own linguistic and cultural backgrounds to their double role of assessor and interlocutor.

The ECCE Speaking Test is a structured but not scripted face-to-face oral examination targeted at the Independent User level on the Common European Framework Scale. The three tasks in the test are designed to represent different interaction patterns and discourse styles in order to assess candidates’ competence to convey and elicit information and to support their decisions. The examiners have to take not only a dominant role as interviewer in Task 1 but also a passive role as information provider in Task 2 to enable the test taker to make a

Page 82: Volume 3, 2005

74

decision. Though Task 3 returns the role of initiator to the examiner, the role is not as overbearing as it is in Task 1, as the goal in Task 3 is to encourage candidates to elaborate on the reasons for their decisions.

Preferably, if the examiners faithfully carry out the three tasks and ensure that their own conversational styles and discourse behaviors do not alter the conditions under which the candidates perform and are examined, the test can successfully assess what it is designed to assess, thereby functioning as a valid and reliable test. However, is that the case? Apart from the factors brought by the test takers themselves that may affect performance, have the ECCE examiners, native or nonnative, strictly followed the ECCE Oral Examiner’s Manual instructions and conducted the test unvaryingly so that the candidates can be assessed equally? Furthermore, have they guaranteed that the test takers have supplied them with sufficient samples of spoken language to allow them to make fair judgments? If they have not succeeded in doing so, is it their divergent discourse features or conversational styles that have prevented the test takers from performing to the best of their possible ability? Have the NS and NNS examiners varied in this aspect? These are the concerns and inquiries investigated by this study.

Background

The reliability and construct validity of oral assessment have been thorny issues for

the language testing community. On the one hand, as Luoma (2004) summarizes, quantitative approaches such as correlation coefficients and standard error measurements (SEM) have been widely applied by testing boards to improve the estimates of test scores, so that stakeholders’ confidence in the test can be maintained. On the other hand, as Lazaraton (2002) observes, only since the last decade have process-based or discourse-based studies on oral language assessment been attempted to examine the nature of the speech event and its quality in relation to the validity and reliability of oral assessment.

This qualitative and empirical approach was first called for by Van Lier (1989) to investigate the “turn-by-turn sequential interaction” (p. 497) so that the practice of designing the procedures and rating scales of OPIs could be evaluated. This need, as Fulcher (1987) had remarked on earlier, could be found in discourse analysis, a then-new approach to construct validation by which the construct can be empirically tested. Notably, systematically transcribed speaking tests have since been used to scrutinize spoken discourse using, according to He and Young (1998), mainly three approaches: Conversation Analysis (CA), Ethnography of Speaking and Speech Acts, and Gricean Pragmatics.

Extensive discourse studies have been conducted to investigate the interviewer’s and the interviewee’s behavior in OPIs regarding test validity, task effect, and the effect of the interlocutor on the candidates’ rating. The first question researchers asked was if interviewers conversed in OPIs in ways similar to natural conversation. Young and Milanovic’s (1992) study was one of the first to investigate this question. The study analyzed features of dominance, contingency, and goal orientation, as well as contextual factors in the data and suggested that the discourse was highly asymmetrical, which constrained both the interviewers and test takers in terms of what they could contribute in the oral interaction. These styles remained stable over time even in structured and scripted oral proficiency tests (Brown & Lumley, 1997; Lazaraton, 1992, 1996a; Reed & Halleck, 1997).

Page 83: Volume 3, 2005

75

On the subject of the impact of interviewers’ discourse styles on examinees’ ratings, Ross and Berwick’s research (1992) investigated whether the interviewer’s control and accommodations in OPIs affected ratings and the degree of such effects. Their findings are that test taker’s ratings could be predicted from the amount and types of accommodation that interviewers have to make. Subsequent studies have expanded the scope of investigation from looking at the interviewer’s discourse to looking also at that of the interviewee’s. Brown and Hill (1998) analyzed the co-constructed discourse in the IELTS Speaking Test based on the results of FACETS analysis in terms of the “easy” or “difficult” interviewer. They revealed that the easiest interlocutor shifted topics more frequently, asked simpler questions, and engaged in more question-answer exchanges, while the most difficult interlocutor challenged candidates more and acted more like a natural conversation partner. In a subsequent study, Brown (2003) applied CA to examine the impact of two different interviewers’ discoursal styles—“teacherly” and “casual” (p. 17)—on the same candidate’s performance. Raters were employed to comment on the candidate’s oral production resulting from the two interviews. The results show that the test taker is judged an effective communicator when taking the test with the teacherly interlocutor who, among other things, developed and extended topics skillfully. With the casual interviewer, who used more nontest conversational eliciting strategies, the candidate was judged as unforthcoming or uncooperative in communication.

As a result, to minimize the stable but unpredictable individual interviewer styles, test developers became interested in the application of an interlocutor frame to guide and constrain oral examiners from changing the ways that test takers are assessed. A series of studies by Cambridge ESOL (Lazaraton, 1996a, 1996b; Lazaraton & Saville, 1994) about the effect of interlocutor frames (or test scripts) have shown that deviation from interlocutor frames is frequent. The results of FACETS analysis in these studies indicate that this problem affects reliability of the ratings. Since oral examiners who also have to act as interlocutors cannot be considered a neutral factor, a choice between the face validity and reliability of the OPI procedure has to be made.

Along the same lines, O’Sullivan and Lu (2004) analyzed 30 seconds of pre- and post-deviation oral production by examinees in 62 audiotaped IELTS Speaking Test events. They identified the four most frequent deviations from the interlocutor frame: paraphrasing questions, interrupting with questions, asking improvised questions, and commenting after test takers’ replies. They found that there was a task factor in terms of frequency of deviation. But, because deviations were not frequent in the data, there were no systematic changes between the pre- and post-deviation spoken samples regarding accuracy, complexity, and fluency except expanding in the discourse, one of the three specific features of prolonging. Therefore, O’Sullivan and Lu suggested that the interlocutor frame can be flexible with deviations, such as paraphrasing questions, if the nature of the question is abstract or cognitively challenging.

With regard to the differences between the NNS and NS examiner discoursal performance and their impact on rating and test takers’ discoursal performance, there have been few studies. The difference in rater harshness between the two groups seems to have been the most interesting area for previous research (Brown, 1995; Fayer & Karshiski, 1987; Sheorey, 1986; Van Meale, 1994 as quoted in Reed & Cohen, 2001). To date, the present study can only draw valuable insights from the study by Berwick & Ross (1996), which analyzed the discourse of six Japanese as second language (JSL) interviews and six English as second language (ESL) interviews conducted by two trained male examiners. The Japanese

Page 84: Volume 3, 2005

76

JSL examiner and the American ESL examiner varied systematically in terms of approaches in the spoken discourse with the test takers. Statistical analyses based on the 12 interviews revealed that the JSL examiners offered significantly more accommodation such as display questions, overarticulation, lexical simplification, and more control in terms of topic shift. In contrast, the ESL examiner responded more to the content and gave the test takers more chances to elaborate on the topics. Therefore, Berwick and Ross suggested that there is “a degree of cultural/pragmatic relativity” (p. 48) in the OPI procedure and called for further research with larger amounts of data to verify this phenomenon.

Though discourse-based study is a recently established direction for research in oral assessment, discoursal performance has been a subject for a substantial amount of investigation. Carroll claimed in 1980 that the expert speaker can initiate, expand, and develop a theme, while modest speakers lack flexibility and initiative, and marginal speakers rarely take initiative and maintain dialogue in a rather passive manner. How and to what extent examinees of different levels of oral proficiency perform distinctively has also been especially important to testing organizations that are concerned with the inclusiveness and efficiency of their rating scales (see Hughes, 1989; Weir & Bygate, 1990). Later studies have suggested that when performing the same language elicitation task, higher level test takers may be more likely to produce more complicated discourse features such as initiating, elaborating, supporting, challenging, speculating, and developing topics than low- or lower-level candidates (Hasselgren, 1997; Lazaraton, 2002; Lazaraton & Wagner, 1996; Shohamy, 1994; Young, 1995), which parallels the findings of spoken discourse analysis of learner’s speech by the systematic and functional approach (Hoey, 1991; McCarthy & Carter, 1994).

To conclude, studies about how examiners as both assessor and interlocutor initiate and manage the discourse in face-to-face oral proficiency tests have resulted in implementation of structured or scripted face-to-face oral tests with, sometimes, an interlocutor frame. It seems that these frames have constrained but not totally succeeded in restricting examiners from using their individual discourse styles, cultural-specific or not. Further research has suggested that interlocutor frames can be made flexible to allow space for examiners to adjust to examinees with different ability levels and cognitive maturity. Elaborating, taking initiative, and so forth, in spoken discourse have been recognized as salient features or indictors for high-level oral communicative language ability, and some tests have been trying to differentiate these features in order to assess accurately and fairly. However, how the interlocutor examiners respond after test takers’ replies, and their questions to elicit demonstration of discourse competencies, have not been fully investigated. Furthermore, to validate Berwick and Ross’s proposal of a “cultural/pragmatic relativity” in oral assessment (1996), further research with more data from live EFL speaking tests with several types of tasks, conducted by not just one NS or NNS examiner, is needed to investigate this validity issue.

Aims of the Study

In a structured but unscripted direct oral exam—the ECCE Speaking Test—discourse analysis was carried out to examine the oral interaction between the examiners and examinees

1. to see if there are overall differences between the amount and types of the eliciting and non-eliciting moves in discourse produced by the NNS and NS examiners;

Page 85: Volume 3, 2005

77

2. to identify the non-eliciting specific discourse features in the examiners’ follow-up moves that do not encourage the examinees to elaborate or prolong their replies, decisions, or choices;

3. to identify the non-eliciting specific discourse features that do not encourage initiation from the examinees to seek information; and

4. to find out if there are differences between the NNS and NS examiners in the amount and types of non-eliciting discourse features that do not encourage the examinees’ elaboration and initiative.

Methodology ECCE Speaking Test

The speaking section is an integral part of the Examination for the Certificate of Competency in English (ECCE), produced by the English Language Institute, University of Michigan (ELI-UM). Its purpose is to assess the candidates’ basic operational competence in giving and asking for information, and justifying decisions and choices, and so forth. According to the ECCE Oral Examiner’s Manual (English Language Institute, 2004), the ability to elaborate and to take initiative are salient features of discoursal performance.

In Task 1, the competence of conveying nonsensitive personal information is assessed, as examiners are required to use a variety of questions (open and closed) to elicit speech from candidates. Task 2 is for eliciting initiations in order to assess the ability to ask for information to make a decision or give a suggestion based on a prompt that presents a situation and the candidate’s task. Pictures or photographs are provided to illustrate the task. Task 3 continues the topic in Task 2, and examiners are instructed to encourage examinees to elaborate the reasons for their decision or suggestion. To obtain more oral samples so that ratings can be as accurate as possible, examiners are also provided elaboration questions to prolong the spoken interaction.

Though the examiners are not provided with scripts, the ELI-UM gives fairly detailed guidelines for conducting the three tasks and specific Dos and Don’ts for how to behave and speak in the oral interaction (English Language Institute, 2004). These guidelines and instructions will be presented in the Discourse Analysis (DA) section because of their importance to the DA approach of this study. The examiner’s manual also offers “Descriptors of Salient Features” and a section that explains the indicators for the salient features. A checklist for decision-making on the five criteria—fluidity of delivery, elaboration and initiative, vocabulary, grammar, and intelligibility—is also given to guide the examiners. The manual explains that fluidity of delivery, elaboration and initiative, and vocabulary have proved to be the best indicators to distinguish levels on the test. Overall ratings are Competent Speaker, Moderately Competent Speaker, Marginal Speaker, and Limited Speaker. A candidate rated Limited Speaker fails the speaking test. Data

Twenty ECCE live audiotaped speaking tests administered in May and June 2004 were provided by the ELI-UM. Nine were administered by NNS examiners and 11 by NS examiners. Because one examiner from each group failed to tape Task 1, the data consist of 18 recorded Task 1s for analysis. The analysis of Tasks 2 and 3 by one of the NS examiners is

Page 86: Volume 3, 2005

78

not included because the examiner’s repeated effort to enable the candidate to understand the tasks failed. As a result, the data have 19 recorded Task 2s and 3s for analysis. The NNS examiners are numbered NNS1 to NNS9, while the NS examiners numbers are from NS01 to NS11. Information about the examiners was provided by the ELI-UM regarding nationality, age, native language, other language, length of time as ECCE oral examiner, and the training that they had been provided. The examinees’ ratings were also provided, which shows that 13 out of 21 were given the rating of Competent Speaker, one was judged as Marginal Speaker, with the rest of them as Moderately Competent Speakers.

Because Tasks 2 and 3 of the test are based on one of the eight prompts provided by the ELI-UM, I requested tests that used the same prompt so that content would not be a source of irrelevant variation in the examiners’ and test takers’ spoken discourse. As a result, tests that used the same prompt, which required the candidates to make a decision, support that decision and elaborate their reasons, were provided for the study. Transcripts

Transcribing the live tests is orthographical as long as it can reflect the discourse sequence and consequence of the spoken interaction between the examiner and candidate. Therefore, length of pausing, stressed syllables, loudness of speech, and overlapping were not transcribed. The following are the speech features depicted in the transcripts and the conventions employed when necessary:

1. Filled pauses are transcribed as “er” or “um.” 2. A question mark is put at the end of a statement with a rising tone that functions as

a question in the discourse. 3. A comma after a word or phrase shows an unfilled pause with either rising or

falling tone. 4. A full stop is put at the end of a completed sentence with a pause. 5. A circumflex ^ after “okay,” “yeah,” or “yes” shows a rising tone, while a

backslash \ denotes falling tone. 6. One x expresses one syllable of an untranscribable word. 7. Xs are substituted for language other than English, depending on the number of

syllables heard. 8. Nonverbal discourse features such as laughing are put in parentheses.

Discourse Analysis

A task-specific model of the systematic and functional approach developed for investigating the discourse of the Oral Proficiency Test (OPT), which usually consists of several tasks and assigns the examiner as interlocutor (Lu, 2003) was employed for the research. The underlying principles for developing this DA model are different from those of CA, as stated by Lazaraton (2002), in the following two aspects: (1) CA insists on unmotivated looking rather than prestated research questions, while the task-oriented approach is prescriptive and has specified framework for tasks that elicit different discourse patterns; and (2) CA insists on employing the “turn” as the unit of analysis, while the task-oriented approach takes “exchange” as the unit of analysis to reflect the chaining together of functions.

In contrast, the DA approach is prescriptive and selective by nature because it targets the initiating and sustaining discourse features used by the test takers as indicators for

Page 87: Volume 3, 2005

79

communicative language ability. The unit for analysis in this approach is the Topic Exchange and its subsequent levels, Move and Act, as proposed by Sinclaire and Coulthard (1975) and modified by Burton (1981), Coulthard and Montgomery (1981), Francis and Huston (1992), and Hoey (1991). This approach also borrows categories of moves from Eggins and Slade (1997) and refers to studies by Hoey (1991) and McCarthy and Carter (1994) for differences between native-speaker discourse and learner discourse. Specifically, the DA for Task 1 and part of Task 3 is to examine the extended structure of a topic exchange [I (R/I) R (Fⁿ)], which shows that a topic exchange can be longer than the basic Initiation-Response-Feedback structure, and consists of a Response treated as an Initiation, then a Response, then a Follow-up, and maybe more Follow-up moves. But, because the approach is task-specific, chaining of adjacency pairs with the second move treated as Initiation is also examined for Tasks 2 and 3.

As is shown in Figure 1, the task-specific DA approach integrates three fundamental factors in oral assessment: task-specific discourse features, expected discourse features, and targeted features indicating high or higher-level proficiency. Models based on the basic principle for analyzing interactive and monologic discourse are established and applied to live video- or audiotaped speaking test events. Because ECCE speaking tests do not have a monologue task, the interactive model by Lu (2003) is presented in Figure 2 (next page).

Figure 1. Elements Integrated in DA Models for Analyzing OPT Discourse.

As Lu (2005) explains, this is an overall model for analyzing an exchange in the interactive discourse in OPTs. Specific models are formulated based on the overall model, depending on task type, discourse construct, and expected oral output by the test takers. Therefore, a specific model for analyzing a particular task is sometimes based on only part of the overall model. For example, a specific model for analyzing test taker response in an interview discourse, as in Task 1 of the ECCE Speaking Test, will only adopt Prolonging

Discourse features elicited by tasks in

speaking tests

Expected discoursefeatures

by speaking

tests

Discourse features

indicating high level oral Pro-ficiency

DA models for analyzing test takers’ oral

production in interactive or monologue discourse

AnalyseOPTs

Page 88: Volume 3, 2005

80

under Sustaining to see if examinees have responded to show their oral proficiency level is high.

Figure 2. Overall Model for Analyzing Interactive Discourse in OPT.

Applied to the present study, two essential modifications were made for developing specific models to suit the research objectives: (1) Although test takers’ discourse is not the focus of DA, it is the indicator for the effect of the examiner’s discourse and regarded as the starting point for analyzing the examiner’s previous and subsequent turn. Therefore, if an examiner is expected to elicit prolonged speech in a follow-up move after a candidate’s answer to her/his question in Task 1, the candidate’s turn after the follow-up move will be looked at, rather than the examiner’s, before a decision is made as to the examiner’s discourse function in terms of being eliciting or non-eliciting. (2) The general guidelines and the Dos and Don’ts that represent the testing organization’s requirements and expectations of the examiners in order to implement valid tests are integrated in the models for analyzing the three tasks. Therefore, discourse behavior such as interrupting, correcting mistakes, and so forth are regarded as non-eliciting discourse features, since they divert from the expectations of the ELI-UM.

Based on the two elementary modifications, the task-specific DA models for analyzing the spoken discourse in the ECCE were developed with the general guidelines and the Dos and Don’ts from the examiner’s manual related to the individual task incorporated. Examples from the data and explanations are given when necessary. Task 1

For this task, ECCE Speaking Test examiners are instructed to elicit talk and longer responses from the candidates. Specific Dos and Don’ts related to this task are:

a. elicit longer responses by asking questions that establish context followed by requests for more specific information;

b. follow up on the examinees’ replies; c. foster coherence and continuity by using content provided by the candidates; and d. respond naturally to what the examinees say by using utterances such as “Uh-huh,”

“Yes,” and “Oh, I see.” As a result, DA for this task is to first find the divisions of the topic exchanges, then to

locate the follow-up moves made by examiners after candidates’ replies that elicit or do not

Exchange in Test Takers’ Interactive Spoken Discourse

Initiating Sustaining

Informing Asking Questions Prolonging Appending Supporting Confronting

Page 89: Volume 3, 2005

81

elicit elaboration of the topic by candidates. The locating process will carry on until a new topic is raised by the examiners. The specific DA model for Task 1 is shown below in Table 1.

Obviously, the question-answer adjacency pair in spoken discourse is excluded for this study, as there is no third turn to initiate more talk or longer responses from the candidates. Furthermore, the topic exchange may terminate at Turn 4 and the analysis will conclude accordingly. Therefore, Turns 5 and 6 are not the compulsory parts of an exchange for analysis. However, there may be more than six turns in the exchange dealing with the same topic and consequently included for the analysis. Example 1 illustrates how DA is conducted for analyzing oral interaction in Task 1. Table 1. DA Model for Analyzing Examiner’s Eliciting or Non-Eliciting Move in Task 1 Turn

Speaker

Discourse Feature by Examiner and Test Taker

Discourse Analysis

1 Examiner Opening : eliciting Initiating a topic exchange. 2 Candidate Responding : replying Identified. 3 Examiner Follow-up move Eliciting or non-eliciting candidate’s

prolonging move? Specific discourse features are analyzed.

4 Candidate Responding Prolonging or not prolonging? (providing evidence for Turn 3)

5 Examiner Same as in Turn 3 Same as for Turn 3. 6 Candidate Same as in Turn 4 Same as for Turn 4, but providing

evidence for Turn 5. Example 1:

Examiner: Uh-huh, very interesting, so you, you put like information about the cartoon on the website?

Candidate: I, yes, I I I have got er, the series the whole series on our computer, download it, and put it in the cds. [reply]

Examiner: Uh-huh [eliciting follow-up move: engaging] Candidate: like er, 80 cds. [prolonging: elaborating] Examiner: wow. It’s a lot of cds.(Laugh)[non-eliciting follow-up move: commenting] Candidate: yeah [no prolonging]

From NS03 & TT03 Task 2

For Task 2, the test developer requires the examiner to change roles, from the interviewer to a comparatively passive information supplier about the pictures or situation in the prompt. They are reminded that the examinees should take an active role in this task. The model for analyzing Task 2 is presented in Table 2.

Page 90: Volume 3, 2005

82

Table 2. DA Model for Examiner’s Eliciting or Non-Eliciting Responding Move in Task 2 Turn

Speaker

Discourse Feature by Examiner and Test Taker

Discourse Analysis

1 Candidate Opening : eliciting Identifying a topic exchange. 2 Examiner Responding : answering Treated as initiation? Eliciting or non-

eliciting the next question? Specific discourse features are analyzed.

3 Candidate Opening : eliciting Identified. 4 Examiner Responding : answering Same as in Turn 2. 5 Candidate Opening : eliciting Identified. 6 Examiner Responding : answering Same as for Turn 4.

Although the table demonstrates the chaining of only three pairs of question-answer adjacency pairs, the analysis should end when all the question cues provided to the candidates have been asked, with, if necessary, extra questions. Example 2 illustrates how DA is conducted for analyzing oral interaction in Task 2. Example 2:

Candidate: where they live? [eliciting] Examiner: well, the leopards are living in Africa. And the pandas, they’re in china

[eliciting responding move: unelaborated answer] Candidate: how many are left? [eliciting] Examiner: there’s about 20000 leopards left today, and the pandas, there’s only like

1000 pandas left. So there’re not very many [eliciting responding move: unelaborated answer]

Candidate: how many can we save this year? [eliciting] From NS03 & TT03

Task 3 The ECCE Oral Examiner’s Manual states that the expected language functions from

the candidates for Task 3 are to express a choice, preference, or opinion and support it/them. As a result, the examiners are instructed to encourage the candidate to elaborate the reasons for the decision, choice, and so forth, and also to encourage the candidate to discuss why something was not chosen (termed as non-choice in this study).

The discourse pattern for which examiners complete this task is shown in Table 3. The eliciting or non-eliciting moves in the framework for elaboration of the candidate’s choice or non-choice are put together in one turn due to their similarity in nature and function. In reality, they should be independent in separate exchanges as shown in Example 3. It is evident that the follow-up move by the examiner in this task is ideally treated as an initiation to trigger the next elaboration on choice or non-choice.

Page 91: Volume 3, 2005

83

Table 3. DA Model for Analyzing Examiner’s Eliciting or Non-Eliciting Moves for Elaboration of Choice or Non-Choice in Task 3 Turn

Speaker

Discourse Feature by Examiner and Test Taker

Discourse Analysis

1 Examiner Opening Identifying a topic exchange. 2 Candidate Responding Elaborating or not elaborating choice or

non-choice? Providing evidence for Turn 1. 3 Examiner Follow-up move Treated as Initiation?

Eliciting or non-eliciting elaboration of choice or non-choice? Specific discourse features are analyzed.

Example 3:

Candidate: …… but I this case I chose pandas because its more difficult to save. Examiner: yeah [eliciting move for elaboration of choice] Candidate: and you have to do er something um quickly [elaboration of choice] Examiner: uh-huh, uh-huh [eliciting move for elaboration of choice] Candidate: because it’s more, its, its um low, er high, low, low and its more low to er

to improve their lives, so er [elaboration of choice] Examiner: okay. That’s okay so let’s help pandas. Okay? [non-eliciting elaboration of

non-choice] Candidate: uh-huh [no elaboration] Examiner: thank you very much [non-eliciting elaboration of non-choice]

From NNS8 & TT8 Because there are also elaboration questions in the prompts provided by the ELI-UM to elicit more spoken samples from candidates in order to assist rating, part of Task 3 will have the same discourse pattern found in Task 1, where a question-answer sequence exists. This is to say that the eliciting and non-eliciting discourse features by the examiners in Task 1 should also be present here for analysis. As a result, the specific model for examining this part of the task is the same as for Task 1. Tagging the Transcripts Auto Text in Word was employed for the DA of the data, and the tagging process was repeated twice to ensure consistency of the analysis. In the initial tagging, the eliciting or non-eliciting discourse features were identified, and specific moves under the two broad categories were roughly tagged as they appeared in the transcripts. For the second tagging, focus was on the specific moves of the non-eliciting discourse feature. The classification was completed and exact analysis was given. For the final tagging, the previous analysis was checked and some specific moves with very low occurrences were abandoned. Counting the Occurrences An Excel file was used for the frequency counts. First, the analyzed discourse features were tallied for each type in each task for each examiner involved in the study. Then, totals and averages of each type of the eliciting and non-eliciting features were calculated for the

Page 92: Volume 3, 2005

84

NNS Group and for the NS Group. Finally, comparisons of the totals and averages were made for the two groups.

Results Non-Eliciting Discourse Features Employed by Both NS and NNS Examiners

DA resulted in the following list in answer to two of the objectives of the research: (1) to identify the specific discourse features in the examiners’ follow-up moves that do not elicit the examinees’ elaboration or prolonging of replies and decisions, choices, and so forth; and (2) to identify the specific discourse features that do not elicit initiation from the examinees to seek information.

A brief explanation for each of the listed features is given in reference to the specific requirements and conditions of the tasks. Examples from the transcripts are also provided when necessary to illustrate. The underlined utterances or turns are the examples that illustrate the discourse features in discussion.

Agreeing: follow-up moves that show the examiners’ sharing the same opinion, idea, feelings, and so forth that had been given by the candidate. They differ from utterances such as “Yes,” “True,” “That’s right,” etc., in that they express explicitly the agreement without functioning as engaging the candidate in the spoken conversation (see Example 4 below). Example 4:

Examiner: are, are, are there any pandas in Brazil? Candidate: I don’t think so. Examiner: I don’t think there are, either Candidate: uh.

From NS03 & TT03

Answering questions: does not refer to what the examiners are required to do in Task 2—answering the candidates’ questions in order to impart the information requested by the examinee. They are only included for analyzing Task 3, where the examiners are usually in the role of asking questions and eliciting expanded speech from the test takers.

Asking questions: employed by examiners while administering Task 2 when they are supposed to answer questions raised by the candidates. They can be seen as diverting from the assigned role of the examiners.

Back-channeling: follow-up moves which are repetitions of whole or part of candidates’ turns with a falling intonation, as shown by Example 5. The discoursal behavior is an echoing of what the candidate has just said in reply to the examiners’ initiation for a new topic. Example 5:

Candidate: I’m a marketing researcher Examiner: marketing researcher Candidate: yes

From NS10 & TT010

Page 93: Volume 3, 2005

85

Challenging: as Burton (1981) defines it, “challenging moves function to hold up the progress of that topic or topic-introduction in some way” (p. 71). Its occurrence is present only in Task 3 when the examiners are responding to candidates’ justification for their choice. It can be realized by a statement or question (see Example 6). Example 6:

Candidate: because er, they seem to be very cute, but I think that in this moment the leopards needs more help than pandas, because 20000 are left

Examiner: yeah. Candidate: er Examiner: 20000 left, pandas only 1000 Candidate: yeah.

From NNS1 & TT1

Changing topic: a task-specific discourse feature which is an opening move that initiates another topic exchange in spoken discourse. It is realized by a statement or a question. Because in Task 3 the examiners are instructed to encourage the test takers to tell why something was not chosen, this discourse feature is singled out and analyzed specifically for this study. It can also be regarded as misrepresenting the construct of the task in that examiners should not start asking the elaboration questions before they have tackled the reason or justification for not choosing something. Example 7 illustrates how the examiners employed the discourse feature and consequently terminated a necessary phase in administering Task 3. Example 7:

Candidate: so I think we’re going to help the pandas. Examiner: wow, that’s some good decision, um although both are in need but your

organization can only help one kind, right? Um, tell me about this, have you ever seen leopards or pandas in real life?

From NNS2 & TT2

Clarification requests: follow-up moves that indicate non-understanding or lack of comprehension after candidates’ replies or answers. They are usually realized by either questions or repetitions with a rising tone of the part not understood or whole of the previous turn made by a candidate (see Example 8). Example 8:

Candidate: …… I think er, people should er, think about it and er, er make er lots of er effort to help them, or so on to an organization.

Examiner: you mean people, er common people ^ Candidate: yes, of course.

From NNS3 & TT3

Commenting: follow-up moves, statements, or tag questions made by examiners to elaborate, expand, justify, evaluate, and so on, in responding to candidates’ replies to their initiations. Since the nature of the spoken interaction in a speaking test is different from that

Page 94: Volume 3, 2005

86

in a natural conversation, the candidates tend to be more sensitive to what the examiners comment on regarding what they say. The discourse analysis in this study includes words and phrases such as “good,” “interesting,” “nice,” and so on. Examples 9 and 10 exemplify cases. Example 9:

Candidate: its, its, my my father can, can pay, it’s a very, very good school, so I try to, to use all I can use there you know, because I don’t feel that in school, to see my father paying what he’s paying, I just go there to, go there go

Examiner: good, er, I’m sure your father is very happy to hear that. Candidate: (laugh)

From NS04 & TT04

Example 10: Examiner: are you working? Candidate: er, no, I am a mother, I’m married and a mother of two children. Examiner: that’s nice. Candidate: yeah, I know.

From NNS5 & TT5

Concluding: follow-up moves of statements or questions marked with “so” or “then” at the beginning that function to summarize what has been talked about between the examiner and test taker on a topic. Example 11 shows this discourse feature, which usually demands a response from the candidate. Example 11:

Examiner: but how exciting you got to visit London, where else did you say? Candidate: France, just one week, Italy, one week, and last year I was a au pair for a

year in New York, it was wonderful. Examiner: so you had good experiences. Candidate: yea, I think.

From NS06 & TT06

Confirmation requests: different from clarification requests in that, as a follow-up move, they demand affirmation of what the examiner has understood but is not certain of. They are usually realized by either “yes,” “no,” or a word or phrase with a high key rising intonation (see Example 12). Example 12:

Examiner: okay, have you ever seen pandas before? Candidate: no, no Examiner: yes? Candidate: never, no.

From NNS4 & TT4 Correcting mistakes: follow-up moves in which the examiners correct grammatical or

lexical mistakes instead of carrying on with the normal flow of the spoken interaction, which is one of the Don’ts given by the ELI-UM.

Page 95: Volume 3, 2005

87

Engaging: after candidates’ replies to the initial elicitation and in the position of follow-up move, they are realized by “Uh-huh,” “yeah,” or “okay” with a mid-key rising tone to acknowledge or show attention to what has been said by the candidates without interrupting or stopping their utterances.

Exclamations: another discourse feature to indicate acceptance and interest in the test taker’s talk, but realized by utterances such as “wow,” “ah,” “oh,” “really?,” or laughter, which show surprise, amazement, disbelief, amusement, etc.

Informing: follow-up move of a statement made solely to provide information new to candidates (see Example 13). Example 13:

Examiner: good, um, there just some general questions about animals, um, do you enjoy going to the st. pauval zoo? Have you been there?

Candidate: yes, I have already been there er, twice with my daughters of course. Examiner: I just went last week. Candidate yes.

From NS01 & TT01

Interrupting: this discourse behavior terminates candidates’ responses and replies to the examiners’ initiation. It is one of the diverting discourse behaviors that the ELI-UM advises the examiners not to do.

Marker: follow-up move that is realized by “okay,” “right,” “alright,” etc., at the beginning of a turn with a falling intonation. The consequence of employing such a discourse feature intentionally or unintentionally is usually the termination of a topic exchange, as illustrated in Example 14. Example 14:

Examiner: yes, OK, and how do you believe that English er, has changed your life and work?

Candidate: yeah, it’s very important, I think the English is very important, you know, because er, you know businesses needs English, it’s the er business language

Examiner: okay\ Candidate: and er, I don’t know

From NNS5 & TT5

Supplying elaborated answers: another task-specific discourse feature that acts as a responding move in Task 2 after candidates’ initiation for information needed to make a decision. An answer is regarded as elaborated if more than the required information is provided, which results in a non-eliciting turn to the examinee’s next initiation when it is still needed (see Example 15).

Page 96: Volume 3, 2005

88

Example 15: Candidate: um, which is more easy to protect, protect? Examiner: well, okay, the, the leopards there are 20000 in Africa, and the hunters are

killing them for their furs. The pandas are living in forests their habitat is being invaded because people are cutting down the trees, right, they eat bamboo.

Candidate: yes. From NS07 & TT07

Supplying unelaborated answers: in contrast to supplying elaborated answers,

examiners are considered to be complying with the guidelines and representing the construct designed in Task 2 when they provide needed information only (see Example 16). Example 16:

Examiner: um, hun, we would, er which is cheaper to help? To protect, to help to protect.

Candidate: okay, you have enough money to protect 200 leopards or 10 pandas. From NS07 & TT07

Supplying vocabulary: similar by nature to the diverting discourse behavior of

correcting mistakes, this unit of discourse analysis interrupts candidates’ turns and causes disruption in the spoken discourse. As a follow-up move it is usually a word or phrase unknown to the test taker, as shown by Example 17. Example 17:

Examiner: er, apart from these leopards and pandas, have you ever heard about other animals that are recently in danger? Think about other

Candidate: yes, Examiner: species. Candidate: yes, the wise people want to kill, wise, wise Examiner: the whales, yeah. Candidate: the whales, yes.

From NS07 & TT07 Comparing Overall Results between the NS and NNS Examiner Groups

This section reports the results regarding: (1) whether the eliciting and non-eliciting discoursal features by the ECCE Speaking Test examiners are the same or different for the NNS and NS examiners, and (2) whether the amount and types of discourse features that do not elicit the examinees’ elaboration and initiative by the NNS and NS examiners are the same or different. First, a comparison of the amount of eliciting and non-eliciting discourse features by both the NNS and NS examiners is presented in Table 4 to show the examiners’ overall discoursal performance in conducting the speaking test.

Table 4 shows that in general the ECCE Speaking Test examiners elicited significantly more in the discourse for Tasks 1 and 2 as compared to Task 3, when they were supposed to encourage the examinees to elaborate the reason for their choice and non-choice. As a whole,

Page 97: Volume 3, 2005

89

the examiners produced considerably more eliciting and non-eliciting features in their follow-up moves in respect to promoting elaboration replies in Tasks 1 and 3.

The table also reveals that the NS examiners produced more eliciting moves, particularly in the cases of initiating elaboration on replies in Tasks 1 and 3 and getting examinees to ask questions in Task 2, while the NNS examiners produced more non-eliciting moves. However, the NNS examiners’ discoursal performance in Task 3, when trying to encourage elaboration on reasons for choice or non-choice, showed no significant difference from those of the NS group. Table 4. Eliciting and Non-Eliciting Discourse Features in Tasks 1, 2, & 3 by NNS and NS Examiners

No. of Occurrences Task

Discourse Feature Total NNS NSEliciting Follow-up Move 166 79 871 Non-Eliciting Follow-up Move 88 60 28Eliciting Responding Move 88 35 532 Non-Eliciting Responding Move 38 22 16Eliciting Moves for Elaboration of Choice 26 12 14Non-Eliciting Moves for Elaboration of Choice 13 6 6Eliciting Moves for Elaboration of Non-Choice 7 3 4Non-Eliciting Moves for Elaboration of Non-Choice 23 11 12Eliciting Follow-up Move 106 44 62

3

Non-Eliciting Follow-up Move 50 32 18

Table 5 (on the next page) provides the top three types of eliciting moves in each task and the number of occurrences produced by the NS and NNS examiners. As shown in the table, there is not much difference between the NNS and NS examiners for the amount and types of specific discourse features by which they elicited elaboration and initiative from the candidates. Differences such as the NS examiners’ tendency to use commenting in their follow-up moves after test takers’ replies and confirming to elicit questions in Task 2 seem to suggest some characteristics in the NS examiners’ discourse. The results also give evidence that engaging and exclamation as follow-up moves are most effective in eliciting elaboration.

Page 98: Volume 3, 2005

90

Table 5. Eliciting Follow-up Moves by NNS and NS Examiners in Task 1 No. of Occurrences

Task Type of Discourse Features NNS NS

1 Eliciting Follow-up Moves: • Engaging • Exclamation • Commenting • Back-channeling • Acknowledging • Informing

50 13 3 3 3 3

55 15 9 2 0 0

2 Eliciting Responding Moves: • Unelaborated answer • Confirming • Elaborated answer • Non-Informing answer

22 6 4 0

26 13 1 3

3 Eliciting Moves for Elaboration of Choice: • Engaging • Asking question • Confirmation request

Eliciting Moves for Elaboration of Non-Choice: • Challenging • Asking question • Marker • Prompt

Eliciting Follow-up Moves: • Engaging • Exclamation • Agreeing • Commenting

8 4 0

1 1 0 1

30 5 2 1

4 5 2

2 1 1 0

37 6 0 4

Comparing the Non-Eliciting Discourse Features by NNS and NS Examiners

Differences between the amount and types of the non-eliciting discourse features produced in each task by NNS and NS examiners are presented in this section. These differences are shown by the numbers and percentages of all occurrences of each type.

Table 6 shows that the NNS examiners’ non-eliciting follow-up moves are double those produced by the NS examiners. The two groups discouraged the test takers to prolong their replies and elaborate on the topic being dealt with in different ways as well. The NNS examiners are more likely to do so by back-channeling, requesting confirmation, and informing, and marking boundaries in the discourse, while the NS examiners tend to do so by commenting and concluding the candidates’ replies.

Page 99: Volume 3, 2005

91

Table 6. Non-Eliciting Follow-up Moves by NNS and NS Examiners in Task 1 No. of Occurrences % of All Occurrences Types of Non-eliciting Follow-up Moves:

Task 1 NNS NS NNS NS Agreeing 1 0 1.67 -- Back-channeling 8 1 13.33 3.57 Challenging 3 1 5.00 3.57 Clarification request 1 2 1.67 7.14 Commenting 6 4 10.00 14.29 Concluding 6 4 10.00 14.29 Confirmation request 6 1 10.00 3.57 Correcting mistake 0 1 -- 3.57 Engaging 1 2 1.67 7.14 Exclamation 6 4 10.00 14.29 Informing 9 3 15.00 10.71 Interrupting 4 2 6.67 7.14 Marker 7 3 11.67 10.71 Supplying vocabulary 2 0 3.33 -- Totals/Average per examiner 60/7.5 28/2.8

Table 7 shows that, though the examiners were engaged in a similar discourse context as in Task 1, especially when they asked the elaboration questions, some of the non-eliciting follow-up moves made by them in Task 1, such as back-channeling, challenging, confirmation requests, engaging, and interrupting, did not take place in Task 3. Also, the NNS examiners were eliciting less in this part of Task 3, and they employed more discourse features to do so, as well. They shared the same discoursal behaviors such as concluding, informing and commenting with the native speakers, but they came up with more managing discourse features again such as markers and confirmation requests and supplying vocabulary, while the NS examiners still tended to comment, conclude, and agree with the examinees. Table 7. Non-Eliciting Follow-up Moves by NS and NNS Examiners in Task 3

No. of Occurrence % of All Occurrences Types of Non-Eliciting Follow-up Moves: Task 3 NNS NS NNS NS Agreeing 0 3 -- 16.67 Answering question 3 0 9.38 -- Clarification request 6 0 18.75 -- Commenting 2 2 6.25 11.11 Concluding 4 5 12.50 27.78 Correcting mistake 1 0 3.13 -- Elaborated answer 2 0 6.25 -- Exclamation 1 1 3.13 5.56 Informing 5 5 15.63 27.78 Marker 2 0 6.25 -- Supplying vocabulary 4 1 12.50 5.56 Totals/Average per examiner 31/3.44 18/1.8

Page 100: Volume 3, 2005

92

It is obvious that, while conducting Task 2 (see Table 8), if the examiners asked questions, gave information without having been asked, or supplied answers with elaboration, the examinees would not have as many opportunities as they could to encourage the examiners to provide the information needed to make a decision. The NNS examiners’ commenting and informing and the NS examiners’ concluding seem to have stopped the test takers from initiating in the discourse. Furthermore, the average number of non-eliciting moves per examiner for the NNS group decreased. Table 8. Non-Eliciting Responding Moves by NS and NNS Examiners in Task 2

No. of Occurrences % of All Occurrences Types of Non-Eliciting Responding Moves: Task 2 NNS NS NNS NS Asking question 4 4 18.18 25.00 Clarification request 2 0 9.10 -- Commenting 4 1 18.18 6.25 Concluding 0 4 -- 25.00 Confirmation request 1 0 4.55 -- Engaging 1 1 4.55 6.25 Elaborated answer 4 5 22.73 31.25 Informing 5 1 21.74 6.25 Marker 1 0 4.55 -- Totals/Average per examiner 22/2.76 16/1.6

From the totals and averages per examiner for Task 3 (Table 9), we can see that the NNS examiners were for the first time not producing more non-eliciting moves than their NS counterparts. Concluding seems to have stopped candidates most effectively from elaborating their reasons for choosing something by both groups. Noticeably, it is still the NS group that used commenting for this effect.

Table 9. Non-Eliciting Elaboration of Choice Moves by NS and NNS Examiners in Task 3 No. of Occurrences % of All Occurrences Types of Non-Eliciting Moves for

Elaboration of Choice: Task 3 NNS NS NNS NS Changing topic 1 1 16.67 16.67 Commenting 0 2 -- 33.33 Concluding 3 2 50.00 33.33 Confirmation request 0 1 -- 16.67 Engaging 1 0 16.67 -- Informing 1 0 16.67 -- Marker 0 1 -- 16.67 Totals/Averages per examiner 6/0.67 7/0.7

Page 101: Volume 3, 2005

93

Interestingly, the amount of non-eliciting moves for elaboration of non-choice increased compared to the ones for elaborating choice. Table 10 shows that changing the topic by both groups and agreeing with the examinees by the NS examiners are the main causes. The task-specific discourse feature, Changing Topic, means that the examiner terminated the discourse for eliciting candidate’s justification for not choosing something, limiting the effectiveness of the task. The total number of occurrences for this discourse feature is eight, which implies that eight out of nineteen examinees were not assessed for this test feature. Table 10. Non-Eliciting Elaboration of Non-Choice Moves by NS and NNS Examiners in Task 3

No. of Occurrences % of All Occurrences Types of Non-Eliciting Moves for Elaboration of Non-Choice: Task 3 NNS NS NNS NS Agreeing 0 6 -- 50.00 Back-channeling 0 1 -- 8.33 Changing topic 5 3 45.45 25.00 Challenging 3 0 27.27 -- Concluding 1 0 9.10 -- Commenting 0 1 -- 8.33 Interrupting 2 1 18.18 8.33 Totals/Average per examiner 12/1.22 11/1.1

Discussion The Effect of Discourse Variation on Oral Examiners’ Discoursal Performance

In general, the NNS examiners are less facilitative of the examinees’ elaboration in replying to their initiation in discourse and initiative in seeking information. The amount of eliciting discourse features that they used is sometimes half of that used by the NS examiners, while the amount of the non-eliciting ones made by the NNS examiners is often twice that made by their NS counterparts. Nevertheless, this situation changed by different degrees when they were engaged in different discourse contexts generated by Tasks 2 and 3. There was a decrease in the number of non-eliciting moves by both NNS and NS examiners, especially in Task 3. As a result, there was not a substantial difference between the NNS and NS figures. This may indicate an effect of discourse variation on the examiners’ discoursal performance. Examiners tend to discourage the candidates from prolonging their replies and expanding on the topic initiated by them, but the examiners did not do this so much when they were eliciting initiatives and elaboration on the choice and non-choice.

If the effect of discourse variation is justifiable, the possibility that the NNS examiners are less eliciting in discourse needs to be reconsidered. They may only be unable to elicit elaborated replies successfully. In other discourse patterns where their roles are different, they may be more ready to facilitate elaboration and initiative. Therefore, NNS examiners of oral tests seem to be in need of training or standardization for the question-answer or interview discourse pattern. However, it might also be the consequence of the NNS examiners’ strong sense of goal-orientation in the discourse. They were more focused on the completion of the

Page 102: Volume 3, 2005

94

tasks, and therefore, tended to shift topics more frequently and prevent examinees from necessary elaboration on the topics. The Effect of the Diverting Discourse Features by NNS and NS Examiners

A number of the task-specific discourse features included in the specific DA models of this research are considered deviations from the general guidelines and task design by the ELI-UM. They are interrupting, correcting mistakes, supplying vocabulary, asking questions, and supplying elaborated answers in Task 2, answering questions in Task 3, and changing topic to elicit elaboration of non-choice in Task 3. These occurrences again show that training and standardizing examiners is a challenging undertaking to test developers of speaking tests.

The fact that all these diverting features, except elaborating answers, did not appear in the eliciting discourse features shows that they are genuine deviations that have negative effects. The deviations can be seen as frequent (nine elaborated answers in Task 2, nine interrupting moves in Tasks 1 and 3, and six supplying vocabulary moves by the examiners involved), since this research involves 19 examiners. And most importantly, these incidents of deviation are not idiosyncratic. They scatter over the examiners in the above-mentioned parts of the exam, which may suggest again the unpredictability of examiners’ behavior in conducting oral tests, so that training and coaching the examiners to follow the test structure, task procedure, and guidelines for discoursal performance are vital before administration in order to prevent them from misrepresenting the construct and affecting the validity of the test. The Effect of the NNS Examiners’ Management Agenda and the Discoursally More Involved NS Examiners

The results show that, apart from the diverting discourse behaviors, both groups also tended to use certain discourse features that stopped the test takers from elaborating or initiating. However, the most frequent discourse features produced by the two groups vary. In Figure 3, comparisons of the average numbers per examiner of the most frequent discourse features in each group are presented to reveal the tendencies.

The NNS group produced significantly more back-channeling, clarification and confirmation requests (C. requests in Figure 3), informing and markers. Although those discourse elements, apart from back-channeling, are all follow-up moves in an exchange originally, they become initiations after spoken and demand a response from the other party in the conversation. According to Hoey (1991), they can be regarded as the follow-up moves that are treated as initiation to represent the disruption in spoken discourse. Obviously, they terminate the previous topic, which the examinees may have intended to elaborate. For example, to the clarification request “Phone a friend. Do you mean I can phone a friend to help?” the other party has to give a reply “Yes” or “No” to comply with the rule for a natural conversation, which cannot be considered a sufficient language sample for an oral examiner to use to make a rating.

Page 103: Volume 3, 2005

95

0

0.5

1

1.5

2

2.5

Averages per exam

iner

agreeing

back-channelingcommentingconcludingchallengingC.requestsexclamationinforming

marker

Types of Non-eliciting Moves

Comparison of the Most Frequent Non-eliciting Moves by NNS and NS Examiners(Diverting features not included)

NNS NS

Figure 3. Comparison of the Most Frequent Non-Eliciting Moves by NNS and NS Examiners.

In contrast, the NS group’s most frequent discourse features that function as non-eliciting moves do not necessarily demand a response but may require a follow-up move to acknowledge or accept. This is self-evident because, to someone who has just expressed agreement, we are not obliged to respond. We can express appreciation verbally or use paralinguistic features to complete the episode of the oral communication. The same is also true with commenting and concluding, which the NS group produced comparatively more often than they did the other features. These features seem to have acted as the genuine follow-up moves in response to the examinees’ replies which show interest in what the examinee had said. They suggest a higher degree of involvement in the discourse by the NS examiners, though they operated as non-eliciting to the candidates’ expected discoursal performance. In fact, the non-eliciting moves as shown in Figure 3 all appeared as eliciting moves. However, only the NS examiners’ commenting was one of the top three that effectively initiated test takers’ prolonged speech.

The specific non-eliciting discourse features produced more by the NNS examiners, namely challenging, C. requests, informing, and markers, function not only as initiating moves as discussed earlier, but also operate as what Bygate (1987) describes as “agenda management” (p. 36) speaking skill features. They deal with starting, maintaining, directing, or ending a topic, which correspond to the NNS examiners’ marker (starting or ending), informing (starting or maintaining), challenging (directing or maintaining), and C. requests (directing or maintaining). As a result, it seems that the NNS examiners attempted to take

Page 104: Volume 3, 2005

96

more control in the oral interaction than the NS examiners. These efforts seem to have partly operated negatively against the expected test taker’s discoursal performance. This finding matches the results by Berwick and Ross (1996).

Conclusion and Implications

This research has applied a discourse analytic approach to investigate the non-eliciting effect of the ECCE oral examiners’ discourse behavior on the construct of the test and the expected examinee discoursal performance. It was found that, on the one hand, in general the ECCE Speaking Test examiners, regardless their linguistic and cultural backgrounds, have followed the test-developer’s guidelines and presented the instrument to the test takers in order to assess the targeted discoursal performance: elaboration and initiative in spoken interaction. On the other hand, there were deviations from the requirements by the ELI-UM and the task requirements by both NNS and NS examiners. As a result, Tasks 2 and 3 could have been conducted more effectively to assess the ability to take initiative and support decisions. In comparison, the NNS examiners performed less eliciting behavior and more non-eliciting behavior than the NS examiners. Their non-eliciting discoursal performance varies with the discourse variation borne by the task types. Therefore, their discoursal performance and adherence to the examiners’ guidelines are similar to that of their NS counterparts in initiating elaboration of choices, but not in other discourse contexts, such as being an information-provider or in the follow-up move after examinees’ replies to their initiation. Furthermore, there did seem to be a cultural/pragmatic relativity caused by preferences of the specific discourse features by the NNS and NS examiner, the effect of which may be non-eliciting. It was noted that the NNS examiners in this study tend to take control and be goal-oriented in the follow-up move, thus depriving the test takers of chances to elaborate, while the NS examiners seemed to have been more involved in the oral interaction, paying attention to the content of what was being said by the candidates.

The findings of this research may imply that the institutional nature of face-to-face OPTs with role-based activities cannot be neglected. Though examiners sometimes subconsciously have the tendency of treating the oral interaction as natural conversation, the goals of OPTs determine that the spoken discourse involved is limited in terms of naturalness, interactiveness, and range of discourse behavior that can be appropriate for the intended constructs. This seems to indicate that interlocutor frames could be necessary for standardization of oral examiners’ task and discoursal performance. It is possible that the NNS examiners’ OPT conversational styles are influenced by their first languages and cultures. However, the results indicate that it is the NNS examiners’ inclination to control discourse that had the effect of not initiating the expected examinee discoursal performance.

In summary, this study is a small-scale investigation with data randomly selected which might not be representative of the examiners or candidature. Therefore, any conclusion and implications drawn from the study have to be considered cautiously, and further study with more data and a wider range of participants will be needed to confirm the results of the present research to make generalizations about the effect of examiners’ non-eliciting discourse behavior on the reliability and validity of oral assessment.

Page 105: Volume 3, 2005

97

Acknowledgments

I would like to express my deeply felt gratitude to the Spaan Fellowship Committee for accepting my research proposal to make this study a reality. I would also like to thank Ms. Mary Spaan’s practical guidance on the project and the efficiency of all involved staff at the ELI-UM.

References

Berwick, R., & Ross, S. (1996). Cross-cultural pragmatics in oral proficiency interview strategies. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition, and assessment: Selected papers from the 15th language testing research colloquium, Cambridge and Arnhem (pp. 34–54). Cambridge, UK: Cambridge University Press.

Brown, A. (1995). The effect of rater variables in the development of an occupation specific language performance test. Language Testing, 12(1), 1–15.

Brown, A. (1998). Interviewer style and candidate performance in the IELST oral interview. In S. Woods (Ed.), Research Reports 1997, Vol. 1 (pp. 173–191). Sydney, Australia: ELICOS.

Brown, A., & Hill, K. (1998). Interviewer style and candidate performance in the IELTS oral interview. IELTS Research Reports 1998, 1, 1–19.

Brown, A., & Lumley, T. (1997). Interviewer variability in specific-purpose language performance tests. In A. Huhta, V. Kohonen, L. Kurki-Suonio, & S. Luoma (Eds.), Current developments and alternatives in language assessment (pp. 137–150). Jyväskylä, Finland: University of Jyväskylä.

Burton, D. (1981). Analyzing spoken discourse. In M. Coulthard & M. Montgomery (Eds.), Studies in discourse analysis (pp. 61–83). London: Routledge.

Bygate, M. (1987). Speaking. Oxford, UK: Oxford University Press. Carroll, B. J. (1980). Testing communicative performance. Oxford, UK: Pergamon. Coulthard, J., & Montgomery, M. (1981). Studies in discourse analysis. London: Routledge. Eggins, S., & Slade, D. (1997). Analyzing casual conversation. London: Cassell. English Language Institute. (2004). ECCE oral examiner’s manual, May-June 2004. Ann

Arbor, MI: English Language Institute, University of Michigan. Fayer, J. M., & Karshiski, E. (1987). Native and nonnative judgments of intelligibility and

irritation. Language Learning, 37, 313–326. Francis, G., & Huston, S. (1992). Analyzing everyday conversation. In M. Coulthard (Ed.),

Advances in spoken discourse analysis (pp. 123–161). London: Routledge. Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT Journal,

41, 287–291. Hasselgren, A. (1997). Oral test subskill scores: What they tell us about raters and pupils. In

A. Huhta, V. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current developments and alternatives in language assessment: Proceedings of LTRC, 96 (pp. 226–241). Jyväskylä, Finland: University of Jyväskylä.

He, A. W., & Young, R. (1998). Language proficiency interviews: A discourse approach. In R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 1–26). Philadelphia: John Benjamins.

Page 106: Volume 3, 2005

98

Hoey, M. (1991). Some properties of spoken discourse. In R. Bowers & D. Brumfit (Eds.), Applied linguistics and English language teaching (pp. 65–84). Basingstoke, UK: Macmillan.

Hughes, A. (1989). Testing for language teachers. Cambridge, UK: Cambridge University Press.

Lazaraton, A. (1992). The structural organization of a language interview: A conversation analytic perspective. System, 20, 373–386.

Lazaraton, A. (1996a). Interlocutor support in oral proficiency interviews: The case of CASE. Language Testing, 13(2), 151–172. Lazaraton, A. (1996b). A qualitative approach to monitoring examiner conduct in the

Cambridge assessment of spoken English (CASE). In M. Milanovic & N. Saville (Eds.), Performance testing, cognition, and assessment: Selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem (pp. 18–33). Cambridge, UK: Cambridge University Press.

Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge, UK: Cambridge University Press.

Lazaraton, A., & Saville, N. (1994, February). Process and outcomes in oral assessment. Paper presented at the 16th Language Testing Research Colloquium, Washington, DC.

Lazaraton, A., & Wagner, S. (1996). The revised test of spoken English (TSE): Analysis of native speaker and nonnative speaker data. (TOEFL Monograph Series MS-7). Princeton, NJ: Education Testing Service.

Luoma, S. (2004). Assessing speaking. Cambridge, UK: Cambridge University Press. Lu, Y. (2003, July). Test-takers’ first languages and their discoursal performance in oral

proficiency tests. Paper presented at the 25th Language Testing Research Colloquium, Reading, UK.

Lu, Y. (2005). A discourse analytic validation study of discourse competence in oral proficiency tests. Unpublished doctoral dissertation, University of Reading, Reading, UK.

McCarthy, M., & Carter, R. (1994). Language as discourse: Perspectives for language teaching. London: Longman.

O’Loughlin, K. (1997). The comparability of direct and semi-direct speaking tests: A case study. Unpublished doctoral dissertation, University of Melbourne, Melbourne, Australia.

O’Sullivan, B., & Lu, Y, (2004). An empirical study on examiner deviation from the set interlocutor frame in the IELTS speaking paper. Unpublished manuscript.

Reed, D., & Cohen, A. (2001). Revisiting raters and ratings in oral language assessment. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara, & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 82–96). Cambridge, UK: Cambridge University Press.

Reed, D.J., & Halleck, G.B. (1997). Probing above the ceiling in oral interviews: What’s up there? In A. Huhta, V. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current developments and alternatives in language assessment: Proceedings of LTRC 96 (pp. 225–238). Jyväskylä, Finland: University of Jyväskylä.

Ross, S., & Berwick, R. (1992). The discourse of accommodation in oral proficiency interviews. Studies in Second Language Acquisition, 14, 159–176.

Page 107: Volume 3, 2005

99

Sheorey, R. (1985, April). Goof gravity in ESL: Native vs. nonnative perceptions. Paper presented at the 19th Annual TESOL Convention, New York.

Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11, 99–123.

Sinclair, J., & Coulthard, M. (1975). Toward an analysis of discourse. Oxford, UK: Oxford University Press.

Sinclair, J., & Coulthard, M. (1992). Toward an analysis of discourse. In M. Coulthard (Ed.), Advances in spoken discourse analysis (pp. 1–34). London: Routledge.

Van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency interviews as conversation. TESOL Quarterly, 23, 487–508.

Young, R. (1995). Conversational styles in language proficiency interviews. Language Learning, 45, 3–42.

Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Second Language Acquisition, 14, 403–424.

Page 108: Volume 3, 2005

100

Page 109: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 3, 2005 101 English Language Institute, University of Michigan

An Investigation of Lexical Profiles in Performance on EAP Speaking Tasks

Noriko Iwashita University of Queensland

The present study investigates lexical competence in performance on speaking tests. It examines the extent to which learners preparing for tertiary study in English-speaking countries are able to demonstrate their ability to use a wide range of vocabulary in carrying out academic speaking tasks. Ninety-six task performances over four different tasks and two task types were drawn from three different levels. The performances were transcribed and analyzed using the WordSmith Tools program (Scott, 2004). The results showed that while the number of words used specific to academic speaking tasks differed little with proficiency level, test takers’ vocabulary varied according to task and task types. The results of the study have implications for task design in academic speaking tests and teaching/learning vocabulary in EAP courses.

Each year, many students from various parts of the world apply to begin tertiary study

in English-speaking countries. With an increasing number of nonnative speakers of English, the examination of English proficiency has become important to providing tertiary institutions with precise information on students’ competence in handling the academic English needed for understanding lectures, participating in class discussions, and writing essays. The ability to use a wide range of vocabulary in an academic setting is extremely important to satisfying the requirements of academic study.

The proposed study investigates how ESL learners preparing for tertiary study in an English-speaking country demonstrate lexical knowledge in their performance on an English for Academic Purposes (EAP) speaking test. The specific aims of the study are (1) to examine the lexical competence of learners at different proficiency levels using MICASE (Michigan Corpus of Academic Spoken English), and (2) to examine the impact of EAP task types on learners’ demonstration of lexical knowledge. Drawing on the quantitative linguistic analysis of the MICASE, the present study undertook a comprehensive linguistic description of the range of spoken registers used by ESL learners in their academic speaking test performance in two different types of tasks.

Background

The ability to use a wide range of words has been regarded as important for ESL learners to pursue tertiary study in English-speaking countries. Investigation into the use of vocabulary in the performance of academic tasks has been studied mainly in the contexts of writing and reading. For example, Enober (1995) examined the lexical component as one factor in holistic scoring. Sixty-six essays by nonnative speakers of English from various language backgrounds were holistically scored and the scores measured against with four lexical richness measures (lexical variation, error-free variation, percentage of lexical error,

Page 110: Volume 3, 2005

102

and lexical density). The results showed significant high correlations for lexical variation. Santos (1988) investigated the reactions of academics to essays by 96 students who were native speakers of Korean or Chinese and found that vocabulary errors were regarded as the most serious. Leki and Carson (1994) conducted a survey asking nonnative English-speaking students about what they would like to learn in EAP courses and found that vocabulary was identified as the first priority. Although a considerable amount of research is devoted to the role of vocabulary knowledge in academic writing, little is known about how important it is for learners to possess a wide range of vocabulary in academic speaking.

Academic speech is defined as speech that occurs in academic settings and includes both rehearsed and spontaneous speech (e.g., Lindemann & Mauranen, 2001). In general, research in academic speech has lagged far behind academic writing, with most studies of the former devoted to the characteristics of academic discourse such as rhetorical organization of classroom discourse and lectures, or to examining registers frequently observed in academic speech in corpus-based study (e.g., Lindemann & Mauranen, 2001). Few, however, have investigated lexical competence demonstrated in test performance in relation to what extent a wide range of vocabulary is required to carry out academic tasks such as discussions and presentations.

Proficiency tests such as TOEFL and IELTS provide university administrators with information about whether test takers are able to cope with tertiary study in English-speaking countries. In the speaking component of such tests, tasks are designed to simulate situations that test takers are likely to use in an academic context. Brown, Iwashita, and McNamara (2005) examined comments on academic speaking performance by expert EAP teachers in the context of scale development and found that EAP teachers made general assessments of test takers’ vocabulary skills and commented frequently on the adequacy of their vocabulary for the particular task. Analysis of test-taker discourse has shown lexical knowledge to be one of the most important features distinguishing proficiency levels of examinees (e.g., Iwashita, Brown, McNamara, & O’Hagan, 2003), but most research has examined general vocabulary (e.g., Douglas & Selinker, 1993) rather than the specific lexis of academic spoken English. Brown, Iwashita, and McNamara (2005) investigated the use of academic vocabulary in speaking test performances using the Academic Word List (Coxhead, 1998) and found little difference across proficiency levels and task types. The Academic Word List was compiled from a corpus of 3.5 million headwords of written academic text; that is, not from spoken corpora. Some recent studies identify typical features of academic speech (e.g., Camiciottoli, 2004) that is not observed in academic written English. For this reason, it is important to investigate how lexical profiles in academic spoken English may differ according to proficiency levels and task types based on the academic spoken corpus. Task Characteristics

Speaking tasks used in EAP tests increasingly seek to replicate the roles of and demands on students in academic contexts. An important but rather underresearched development in EAP test-task design is that of integrated tasks (see Lewkowicz, 1997), in which test takers are required to process and transform cognitively complex stimuli (written texts, lectures, etc.) and integrate the information into their speaking performance. Such performances are more complex and demanding than traditional independent tasks where test takers draw on their own knowledge or ideas in response to a question or prompt, and where

Page 111: Volume 3, 2005

103

the absence of input means that the tasks are often restricted to fairly bland topics drawing on test takers’ general knowledge.

In general, the greater complexity of integrated tasks in terms of content and organization led the investigators to expect differences in the quality of the content and organization of performances across the two task types in the study. However, it was not known whether there would be differences between performances on the two task types in terms of more specific features. Because integrated tasks provide learners with language input, it was expected that the better responses to these tasks would involve more complex or sophisticated language, in terms of vocabulary at least; and, given the greater potential for complex ideas to be communicated, in terms of grammatical complexity, as well. However, the greater cognitive demands of integrated tasks could have the opposite effect on markers of linguistic processing by making it more difficult for speakers to manage the linguistic control needed to yield higher scores on measures of sophistication and complexity. According to Skehan (1998), performance on integrated tasks is generally less accurate and fluent than independent task performance. However, producing speech using the information presented in the prompt may enhance the quality of lexical aspects of performance, but so far this aspect has not been researched. The literature on information-processing approaches to tasks (e.g., Robinson, 1995, 1996, 2002; Skehan, 1998) is similarly ambiguous. For example, Skehan (1998) argues that the assumed higher cognitive load of these tasks should mean that fewer cognitive resources are available to manage aspects of linguistic processing, resulting in lower scores on these measures. In contrast, Robinson (2002) claimed that the greater cognitive challenge may lead to heightened concentration, yielding generally better performances and resulting in higher scores on at least some of the features measured. Research Questions

The present study addresses the following research questions: 1. How does learners’ lexical competence vary according to their proficiency

levels? 2. To what extent do task and task type impact on learners’ vocabulary use in

academic speaking tasks?

Methodology Data

The data used for the study were initially collected in the United States as part of the piloting of materials in the development of the next generation of TOEFL. Performances on four pilot oral test tasks had been double-rated by Educational Testing Service (ETS) staff using a draft global scale with five levels (1 to 5). For the purposes of this project, ten samples of each task at each of the upper three levels (levels 3–5) were initially selected from a larger pool of pilot test data, a total of 24 performances per task and 96 in total. The ESL learners who took the trial test varied in age, L1, length of residence in an English-speaking country and prior time spent studying English, but all were studying English to prepare for tertiary study in the United States at the time of data collection. Tasks

The four test tasks used in the present study were of two types, independent and integrated, based on whether performance involved prior comprehension of extended stimulus

Page 112: Volume 3, 2005

104

materials. In the independent tasks, participants were asked to express their opinion on a certain topic, which was presented with no accompanying material to read or hear. In the integrated tasks, participants first listened to or read information presented in the prompt and then were asked to explain, describe, or recount the information. The amount of preparation and speaking time varied for each task, but longer preparation and speaking times were given for the integrated tasks than for the independent ones (see Table 1). Table 1. The Academic Speaking Tasks Task Type Targeted functions and

discourse features Preparation time (secs)

Speaking time (secs)

1 Independent Opinion; Impersonal focus; Factual/conceptual information

30 60

2 Independent Value/significance; Impersonal focus; Factual/conceptual information

30 60

3 Integrated; Monologic lecture

Explain/describe/recount; Example/event; Cause/effect

60 90

4 Integrated; Reading

Explain/describe/recount; Process/procedure; Purpose/results

90 90

Data Analysis

Data were analyzed in two stages: general vocabulary use and the use of vocabulary specific to context and academic speech. Prior to analysis, the transcribed speech was pruned to exclude features of repair and imported into the application VocabProfile (Cobb, 2002). Frequency counts were then developed. The word-token measure was used because it was assumed that, for weaker test takers or performances on tasks that were more cognitively demanding, not all of the allowed time would be taken up with speech, and even if it were, speech was likely to be slower and thus yield fewer tokens. The word-type measure was chosen as a gauge of the range of vocabulary used; it was hypothesized that more proficient speakers and speakers addressing more cognitively complex tasks would use a wider range of word-types. In order to enable comparisons across tasks with different times allowed for completion, instances of word-token and word-type were counted per 60 seconds of speech. It should be noted that because a limited length of speaking time is allowed for each task, candidates who can speak fast produce more word-tokens than test takers who speak slowly. For that reason, the number of word-tokens could be affected by the speed of test-taker speech.

To examine learners’ ability to use vocabulary specific to the task context and to academic speech, the data were further analyzed using WordSmith Tools and MICASE. WordSmith Tools is an integrated suite of programs for looking at how words behave in text. The KeyWords program in Tools was used to identify the key words in the text. Key words are those whose frequency is unusually high in comparison with some norm but are not the

Page 113: Volume 3, 2005

105

most frequent words (Scott, 2004). The key words were calculated by comparing the frequency of each word in the word list of the transcripts of the test taker performances with the frequency of the same word in the reference word list. In the present study, the MICASE was used as a reference list. It was expected that if learners used words typically used in academic settings, the KeyWords program would identify both context-dependent and context-independent words (note: context-dependent words such as proper nouns were excluded from the analysis). The results of the analysis are reported with the number and type of key word tokens and also the percentages of key word token and type in the total number of word tokens and types. The effects of proficiency and task and task type (i.e., independent and integrated) on these measures were examined using inferential statistics (MANOVA and T-tests).

Results Differences Across Proficiency Levels

Descriptive statistics of general vocabulary use and key words are summarized in Table 2, and key words identified by the KeyWords program in Table 3. As expected, both raw and frequency data show that more word tokens and types were produced by higher proficiency learners than lower proficiency learners, but the analysis of key word use revealed that proficiency level had little effect on the production of key words. Key word token and type increased slightly with proficiency, but there was little difference in the percentage of key words in the total number of words across proficiency levels. It should be noted that variation among learners in use of academic vocabulary (all four measures) is notably large, as is shown by the SDs.

MANOVA was performed to examine statistical differences in the measures of both general and academic vocabulary use, and no significant difference across levels was observed (Table 4). Table 2. Descriptive Statistics of General Vocabulary Use per Level Token Type Token/60secs Type/60 secs Level M SD M SD M SD M SD 3 101.44 36.68 56.97 15.549 82.98 22.74 48.08 14.734 121.22 33.65 67.66 13.045 98.85 25.55 56.47 15.215 134.59 35.01 72.13 14.701 116.32 24.27 66.87 15.71

Table 3. Descriptive Statistics of the Key Words and Percentages Keyword token Keyword type Token (%) Type (%) Level M SD M SD M SD M SD 3 11.61 7.48 3.46 1.95 0.11 0.06 0.06 0.03 4 13.52 10.80 3.41 2.10 0.10 0.07 0.05 0.03 5 13.68 8.43 3.80 2.35 0.10 0.05 0.05 0.03

Page 114: Volume 3, 2005

106

Table 4. Results of Multivariate Analyses (Hottelings Test) Effect Value F Hypothesis

df Error

df Sig. Partial Eta

Squared Intercept 6.465 108.288 4 67 0.001 0.866 Level 0.099 0.816 8 132 0.590 0.047 Task 0.837 4.583 12 197 0.001 0.218 Level * Task 0.183 0.499 24 262 0.978 0.044

Impact of Task and Task Type on Lexical Profile

Tables 5 and 6 present descriptive statistics of general vocabulary and key words according to four different tasks and include both raw and frequency data. It was assumed that learners would produce more words in Tasks 3 and 4, as the required speaking time was longer than in Tasks 1 and 2; but, as shown in the frequency data, the independent tasks produced more words per 60 seconds than did the integrated tasks (Table 5). However, instances of key words were observed far more frequently in integrated task performances than in independent task performances for all four measures.

The results of multivariate analysis show that the effect of task on vocabulary measures was found to be significant with small effect size (Table 4). Task type comparisons are summarized in Table 7. All measures except word token were found to be significantly different between independent and integrated tasks. Table 5. Descriptive Statistics of Vocabulary Use (per Task) Token Type Token/60 secs Type/60 secs Task M SD M SD M SD M SD 1 105.63 27.76 62.71 12.59 116.09 28.21 69.58 15.67 2 97.08 25.69 60.00 12.12 104.59 29.40 64.81 15.61 3 156.00 36.45 79.83 18.17 90.18 19.42 46.35 10.81 4 117.62 29.81 64.79 14.67 86.67 23.03 47.82 11.92

Table 6. Descriptive Statistics of Key Words and Percentages (per Task) Keyword Token Keyword Type Token (%) Type (%) Task M SD M SD M SD M SD 1 9.42 6.636 2.16 1.385 0.08 0.04 0.03 0.02 2 8.07 6.364 2.4 1.454 0.08 0.05 0.04 0.03 3 13.17 10.499 3.79 1.744 0.12 0.05 0.07 0.03 4 18.46 7.616 5.13 2.173 0.11 0.08 0.06 0.03

Page 115: Volume 3, 2005

107

Table 7. Comparison of Task Types Task type M SD t df Sig. General Vocabulary Token 1 104.03 26.08 −7.21 70 0.001 2 156.00 36.45 Type 1 61.85 12.28 −5.101 70 0.001 2 79.83 18.17 Token/60 secs 1 111.88 28.49 3.065 70 0.003 2 90.18 19.42 Type/60 secs 1 66.93 15.92 5.852 70 0.001 2 46.35 10.81 Keyword Analysis KW Token 1 8.82 6.46 −1.95 56 0.056 2 13.17 10.50 KW Type 1 2.26 1.40 −3.695 56 0.001 2 3.79 1.74 KW Token (%) 1 0.08 0.05 −3.021 56 0.004 2 0.12 0.05 KW Type (%) 1 0.04 0.02 −4.22 56 0.001 2 0.07 0.03

Task type 1 = independent, 2= integrated task; KW Token = keyword token; KW type = keyword type.

Discussion and Conclusions

The present study investigates how lexical profile in academic test-task performances varies according to learner proficiency and task and task type. The results show little variation in the number of key words in the performances of the two task types (independent and integrated) according to proficiency level, but that the number of key words identified varied according to task and task type.

It was expected that, as the proficiency of learners increased, the use of key words would increase, as with general vocabulary use (as shown by the number of word tokens and types), but this was not the case. This was partly because the percentage of key words used in each task performance was relatively low and the difference might not be captured clearly in the statistical analysis. Similar results were obtained by analyzing the data in the present study using the Academic Word List (Brown et al., 2005). Brown et al. investigated the percentage of words in each of four categories: the most frequent 1,000 English words, the second most frequent 1,000 English words, words in the Academic Word List, and any remaining words. This was done using VocabProfile, which is based on the Vocabulary Profile (see Laufer & Nation, 1995) and the Academic Word List. The results from our study are summarized in Tables 8 and 9 below. As in the results of the present study, higher proficiency test takers produced more academic vocabulary, and the percentage of academic vocabulary was higher than for other categories. A significant difference was found across levels, but effect size was small. In the case of tasks, the number of academic word tokens was larger in the integrated task

Page 116: Volume 3, 2005

108

performances (Tasks 3 and 4), but this was explained by the longer speaking time required for those tasks. The percentage of academic vocabulary was not large in the integrated tasks, which conflicts with the findings of the present study. Table 8. Descriptive Statistics of Academic Words (per Level) Token % Level M SD M SD 3 3.09 2.52 3.02 2.57 4 4.56 2.56 3.78 2.04 5 4.38 3.16 4.15 2.46

Table 9. Results of Multivariate Analysis (Academic Words) (Hottelings Test) Effect Value F Hypothesis

df Error

df Sig. Partial Eta

Squared Intercept 3.178 131.903 2.000 83.000 .000 .761 Level .210 4.298 4.000 164.000 .002 .095 Task .456 6.228 6.000 164.000 .001 .186 Level * Task .149 1.016 12.000 164.000 .437 .069

Table 10. Descriptive Statistics of Academic Words (per Task) Token % Task M M M SD 1 3.92 1.79 3.87 2.19 2 3.58 3.27 3.51 3.01 3 5.50 2.74 3.55 1.51 4 4.29 3.13 3.67 2.71

Table 11. Comparison of the Use of Academic Words Between Tasks Task type Mean SD t df Sig. (2-tailed) Token 1 3.44 1.85 −2.638 70 0.010 2 5.50 2.74 % 1 3.30 1.82 0.251 70 0.802 2 3.55 1.51

Task type: 1 = independent, 2 = integrated. As in the results of the present study, higher proficiency test takers produced more academic vocabulary, and the percentage of academic vocabulary was higher than for other categories. A significant difference was found across levels, but effect size was small. In the case of tasks, the number of academic word tokens was larger in the integrated task performances (Tasks 3 and 4), but this was explained by the longer speaking time required for

Page 117: Volume 3, 2005

109

those tasks. The percentage of academic vocabulary was not large in the integrated tasks, which conflicts with the findings of the present study.

The differences we found in the results of the analysis using different corpora could be explained by the words identified as key words in MICASE using the KeyWords program and the academic words identified in the Academic Word List. The following transcripts of test-taker performances show the words identified in both KeyWords and Academic Word List analyses. (Note: underlined words in bold were identified by both Academic Word List and KeyWords; italicized words were identified by KeyWords; words in bold were identified by the Academic Word List).

Example 1 (Independent Task Level 3) I think that music and art could be encourage because this course can active the children creativity and for nature children are creative and when are children let the people if people develop their creativity when they are children they will be able to perform better when well be adults so in this sense they can be more productive they can create more they can be more helpful to the society and to the companies Example 2 (Integrated Task Level 5) The San Joaquin Valley presented as a place where land subsidence occurred. The San Joaquin Valley located in California was using groundwater from the late eighteen eighties. Now there was heavy pumping of water for both irrigation and other purposes in this valley. By the twenties and thirties land subsidence had already occurred and by the early nineteen seventies because of the unabated use of groundwater groundwater levels had sunk by hundred and twenty metres while the land had dropped by a level of eight metres. Now this might seem like a large amount but it occurred over a long period of time. So in order to mitigate this problem in the nineteen seventies. San Juaquin Valley reduced pumping of water and increased the use of surface water however the problem of land subsidence reappeared in the nineteen nineties because of the drought in California. And this made people start using groundwater again. And it was even a huger problem now because groundwater levels sunk by much greater than the seventies and the land level sunk greatly too.

As shown in the examples above, the academic words identified in the Academic Word

List analysis and the words identified by the KeyWords program using MICASE as a reference list are somewhat different. Words such as occurred, subsidence, and creativity were identified in both analyses. That means they are listed in the Academic Word List and are also frequently observed according to the KeyWords analysis. Words such as groundwater and pumping were identified in the KeyWords analysis, but are not listed in the Academic Word List. Many words listed in the Academic Word List and registers specific to academic speech were not identified in the KeyWords analysis. This does not mean that learners did not use academic words (they did, as is shown in the results of the Academic Word List analysis) or academic registers specific to speaking, but that the words might not occur frequently enough to be captured in the KeyWords analysis. In order to examine whether learners use registers specific to academic speech, we need first to identify the types of registers used in academic speech using MICASE as in other studies, so that, based on the analyses, such features of learner speech can be identified.

Page 118: Volume 3, 2005

110

Nevertheless, the present study shows a lexical profile of learners in their academic speaking test performances, and a comparison of the results from the KeyWords and Academic Word List analyses provides useful information on the use of integrated tasks in academic speaking tests. As was discussed earlier, it was assumed that cognitively demanding integrated tasks would produce more sophisticated speech in terms of grammatical and lexical complexity. However, as shown in the results of the Academic Word List analysis, academic words from the academic written corpus were used more frequently in independent tasks than in integrated tasks, but, according to the KeyWords analysis, words specific to the context were produced more frequently in the integrated tasks than in the independent tasks. Because an input text was given as a prompt, it was assumed that learners would use content words from the text far more frequently in the integrated tasks. To examine academic vocabulary use it would be better to employ independent rather than integrated tasks, but it is still not clear how learners use (or whether they actually use) academic registers specific to academic speaking. Further investigation will be required to produce a more detailed lexical profile of academic speaking task performances.

Acknowledgments

The study reported in this paper is an extension of a large research project funded by the Educational Testing Service (ETS) for the work of the TOEFL Speaking Team (Brown, McNamara, & Iwashita, 2005). I would like to express my gratitude to my coresearchers of the large project, Annie Brown, Tim McNamara, and Sally O’Hagan. Without their hard work and dedication to the project, I would not have been able to conduct the present study. I am also thankful for Gavin Melles for his suggestions on data analysis. Lastly, I would like to sincerely thank the English Language Institute of the University of Michigan for funding this research project and also Jeff Johnson for proofreading and editing the paper.

References

Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English for academic purposes speaking tasks. (TOEFL Monograph Series #MS29). Princeton, NJ: Educational Testing Service.

Camiciottoli, B. C. (2004). Interactive discourse structuring in L2 guest lectures: Some insights from a comparative corpus-based study. Journal of English for Academic Purposes, 3, 39–54.

Cobb, T. (2002). The Web vocabulary profiler. [Computer Software]. Retrieved from http://www.er.uqam.ca/nobel/r21270/textools/web_vp.html.

Coxhead, A. (1998). An Academic Word List. (English Language Institute Occasional Publication No. 18).Wellington, New Zealand: Victoria University of Wellington, School of Linguistics and Applied Language Studies.

Douglas, D., & Selinker, L. (1993). Performance on a general versus a field-specific test of speaking proficiency by international teaching assistants. In D. Douglas & C. Chapelle (Eds.), A new decade of language testing research (pp. 235−256). Alexandria, VA: TESOL Publications.

Page 119: Volume 3, 2005

111

Enober, C. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing, 4(2), 139–155.

Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2003, March). Analysis of test-taker discourse analysis in the development of speaking scale. Paper presented at the conference of the American Association of Applied Linguists, Arlington, VA.

Laufer, B., & Nation, I. S. P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322.

Leki, I., & Carson, J. (1994). Students’ perceptions of EAP writing instruction and writing needs across the disciplines. TESOL Quarterly, 28(1), 81–101.

Lewkowicz. J. (1997). The integrated testing of a second language. In C. Clapham & D. Corson (Eds.), Encylopaedia of language and education: Vol 7. Language testing and assessment (pp. 121–130). Dordrecht, The Netherlands: Kluwer.

Lindemann, S., & Mauranen, A. (2001). It’s just real messy: The occurrence and function of just in a corpus of academic speech. English for Specific Purposes, 20, 459–475.

Robinson, P. (1995). Task complexity and second language narrative discourse. Language Learning, 45, 141–175.

Robinson, P. (1996). Introduction: Connecting tasks, cognition and syllabus design. The University of Queensland Working Papers in Language and Linguistics, 1(1), 1–15.

Robinson, P. (2002). Task complexity, cognitive resources, and second language syllabus design. In P. Robinson (Ed.), Cognition and second language instruction (pp. 287–318). New York: Cambridge University Press.

Santos, T. (1988). Professors’ reactions to the academic writing of nonnative-speaking students. TESOL Quarterly, 22(1), 69–91.

Scott, M. (2004). WordSmith Tools (Version 4) [computer software]. Oxford, UK: Oxford University Press.

Skehan, P. (1998). A cognitive approach to language learning. Oxford, UK: Oxford University Press.

Page 120: Volume 3, 2005

112

Page 121: Volume 3, 2005

Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 3, 2005 113 English Language Institute, University of Michigan

A Summary of Construct Validation of an English for Academic Purposes Placement Test

Young-Ju Lee

University of Illinois at Urbana-Champaign The Computerized Enhanced ESL Placement Test (CEEPT) at the University of Illinois at Urbana-Champaign is a daylong process-oriented writing assessment in which test takers are given sufficient time to plan, produce, and revise an essay. This is a comprehensive validation study of the CEEPT, firmly rooted in the modern paradigm of test validation. I employ Messick’s (1989) validity framework to build my arguments and I use multiple types and sources of evidence. A validity table is presented to summarize the positive and negative attributes of six research questions investigated in this study: predictive validity, convergent and discriminant validity, improved essay quality, consequential validity, test fairness, and authenticity. Arguments in favor of the CEEPT come from six types of validity evidence. The findings will have important practical implications for implementing a multiple-draft essay test and a computer mode in assessment contexts at the sacrifice of logistical constraints.

Each institution of higher education in the United States has its own English placement test to identify incoming international students who will benefit from English as a Second Language (ESL) instruction. Any institution of higher education needs to conduct a sound validation study if it employs a locally developed English proficiency test. Doing so reduces the likelihood that international students would have subsequent language-related difficulties in coursework and maximizes the chance that they would get sufficient language support by taking necessary ESL courses. The Computerized Enhanced ESL Placement Test (CEEPT) at the University of Illinois at Urbana-Champaign (UIUC) is in need of a comprehensive validation study because it was developed three years ago and is administered on a new test-taking mode. I base the theoretical framework for my study on Messick’s (1989) unified validity framework and Chapelle’s (1994) validity table. Messick expanded our scholarly conception of what validity is and how it can be investigated. His own definition of evidence included meaning and value as well as fact. He argued that validity should be seen as a unitary concept because construct-related evidence undergirds not only construct-based inferences but also content- and criterion-based inferences.

He emphasized fundamental points related to test use such as misuse of the test, social consequences, and test fairness. The consideration of the consequences of test use becomes crucial under Messick’s framework. He argued that adverse social consequences related to invalid test interpretation raise social and political issues as opposed to validity issues. If the adverse social consequences are attributed to test invalidity, then the validity of the test use becomes questionable. We need to include the effect of tests on students, institutions, and society as one type of validity evidence.

Page 122: Volume 3, 2005

114

Chapelle’s (1994) validity table is one viable method to efficiently translate Messick’s progressive matrix. Chapelle investigated whether C-tests are a valid test method for L2 vocabulary by weighing validity justifications. She constructed a validity table where three columns (evidence, argues in favor, and argues against) and four rows regarding four types of justification (construct validity, relevance/utility, value implications, and social consequences) are given. The main layout of her validity table is provided in Figure 1.

Justifications/ Evidence Argues in favor Argues against Construct validity Relevance/ utility Value implications

Social consequences Figure 1. Analysis of justifications for the use of the C-test for measuring L2 vocabulary ability (from Chapelle, 1994, p. 177). By weighing negative as well as positive attributes, test users can decide if the use of a test is valid in a specific setting. The virtue of this validity table is that we present negative evidence that has not been reported explicitly, and we refute it. Each test that we want to validate will have a different validity table design by taking a different number of rows and columns, and different types of evidence in rows. The design of a validity table and the types of evidence will vary because each test is used for a specific purpose in a particular setting, which will result in a different focus of validity inquiry. Although a validity table is the best tool to date in test validation, it is somewhat too simplistic for capturing rich and complex arguments and discourse.

Purpose of the Study This is a comprehensive validation study of the CEEPT, firmly rooted in the modern

paradigm of test validation. I employ Messick’s (1989) validity framework to build my arguments and I use multiple types and sources of evidence. This study is based on my pragmatic attitude toward validation, the focus on consequential validity, and a combination of quantitative and qualitative approaches.

The CEEPT (Appendix B) is a performance assessment that requires students to summarize and integrate content from articles and lectures. The CEEPT facilitates prewriting tasks and multiple drafting with feedback between drafts. Three notable features of the CEEPT are extended time for writing, refined facilitative activities, and access to a word processor for essay writing.

The present study aims to: (a) support a process-oriented writing assessment as an institutional ESL placement test; (b) build validity arguments that the CEEPT is a valid measure of integrated academic writing ability, using multiple types and sources of evidence; and (c) advocate computer delivery of the test.

Page 123: Volume 3, 2005

115

Research Questions Research Question 1. Evidence of Predictive Validity: The Relationship between CEEPT and Academic Performance To what extent does the CEEPT predict international graduate students’ academic performance as well as language difficulties? Stated differently, what are the practical implications of CEEPT scores for graduate students’ academic success at UIUC? The first research question is investigated using both qualitative and quantitative data on three measures of academic performance—GPA, faculty evaluations, and students’ self-assessments. CEEPT and Academic Performance: Grade Point Average as a Criterion To what extent do CEEPT scores predict subsequent academic performance?

The first predictive criterion is international students’ GPA during the first semester of graduate study at UIUC. Because the correlation between English proficiency scores and GPA tends to become weak for second and subsequent semesters, the present study focuses on first semester GPA only. CEEPT and Academic Performance: Faculty Evaluations as a Criterion

How do faculty in academic courses evaluate the test taker’s English ability? In other words, how do faculty perceive the relationship between a student’s English language proficiency and academic performance? To what extent do the faculty evaluations correspond to correlations between CEEPT and first semester GPA?

I investigated content-field faculty perceptions of students’ English language proficiency and academic performance. Qualitative data were collected from faculty in two ways. One was to administer surveys with closed and open-ended items to participants’ content course faculty. The other was to conduct individual interviews to selected faculty. CEEPT and Academic Performance: Students’ Self-Assessments as a Criterion What are students’ perceptions of the relationship between their own English language proficiency and academic performance? I investigated students’ own assessments of their academic performance in two ways. I administered surveys with closed and open-ended items to students, and conducted individual interviews with selected students. Qualitative data would also indicate perceived difficulties of international students in coping with academic requirements in their studies.

Research Question 2. Evidence of Convergent and Discriminant Validity

To what extent do test scores on the CEEPT provide evidence of convergent and discriminant validity? How well does the CEEPT discriminate quantitative ability from language ability? The second research question was addressed by comparing patterns of correlations among the various measures. Each subsection of the TOEFL and the GRE was employed as concurrent criteria.

Page 124: Volume 3, 2005

116

Research Question 3. Evidence of Improved Essay Quality: The Effect of Revision on the Quality of Second Drafts To what extent and in what ways does the quality of written products differ between

first and second drafts? The third research question investigated the effect of the revision session facilitated by

computer writing tools on the quality of second drafts. Two kinds of evidence were used to compare the quality of the two drafts: scores and textual analysis. The primary interest is whether first draft essays and second draft essays differ from each other in terms of textual quality, although actual placement decisions are made solely on the basis of second draft essays. For detailed text analysis, quantity of text (i.e., the number of words, T-units, and the T-unit length) and textual features (i.e., modals, adjective clauses, logical connectors, and exemplification) were examined. Research Question 4. Evidence of Consequential Validity: Consequences of the Decisions Made on the Basis of CEEPT Scores

What is the direct impact of the CEEPT on international students at UIUC? Stated differently, do they perceive they were misclassified as masters or nonmasters?

The decisions made about test takers on the basis of CEEPT scores will directly affect them. Test takers who score above the cutoff score are exempt from ESL courses and are expected to have the ability to cope with the language demands of their first semester of study. However, two types of misclassifications can occur. First, some test takers might be misclassified as masters and will not be required to take ESL courses. They may have subsequent language-related difficulties in coursework. Second, some test takers might be misclassified as nonmasters (i.e., those who got scores below the cutoff score) and will be required to take ESL courses. Accordingly, they will not benefit from a full registration of content courses. The Case of Typical CEEPT Takers

I chose students who were happy with their CEEPT results. Students who are exempt from ESL courses are masters and they are expected to have almost no difficulty with the English required in their content courses. Although this may seem self-evident, it gives valuable insights into the claim of construct validity of the CEEPT. If masters report that they did not have language-related difficulties in coursework, the intended consequences of the CEEPT are met. The Case of Special CEEPT Takers: Malcontents

I deliberately chose a few malcontents. The operational definition of malcontents is as follows: people who are unhappy with a particular test-taking experience such as tests’ results and posttest instructional sequences. Malcontents are students who are not convinced that their test results are accurate, regardless of whether or not they liked the overall content and format of the test. In relation to malcontents, I use two terms to describe test takers’ reactions to the test-taking experience: malcontented and malcontentedness. Malcontented is a descriptive term that can be interchangeably used with “unhappy” or “dissatisfied.” Malcontentedness refers to a general phenomenon that reflects test takers’ displeasure with their test results and posttest instructions. Malcontentedness from a validation point of view

Page 125: Volume 3, 2005

117

can correspond to consequential validity, a test’s effects on test takers. Little empirical research has been conducted on this topic in language testing.

If malcontents reported that they benefited from ESL courses and that they did not have major language related difficulties in coursework, it would constitute an important part of validation. This would also contribute to improve current CEEPT practices and update curricula and teaching methodology in ESL courses. In contrast, if the ESL courses that students are required to take in accordance with CEEPT results are not perceived to be beneficial by those students, it casts doubt on both CEEPT results and ESL course objectives. Therefore, the fourth research question indirectly validates ESL courses at UIUC. Research Question 5. Evidence of Test Fairness: Lack of Bias in Prediction

To what extent does the regression of the first semester GPA on CEEPT differ for gender, discipline, and language background subgroups?

A test is biased if consistent nonzero errors of prediction are made for members of the subgroup in the prediction of the criterion for which the test was designed (Pedhazur, 1997). Test bias in this study is operationally defined as a systematic error in the predictive validity of a test associated with group membership (Zeidner, 1987). The regression equations were analyzed by gender, discipline, and language background. If students from a certain group membership tend to earn lower grades than their CEEPT scores would suggest, subgroup differences in prediction become a threat to the construct validity of the CEEPT. Research Question 6. Evidence of Authenticity: Test takers’ Perceptions of the CEEPT

How do test takers perceive the integrated, process-oriented, and computerized writing assessment?

The sixth research question addresses test takers’ perceptions of the CEEPT. Their responses to closed and open-ended items on the CEEPT survey will be presented.

Methods Participants

International graduate students voluntarily selected the CEEPT as their operational test in August 2004, and the test results were officially used in placing them into appropriate ESL classes. Participants across disciplines took the CEEPT. They represent the subgroup of accepted international graduate students with lower TOEFL scores than the campus-wide or departmental cutoff scores. A total of 121 students took the CEEPT in Fall 2004. Among them, 11 (9.1%) were undergraduates and 110 (90.9%) were graduate students. Among the 110 graduate students, 100 students provided consent forms and voluntarily agreed to participate in this study. Instruments The CEEPT Survey A CEEPT survey was administered to participants on the same day they took the test. The CEEPT survey elicited test takers’ perceptions of (a) format and content of the test, (b) computer mode, (c) revision process, and (d) suggestions for the future CEEPT.

Page 126: Volume 3, 2005

118

The Self-Assessment Survey A self-assessment survey was administered to elicit students’ own assessments of their academic progress and performance at midsemester. Responses to the survey also gave me information about appropriate faculty members to contact for the faculty evaluation survey. A total of 100 surveys were sent out and 55 were returned, which yielded a return rate of 55%. The Faculty Evaluation Survey

Faculty were asked to fill out a survey at the end of the semester so that they would be able to better reflect on their perceptions of students’ English language proficiency and academic performance. The faculty were asked about a student’s English proficiency, academic performance in the course, and the extent to which students’ level of proficiency hindered their performance in the academic course (Cotton & Conrow, 1998). A total of 50 surveys were sent out and 34 were returned, which yielded a return rate of 68%. Interviews with Selected Students

Interviews with 20 students were conducted individually in December 2004. Interviews lasted about 60 minutes, with follow-up interviews as needed. All interviews were audiotaped and transcribed verbatim. Most of the same questions in the Self-Assessment Survey were asked, but students were asked to elaborate on their answers. I also elicited participants’ perceptions of their course of study, writing assignments, and ESL courses. Interviews with Selected Faculty I visited faculty members during their office hours in February 2005. Interviews with ten faculty members were conducted for about 30 minutes each, with follow-up emails as needed. The information gathered from the Faculty Evaluation Survey served as the basis for the interview protocol questions. They were also asked about the goals of their writing assignments, the importance of writing in their program, and specific perceptions of participants’ performance in their courses. Interviews with Case Study Participants I interviewed case study participants individually three times from the fall semester of 2004 to the spring semester of 2005. I took extensive notes during interviews and recorded them as well for later transcription. Each interview lasted approximately one and a half hours. The first interview at mid-fall semester covered participants’ general academic backgrounds, content courses in their programs, and ESL courses. The second interview at the end of the fall semester focused on academic performance in their content courses, English-related problems, and the usefulness of ESL courses. The third interview at mid-spring semester elicited their opinions about their first semester grades and their performance in content courses.

Data Collection and Procedures

The research design is mixed in terms of sources and types of data. Data were

gathered not only from test takers but also from faculty, and quantitative and qualitative forms of data gathering were employed (Table 1). The timeline was:

Page 127: Volume 3, 2005

119

A. Test administration date: August 6, 13, and 20, 2004

1. The CEEPT administration 2. The CEEPT survey

B. The fall semester of 2004: November to December 1. The Self-Assessment Survey 2. The Interview with Selected Students

C. The early spring semester of 2005: December to February 1. The Faculty Evaluation Survey 2. The Interview with Selected Faculty 3. Receipt of test and GPA data from the Office of Admissions and Records

Table 1. Types of Data Related to Each Research Question Research questions

Quantitative data

Qualitative data

RQ 1: Predictive validity

1. First semester GPA 2. CEEPT scores 3. Responses to closed items on

the Self-Assessment Survey 4. Responses to closed items on

the Faculty Evaluation Survey

1. Responses to open-ended items on the Self-Assessment Survey

2. Responses to open-ended items on the Faculty Evaluation Survey

3. Responses to Interviews with students

4. Responses to Interviews with faculty

RQ 2: Convergent and discriminant validity

1. Scores on each subsection of the TOEFL and the GRE

RQ 3: Improved essay quality

1. Holistic and analytic scores on the two drafts

2. Frequency of words, T-units, and T-unit length on the two drafts

3. Textual features on the two drafts

4. Responses to closed items about revision on the CEEPT survey

1. Responses to an open-ended item about revision on the CEEPT survey

2. My impressionistic notes about topic, voice, grammaticality, and any noticeable aspects

3. Written feedback on mechanics from raters

RQ 4: Consequential validity

1. CEEPT scores and the placement decisions

1. Interviews with case study participants

RQ 5: Test fairness

1. First semester GPA 2. CEEPT scores 3. Group membership

RQ 6: Authenticity

1. Responses to closed items on the CEEPT survey

1. Responses to open-ended items on the CEEPT survey

Page 128: Volume 3, 2005

120

Data Analysis

Data were collected from the instruments and procedures described above. All statistical analyses were performed using SPSS version 11.5 for windows. For the first research question, CEEPT scores were correlated with first semester GPA using Pearson product-moment correlation coefficients. To get the correlation between CEEPT scores and GPA, four levels of CEEPT scores were employed. Essay raters holistically evaluated CEEPT essays in terms of four different categories: (1) Too low; (2) ESL 5001; (3) ESL 501; and (4) Exempt. Therefore, numbers 1 through 4 were assigned to the four levels: Too low = 1; ESL 500 = 2; ESL 501 = 3; and Exempt = 4. Second, to investigate content-field faculty perceptions of students’ English language proficiency and academic performance in their courses, survey and interview responses were analyzed. Third, students’ own assessments of their academic performance were analyzed. To answer the second research question, correlations between the CEEPT and each section of standardized admissions tests (i.e., TOEFL and GRE) were calculated using Pearson product-moment correlation coefficients. A correlation table indicating convergent and discriminant validity is presented. Because this correlation coefficient is not a precise estimate with a sample size of 100, a confidence interval for significant correlation coefficients was calculated and interpreted. That is, both point estimation and confidence interval were used to estimate the value of a population correlation coefficient. For the third research question, scores and text were examined to compare the quality of the two drafts. The following two analyses were conducted for score comparison. First, the dependent t-tests were used to examine differences in holistic scores between two drafts. Second, to examine the five analytic scores, I used Repeated Measures MANOVA (multiple analysis of variance). MANOVA evaluates differences among centroids for a set of dependent variables when there are more than two levels of a group (Tabachnick & Fidell, 2001). The advantage of MANOVA over multiple t-tests is that it guards against inflated Type 1 error due to multiple tests of correlated six analytic scores. For the text analysis about quantity of text, two analyses were done. First, dependent t-test was employed to compare quantity of texts between the two drafts: the number of words, T-units, and T-unit length. Second, in order to determine which measure of text length was the best predictor of holistic scores on second drafts, a stepwise regression analysis was conducted. In the regression model, independent variables were frequency of words and T-units in two drafts, and the dependent variable was holistic scores on second drafts. Based on Hinkel’s (2002) criteria, I chose four features that would reflect change and improvement across revisions in a testing context. I employed normalized frequency counts rather than raw numbers; the frequency counts of features are normalized to text length. This normalization is essential for a comparison of frequency counts across texts because text length can vary widely (Biber, 1988). To examine mean differences in normalized frequency of textual features between two drafts, I used Repeated Measure MANOVA.

To investigate consequences of the decisions made on the basis of CEEPT scores, qualitative data obtained during interviews from case study participants are reported. The participants’ verbal protocols were transcribed, analyzed, and summarized. Qualitative data collection strategies included prolonged engagement with participants; that is, the study was 1 ESL 500, English for Oral and Written Communication for International Graduate Students and ESL 501, Introduction to Academic Writing for International Graduate Students are a sequence of two ESL courses.

Page 129: Volume 3, 2005

121

nine months in duration. The fifth research question addresses the differential effect of the CEEPT on the first semester GPA for different subgroups defined by gender, discipline, and language background. I used the Attribute-Treatment Interaction (ATI) design to compare regression equations for the categorical variable, group membership. In the regression model, the dependent variable is the first semester GPA and the independent variable is the CEEPT score. Effect coded vectors were created for group membership. A comparison of regression equations involves three steps: testing overall F; a slope difference; and an intercept difference (Pedhazur, 1997). Testing overall F is to examine whether the proportion of variance accounted for is meaningful. Only after it is concluded that the slopes do not differ significantly from each other can we move to the next step. The third question addressed is whether the intercepts are equal. Testing the difference between intercepts amounts to testing the difference between the effects of the categorical variable. The sixth research question addresses test takers’ perceptions of the CEEPT. Their responses to closed and open-ended items on the CEEPT survey were analyzed and summarized. Following Brown’s (2001) suggestions for data transcription and analysis, I transcribed responses to open-ended items into a computer file. Responses were typed as they appear on the survey, including misspellings, grammatical errors, and typos.

Summary of Results For the first research question, I obtained the correlation coefficient of 0.052 for the overall sample between CEEPT scores and first semester GPA at UIUC. The direction and magnitude of correlation coefficients varied depending on the discipline. For language-oriented disciplines such as Business (r = 0.275) and Humanities (r = 0.35), there was a positive relationship between CEEPT scores and GPA. In contrast, there was a negative relationship for nonlanguage-oriented disciplines such as Life Sciences (r = −0.548) and Technology (r = −0.213). The qualitative data complemented the results of a correlation coefficient between CEEPT and first-semester GPA at the practical level. From faculty interviews, I learned more about faculty members’ perceptions of students’ English proficiency and its effect on performance in content courses. This type of in-depth information could only be gained through interviews with faculty, which allowed me as a researcher to understand the objectives of the courses and the expectations of the students in their own contexts. Although many students were generally positive about the adequacy of their English, some of them reported that a higher English proficiency would have been beneficial for managing their content courses during their first semester study at UIUC. The relationship in the nomological net (i.e., conceptual network, validity framework) between CEEPT scores and academic success in content courses is central to the meaning of the construct. With a few exceptions, students who scored well on the CEEPT, especially those who were exempted, possessed most of the English skills to manage the content courses. Clearly, there were exceptions to the validity of the score-based inferences in the case of students in the Technology and Life Sciences disciplines. But valid score-based inferences were generally made from students’ performance in the Humanities discipline. In summary, qualitative and quantitative analyses offered complementary findings on the multiple dimensions of academic performance.

Page 130: Volume 3, 2005

122

The results of the second research question showed evidence of the convergent and discriminant validity of the CEEPT. The CEEPT is convergent with the TOEFL listening, the TOEFL essay writing, and the GRE analytical writing sections in that the correlation coefficients are high and significant. The correlations between the CEEPT and the TOEFL writing section or the GRE analytical writing assessment illustrate one instance of convergent validity: a monotrait (i.e., academic writing ability) and monomethod (i.e., composition) value.

The CEEPT measures different abilities than do the quantitative and analytical sections of the GRE, as indicated by the insignificant and low correlation coefficients. The correlation coefficients between the CEEPT and the quantitative or analytical section of the GRE correspond to a heterotrait (i.e., academic writing ability versus quantitative ability versus analytical reasoning) and heteromethod (i.e., composition versus multiple-choice questions) value. The insignificant and relatively low correlation coefficient strongly supports the discriminant validity of the CEEPT with the quantitative section of the GRE. Table 2 illustrates the correlation coefficients of CEEPT with the TOEFL and GRE scores. Table 2. Correlation Coefficients of CEEPT with the TOEFL and GRE Scores Test

n

Pearson correlation coefficients

p

95% Confidence interval a

TOEFL Listening 82 0.255 0.021* 0.04, 0.45 TOEFL Structure & Written Expression 82 0.212 0.055 — TOEFL Reading 82 0.088 0.431 — TOEFL Essay Rating 73 0.310 0.003** 0.088, 0.371 TOEFL Total Scores 88 0.277 0.009** 0.089, 0.448 GRE Verbal 49 0.273 0.058 — GRE Analytical Writing 39 0.446 0.004** 0.165, 0.678 GRE Quantitative 49 −0.060 0.683 — GRE Analytical 10 0.211 0.558 —

Due to technical difficulties, 100 participants’ scores could not be found in the old database in OAR. a 95% confidence interval was calculated for significant correlation results. * p < 0.05 ** p < 0.01 The results of the third research question showed that the quality of the essay improved from the first to second drafts due to the revision session facilitated by peer feedback and computer writing tools. There were significant score differences between two drafts composed during the CEEPT. The holistic scores, on average, were 0.263 higher on the second drafts than on the first ones, which was statistically significant. Students became good editors, and therefore, better writers. The results of the Repeated Measures MANOVA showed that mean differences in an analytic scale—indicated by organization, content, grammar, use of sources, avoidance of plagiarism, and mechanics—were significantly associated with the draft effect. The univariate test results also confirmed that six analytic scores were higher on the second drafts than on the first ones, which indicate that second drafts are linguistically better than first drafts.

Page 131: Volume 3, 2005

123

Regarding the quantity of text, the results of the dependent t-test indicated that the differences of 147.74 words, 7.17 T-units, and 1.14 T-unit length between the two drafts were significant. This showed that students produced significantly more words, T-units, and complex sentences on second drafts than on first drafts. In the stepwise regression analysis, the frequency of words in second drafts was the best predictor of the holistic score on second drafts among four measures of text length. Normalized frequency of four text features (i.e., modals, adjective clauses, logical connectors, and exemplification) was employed to investigate specific features that may affect essay quality. The results of the Repeated Measures MANOVA showed that draft had an effect on normalized frequency of four textual features. The univariate test results showed that second drafts contained more modals than first drafts. Second drafts were closer to developed academic discourse than were first drafts. The results of the fourth research question showed that the CEEPT had intended consequences on students. Three contented students showed no classification errors. Three malcontents reported that they benefited from taking ESL courses and appreciated the benefit, which showed that adverse consequences were minimized at UIUC. The inclusion of malcontents’ points of view contributes to the understanding of construct validity in several aspects. That is, the effect of the tests on students addresses the consequential validity, the most important type of validity evidence. Second, it focuses on test use, especially negative attributes, which have not been reported explicitly in any validation study. Third, the effect of the tests on students is about experience at the level of the individual. I believe that the perspectives of students, the ultimate stakeholders in testing, need to be reflected in test validation. Tables 3 and 4 summarize the findings of the fourth research question.

The results of the fifth research question showed that differences in group membership did not explain differences in academic performance in content courses. An absence of evidence of bias ensures comparable construct and predictive validity for subgroups of language background (Chinese, Korean, and European), gender, and discipline (Humanities, Business, and Technology). For the sixth research question, responses to open-ended and closed items on the CEEPT showed positive evidence in support of the CEEPT. Students perceived a close match between the academic tasks and the CEEPT tasks, and this high authenticity contributed to eliciting their true writing abilities. Students generally expressed their preference for the CEEPT and the computer delivery of the test. The aim of the present study is to seek a more balanced view examining both the strengths and weaknesses of the CEEPT’s intended interpretations and uses, moving beyond the confirmationist bias. The validity table in Appendix A summarizes the positive and negative attributes of all the research questions investigated in this study. The current research does not give information on how to summarize the overall findings in the validity table. We language testers do not know how to vote on various validity arguments. For example, does modest evidence of predictive validity undermine the overall argument? I do not believe that modest evidence of predictive validity undermines the overall argument because I have strong evidence of authenticity, convergent and discriminant validity, test fairness, consequential validity, and improved essay quality. However, I should leave the validity vote (i.e., summary of the validity table) to readers to decide whether or not they will employ this kind of assessment in their contexts.

Page 132: Volume 3, 2005

Table 3. Intended Consequences of CEEPT Placement Decisions: Case Studies of Three Typical Students Students’ perceptions

Participants Placement decisions Placement decisions ESL courses Content courses

Tom a ESL 500

Accept the test result because of tuition benefit

Very useful Difficulty with listening

Lisa a ESL 501 Accept the test result because of tuition benefit

Somewhat useful No specific problems

Cathy Exempt Accept the rest result N/A No problems at all a Tom and Lisa were participants from the pilot study, not the main study. Table 4. Unintended Consequences of CEEPT Placement Decisions: Case Studies of Three Malcontents

Students’ perceptions Malcontents

Placement decisions Placement decisions ESL courses Content courses

Perception Change

Amy a ESL 500 Not accept (TOEFL writing score = 5)

Time-consuming in spring, Very useful in summer

No specific problems, but lower grades than expected

Accept the placement decision

Anna ESL 501 Not accept (TOEFL writing score = 6)

Time-consuming but somewhat useful, Appreciation of the benefit

No specific problems except a misunder-standing of questions in the textbook

Not Accept

Lily: A double malcontent

ESL 500 Not accept both CEEPT and diagnostic test results

Useful in Fall and Spring, Appreciation of the benefit, No immediate effect on helping manage her content courses

Some problems with readings because of many unfamiliar expressions

Not Accept

a Amy was a malcontent from the pilot study, not the main study

124

Page 133: Volume 3, 2005

125

Implications There has been a mismatch between process-oriented writing instruction and product-oriented assessment. Although the methodology for teaching ESL writing has changed toward the process-centered approach, the assessment of ESL writing skills on standardized and institutional ESL placement tests has focused on written products (Hinkel, 2002).

The CEEPT is a process-oriented writing assessment in a large-scale ESL testing context. The results of the present study will help language testing specialists evaluate the validity of process-oriented writing assessment designed for placement purposes. Based on negative as well as positive attributes presented in the validity table, the findings will have important practical implications for implementing a multiple-draft essay test and a computer mode in assessment contexts at the sacrifice of logistical constraints. The findings in this study will advance our understanding of ESL writing assessments and will be applicable to other academic contexts. Specifically, the findings can serve as a basis for institutional placement test design in both ESL and English composition programs.

References

Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press.

Brown, J. D. (2001). Using surveys in language programs. Cambridge, UK: Cambridge University Press.

Chapelle, C. (1994). Are C-tests valid measures for L2 vocabulary research? Second Language Research, 10, 157–187.

Cotton, F., & Conrow, F. (1998). An investigation of the predictive validity of IELTS amongst a group of international students studying at the University of Tasmania. In S. Wood (Ed), IELTS Research Reports, Vol 1 (pp. 72–115). Canberra, Australia: IELTS Australia.

Hinkel, E. (2002). Second language writers’ text: Linguistic and rhetorical features. Mahwah, NJ: Erlbaum.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–103). New York: Macmillan.

Pedhazur, E. (1997). Multiple regression in behavioral research (3rd ed.). New York: Harcourt Brace.

Tabachnick, B., & Fidell, L. (2001). Using multivariate statistics (4th ed.). Boston: Allyn and Bacon.

Zeidner, M. (1987). A comparison of ethnic, sex, and age bias in the predictive validity of English language aptitude tests: Some Israeli data. Language Testing, 4, 55–71.

Page 134: Volume 3, 2005

Appendix A Validity Table for the CEEPT

Justification/ Evidence Argues in favor Argues against Refutation

Extent to which positive attribute is satisfied

Evidence of authenticity

There is a relatively high degree of correspondence between the characteristics of the CEEPT tasks and the features of a target language use (TLU) task.

There is a gap between CEEPT tasks and TLU tasks.

Students also perceived a close match between the TLU tasks and the CEEPT tasks, as indicated by responses to item #1 and comments on “authentic writing” on the CEEPT survey.

Highly satisfied

Evidence of convergent and discriminant validity

High correlations between the CEEPT and measures of similar abilities (e.g., each section of the TOEFL, the GRE writing assessment, and the verbal section of the GRE) indicate convergent validity. In contrast, relatively low correlations between the CEEPT and measures of different abilities (e.g., the quantitative section and the analytical section of the GRE) indicate discriminant validity.

There are no noticeable differences in obtained correlation coefficients between convergent and discriminant validity. That is, the same degree of high correlation between the CEEPT and measures of different abilities (e.g., the quantitative section and the analytical section of the GRE) was obtained.

The CEEPT is convergent with the TOEFL listening (r = 0.255), the TOEFL essay writing (r = 0.310), and the GRE analytical writing (r = 0.446) sections in that correlation coefficients are high and significant. The insignificant and relatively low correlation coefficient strongly supports the discriminant validity of the CEEPT with the quantitative section of the GRE (r = −0.06).

Highly satisfied

(table continues)

126

Page 135: Volume 3, 2005

Validity Table for the CEEPT (continued)

Justification/ Evidence Argues in favor Argues against Refutation

Extent to which positive attribute is satisfied

Evidence of predictive validity

(1) There is a statistically significant correlation between CEEPT scores and first semester GPA. (2) Faculty perceptions of test takers’ English ability in content courses correspond to scores on the CEEPT.

CEEPT scores do not produce a corresponding future academic performance.

(1) The direction and magnitude of correlation coefficients varied depending on the discipline. For language-oriented disciplines such as Business (r = 0.275) and Humanities (r = 0.35), there was a positive relationship between CEEPT scores and GPA. In contrast, there was a negative relationship for nonlanguage-oriented disciplines such as Life Sciences (r = −0.548) and Technology (r = −0.213). (2) Qualitative data complemented the results of a correlation coefficient between CEEPT and first semester GPA at the practical level.

Moderately satisfied

Evidence of test fairness: Lack of bias in prediction

The CEEPT does not have serious biasing effects for test takers from subgroups of language background, gender, and discipline.

The CEEPT is not fair for test takers from subgroups.

No bias ensures comparable construct and predictive validity for subgroups of language background, gender, and discipline.

Highly satisfied (table continues)

127

Page 136: Volume 3, 2005

Validity Table for the CEEPT (continued)

Justification/ Evidence Argues in favor Argues against Refutation

Extent to which positive attribute is satisfied

Evidence of consequential validity: Impact on test takers

The decisions made about test takers on the basis of CEEPT scores directly affected them. (1) Typical CEEPT takers showed no classification errors. Test takers who were exempted reported that they were able to manage the language demands in content courses. Test takers who were required to take ESL courses reported some English difficulties in content courses. (2) Malcontents (i.e., test takers unhappy with the test results) reported that they benefited from ESL courses and that they did not have major language related difficulties in the coursework.

(1) Typical CEEPT takers showed serious classification errors. (2) Malcontents reported that they did not benefit from ESL courses and that they had major language related difficulties in the coursework.

(1) Participant 4, Cathy, who is exempted, reported no English problems at all. Participant 1, Tom, who was required to take two ESL courses, had a major difficulty with listening. Participant 2, Lisa, who was required to take one ESL course, did not have specific problems. (2) A trajectory of the first malcontent, Amy, reported that she benefited from the ESL courses and that she was happy with her test result after her second required ESL class. The second and the third malcontents, Anna and Lily, found the ESL course useful at the first instructional sequence; however, they were not happy with their test results even at the completion of the ESL instruction.

Highly satisfied (table continues)

128

Page 137: Volume 3, 2005

Validity Table for the CEEPT (continued) Justification/ Evidence

Argues in favor Argues against Refutation Extent to which positive attribute is satisfied

Evidence of improved essay quality: the effect of the revision process facilitated by computer writing tools on the quality of second drafts

The construct label is based on the recursive writing model. The Hayes’ (1996) model views revision as an ongoing process. The computer mode makes revision efficient, as indicated by the textual difference between the two drafts. (1) A statistically significant difference in holistic and analytic scores exists between the two drafts. (2) There exist statistically significant differences in number of words, T-units, T-unit length, and four textual features between the two drafts.

(1) No statistically significant difference in holistic and analytic scores exists between the two drafts. (2) There exists no statistically significant difference in number of words, T-units, T-unit length, and four textual features between the two drafts.

(1) A mean difference of 0.263 in holistic scores between the two drafts was significant (p < 0.000). The draft effect on six analytic scores was significant (p < 0.000). The univariate test results showed that there were significant differences in the six analytic scores between the two drafts (p < 0.0083). (2) Mean differences of 147.74 words, 7.17 T-units, and 1.14 T-unit length between the two drafts were significant (p < 0.000). The draft had a significant effect on normalized frequency of four textual features (p < 0.022). The univariate test results showed that second drafts contained more modals than first drafts (p < 0.01).

Highly satisfied as indicated by the p-values of less than 0.05 from various statistical tests

129

Page 138: Volume 3, 2005

130

Appendix B CEEPT Procedures PART ONE 8:20 CHECK-IN (10 MIN) 8:30 ORAL INTERVIEW PHASE 1 (30 MIN) Students are divided into two groups, and each student is asked some questions for assessment of pronunciation skills. 9:00 EXPLANATION ABOUT THE CEEPT PROCEDURES (5 MIN) The teacher explains how the test proceeds and tells students to read the directions on the screen. 9:05 TOPIC INTRODUCTION (5 MIN) The teacher defines the topic. 9:10 GROUP BRAINSTORMING (10 MIN) In groups of 3 or 4, students brainstorm answers to general questions proposed by the teacher about the topic. The teacher distributes the discussion questions sheets. 9:20 WHOLE CLASS DISCUSSION (10 MIN) Students share the answers with the class. 9:30 BREAK (5 MIN) 9:35 VIDEO WATCHING (10 MIN) The teacher distributes scratch papers. 9:45 ARTICLE READING (20 MIN) The teacher distributes the assigned article relevant to the topic of the video. 10:05 GROUP DISCUSSION (20 MIN) In new groups, students discuss the video and the article using guided questions about the topic provided by the teacher. 10:25 BREAK (5 MIN) 10:30 EXPLANATION ON THE SCORING CRITERIA (10 MIN) Students are given a holistic scoring criteria sheet for graduate students (for undergraduates, please give the criteria for undergraduates). The teacher briefly explains that students will be assigned one of four scores. Students read carefully a scoring criteria sheet and ask questions, if necessary. 10:40 TIME FOR ORGANIZING AN ESSAY (10 MIN)/ ROUGH DRAFT WRTING (40 MIN) Students will organize their essay and then start to write their first draft. 11:30 LUNCH BREAK/ INTENSIVE ORAL INETRVIEW The teacher collects students’ essays and any handouts provided before. The students who are diagnosed in the Oral Phase 1 to take an intensive oral interview are interviewed individually for 20 minutes.

Page 139: Volume 3, 2005

131

PART TWO 1:30 PEER REVIEW FAMILIARIZATION TASKS (20 MIN) The teacher defines peer review with students and explains the purpose and benefit of the peer feedback session. The teacher returns the first draft essays and two blank peer review sheets to students. Students read a blank peer review sheet carefully and ask questions, if necessary. The teacher distributes a text written by someone unknown to students. As a whole class activity, the teacher asks students to respond to a paragraph based on the peer review sheet. The teacher explains how to fill in the peer review sheet and shows the completed peer review sheet. 1:50 PEER REVIEW (60 MIN) The teacher forms groups of 3. Students in groups of 3 take turns reading each essay and writing comments on the peer review sheet.

Read essay #1 and write comments (15 min) Each student hands his/her own essay to the person to the left or right. (The essay should be circulated in one direction for the whole group throughout the peer review session).

Read essay #2 and write comments (15 min)

Essay #1 is handed to the next person and becomes essay #2 for that person.

Group discussion on person A’s essay (10 min) Students determine who is person A, B, and C. The two students who have written comments about person A’s essay give oral comments and suggestions about how the the essay could be improved. After finishing the oral feedback, student A collects two peer review sheets from the other students.

Group discussion on person B’s essay (10 min)

Two students give oral comments and suggestions to person B. Student B collects two peer review sheets from the other students.

Group discussion on person C’s essay (10 min)

Two students give oral comments and suggestions to person C. Student C collects two peer review sheets from the other students.

2:50 BREAK (10 MIN) 3:00 ESSAY REVISION (50 MIN) Students write their second and final essays based on the peers’ comments. 3:50 END OF THE WHOLE TEST The teacher will collect all the materials and two drafts, first one written in the morning and second in the afternoon.

Page 140: Volume 3, 2005

132