esl essay raters’ cognitive processes

ESL essay raters cognitive processes

ESL essay raters cognitive processesPaula Winke and Hyojung LimMichigan State [email protected]@msu.edu This is a study of rater behavior

This is a study of rater behavior

My essayHow does a rater make scoring decisions? What does a rater pay attention to when rating?


My essayLanguage testers need to know if construct-irrelevant variation in scores stem from how raters approach and think about a rubric.


My essayEmpirical studies on raters cognitive processes are scarce (especially with analytic scoring), and findings are not consistent.

Previous findings

My essayRaters focus on different features in essays when scoring; weight the different scoring categories differently (Cumming et al., 2002; Eckes, 2008; Orr, 2002).

Previous findings

My essaySometimes they consider external features that are not even described in a rubric (Barkaoui, 2010; Lumley, 2005; Vaughan, 1991).

Previous findings

My essayRaters may have different attentional foci when scoring, and their foci may depend on the scale type (holistic vs. analytic), the raters experience (expert vs. novice rater),the raters L1 and even L2 background.

The current studyWed like to know

How raters cognitively process (i.e., use) an analytic rubric while rating ESL essays Whether variability in processing (difference in rubric usage) is associated with lower inter-rater reliability Research QuestionsTo which parts of an analytic rubric do raters pay the most attention (measured as total fixation duration and visit count)?

Are inter-rater reliability statistics on the subcomponents of an analytic rubric related to the amount of attention paid to those subcomponents? Method9 raters, all ESL instructors in the same English-language program at a large, Midwestern university and native speakers of English. Each rated 40 essays (4 prompts * 10 essays). Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) content, organization, vocabulary, language use, and mechanicsTobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program.

Content OrganizationVocabulary Language UseMechanics Method9 raters, all ESL instructors in the same English-language program at a large, Midwestern university and native speakers of English. Each rated 40 essays (4 prompts * 10 essays). Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) content, organization, vocabulary, language use, and mechanicsTobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program.

ProcedureSession 1 in conference roomSession 2 in LabSession 3 in LabTwo-hour rater training session

The raters worked through 7 benchmark essays with Paula.Hyojung explained the procedure.

Background questionnaireEye calibrationPractice rating (norming session)

Block 1: 10 essaysBlock 2: 10 essays Eye calibration Practice rating (norming session)

Block 3: 10 essaysBlock 4: 10 essays

The data

Data AnalysisTo quantify attention: total fixation duration (divided by the number of words in each category) and visit countTo observe a rating process: time to first fixation, gaze plots, and heat maps (Bax & Weir, 2012)Inter-rater reliability: the intraclass coefficient (ICC) and reliability adjusted by the Spearman-Brown prophecy formulaStatistics: the Kruskal-Wallis and Mann-Whitney (post hoc) testResultsIn general, raters read the rubric from left to right, starting from content, organization, vocabulary, language use to mechanics. Oftentimes (71 times, to be specific), mechanics were overlooked.

ResultsOrganization received the most attention (in terms of fixation duration and visit count) and showed the highest inter-rater reliability; raters attended least to and agreed least on mechanics.

r = .90r = .75Fixation duration (mean) in seconds with # of words controlledVisit countIntraclass CoefficientSpearman-Brown prophecy formula Content.0714.03.89.82Organization.0814.14.92.90Vocabulary.0564.40.88.78Language Use.0534.15.90.82Mechanics.0412.57.85.75Statistical resultsOrganization, Content >> Vocab. Lang >> Mechanics Vocab, Organization, Lang, Content>> MechanicsResultsFrom a qualitative review of the videos and heatmaps in comparison with each raters inter-rater reliability estimate, we believe that raters who agreed the most had common attentional foci, whereas those who agreed the least did not.

Incongruous RatersRaters 1 and 7 were found to be most incongruous, given their lowest inter-rater reliability for the total score (.45), and the second lowest reliability for content (.36) and for mechanics (.28). Because the scores for Essay 2 had the largest standard deviation, we looked at the heat maps for essay 2 for raters 1 and 7.

Essay 2

Rater 1

Essay 2

Rater 7

Agreeing RatersRaters 6 and 8 had the highest correlation coefficient in total scores (r=.79) as well as on the sub-scores for content (r=.75) and mechanics (r=.67). Given that the scores of Essay 8 shows the smallest standard deviation, the heat maps for the essay 8 were compared between rater 6 and 9.

Essay 8

Rater 6

Essay 8

Rater 8

DiscussionRaters attention and inter-rater reliabilityMore attention leads to higher inter-rater reliability with analytic scoring. ( greater care and attention decrease reliability with holistic scoring, Wolfe, 1997) Those who showed higher inter-rater reliability showed similar reading patterns reading a relatively large area of the rubric, and having common patterns of attentional foci.

DiscussionThe effect of the layout With an analytic scale, raters decision-making behaviors tend to operate within the scope of the given guidelines (Smith, 2000). Part of the guidelines is the order of the categories. We think that raters gave their most attention to content and organization and their least attention to mechanics because of a primacy effect.It has to do with rubric real estate. DiscussionIn Lumleys (2005) study, the conventions of presentation (spelling, punctuation, script layout) received the second most attention after content, more attention than organization and grammar. In his study, the conventions of presentation came second after content in the rubric. May also be evidence of this primacy effect.DiscussionRaters may use the rubric mainly to justify or adjust the scores for an essay on which they have already made decisions. When finishing reading an essay, raters seemed to know where the quality of the essay would fall in the grid of the analytic rubric.Those who showed higher inter-rater agreement appeared to look through more descriptors for various levels; those who didnt seemed to stick to their initial judgment. Limitations & Future DirectionsThe eye-movement data dont fully explain why raters paid more attention to certain categories or whether raters considered non-criterion features. -> analysis of our stimulated-recall interview data is needed.We dont know if there was any halo effect across essays in the rating process.Information is lacking on how raters read the essays and how they went back and forth between the essays and the rating scale. We have collected data for a second study in which both the rubric and essay are on screen, and data for a third study to investigate potential halo effects. Questions or comments?

Paula Winke [email protected] Lim [email protected]

Notes on EssaysWe assembled a stratified sample of 40 essays from prior ESL placement tests at a large Midwestern university. We culled four sets of 10 essays, each set from one of four scoring bands (64 and below, 65-69, 70-74, and 75 and above: see supplemental material that accompanies the online version of this manuscript). We balanced the selection of the 40 essays equally across four prompts as follows, with two to three essays at each score band being a response to one of these prompts:Do you think it is better for people to make their purchases online or to go shopping in stores and malls? Use specific details and examples to explain your answer. Some people say that all international students who are studying English should have an American roommate for at least one year. What is your opinion on this topic? Some employees have bosses that they really like working for, while others have bosses that they absolutely hate. What are the most important qualities of a good boss at work, and why?If you had the choice, would you rather take a college course online or have the same class face to face with an instructor and classmates in a classroom? Use specific details and examples to explain your answer. The length of student essays was limited to one page so that raters did not need to flip over pages while rating. The order of 10 essays within each prompt set was randomized, and the order of the four prompt sets was counterbalanced across raters. A packet of 40 copied essays were ready for each rater, and raters were allowed to write on the essays while rating. Additionally, we selected two more essays for norming, and the essays were from the middle two score bands of 65-74.

Notes on Time to 1st FixationCategoriesNMean Time Std. DeviationMean RankContent351101.6633.31567.65Organization351108.1633.18649.64Vocabulary351123.3938.44838.28Language Use350142.4144.981030.29Mechanics280163.6455.871196.35The mean rank is the result of the Kruskai-Wallis test.

NTotal fixation duration(Mean)Number of WordsFixation duration (Mean)with number of words controlledSDMean RankContent35110.72015113.766/151= .071.0471050.45Organization3517.576949.597/94= .081.0621089.23Vocabulary3518.21614610.397/146=.056.037888.95Language Use3509.68918412.576/184=.053.034843.29Mechanics2803.690894.133/89=.041.050518.07Eye fixation duration with number of words controlled Note. Measurement units are seconds (e.g. 10.720 seconds). Mean ranks are the result of the Kruskal-Wallis test.

esl essay raters’ cognitive processes

Documents

analytic rubric

esl essays

raters l1

raters approach

analytic scoring

englishlanguage program

analytic rating scale

study of rater behavior