work smart { reducing human e ort in short …...work smart {reducing human e ort in short-answer...

Work Smart –Reducing Human Effort in Short-Answer Grading

Margot Mieskes, Hochschule DarmstadtUlrike Pado, Hochschule fur Technik Stuttgart

Introduction Machine Grading Experiments Discussion

Introduction

I Testing is an integral part of (language) teaching

I Specifically in focus: Tests with Short-Answer Questions(SAQs) for language or content assessment

1 / 18

Short-Answer Questions

Example from CREE corpus, Meurers et al. (2011b): ReadingComprehension

I Read text “Television and Children”

I Question: How is violence portrayed in cartoons according tothe article?

I Student Answer: There are underlying themes of justice andpunishment, that is, the “bad guys” do not usually win.

I Grader 1: correct, grader 2: correct

2 / 18

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Our Goal

3 / 18

Our Goal

3 / 18

Our Goal

3 / 18

Our Goal

3 / 18

Outline of Talk

I Machine Graders: Data, features and evaluation

I Study 1: Human performance

I Study 2: Combining machine graders

4 / 18

Outline of Talk

4 / 18

Outline of Talk

4 / 18

Data Sets

Corpus#Questions/ Language#Answers

ASAP (www.kaggle.com/c/asap-sas) 5/8182

ENSEB (Dzikovska et al., 2013) 135/4969Beetle (Dzikovska et al., 2013) 47/3941Mohler (Mohler et al., 2011) 81/2273CREE (Meurers et al., 2011a) 61/566CREG (Meurers et al., 2011b) 85/543

GERCSSAG (Pado and Kiefer, 2015) 31/1926

5 / 18

Models and Features

I Three learning algorithms: Random Forest, Support VectorMachine, Decision Tree

I Features designed to cover the feature types used in theliterature: N-Grams, text similarity measures, dependencyparses and deep semantic representations, textual entailment(Pado, 2016)

6 / 18

Evaluation Measures

I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down

to the individual answer

7 / 18

Evaluation Measures

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?

I Compare multiple annotators, for multiple target grades, downto the individual answer

7 / 18

Evaluation Measures

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down

to the individual answer

7 / 18

Outline of Talk

8 / 18

Human Performance

Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41

I Easiest case: Correct-incorrect decision

I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing

I Accuracies around 85% have been accepted: ∼ 15% error

9 / 18

Human Performance

9 / 18

Human Performance

9 / 18

Machine Ensembles

Our Goal

I Ideally, at the same error levels as before

I Idea: Machines do some of the work and humans step inwhere machines fail

10 / 18

Machine Ensembles

Strategy

I Train several classifiers and collect their predictions: Ensemblelearning

I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners

I Also, now there are multiple annotations! Use κ to determineensemble agreement

I Assumption: The better ensemble agreement is on aprediction, the more reliable it is

I Human checks of the machine labels are only needed forunreliable predictions

11 / 18

Machine Ensembles

Strategy

11 / 18

Machine Ensembles

Strategy

11 / 18

Machine Ensembles

Strategy

11 / 18

Machine Ensembles

Verifying the Assumption

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%

Multi 18% – – 38% 30% – –

I Percentage of incorrect predictions made in full agreement

I For most corpora, decisions made in full agreement are asreliable as human annotators

I The task is noticeably harder for more than two grade levels

12 / 18

Machine Ensembles

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –

12 / 18

Machine Ensembles

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –

12 / 18

Machine Ensembles

Identifying Unreliable Predictions

I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)

I Next: Any answers the ensemble didn’t label in full agreement

13 / 18

Machine Ensembles

Identifying Unreliable Predictions

I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)

I Next: Any answers the ensemble didn’t label in full agreement

13 / 18

Machine Ensembles

Effort and Remaining Error: Binary Case

ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly

Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%

allPartA

Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%

I Binary case: Pass-fail decision

I First: Any answers the ensemble couldn’t label (none here!)

I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded

14 / 18

Machine Ensembles

Effort and Remaining Error: Binary Case

ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly

Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%

allPartA

Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%

I Binary case: Pass-fail decision

I First: Any answers the ensemble couldn’t label (none here!)

I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded

14 / 18

Machine Ensembles

Effort and Remaining Error: Multiclass Case

ASAP CSSAG MohlerNAonly

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

I Multiclass case: 5 to 10-way decision

I First: Revise answers the ensemble couldn’t label – more workclearly needed

I Second: Revise cases of partial agreement

I Acceptable error levels, but more manual work than in thebinary case

15 / 18

Machine Ensembles

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

15 / 18

Machine Ensembles

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

15 / 18

Did we reach our goal?

I Human effort can be reduced while error levels remain stable

I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners

I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail

16 / 18

What else to consider?

I When planning ensemble-supported grading:I Know your requirements

I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)

I Future workI Run a user study: Get feedback on usefulness and usability

17 / 18

work smart { reducing human e ort in short …...work smart {reducing human e ort in short-answer...

Documents