work smart { reducing human e ort in short …...work smart {reducing human e ort in short-answer...
TRANSCRIPT
Work Smart –Reducing Human Effort in Short-Answer Grading
Margot Mieskes, Hochschule DarmstadtUlrike Pado, Hochschule fur Technik Stuttgart
Introduction Machine Grading Experiments Discussion
Introduction
I Testing is an integral part of (language) teaching
I Specifically in focus: Tests with Short-Answer Questions(SAQs) for language or content assessment
1 / 18
Introduction Machine Grading Experiments Discussion
Short-Answer Questions
Example from CREE corpus, Meurers et al. (2011b): ReadingComprehension
I Read text “Television and Children”
I Question: How is violence portrayed in cartoons according tothe article?
I Student Answer: There are underlying themes of justice andpunishment, that is, the “bad guys” do not usually win.
I Grader 1: correct, grader 2: correct
2 / 18
Introduction Machine Grading Experiments Discussion
Our Goal
I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests
I Reduce human grading effort!
I Specifically: Machines do some of the work and humans stepin where machines fail
I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ
I This means not every student answer will behuman-graded
I Appropriate for placement testing etc., where the overallgrade is reported
3 / 18
Introduction Machine Grading Experiments Discussion
Our Goal
I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests
I Reduce human grading effort!
I Specifically: Machines do some of the work and humans stepin where machines fail
I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ
I This means not every student answer will behuman-graded
I Appropriate for placement testing etc., where the overallgrade is reported
3 / 18
Introduction Machine Grading Experiments Discussion
Our Goal
I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests
I Reduce human grading effort!
I Specifically: Machines do some of the work and humans stepin where machines fail
I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ
I This means not every student answer will behuman-graded
I Appropriate for placement testing etc., where the overallgrade is reported
3 / 18
Introduction Machine Grading Experiments Discussion
Our Goal
I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests
I Reduce human grading effort!
I Specifically: Machines do some of the work and humans stepin where machines fail
I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ
I This means not every student answer will behuman-graded
I Appropriate for placement testing etc., where the overallgrade is reported
3 / 18
Introduction Machine Grading Experiments Discussion
Our Goal
I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests
I Reduce human grading effort!
I Specifically: Machines do some of the work and humans stepin where machines fail
I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ
I This means not every student answer will behuman-graded
I Appropriate for placement testing etc., where the overallgrade is reported
3 / 18
Introduction Machine Grading Experiments Discussion
Outline of Talk
I Machine Graders: Data, features and evaluation
I Study 1: Human performance
I Study 2: Combining machine graders
4 / 18
Introduction Machine Grading Experiments Discussion
Outline of Talk
I Machine Graders: Data, features and evaluation
I Study 1: Human performance
I Study 2: Combining machine graders
4 / 18
Introduction Machine Grading Experiments Discussion
Outline of Talk
I Machine Graders: Data, features and evaluation
I Study 1: Human performance
I Study 2: Combining machine graders
4 / 18
Introduction Machine Grading Experiments Discussion
Data Sets
Corpus#Questions/ Language#Answers
ASAP (www.kaggle.com/c/asap-sas) 5/8182
ENSEB (Dzikovska et al., 2013) 135/4969Beetle (Dzikovska et al., 2013) 47/3941Mohler (Mohler et al., 2011) 81/2273CREE (Meurers et al., 2011a) 61/566CREG (Meurers et al., 2011b) 85/543
GERCSSAG (Pado and Kiefer, 2015) 31/1926
5 / 18
Introduction Machine Grading Experiments Discussion
Models and Features
I Three learning algorithms: Random Forest, Support VectorMachine, Decision Tree
I Features designed to cover the feature types used in theliterature: N-Grams, text similarity measures, dependencyparses and deep semantic representations, textual entailment(Pado, 2016)
6 / 18
Introduction Machine Grading Experiments Discussion
Evaluation Measures
I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled
correctly?
I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down
to the individual answer
7 / 18
Introduction Machine Grading Experiments Discussion
Evaluation Measures
I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled
correctly?
I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?
I Compare multiple annotators, for multiple target grades, downto the individual answer
7 / 18
Introduction Machine Grading Experiments Discussion
Evaluation Measures
I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled
correctly?
I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down
to the individual answer
7 / 18
Introduction Machine Grading Experiments Discussion
Outline of Talk
I Machine Graders: Data, features and evaluation
I Study 1: Human performance
I Study 2: Combining machine graders
8 / 18
Introduction Machine Grading Experiments Discussion
Human Performance
Human Performance
Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41
I Easiest case: Correct-incorrect decision
I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing
I Accuracies around 85% have been accepted: ∼ 15% error
9 / 18
Introduction Machine Grading Experiments Discussion
Human Performance
Human Performance
Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41
I Easiest case: Correct-incorrect decision
I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing
I Accuracies around 85% have been accepted: ∼ 15% error
9 / 18
Introduction Machine Grading Experiments Discussion
Human Performance
Human Performance
Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41
I Easiest case: Correct-incorrect decision
I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing
I Accuracies around 85% have been accepted: ∼ 15% error
9 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Our Goal
I Reduce human grading effort!
I Ideally, at the same error levels as before
I Idea: Machines do some of the work and humans step inwhere machines fail
10 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Strategy
I Train several classifiers and collect their predictions: Ensemblelearning
I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners
I Also, now there are multiple annotations! Use κ to determineensemble agreement
I Assumption: The better ensemble agreement is on aprediction, the more reliable it is
I Human checks of the machine labels are only needed forunreliable predictions
11 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Strategy
I Train several classifiers and collect their predictions: Ensemblelearning
I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners
I Also, now there are multiple annotations! Use κ to determineensemble agreement
I Assumption: The better ensemble agreement is on aprediction, the more reliable it is
I Human checks of the machine labels are only needed forunreliable predictions
11 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Strategy
I Train several classifiers and collect their predictions: Ensemblelearning
I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners
I Also, now there are multiple annotations! Use κ to determineensemble agreement
I Assumption: The better ensemble agreement is on aprediction, the more reliable it is
I Human checks of the machine labels are only needed forunreliable predictions
11 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Strategy
I Train several classifiers and collect their predictions: Ensemblelearning
I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners
I Also, now there are multiple annotations! Use κ to determineensemble agreement
I Assumption: The better ensemble agreement is on aprediction, the more reliable it is
I Human checks of the machine labels are only needed forunreliable predictions
11 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Verifying the Assumption
Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%
Multi 18% – – 38% 30% – –
I Percentage of incorrect predictions made in full agreement
I For most corpora, decisions made in full agreement are asreliable as human annotators
I The task is noticeably harder for more than two grade levels
12 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Verifying the Assumption
Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –
I Percentage of incorrect predictions made in full agreement
I For most corpora, decisions made in full agreement are asreliable as human annotators
I The task is noticeably harder for more than two grade levels
12 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Verifying the Assumption
Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –
I Percentage of incorrect predictions made in full agreement
I For most corpora, decisions made in full agreement are asreliable as human annotators
I The task is noticeably harder for more than two grade levels
12 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Identifying Unreliable Predictions
I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)
I Next: Any answers the ensemble didn’t label in full agreement
13 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Identifying Unreliable Predictions
I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)
I Next: Any answers the ensemble didn’t label in full agreement
13 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Effort and Remaining Error: Binary Case
ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly
Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%
allPartA
Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%
I Binary case: Pass-fail decision
I First: Any answers the ensemble couldn’t label (none here!)
I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded
14 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Effort and Remaining Error: Binary Case
ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly
Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%
allPartA
Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%
I Binary case: Pass-fail decision
I First: Any answers the ensemble couldn’t label (none here!)
I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded
14 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Effort and Remaining Error: Multiclass Case
ASAP CSSAG MohlerNAonly
Effort 7% 4% 9%Error 28% 44% 41%
allPartA
Effort 39% 50% 59%Error 11% 19% 15%
I Multiclass case: 5 to 10-way decision
I First: Revise answers the ensemble couldn’t label – more workclearly needed
I Second: Revise cases of partial agreement
I Acceptable error levels, but more manual work than in thebinary case
15 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Effort and Remaining Error: Multiclass Case
ASAP CSSAG MohlerNAonly
Effort 7% 4% 9%Error 28% 44% 41%
allPartA
Effort 39% 50% 59%Error 11% 19% 15%
I Multiclass case: 5 to 10-way decision
I First: Revise answers the ensemble couldn’t label – more workclearly needed
I Second: Revise cases of partial agreement
I Acceptable error levels, but more manual work than in thebinary case
15 / 18
Introduction Machine Grading Experiments Discussion
Machine Ensembles
Effort and Remaining Error: Multiclass Case
ASAP CSSAG MohlerNAonly
Effort 7% 4% 9%Error 28% 44% 41%
allPartA
Effort 39% 50% 59%Error 11% 19% 15%
I Multiclass case: 5 to 10-way decision
I First: Revise answers the ensemble couldn’t label – more workclearly needed
I Second: Revise cases of partial agreement
I Acceptable error levels, but more manual work than in thebinary case
15 / 18
Introduction Machine Grading Experiments Discussion
Did we reach our goal?
I Human effort can be reduced while error levels remain stable
I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners
I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail
16 / 18
Introduction Machine Grading Experiments Discussion
Did we reach our goal?
I Human effort can be reduced while error levels remain stable
I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners
I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail
16 / 18
Introduction Machine Grading Experiments Discussion
Did we reach our goal?
I Human effort can be reduced while error levels remain stable
I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners
I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail
16 / 18
Introduction Machine Grading Experiments Discussion
What else to consider?
I When planning ensemble-supported grading:I Know your requirements
I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)
I Future workI Run a user study: Get feedback on usefulness and usability
17 / 18
Introduction Machine Grading Experiments Discussion
What else to consider?
I When planning ensemble-supported grading:I Know your requirements
I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)
I Future workI Run a user study: Get feedback on usefulness and usability
17 / 18
Introduction Machine Grading Experiments Discussion
What else to consider?
I When planning ensemble-supported grading:I Know your requirements
I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)
I Future workI Run a user study: Get feedback on usefulness and usability
17 / 18