esr7 carolina scarton - expert summer school - malaga 2015
TRANSCRIPT
Finding Ways to Assess Machine Translated Documents for Document-level Quality Prediction
Carolina Scarton [email protected]
Supervisor: Dr Lucia Specia
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
2
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
3
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
4
Introduction
Quality estimation (QE) of machine translations
– quality predictions for new, unseen machine translated texts
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
5
Introduction
Quality estimation (QE) of machine translations
– quality predictions for new, unseen machine translated texts
– use of machine learning techniques – only few labelled data points
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
6
Introduction
Quality estimation (QE) of machine translations
– quality predictions for new, unseen machine translated texts
– use of machine learning techniques – only few labelled data points
– different from BLEU-style metrics – QE does not rely on reference translations
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
7
Introduction
Open problems:
– Granularity level?• Word-level• Sentence-level• Document-level
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
8
Introduction
Open problems:
– Granularity level?• Word-level• Sentence-level• Document-level
– Which are the best features?• Linguistic features have been explored: but not
much on discourse features!
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
9
Introduction
Open problems:
– Granularity level?• Word-level• Sentence-level• Document-level
– Which are the best features?• Linguistic features have been explored: but not
much on discourse features!
– Which are the best quality labels?• Likert• HTER• BLEU-style
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
10
Target documents
Source documents
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
11
Target documents
Feature extractor
Source documents
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
12
Target documents
Features for QE
Feature extractor
Source documents
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
13
Target documents
Features for QE
Feature extractor
Source documents
QE model training
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
14
Target documents
Features for QE
Feature extractor
Source documents
Quality labels Likert HTER BLEU ...
QE model training
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
15
Target documents
Features for QE
Feature extractor
QE modelSource documents
Quality labels Likert HTER BLEU ...
QE model training
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
16
Target documents
Features for QE
Feature extractor
QE model
Predictions
Source documents
Quality labels Likert HTER BLEU ...
QE model training
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
17
Target documents
Features for QE
Feature extractor
QE model
Predictions
Source documents
Quality labels Likert HTER BLEU ...
QE model training
Defining the ideal quality label for document-level prediction is a
challenge
Introduction
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
18
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
19
Quality Estimation Framework
QuEst (www.quest.dcs.shef.ac.uk)
– Framework for sentence-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
20
Quality Estimation Framework
QuEst (www.quest.dcs.shef.ac.uk)
– Framework for sentence-level QE
– QuEst++ → recent extension for word and document levels • https://github.com/ghpaetzold/questplusplus
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
21
Quality Estimation Framework
QuEst (www.quest.dcs.shef.ac.uk)
– Framework for sentence-level QE
– QuEst++ → recent extension for word and document levels • https://github.com/ghpaetzold/questplusplus
– Feature Extraction module (Java)
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
22
Quality Estimation Framework
QuEst (www.quest.dcs.shef.ac.uk)
– Framework for sentence-level QE
– QuEst++ → recent extension for word and document levels • https://github.com/ghpaetzold/questplusplus
– Feature Extraction module (Java)
– Machine Learning module (Python)
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
23
Quality Estimation Framework
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
24
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
25
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
26
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
Scarton and Specia (2014)– Document-level QE prediction using discourse features – also
predicted BLEU scores
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
27
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
Scarton and Specia (2014)– Document-level QE prediction using discourse features – also
predicted BLEU scores
Carpuat and Simard (2012)– Lexical consistency study of MT outputs → MT is overall consistent!
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
28
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
Scarton and Specia (2014)– Document-level QE prediction using discourse features – also
predicted BLEU scores
Carpuat and Simard (2012)– Lexical consistency study of MT outputs → MT is overall consistent!
Meyer and Weber (2013)– Implicit discourse connectives in MT
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
29
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
Scarton and Specia (2014)– Document-level QE prediction using discourse features – also
predicted BLEU scores
Carpuat and Simard (2012)– Lexical consistency study of MT outputs → MT is overall consistent!
Meyer and Weber (2013)– Implicit discourse connectives in MT
Li et al. (2014)– Discourse connectives → improve MT → correlations between
discourse connectives and HTER
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
30
Related Work
Soricut and Echihabi (2010) → TrustRank– Ranking documents according to BLEU scores
Scarton and Specia (2014)– Document-level QE prediction using discourse features – also
predicted BLEU scores
Carpuat and Simard (2012)– Lexical consistency study of MT outputs → MT is overall consistent!
Meyer and Weber (2013)– Implicit discourse connectives in MT
Li et al. (2014)– Discourse connectives → improve MT → correlations between
discourse connectives and HTER
Guzmán et al. (2014)– Document-level evaluation metric using RST
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
31
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
32
Quality Label problem
Quality labels are a challenge:
– Which is the ideal quality label for document-level QE?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
33
Quality Label problem
Quality labels are a challenge:
– Which is the ideal quality label for document-level QE?
– How can we assess documents?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
34
Quality Label problem
Quality labels are a challenge:
– Which is the ideal quality label for document-level QE?
– How can we assess documents?
• Sentence-level scores aggregation?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
35
Quality Label problem
Quality labels are a challenge:
– Which is the ideal quality label for document-level QE?
– How can we assess documents?
• Sentence-level scores aggregation?
• New assessment score of the document as a whole?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
36
Quality Label problem
Quality labels are a challenge:
– BLEU-style metrics as quality labels• LIG corpus (FR-EN) → 119 documents
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
37
Quality Label problem
Quality labels are a challenge:
– BLEU-style metrics as quality labels• WMT corpus (EN-DE) → 52 documents (1215
paragraphs)
– Low STDEV → documents have similar quality• Is it really true?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
38
Quality Label problem
Quality labels are a challenge:
– BLEU-style metrics as quality labels• WMT corpus (EN-DE) → 52 documents (1215
paragraphs)
– Low STDEV → documents have similar quality• Is it really true?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
39
Quality Label problem
Quality labels are a challenge:
– BLEU-style metrics as quality labels• WMT corpus (EN-DE) → 52 documents (1215
paragraphs)
– Low STDEV → documents have similar quality• Is it really true?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
40
Two-stage post-edition method
PE1:
– Post-edition of sentences without context
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
41
Two-stage post-edition method
PE1:
– Post-edition of sentences without context• Wir brauchen das kulturelle Fundament, aber wir haben jetzt
mehr Schriftsteller als Leser.
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
42
Two-stage post-edition method
PE1:
– Post-edition of sentences without context• Wir brauchen das kulturelle Fundament, aber wir haben jetzt
mehr Schriftsteller als Leser.
PE2:
– Post-edition of sentence with context
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
43
Two-stage post-edition method
PE1:
– Post-edition of sentences without context• Wir brauchen das kulturelle Fundament, aber wir haben jetzt
mehr Schriftsteller als Leser.
PE2:
– Post-edition of sentence with context• - St. Petersburg bietet nicht viel kulturelles Angebot, Moskau hat
viel mehr Kultur, es hat eine Grundlage. Es ist schwer fr die Kunst, sich in unserem Umfeld durchzusetzen. Wir brauchen das kulturelle Fundament, aber wir haben jetzt mehr Schriftsteller als Leser. Das ist falsch. In Europa gibt es viele neugierige Menschen, die auf Kunstausstellungen, Konzerte gehen. Hier ist diese Schicht ist dünn.
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
44
Two-stage post-edition method
Hypothesis:
– There are problems in MT outputs that can only be solved in context
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
45
Two-stage post-edition method
Hypothesis:
– There are problems in MT outputs that can only be solved in context
– Measuring the difference from PE1 to PE2
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
46
Two-stage post-edition method
Hypothesis:
– There are problems in MT outputs that can only be solved in context
– Measuring the difference from PE1 to PE2
• Isolating document-level problems
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
47
Two-stage post-edition method
Hypothesis:
– There are problems in MT outputs that can only be solved in context
– Measuring the difference from PE1 to PE2
• Isolating document-level problems
• Using the difference to create a better quality label
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
48
Two-stage post-edition method
Hypothesis:
– There are problems in MT outputs that can only be solved in context
– Measuring the difference from PE1 to PE2
• Isolating document-level problems
• Using the difference to create a better quality label
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
49
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
50
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus • Filter 1: only paragraphs with more than 3
sentences (less than 8)
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
51
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus • Filter 1: only paragraphs with more than 3
sentences (less than 8)• Filter 2: Paragraphs ordered by number of
discourse phenomena (discourse connectives and pronouns)
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
52
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus • Filter 1: only paragraphs with more than 3
sentences (less than 8)• Filter 2: Paragraphs ordered by number of
discourse phenomena (discourse connectives and pronouns)
• Final data: 200 paragraphs
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
53
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus • Filter 1: only paragraphs with more than 3
sentences (less than 8)• Filter 2: Paragraphs ordered by number of
discourse phenomena (discourse connectives and pronouns)
• Final data: 200 paragraphs
– Annotators → students of “translation studies” in Saarland University, Saarbrücken, Germany
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
54
Two-stage post-edition method
Experiments:
– Data: 1215 paragraphs → WMT EN-DE corpus • Filter 1: only paragraphs with more than 3
sentences (less than 8)• Filter 2: Paragraphs ordered by number of
discourse phenomena (discourse connectives and pronouns)
• Final data: 200 paragraphs
– Annotators → students of “translation studies” in Saarland University, Saarbrücken, Germany
– 16 sets → evaluate agreement
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
55
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
56
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
57
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
58
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
59
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
60
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
61
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
62
Two-stage post-edition method
Annotator's agreement:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
63
Two-stage post-edition method
Changes from PE1 to PE2 – paragraphs perspective:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
64
Two-stage post-edition method
Changes from PE1 to PE2 – paragraphs perspective:
Document-level QE
All paragraphswere changed
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
65
Two-stage post-edition method
Paragraph changes example:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
66
Two-stage post-edition method
Paragraph changes example:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
67
Two-stage post-edition method
Paragraph changes example:
Document-level QE
Better wordchoices
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
68
Two-stage post-edition method
Paragraph changes → manual analysis
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
69
Two-stage post-edition method
Paragraph changes → manual analysis– Discourse/context changes
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
70
Two-stage post-edition method
Paragraph changes → manual analysis– Discourse/context changes– Stylistic changes
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
71
Two-stage post-edition method
Paragraph changes → manual analysis– Discourse/context changes– Stylistic changes– Other changes
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
72
Two-stage post-edition method
Paragraph changes → manual analysis– Discourse/context changes– Stylistic changes– Other changes
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
73
Two-stage post-edition method
Paragraph changes → manual analysis– Discourse/context changes– Stylistic changes– Other changes
– Low agreement• Annotators should not made lots of stylistic
changes!
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
74
Two-stage post-edition method
Final results:
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
75
Two-stage post-edition method
Final results:
– 116 paragraphs analysed
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
76
Two-stage post-edition method
Final results:
– 116 paragraphs analysed
– Some changes → only with paragraph context
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
77
Two-stage post-edition method
Final results:
– 116 paragraphs analysed
– Some changes → only with paragraph context
– However
• How to combine the results into a quality label?
Document-level QE
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
78
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
79
Large-scale experiments
Extending the research:
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
80
Large-scale experiments
Extending the research:
– Data: ~ 1000 data points • Different language pairs• Entire documents (?)
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
81
Large-scale experiments
Extending the research:
– Data: ~ 1000 data points • Different language pairs• Entire documents (?)
– Annotators: expert annotators (familiar with post-editing)• Improving guidelines and training
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
82
Large-scale experiments
Extending the research:
– Data: ~ 1000 data points • Different language pairs• Entire documents (?)
– Annotators: expert annotators (familiar with post-editing)• Improving guidelines and training
– Evaluation: combining PE2 – PE1 with other metrics (HTER, BLEU, …)
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
83
Large-scale experiments
Extending the research:
– Data: ~ 1000 data points • Different language pairs• Entire documents (?)
– Annotators: expert annotators (familiar with post-editing)• Improving guidelines and training
– Evaluation: combining PE2 – PE1 with other metrics (HTER, BLEU, …)
– Alternative approach: • Post-editions in contexts → available• Apply PE1 (post-editing the sentences again →
without context)• PE2 – PE1 as usual
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
84
Agenda
Introduction
Quality Estimation Framework
Related Work
Document-level Quality Estimation
Quality Label problem
Two-stage post-edition experiment
Large-scale experiments
Conclusion
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
85
Conclusion
Two-stage post-edition method → promising!
– Problems that can only be solved in context
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
86
Conclusion
Two-stage post-edition method → promising!
– Problems that can only be solved in context
How to compute a quality label?
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
87
Conclusion
Two-stage post-edition method → promising!
– Problems that can only be solved in context
How to compute a quality label?
– Combine PE2-PE1 with other metrics?
– Use PE2-PE1?
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015
88
Acknowledgement
Saarland University: Marcos Zampieri, Mihaela Vela, Heike Przybyl and Josef Van Genabith
Reviewers from EXPERT Workshop
Thank you!
Carolina Scarton [email protected]
Supervisor: Dr Lucia Specia
EXPERT – Scientific and Technological Workshop – Malaga, Spain – 27/06/2015