challenges in predicting machine translation utility for ... · machine translation as a starting...

49
Challenges in Predicting Machine Translation Utility for Human Post-Editors Michael Denkowski and Alon Lavie Language Technologies Institute Carnegie Mellon University October 29, 2012

Upload: others

Post on 05-Jul-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

Page 2: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

Page 3: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

Page 4: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

Page 5: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT with Human Post-Editing

Source Text

FastTranslation

Translators

MT System

Good FastTranslation

Page 6: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Page 7: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Page 8: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Page 9: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Page 10: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

Page 11: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

This Work

Machine translation as a starting point for human translators

• Goal is utility for post-editing

• Compare post-editing to traditional adequacy-driven tasks

Examine results of a post-editing experiment

• Simulate a real-world localization scenario

• Examine challenges in predicting translation usefulness forhuman translators

Page 12: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Adequacy Tasks

Adequacy: semantic similarity to reference translations

Significant research efforts on improving end quality of machinetranslation:

• ACL Workshops on Statistical Machine Translation(Callison-Burch et al., 2011)

• NIST Open Machine Translation Evaluations(Przybocki et al., 2009)

Measured by absolute scores or rankings

Motivation: MT for user consumption, input for other NLP tasks

Page 13: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Post-Editing

Human-targeted translation edit rate (HTER, Snover et al., 2006)

1. Human translators correct MT output

2. Automatically calculate number of edits using TER

TER =# of edits

# of reference words

Edits: insertion, deletion, substitution, block shift

Page 14: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Page 15: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Page 16: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09

Page 17: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

0.27

2: He had to pay lubosi Lubos G. half a million kronor.

0.09

Page 18: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

0.27

2: He had to pay lubosi Lubos G. half a million kronor.

0.09

Page 19: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14

Page 20: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14

Page 21: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49

0.29

2: The problem is that the durability of lines is two or four years.

0.34

0.14

Page 22: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two of the lines , up to is two to four years.

0.49

0.29

2: The problem is that the durability life of lines is two or to four years.

0.34

0.14

Page 23: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Translation ExampleWMT 2011 Czech–English Track

Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

Page 24: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

Page 25: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations

Page 26: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

Data collected from professional translators (in training):

Post-edited translations

Expert post-editing ratings1: No editing required2: Minor editing, meaning preserved3: Major editing, meaning lost4: Re-translate

From parallel data:

Independent reference translations

Page 27: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

Evaluate post-edited results using standard MT evaluation metrics:

BLEU (Papineni et al., 2002):

• n-gram precision with a brevity penalty

TER (Snover et al., 2006):

• Minimum edit distance

Meteor (Denkowski and Lavie, 2011):

• Tunable alignment-based metric

Task: Reference-assisted utility prediction

Page 28: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Page 29: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Page 30: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Page 31: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2

Page 32: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

r 4-pt BLEU TER Meteor

4-point – 0.32 0.28 0.33

HTER 0.49 0.26 0.24 0.27

Metric correlation with post-editing scores

Page 33: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

Oracle experiment: tune Meteor to maximize correlation

How well can we (over)fit expert post-editing ratings?

Page 34: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

The Meteor Metric

Flexible alignment:

Scoring features:

• Precision/Recall contribution (insertions, deletions)

• Fragmentation penalty (reordering)

• Content/function word contribution

• Flexible match weights

Page 35: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Results

r 4-pt BLEU TER Meteor Meteororacle4-point – 0.32 0.28 0.33 0.35

HTER 0.49 0.26 0.24 0.27 0.34

Metric correlation with post-editing scores

Page 36: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Page 37: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

MT Post-Editing Experiment

Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Page 38: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Page 39: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Page 40: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Page 41: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

Page 42: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

Page 43: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

WMT 2012 Quality Estimation Task Data

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

Page 44: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

10

20

30

40

50

60

70

80

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

50

100

150

200

250

Sent

ence

s

UsableNon-usable

Kent State WMT 2012

Page 45: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

10

20

30

40

50

60

70

80

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

50

100

150

200

250

Sent

ence

s

UsableNon-usable

Kent State WMT 2012

Page 46: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Usability vs HTER (WMT12)

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

Expert

Rating

HTER

0

20

40

60

80

100

Page 47: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Conclusions

MT for post-editing utility is a significantly different task fromMT for adequacy

Current MT tools under-perform on predicting post-editingusability

Even metrics that use post-editing information (HTER) don’tmatch expert assessments

To improve post-editing usability, we need better data, bettermetrics, better MT systems

Page 48: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Conclusions

www.transcenter.info

Page 49: Challenges in Predicting Machine Translation Utility for ... · Machine translation as a starting point for human translators Goal is utility for post-editing Compare post-editing

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012