challenges in predicting machine translation utility for ... · machine translation as a starting...

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

Source Text FastTranslation

MT System

Good fast translation?

Source Text GoodTranslation

Translators

MT with Human Post-Editing

Source Text

FastTranslation

Translators

MT System

Good FastTranslation

Source Text

FastTranslation

Translators

MT System

Very SlowRe-Translation

Introduction

Utility prediction: We need to reliably predict the usability ofautomatic translations.

“Referenceless” utility prediction:

• Corresponds to confidence estimation task

• Confidence Estimation for post-editing (Specia 2011)

• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)

Reference-aided utility prediction

• Corresponds to MT evaluation task

• This work

This Work

Machine translation as a starting point for human translators

• Goal is utility for post-editing

• Compare post-editing to traditional adequacy-driven tasks

Examine results of a post-editing experiment

• Simulate a real-world localization scenario

• Examine challenges in predicting translation usefulness forhuman translators

Adequacy Tasks

Adequacy: semantic similarity to reference translations

Significant research efforts on improving end quality of machinetranslation:

• ACL Workshops on Statistical Machine Translation(Callison-Burch et al., 2011)

• NIST Open Machine Translation Evaluations(Przybocki et al., 2009)

Measured by absolute scores or rankings

Motivation: MT for user consumption, input for other NLP tasks

Post-Editing

Human-targeted translation edit rate (HTER, Snover et al., 2006)

1. Human translators correct MT output

2. Automatically calculate number of edits using TER

TER =# of edits

# of reference words

Edits: insertion, deletion, substitution, block shift

Translation ExampleWMT 2011 Czech–English Track

Ref: He was supposed to pay half a million to Lubos G.

1: He had for Lubosi G. to pay half a million crowns.

0.27

2: He had to pay lubosi G. half a million kronor.

0.09


Ref: He was supposed to pay half a million to Lubos G.

1: He had for to pay Lubosi Lubos G. to pay half a million crowns.

0.27

2: He had to pay lubosi Lubos G. half a million kronor.

0.09


Ref: The problem is that life of the lines is two to four years.

1: The problem is that life is two lines, up to four years.

0.49 0.29

2: The problem is that the durability of lines is two or four years.

0.34 0.14



1: The problem is that life is two lines, up to four years.

0.49

0.29

2: The problem is that the durability of lines is two or four years.

0.34

0.14



1: The problem is that life is two of the lines , up to is two to four years.

0.49

0.29

2: The problem is that the durability life of lines is two or to four years.

0.34

0.14



1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

MT Post-Editing Experiment

90 sentences from Google Docs documentation

Translated from English to Spanish by two systems:

• Microsoft Translator

• Moses system (Europarl)

180 MT outputs total

Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing

Translators never saw the reference translations


Data collected from professional translators (in training):

Post-edited translations

Expert post-editing ratings1: No editing required2: Minor editing, meaning preserved3: Major editing, meaning lost4: Re-translate

From parallel data:

Independent reference translations


Evaluate post-edited results using standard MT evaluation metrics:

BLEU (Papineni et al., 2002):

• n-gram precision with a brevity penalty

TER (Snover et al., 2006):

• Minimum edit distance

Meteor (Denkowski and Lavie, 2011):

• Tunable alignment-based metric

Task: Reference-assisted utility prediction

MT Post-Editing Results

Average rating: 1.69

Average HTER: 12.4

Automatic metric scores:

BLEU TER Meteor

Post-edited 79.2 12.4 90.0

MT vs Ref 31.7 49.5 58.2

Post vs Ref 34.1 48.3 59.2


r 4-pt BLEU TER Meteor

4-point – 0.32 0.28 0.33

HTER 0.49 0.26 0.24 0.27

Metric correlation with post-editing scores


Oracle experiment: tune Meteor to maximize correlation

How well can we (over)fit expert post-editing ratings?

The Meteor Metric

Flexible alignment:

Scoring features:

• Precision/Recall contribution (insertions, deletions)

• Fragmentation penalty (reordering)

• Content/function word contribution

• Flexible match weights


r 4-pt BLEU TER Meteor Meteororacle4-point – 0.32 0.28 0.33 0.35

HTER 0.49 0.26 0.24 0.27 0.34

Metric correlation with post-editing scores


Additional experiment: translation usability

Divide translations into two groups:

• Suitable for post-editing (1-2)

• Not suitable for post-editing (3-4)

Examine metric score distribution of each group

Assess metric ability to distinguish between usable and non-usabletranslations

Unfair advantage: reference translations

Usability Experiment Results

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

5

10

15

20

25

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

2

4

6

8

10

12

14

16

18

Sent

ence

s

UsableNon-usable

Larger Data Set

Are out results skewed by the small size of the data (180 sentences)?

WMT12 Quality Estimation Task:

1832 English-to-Spanish MT outputs

HTER scores and 5-point multiple-expert ratings

Run usability experiment with this data

WMT 2012 Quality Estimation Task Data

0.0 0.2 0.4 0.6 0.8 1.0BLEU Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score

0

50

100

150

200

Sent

ence

s

UsableNon-usable

Usability vs HTER

How well do experts and HTER agree?

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

10

20

30

40

50

60

70

80

Sent

ence

s

UsableNon-usable

0.0 0.2 0.4 0.6 0.8 1.0HTER

0

50

100

150

200

250

Sent

ence

s

UsableNon-usable

Kent State WMT 2012

Usability vs HTER (WMT12)

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

Expert

Rating

HTER

0

20

40

60

80

100

Conclusions

MT for post-editing utility is a significantly different task fromMT for adequacy

Current MT tools under-perform on predicting post-editingusability

Even metrics that use post-editing information (HTER) don’tmatch expert assessments

To improve post-editing usability, we need better data, bettermetrics, better MT systems

Conclusions

www.transcenter.info

Challenges in Predicting Machine TranslationUtility for Human Post-Editors

Michael Denkowski and Alon Lavie

Language Technologies InstituteCarnegie Mellon University

October 29, 2012

challenges in predicting machine translation utility for ... · machine translation as a starting...

Documents