improving mt quality with evaluation-oriented …improving mt quality with evaluation-oriented...

32
Machine Translation evaluation for MT development Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds [email protected] QTLaunchPad Workshop @ EAMT 2014

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Machine Translation evaluation for MT development Improving MT quality with evaluation-oriented methods

Bogdan Babych

Centre for Translation Studies

University of Leeds

[email protected]

QTLaunchPad Workshop @ EAMT 2014

Page 2: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Overview

• MT evaluation & MT development: science vs. engineering perspectives

Re-engineering models of automated evaluation for MT improvement

Limitations of automated metrics

• Usability of MT for usage scenarios

Engineers’ favorite toy vs. translators’ tool

Future: Realistic scenarios & evaluation-guided MT

Page 3: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

MT evaluation & MT development • MT evaluation: the field in its own right

‘bigger’ than MT development ?

‘science’ methodology: quantifying & understanding natural phenomena behind engineering advances o understanding, well-formedness, mechanisms of language,

communication, cognition; how to move forward

Engineering context: test for theories & models

• Human evaluation: ultimate benchmark?

• Absolute vs. purpose-related quality

Skopos of human translation

Technical definitions of quality & MT performance

Page 4: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Automated MT evaluation: source of models for MT • What works for automated evaluation also

improves MT Evaluation uses some computable text features

that characterize MT quality

Features are calibrated by human judgments or performance

New models based on the same features can improve MT quality

• Examples: Named Entities, Information Extraction, MWE &Terminology Translation

Page 5: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

NER: evaluation and improvement • MT errors more frequently destroy relevant

contexts don’t create spurious contexts for NE

• Difficulties for automatic tools on finding NEs ~proportional to relative “quality” (the amount of MT degradation)

• NER system (ANNIE) www.gate.ac.uk: o the number of extracted Organisation Names gives an

indication of Adequacy

ORI: … le chef de la diplomatie égyptienne HT: the <Title>Chief</Title> of the

<Organization>Egyptian Diplomatic Corps </Organization>

MT-Systran: the <JobTitle> chief </JobTitle> of the Egyptian diplomacy

Page 6: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

MT Evaluation with Named Entity recognition

0

100

200

300

400

500

600

700

Org

aniz

atio

nTit l

e

JobTit l

e

{Job}T

it le

Firs

tPer

son

Pers

onDat

e

Loca

t ion

Money

Perc

ent

Reference

Expert

Candide

Globalink

Metal

Reverso

Systran

Page 7: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

P&R of Organization Names vs. Human Adequacy & Fluency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P.HT- exp.

P.HT- ref

P.candide

P.globalink

P.ms

P.reverso

P.systran

R.HT- exp.

R.HT- ref

R.candide

R.globalink

R.ms

R.reverso

R.systran

HT- Ref

HT- Exp.

U/ I

BLEU/ade BLEU/flu R(Org)/ade R(Org)/flu

r correl 0.9535 0.995 0.8682 0.9806

Page 8: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

NER for MT Improvement: RBMT (later generalized to SMT & Hybrid MT)

ProMT 1998

E-R

ProMT 2001

E-F

Systran 2000

E-F

Mark N Score N Score N Score

+1* 28 = +28.0 23 = + 23.0 18 = + 18.0

+0.5* 2 = +1.0 5 = + 2.5 24 = + 12.0

0* 4 = 0 7 = 0 8 = 0

–0.5* 3 = –1.5 1 = – 0.5 1 = – 0.5

–1* 13 = –13.0 14 = – 14.0 10 = – 10.0

SUM 50 +14.5 50 + 11.0 61 + 19.5

Gain +29% +22% +32%

Page 9: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

NER improvement example

Original:

The agreement was reached by a coalition of four

of Pan Am's five unions.

Baseline translation:E-F

ProMT L'accord a été atteint par une coalition de quatre

de casserole cinq unions d'Am.

(‘The agreement was reached by a coalition of

four of saucepan five unions of Am.’)

DNT-processed translation:

L'accord a été atteint par une coalition de quatre

de cinq unions de Pan Am.

(‘The agreement was reached by a coalition of

four of five unions of Pan Am.’)

Page 10: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

MWEs: MT and automated evaluation of concordances

Sent-N

(de-ori)

•Source text

Sent-N

(en-mt)

•Machine Translation output

Sent-N

(en-ht)

•Human Translation

Page 11: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Systematically mistranslated MWEs: x=log(freq); y=exp(BLEU)

0.5 1 1.5 2

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

1.22

1.24

1.26

1.28

1.3

1.32

1.34

1.36

1.38

1.4

1.42

1.44

1.46

depleted uraniumsan suu kyihazardous substances

vis à vis electrical and electronicnobel prize winnerdeath penaltyarms exports

animal feed

drink drivingrenewable energy sourcesfoot and mouth

raw materials

mad cow disease sized enterprises

transportat ion of radioactive

atomic energy agency

intellectual property

animal welfareconstitut ional legality

bosnia herzegovina

interception of telecommunications

ir ish referendumnuclear materials

lawful interception

Page 12: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Evaluation feature potential vs. useful MT models

• Evaluation works as feature selection E.g., performance of Information Extraction (template

filling)

Slot filling performance from MT indicative of MT quality

• New models for MT require more work Perpetrator, Target, Instrument… slots filled from

multiple sentences o terrorist killing of Attorney General

o != killing of a terrorist (by analogy to “tourist killing” or “farmer killing”); =killing by terrorists

o ? “…just pretending to be a terrorist killing war machine…”

o ? “… who is working for the police on a terrorist killing mission…”

o ? “…merged into the "TKA" (Terrorist Killing Agency), they would … proceed to wherever terrorists operate and kill them…”

Page 13: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

o “X’s defeat” == X’s loss

o “X’s defeat of Y” == X’s victory

• ORI: Swedish playmaker scored a hat-trick in the 4-2 defeat of Heusden-Zolder … its defeat of last night

… their FA Cup defeat of last season

… their defeat of last season’s Cup winners

… last season’s defeat of Durham

• ‘Non-monotonic’ filling from text could resolve ambiguity Semantic role labeling beyond sentence level

• Unclear how to use text-level disambiguation in MT models

Page 14: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Limitations of metrics

• ‘Used’ or ‘ideologically related’ evaluation metrics overestimate MT performance E.g., BLEU (Callison-Burch et al., 2006)

‘Surprise’ metric needed

• Require re-calibration to determine usability levels (text types, target languages) Regression parameters for predicting Ade/Flu

• Metric sensitivity depends on quality level BLEU has lower correlation for closely-related MT

& non-native HT

Page 15: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Non-native HT vs. MT

Page 16: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Sensitivity of BLEU vs. NER-based metrics: MT for distant & closely-related languages

Page 17: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Quality parameters for MT usage scenarios • There are no absolute standards of

translation quality but only more or less appropriate translations for the purpose for which they are intended. (Sager 1989: 91)‏

• Purpose-oriented MT quality definition o QTLaunchPad’s flagship examples: Post-Editing…

Usage scenarios and usability parameters

Moving beyond Ade, Flu, Inf … & BLEU

Evaluating & Improving MT for different target scenarios

Page 18: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

MT usage scenarios • Place in the workflow

Pre/Post-editing

Controlled authoring, Sublanguage

Fully automatic & unrestricted authoring

• Purpose & deadlines (continuous range) High-quality publication

Internal communication (one-off translation)

Comprehension, getting information (on-line reading)

Performing tasks (following technical instructions, etc.)

Multilingual research (legal, medical advice, events)

Automatic processing (Information Extraction, Text Classification)

Page 19: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Evaluation metrics for usability levels and thresholds? • If cheaper, etc. than task-based evaluation

Would be of real interest for translation industry

• Establishing a project-specific usability threshold for MT system performance A metric should match usage scenario

Minimal value required to benefit from MT

• Predicting productivity gains from metrics Different thresholds for scenarios

Comparative usability of MT systems for scenarios may not coincide

• Usability thresholds with BLEU? No, unless we calibrate for specific TL, text type

Page 20: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

z-scores for Intercept of regression lines: BLEU vs. Ade: TL & text type (line=sign. 99.9%)

Page 21: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

TAUS Dynamic Quality Framework: PE productivity test

Page 22: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

TAUS DQF reporting

Page 23: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Post-editing: productivity • post-editing increases translators’ productivity

(tested: IT documentation; no results for other combinations of translation directions, subject domains and text types)

Improvements by 74% on average,

figure varies widely between different translators (between 20% and 131%) depending on their attitude and experience with post-editing MT-generated texts. o Plitt & Masselot (2010)

• Importance of training in acquiring post-editing skills

Page 24: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Productivity = Processing speed & cognitive effort • Time to post-edit single segments

• Fixation intervals detected by eye-tracking system O’Brien, 2011

• Integration of MT into TM settings: translating unmatched segments Federico et al., 2012 (effort=no of changes)

http://amta2012.amtaweb.org/AMTA2012Files/papers/123.pdf

Page 25: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Federico et al., 2012: PE Effort

Page 26: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Federico et al, 2012: PE Speed

Page 27: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Lessons for translators • Individual translator’s performance – the major factor in the

variation in data Training matters

• Impact of subject domain & translation direction

• Behind variation? - Importance of understanding the purpose: Publication, in-house use?

o semantically correct & poor in style.

o Semantically, stylistically & terminologically correct?,

• Attitude issues Spending time for good-enough vs perfect translation

Quality standards influenced by received suggestions?

• Best approach – purpose-based MT evaluation(?) ‘Chess clock’ strategy: acceptable quality for the purpose stated in

the brief & the time allowed

Treating MT-translated texts in their own right: minimal changes for acceptable quality

Page 28: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Automated metric for usability levels • Unresolved problem: Usability-oriented

automated MT evaluation Task-based evaluation metrics (not proximity to

the reference) = success in using MT

Calibrated for realistic usage scenarios

Foundation for new MT models

• Candidates Performance on automated annotation tasks

(parsing, terminology, data mining)

Selective or weighted lexical overlap

Page 29: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Evaluation of MT vs. for MT

• MT resource evaluation

Understanding composition of the training corpora

• Performance benchmarking on text types

Understanding which training resources perform best for which project

Project-specific model creation, selection and combination optimized for specific usage scenario

Page 30: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Evaluation-guided MT • Evaluation has been part of the development

workflow Nowadays – new ways of integration

Systematic prioritizing of models, features, workflow: o What translators need: interface, functionality, quality

Accounting for different interests of stakeholders

Provides understanding of useful linguistic features that can be integrated into new models

The prima facie case against operational machine translation from the linguistic point of view will be to the effect that there is unlikely to be adequate engineering where we know there is no adequate science. (Martin Kay, 1980)

Page 31: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

FT2MT – Project idea: Fast

track and fine-tune MT

Enabling Better Translation “FT2MT”

3. Data Built on TAUS Data repository of (now) 54 Billion words in 2,200 language pairs.

1. Models Language, translation and reordering models instead of ‘raw’ translation data

4. Linguistic annotation Rich linguistic annotation of language data helps to match texts with models for training

2. Framework Allowing users to try and optimize the combination of models

5. Evaluation Automatic task-oriented evaluation (correlated with human evaluation) to support selection of model combination.

6. Translator UI & API Interface for translators/users to test and fine-tune model selection + for developers for integration of new interfaces 7. Education - Outreach

Showcases, tutorials, and global open access to all developers, users and researchers

CTS/Leeds, TAUS, U Edinburgh, FBK, Translated, RACAI, Lingenio

Page 32: Improving MT quality with evaluation-oriented …Improving MT quality with evaluation-oriented methods Bogdan Babych Centre for Translation Studies University of Leeds b.babych@leeds.ac.uk

Thanks!

• Questions…