improving mt quality with evaluation-oriented …improving mt quality with evaluation-oriented...

Machine Translation evaluation for MT development Improving MT quality with evaluation-oriented methods

Bogdan Babych

Centre for Translation Studies

University of Leeds

[email protected]

QTLaunchPad Workshop @ EAMT 2014

mailto:[email protected]

Overview

• MT evaluation & MT development: science vs. engineering perspectives

Re-engineering models of automated evaluation for MT improvement

Limitations of automated metrics

• Usability of MT for usage scenarios

Engineers’ favorite toy vs. translators’ tool

Future: Realistic scenarios & evaluation-guided MT

MT evaluation & MT development • MT evaluation: the field in its own right

‘bigger’ than MT development ?

‘science’ methodology: quantifying & understanding natural phenomena behind engineering advances o understanding, well-formedness, mechanisms of language,

communication, cognition; how to move forward

Engineering context: test for theories & models

• Human evaluation: ultimate benchmark?

• Absolute vs. purpose-related quality

Skopos of human translation

Technical definitions of quality & MT performance

Automated MT evaluation: source of models for MT • What works for automated evaluation also

improves MT Evaluation uses some computable text features

that characterize MT quality

Features are calibrated by human judgments or performance

New models based on the same features can improve MT quality

• Examples: Named Entities, Information Extraction, MWE &Terminology Translation

NER: evaluation and improvement • MT errors more frequently destroy relevant

contexts don’t create spurious contexts for NE

• Difficulties for automatic tools on finding NEs ~proportional to relative “quality” (the amount of MT degradation)

• NER system (ANNIE) www.gate.ac.uk: o the number of extracted Organisation Names gives an

indication of Adequacy

ORI: … le chef de la diplomatie égyptienne HT: the <Title>Chief</Title> of the

<Organization>Egyptian Diplomatic Corps </Organization>

MT-Systran: the <JobTitle> chief </JobTitle> of the Egyptian diplomacy

http://www.gate.ac.uk/

MT Evaluation with Named Entity recognition

0

100

200

300

400

500

600

700

Org

aniz

atio

nTit l

e

JobTit l

e

{Job}T

it le

Firs

tPer

son

Pers

onDat

e

Loca

t ion

Money

Perc

ent

Reference

Expert

Candide

Globalink

Metal

Reverso

Systran

P&R of Organization Names vs. Human Adequacy & Fluency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P.HT- exp.

P.HT- ref

P.candide

P.globalink

P.ms

P.reverso

P.systran

R.HT- exp.

R.HT- ref

R.candide

R.globalink

R.ms

R.reverso

R.systran

HT- Ref

HT- Exp.

U/ I

BLEU/ade BLEU/flu R(Org)/ade R(Org)/flu

r correl 0.9535 0.995 0.8682 0.9806

NER for MT Improvement: RBMT (later generalized to SMT & Hybrid MT)

ProMT 1998

E-R

ProMT 2001

E-F

Systran 2000

E-F

Mark N Score N Score N Score

+1* 28 = +28.0 23 = + 23.0 18 = + 18.0

+0.5* 2 = +1.0 5 = + 2.5 24 = + 12.0

0* 4 = 0 7 = 0 8 = 0

–0.5* 3 = –1.5 1 = – 0.5 1 = – 0.5

–1* 13 = –13.0 14 = – 14.0 10 = – 10.0

SUM 50 +14.5 50 + 11.0 61 + 19.5

Gain +29% +22% +32%

NER improvement example

Original:

The agreement was reached by a coalition of four

of Pan Am's five unions.

Baseline translation:E-F

ProMT L'accord a été atteint par une coalition de quatre

de casserole cinq unions d'Am.

(‘The agreement was reached by a coalition of

four of saucepan five unions of Am.’)

DNT-processed translation:

L'accord a été atteint par une coalition de quatre

de cinq unions de Pan Am.

(‘The agreement was reached by a coalition of

four of five unions of Pan Am.’)

MWEs: MT and automated evaluation of concordances

Sent-N

(de-ori)

•Source text

Sent-N

(en-mt)

•Machine Translation output

Sent-N

(en-ht)

•Human Translation

Systematically mistranslated MWEs: x=log(freq); y=exp(BLEU)

0.5 1 1.5 2

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

1.22

1.24

1.26

1.28

1.3

1.32

1.34

1.36

1.38

1.4

1.42

1.44

1.46

depleted uraniumsan suu kyihazardous substances

vis à vis electrical and electronicnobel prize winnerdeath penaltyarms exports

animal feed

drink drivingrenewable energy sourcesfoot and mouth

raw materials

mad cow disease sized enterprises

transportat ion of radioactive

atomic energy agency

intellectual property

animal welfareconstitut ional legality

bosnia herzegovina

interception of telecommunications

ir ish referendumnuclear materials

lawful interception

Evaluation feature potential vs. useful MT models

• Evaluation works as feature selection E.g., performance of Information Extraction (template

filling)

Slot filling performance from MT indicative of MT quality

• New models for MT require more work Perpetrator, Target, Instrument… slots filled from

multiple sentences o terrorist killing of Attorney General

o != killing of a terrorist (by analogy to “tourist killing” or “farmer killing”); =killing by terrorists

o ? “…just pretending to be a terrorist killing war machine…”

o ? “… who is working for the police on a terrorist killing mission…”

o ? “…merged into the "TKA" (Terrorist Killing Agency), they would … proceed to wherever terrorists operate and kill them…”

o “X’s defeat” == X’s loss

o “X’s defeat of Y” == X’s victory

• ORI: Swedish playmaker scored a hat-trick in the 4-2 defeat of Heusden-Zolder … its defeat of last night

… their FA Cup defeat of last season

… their defeat of last season’s Cup winners

… last season’s defeat of Durham

• ‘Non-monotonic’ filling from text could resolve ambiguity Semantic role labeling beyond sentence level

• Unclear how to use text-level disambiguation in MT models

Limitations of metrics

• ‘Used’ or ‘ideologically related’ evaluation metrics overestimate MT performance E.g., BLEU (Callison-Burch et al., 2006)

‘Surprise’ metric needed

• Require re-calibration to determine usability levels (text types, target languages) Regression parameters for predicting Ade/Flu

• Metric sensitivity depends on quality level BLEU has lower correlation for closely-related MT

& non-native HT

Non-native HT vs. MT

Sensitivity of BLEU vs. NER-based metrics: MT for distant & closely-related languages

Quality parameters for MT usage scenarios • There are no absolute standards of

translation quality but only more or less appropriate translations for the purpose for which they are intended. (Sager 1989: 91)‏

• Purpose-oriented MT quality definition o QTLaunchPad’s flagship examples: Post-Editing…

Usage scenarios and usability parameters

Moving beyond Ade, Flu, Inf … & BLEU

Evaluating & Improving MT for different target scenarios

MT usage scenarios • Place in the workflow

Pre/Post-editing

Controlled authoring, Sublanguage

Fully automatic & unrestricted authoring

• Purpose & deadlines (continuous range) High-quality publication

Internal communication (one-off translation)

Comprehension, getting information (on-line reading)

Performing tasks (following technical instructions, etc.)

Multilingual research (legal, medical advice, events)

Automatic processing (Information Extraction, Text Classification)

Evaluation metrics for usability levels and thresholds? • If cheaper, etc. than task-based evaluation

Would be of real interest for translation industry

• Establishing a project-specific usability threshold for MT system performance A metric should match usage scenario

Minimal value required to benefit from MT

• Predicting productivity gains from metrics Different thresholds for scenarios

Comparative usability of MT systems for scenarios may not coincide

• Usability thresholds with BLEU? No, unless we calibrate for specific TL, text type

z-scores for Intercept of regression lines: BLEU vs. Ade: TL & text type (line=sign. 99.9%)

TAUS Dynamic Quality Framework: PE productivity test

TAUS DQF reporting

Post-editing: productivity • post-editing increases translators’ productivity

(tested: IT documentation; no results for other combinations of translation directions, subject domains and text types)

Improvements by 74% on average,

figure varies widely between different translators (between 20% and 131%) depending on their attitude and experience with post-editing MT-generated texts. o Plitt & Masselot (2010)

• Importance of training in acquiring post-editing skills

Productivity = Processing speed & cognitive effort • Time to post-edit single segments

• Fixation intervals detected by eye-tracking system O’Brien, 2011

• Integration of MT into TM settings: translating unmatched segments Federico et al., 2012 (effort=no of changes)

http://amta2012.amtaweb.org/AMTA2012Files/papers/123.pdf

Federico et al., 2012: PE Effort

Federico et al, 2012: PE Speed

Lessons for translators • Individual translator’s performance – the major factor in the

variation in data Training matters

• Impact of subject domain & translation direction

• Behind variation? - Importance of understanding the purpose: Publication, in-house use?

o semantically correct & poor in style.

o Semantically, stylistically & terminologically correct?,

• Attitude issues Spending time for good-enough vs perfect translation

Quality standards influenced by received suggestions?

• Best approach – purpose-based MT evaluation(?) ‘Chess clock’ strategy: acceptable quality for the purpose stated in

the brief & the time allowed

Treating MT-translated texts in their own right: minimal changes for acceptable quality

Automated metric for usability levels • Unresolved problem: Usability-oriented

automated MT evaluation Task-based evaluation metrics (not proximity to

the reference) = success in using MT

Calibrated for realistic usage scenarios

Foundation for new MT models

• Candidates Performance on automated annotation tasks

(parsing, terminology, data mining)

Selective or weighted lexical overlap

Evaluation of MT vs. for MT

• MT resource evaluation

Understanding composition of the training corpora

• Performance benchmarking on text types

Understanding which training resources perform best for which project

Project-specific model creation, selection and combination optimized for specific usage scenario

Evaluation-guided MT • Evaluation has been part of the development

workflow Nowadays – new ways of integration

Systematic prioritizing of models, features, workflow: o What translators need: interface, functionality, quality

Accounting for different interests of stakeholders

Provides understanding of useful linguistic features that can be integrated into new models

The prima facie case against operational machine translation from the linguistic point of view will be to the effect that there is unlikely to be adequate engineering where we know there is no adequate science. (Martin Kay, 1980)

FT2MT – Project idea: Fast

track and fine-tune MT

Enabling Better Translation “FT2MT”

3. Data Built on TAUS Data repository of (now) 54 Billion words in 2,200 language pairs.

1. Models Language, translation and reordering models instead of ‘raw’ translation data

4. Linguistic annotation Rich linguistic annotation of language data helps to match texts with models for training

2. Framework Allowing users to try and optimize the combination of models

5. Evaluation Automatic task-oriented evaluation (correlated with human evaluation) to support selection of model combination.

6. Translator UI & API Interface for translators/users to test and fine-tune model selection + for developers for integration of new interfaces 7. Education - Outreach

Showcases, tutorials, and global open access to all developers, users and researchers

CTS/Leeds, TAUS, U Edinburgh, FBK, Translated, RACAI, Lingenio

Thanks!

• Questions…

improving mt quality with evaluation-oriented …improving mt quality with evaluation-oriented...

Documents