importance of mt evalua&on difficulty of mt evalua&on evalua ... · evalua&on of...

25
Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@<k.eu Slides from the presenta&on by MaDeo Negri… and myself Disclaimer “More has been wriDen about MT evalua&on over the past 50 years than about MT itself” Hovy et al.: Principles of Context-Based Machine Transla7on Evalua7on. Machine Transla&on, 16, pp. 1–33, 2002 (aDributed to Yorick Wilks) “It is impossible to write a comprehensive overview of the MT evalua&on literature” Adam Lopez.: Sta7s7cal Machine Transla7on. ACM Compu&ng Surveys 40(3) pp. 1–49, August 2008. MT Evalua&on, Trento, Doctoral School - April 2016 Outline Importance of MT Evalua&on Difficulty of MT Evalua&on Human evalua&on: fluency/adequacy Automa&c evalua&on: Reference-based: BLEU, TER, HTER (chosen among MANY others) Reference-free: quality es&ma&on (es&ma&ng post-edi&ng effort) MT Evalua&on, Trento, Doctoral School - April 2016 The importance of MT evalua&on Answering “How good is an MT system?as a way to: Which system to use for a given task Assess and compare systems’ performance Define the state of the art Drive system development and measure improvements Decide whether to apply MT at all …Necessary (yes, not sufficient) condi&ons for progress in any research field Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016

Upload: others

Post on 09-Jan-2020

46 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Evalua&onofMachineTransla&onQualityMarcoTurchiFBKTrento,Italyturchi@<k.eu

Slidesfromthepresenta&onbyMaDeoNegri…andmyself

Disclaimer

“MorehasbeenwriDenaboutMTevalua&on

overthepast50yearsthanaboutMTitself”

Hovyetal.:PrinciplesofContext-BasedMachineTransla7onEvalua7on.

MachineTransla&on,16,pp.1–33,2002

(aDributedtoYorickWilks)

“ItisimpossibletowriteacomprehensiveoverviewoftheMTevalua&onliterature”

AdamLopez.:Sta7s7calMachineTransla7on.

ACMCompu&ngSurveys40(3)pp.1–49,August2008.

MTEvalua&on,Trento,DoctoralSchool-April2016

Outline

•  ImportanceofMTEvalua&on

•  DifficultyofMTEvalua&on

•  Humanevalua&on:fluency/adequacy

•  Automa&cevalua&on:

– Reference-based:BLEU,TER,HTER(chosenamongMANYothers)– Reference-free:qualityes&ma&on(es&ma&ngpost-edi&ngeffort)

MTEvalua&on,Trento,DoctoralSchool-April2016

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 2: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

Page 3: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature•  Perfectorverypoortransla&ons

areeasytoscore,butwhathappensinbetween?

DifficultyofMTevalua&on

•  Manydifferentacceptabletransla&onsforthesamesentence

���������

–  Iam[experiencing|sufferingfrom|feeling]athrobbingpain.–  I[feel|canfeel|have]a[throbbingpain|painfulthrobbing].–  [Itisa|It’sin|I’vegota]throbbingpain.–  It’sthrobbing[anditreallyhurts|withpain].–  [It’spainfuland|Ithurtssomuch]it’sthrobbing.

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 4: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

DifficultyofMTevalua&on

•  Howwouldyoutranslate:

It’srainingcatsanddogsAceinthehole

BeataroundthebushChewthefat

Wildgoosechase

TieoneonSunnysmile

•  Literally,itsmeaningorthecorrespondingidiom(ifany)?

MTEvalua&on,Trento,DoctoralSchool-April2016

DifficultyofMTevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

•  Classifica&onoferrors:aquiterichtaxonomy

Note:errortypesarenotmutuallyexclusiveandonenco-occur(Vilaretal.2006)

HumanVsAutoma&cevalua&on

•  HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

•  Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck

MTEvalua&on,Trento,DoctoralSchool-April2016

HumanVsAutoma&cevalua&on

•  HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

•  Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 5: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 6: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:adequacyandfluency

•  Sourcesentence:Lechatentredanslachambre.

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on:Likertscales

Adequacy

5 allmeaning

4 mostmeaning

3 muchmeaning

2 liDlemeaning

1 none

MTEvalua&on,Trento,DoctoralSchool-April2016

Fluency

5 flawlessEnglish

4 goodEnglish

3 non-na&veEnglish

2 disfluentEnglish

1 incomprehensible

Humanevalua&on:subjec&vity

a

fluency

adeq

uacy b

cd

a

fluency

adeq

uacy b

c

d

a

fluency

adeq

uacy b

cd

JUDGE1 JUDGE2 JUDGE3

• Perfectorverypoortransla&onsareeasytoscore… …butwhathappensinbetween?

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe

Humanevalua&on:subjec&vity

Evaluatorsdisagree!•  …lookatthishistogramofadequacyjudgmentsby

differenthumanevaluators

MTEvalua&on,Trento,ISITSchool-November2013

Page 7: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:measuringagreement

•  Kappacoefficient

– p(A):propor&onof&mesthattheevaluatorsagree

– p(E):propor&onof&methattheywouldagreebychance

(5-pointscale→p(E)=1/5)

– Completeagreement:K=1

– Noagreementhigherthanchance:K=0

•  Example:inter-evaluatoragreementinWMT2007

K =p(A) − p(E)1− p(E)

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Humanevalua&on:alterna&ves

•  Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

•  Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373

Humanevalua&on:alterna&ves

•  Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

•  Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373

Humanevalua&on:alterna&ves

•  Reading=me– peoplereadmorequicklyawell-formedtext

•  Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on

Page 8: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:alterna&ves

•  Reading=me– peoplereadmorequicklyawell-formedtext

•  Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on

Automa&cmetricsforMTevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

Requirementsforautoma&cmetrics

•  Lowcost(wrthumanevalua&on)

•  Objec=ve(unbiased)•  Meaningful:scoreshouldgiveintui&veinterpreta&onof

transla&onquality

•  Efficient:tobecomputedquicklyandonen

•  Consistent:repeateduseofmetricshouldgivesameresults

•  Correct:metricmustrankbeDersystemshigher

MTEvalua&on,Trento,DoctoralSchool-April2016

Reference-basedmetrics

•  Idea:computeasimilarityscorebetweenacandidatetransla&onandoneormorehigh-qualityreferencetransla&ons– Referencesarecreatedbyhumanexperts(e.g.professionaltranslators)

– Severalreferencesallowustoaccountforvariabilityofgoodtransla&ons

•  Criterionforvalida=ngautoma=cmetrics:automa&cscoresmustcorrelatewithhumanonesontestdata

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 9: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Reference-basedmetrics•  Typically:

–  Simisasimilaritymetricbetweensentences–  Simcanuseavarietyofproper&es:stringdistance,wordprecision/

recall,syntac&csimilarity,seman&cdistance,etc.

WER:ra&oofsmallesteditdistanceandoutputlength

BLEU:weightedsumofprecisionofn-grams

TER:normalizednumberofeditstomatchtheclosestreference

METEOR:harmonicmeanofunigramprecision/recallNIST,PER,GTM,HTER,TERP,CDER,GTM,BLANC,PER,ULC,MT-NCD,ATEC,TESLA,SEPIA,IQTM,BEWT-E,MEANT,etc.

1k

sim(refii=1

k

∑ ,cand) 1≤k≤4

“candidate”,“reference”,“n-grams”

Candidate(or“target”or“hypothesis”):thegunmanwasshotdeadbypolice.

Referencetransla=on:thegunmanwasshottodeathbythepolice.

N-grams:the,gunman,was,shot,by,police,.

thegunman,gunmanwas,wasshot,police. thegunmanwas,gunmanwasshot

thegunmanwasshot4-grams

3-grams

2-grams

1-grams

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 10: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

MTEvalua&on,Trento,DoctoralSchool-April2016

REF

HYP1

HYP2

HYP3

VERYGOOD

BAD

VERYBAD

TheBLEUmetric:modifiedn-gramprecision

•  n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

•  Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe

Ref: thecatisonthemat

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 11: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:modifiedn-gramprecision

•  n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

•  Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe

Ref: thecatisonthemat

MTEvalua&on,Trento,DoctoralSchool-April2016

p1standard =

77

p1modified =

27

TheBLEUmetric:brevitypenalty

•  Brevitypenalty(BP):topenalizetooshorthypotheses– Example:

Hyp: the

Ref: thecatisonthemat

…Can’tjusttypeoutsingleword“the’’(precision1.0!)

– c=lengthofMThypothesis,r=lengthoftheclosestreference

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 12: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

c=8

r=9

Page 13: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thegunmanwaskilledbythepolice.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68

NOTE:thisisaproduct!!!! Ifoneofthefactorsis0(e.g.no4-grammatches)thefinalscorewillbe0!!!Forthisreasonthefinalscoreisusuallycalculatedontheen=reevalua=oncorpus,notonsinglesentences!

TheBLEUmetric:correla&onwithtrainingsetsize

MTEvalua&on,Trento,DoctoralSchool-April2016

ExperimentsbyPhilippKoehn

BLEUscore

No.sentencepairsusedintraining

FromGeorgeDoddington,NIST,2002

TheBLEUmetric:correla&onwithhumanjudgments

Page 14: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetriclimita&ons:examples

•  Reference: abcdefghIjklmnopqrs

•  Hyp1: abcdfegihjlkmonprqs

•  Hyp2: abcdefgxxxxxxxxxxxx

Hyp1 Hyp2

1-gram 1.0000 0.3684

2-gram 0.1666 0.3333

3-gram 0.1176 0.2941

4-gram 0.0625 0.2500

BLEUScore 0.1871 0.3083

MTEvalua&on,Trento,DoctoralSchool-April2016

Longern-gramsdominateshortern-grams!!!

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

Smallchangesinthetextmaydeterminebigmeaningchanges!

Page 15: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

WHY?

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

WHY?

…The“invisibleregion”[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:examples

MTEvalua&on,Trento,DoctoralSchool-April2016

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

NNPNNPRBVBZDTNNINNNPNNP 0.5411

NNRBNNPDTVBZNNPINNNPNNP 0.3117

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 16: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#2:(Words+POS)/2[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

NNPNNPRBVBZDTNNINNNPNNP 0.4020

NNRBNNPDTVBZNNPINNNPNNP 0.2966

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:prosandcons•  BLEUrangesfrom0to1(transla&onqualityas“percentage”)

•  Themorethereferences,thehigherthescore

•  Highcorrela&onwithhumanassignedscores,especiallyonfluency

•  Rankingof“similar”MTsystemsequivalenttohumanranking

•  Collec&ngreferencehasahighcost

•  Longern-gramsdominateshortern-grams

•  Smallchangesinthetext(e.g.“not”)maydeterminebigmeaningchanges

•  Scoresarenotstraigh�orwardtointerpret(BLEU=30…sowhat?)

•  Syntaxpoorlymodeled

•  Ignoreswordrelevanceandseman&cequivalence(stringlevelcomparisons)

•  Canfailinrankingsystemsbasedondifferentapproaches

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric(Transla&onEditRate)

•  Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)–  CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

–  Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric(Transla&onEditRate)

•  Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)–  CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

–  Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 17: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheTERmetric:exampleREF:SaudiArabiadeniedthisweekinforma&onpublishedintheAmericanNYTHYP:thisweektheSaudisdeniedinforma&onpublishedintheNYT

•  HYP:fluent,samemeaningofreference(except“American”)

•  butnotexactmatch:

– thisweekisshined– SaudiArabiaintheREFappearsastheSaudisintheHYP– AmericanappearsonlyintheREF

•  Numberofedits=4(1shin,2subs&tu&ons,and1dele&on):

TER%=4/11*100=36.36%

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric:discussion

•  Evalua&onclosetoarealtask(post-edi&ng)•  Resultsaremoreinterpretablethanforothermetrics

•  Canbecomputedonlyforasinglesentence

•  Insensi&vetoseman&ccloseness(e.g.synonyms,paraphrases)

•  Complexityofcomputa&on(op&malcalcula&onofedit-distancewithmoveopera&ons:NP-complete)–  approximatesearchviadynamicprogramming(decomposi&oninsub-

problems

MTEvalua&on,Trento,DoctoralSchool-April2016

TheHTERmetric(Human-targetedTER)

•  TERignoresseman&cequivalenceandheavilydependsonthereferencetransla&on

•  Idea:referencesashumanpost-edi=ons– Performhumanpost-edi&ngtotransformthehypothesisintotheclosestacceptabletransla&on

– HTERmeasuresTERbetweenthehypothesisandtheresul&ngreferencetransla&on

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis(sameasTER)

MTEvalua&on,Trento,DoctoralSchool-April2016

TER/HTER:pros/cons

•  TER–  intui&vemeasureofMTquality

–  adequateforfastdevelopment

–  reasonablycorrelateswithhumanjudgments(>BLEU,<thanotherse.g.METEOR)

–  ignoresseman&cequivalence

•  HTER–  intui&vemeasureofMTquality

–  highestcorrela&onwithhumanjudgments

–  possiblesubs&tuteforhumanevalua&onsbecauselesssubjec&ve

–  expensive:3to7minutespersentenceforahumantoannotate

–  notsuitableforusinginthedevelopmentcycleofanMTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 18: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Applica&on-orientedMTevalua&on

QualityEs&ma&on(QE)

•  Fromcontrolledlabtestsandevalua&oncampaigns…

•  …toMTevalua&oninreal-lifecondi&ons(e.g.theCATframework)– Asasupporttohumantranslators

– Atrun&me

– Withoutreferencetransla&ons

MTEvalua&on,Trento,DoctoralSchool-April2016

(One)scenario:theCATframework

CATTool

?

TheCATtool1. Segmentstheinputdocument2. Provides,foreachsegment:

•  Sugges&onsfromatransla&onmemory(TM)

•  Sugges&onsfromanMTengine

Thetranslator,foreachsegment1. Selectsthebestsugges&on2. Post-editsit(ifnecessary)to

reachpublica&onquality

(One)scenario:theCATframework•  Questions:

–  Is this suggestion good enough to be published?

– Can I trust it? – Can a reader get the gist? –  Is it publishable “as is”? –  If not, what is better: post-editing

or rewriting?

•  Huge market interest –  Increased translators’ productivity – No manual intervention on

reliable MT suggestions

Page 19: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality•  Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons•  Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Predic&ngMToutputquality•  Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons•  Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Posi&ve/Nega&veexamples

Possiblefeatures:hasWings,hasFeathers,sound,moves,hasPalmateFeet,etc.

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

•  Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort

MTEvalua&on,Trento,ISITSchool-November2013

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

MTEvalua&on,Trento,ISITSchool-November2013

Page 20: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

•  Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort

MTEvalua&on,Trento,ISITSchool-November2013

Predic&ngMToutputquality

•  Whatispost-edi&ng?– Aprocessofmodifica&onratherthanrevision(Loffler-Laurian1985)

– The“termusedforthecorrec&onofmachinetransla&onoutputbyhumanlinguists/editors”(VealeandWay1997)

– Repairingtexts(Krings,2001)

– “…theprocessofimprovingamachine-generatedtransla&onwithaminimumofmanuallabor”(TAUSreport,2010)

MTEvalua&on,Trento,DoctoralSchool-April2016

Predic&ngMToutputquality

•  Whatispost-edi&ngeffort?–  theeffortmadebyapost-editortomanuallyimproveamachinegeneratedtransla&on

•  Measureofpost-edi&ngeffort:– Qualityscore(ases&matedbyhumansona1-5Likertscale)

– Numberofeditopera&ons(HTER)

– Post-Edi&ng&me(totalsecondsorsecondsperwords)

– Numberofkeystrokes

– …

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityscores

•  Arbitrarychoiceofthelevelsofquality 1=requirescompleteretransla&on;

2=requiressomeretransla&on;

3=veryliDlepostedi&ngneeded;

4=fitforpurpose

•  Labelingrequireshumaninterven&on

•  Aprecisemeasure

•  Subjec&ve/expensive/&meconsumingtask

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 21: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  WorkshoponSMTscoringschema:1.  TheMToutputisincomprehensible,withliDleorno

informa&ontransferredaccurately.Itcannotbeedited,needstobetranslatedfromscratch.

2.  About50%-70%oftheMToutputneedstobeedited.Itrequiresasignificantedi&ngeffortinordertoreachpublishablelevel.

3.  About25-50%oftheMToutputneedstobeedited.Itcontainsdifferenterrorsandmistransla&onsthatneedtobecorrected.

4.  About10-25%oftheMToutputneedstobeedited.Itisgenerallyclearandintelligible.

5.  TheMToutputisperfectlyclearandintelligible.Itisnotnecessarilyaperfecttransla&on,butrequiresliDletonoedi&ng.

81

Qualityscores

MTEvalua&on,Trento,DoctoralSchool-April2016

Post-edi&ng&me•  Secondsneededtopost-editasentence•  normalizedversioninsecondsperword

–  liDle&me=goodtransla&on

–  large&me=badtransla&on

•  Usuallyincludes:– reading&me

– searchingforinforma&ononexternalresources

–  typing&me

– extra&meforsecondaryac&vity(e.g.correc&on)

•  Highvariabilityacrosssentencesandtranslators

MTEvalua&on,Trento,DoctoralSchool-April2016

HTER(again!)•  HumantargetedTERisthestandardeditdistancebetweentheoriginalmachinetransla&onanditsminimallypost-editedversion

– edits:inser&on,dele&on,subs&tu&on,shin

•  Lowervariability(wrt&me)acrosssentences/translators

MTEvalua&on,Trento,DoctoralSchool-April2016

HTER =#edits

#words_ postedited _version

Post-edi&ng&meVsHTER

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Time:pros/cons–  Accountsfordifferenteffortsin

transla&ngdifferentwords

–  Variabilityamongpost-editors

•  HTER:pros/cons–  Objec&ve,easytocomputemeasure–  lessvarianceacrosspost-editors

(bad=badforall)–  Ignoresdifferenteffortsintransla&ng

differentwords

Page 22: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality

•  Tasks:– Automa&clabeling

•  realvalues=regression•  integers=classifica&on

– Automa&cranking

•  Granularity– Wordlevel(e.g.“Thecatenterintheroom”)– Sentencelevel(e.g.“Thecatenterintheroom”:2.27)– Documentlevel

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Regression•  Regression(predic&onsasrealvalues):

– MeanAbsoluteError(MAE)–  RootMeanSquaredError(RMSE)

•  GivenasetofpredictedscoresHandasetofhumanscoresV

MAE =

H(si) −V (si)i=1

N

∑N

RMSE =

(H(si) −V (si))2

i=1

N

∑N

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Classifica&on•  Classifica&on(predic&onsasintegers):

–  Precision(Pr)–  Recall(Re)–  f–score(F1)

•  GivenasetofpredictedscoresHandasetofhumanscoresV•  Anexampleforbinaryclassifica&on

V

1 -1

H1 TruePosi&ve FalsePosi&ve

-1 FalseNega&ve TrueNega&ve

Pr =tp

tp+ fp

Re =tp

tp+ fn

F1 = 2* Pr*RePr+Re

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Ranking

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Spearman’sRankCoefficient

•  DeltaAverage(introducedatWMT2012)

Score Ranking

s1 3.2 3

s2 1 5

s3 5 1

s4 2.7 4

s5 4 2

Judgment Ranking

s1 5 1

s2 1 5

s3 4 2

s4 2 4

s5 3 3

System Human

RankSimilarityMetric

Page 23: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Qualityindicators

•  Featurescanbeextractedfrom–  Thesourcesentence(“Complexity”indicators)–  Thetranslatedsentence(“Fluency”indicators)–  SourceandTargetsentences(“Adequacy”andotherindicators)– MTsystemduringthetransla&onprocess(“Confidence”indicators)

MTEvalua&on,Trento,DoctoralSchool-April2016

Sourcesentence

Translatedsentence

MTsystem

Qualityindicators-Complexity

•  Capturethedifficultytotranslatethesourcesentence•  Complexsentencesarehardertotranslate

–  sourcesentencelength–  n-gramlanguagemodelprobability–  numberofpunctua&onmarks–  sourcesentencetype/tokenra&o(e.g.#nouns/#tokens)–  avg.#oftransla&onsperword(asgivenbyprobabilis&cdic&onaries)–  %ofcontent/non-contentwords–  …

Sourcesentence

Translatedsentence

MTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityindicators-Fluency

•  Capturethelevelofnaturalnessofthetransla=oninthetargetlanguage•  Thetransla&onshouldconformtothetargetlanguageintermsof

grammar,withlexicalchoicesappropriatetothegenreofthesourcetext

–  n-gramlanguagemodelprobability

–  POS-tagtargetlanguagemodel

–  …

Sourcesentence

MTsystem

Translatedsentence

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityindicators-Adequacy

•  Capturethelevelofseman=cequivalencebetweensourceandtransla=on•  Sourceandtargetsentencesshouldconveythesamemeaning.Meaning

drins/lossesfromsourcetotargetsentenceindicateabadtransla&on

–  %ofalignedwordsinsourceandtarget–  %ofalignmentsbetweenwordswiththesamepartofspeech

–  %ofalignednouns/verbs/adjec&ves–  alignedIDFmass(IDFasindicatoroftermrelevance)

–  …

MTsystem

Translatedsentence

Sourcesentence

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 24: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Qualityindicators-Confidence

•  CapturethelevelofconfidenceoftheSMTsystem•  sentencesforwhichthetransla&onprocessiscomplexaremorelikelytobe

badtransla&ons

–  lengthNoftheNbestlist–  numberofprunedhypotheses

–  log-likelihoodscore–  avg.edit-distanceofthe1-bestfromthefirstk-bests

–  …

Sourcesentence

Translatedsentence

MTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

OpenIssues

•  Lackofanobjec&vequalityscoreabletocatchcogni&veefforts– AnewscorethatcontainsthemainfeaturesofHTERandcorrelateswellwithPE&me

•  Lackofatechniqueabletothresholdthequalityscore(badvs.goodtransla&ons)–  IsHTER=0.3/0.5/0.7abadorgoodtransla&on?– UsefulintheCATtoolscenario,whereitisnecessarytodiscardbadtransla&ons

MTEvalua&on,Trento,DoctoralSchool-April2016

OpenIssues

•  Morethan1,000qualityindicatorshavebeendevelopedinthelastyears.– Doweneedalloftheminarealapplica&on?

– Whicharethemostreliableineachgroup?

– Whichisthebestcombina&on?

•  Subjec&vityinthepost-editorworkandinthetask– Asinglequalityes&matorforverydifferentpost-editorbehaviorandtask

– Adaptability/personaliza&on

MTEvalua&on,Trento,DoctoralSchool-April2016

MTEvalua=onDilemma

Summary

•  MTevalua&on:ahottopic…– Sharedevalua&onmethods/rou&nesareakeyassetinanyfield

•  …butadifficulttask– Wetalkedabouterrorvariability,costs,speed,replicability,subjec&vity,correla&onwithhumanjudgments,etc.

Page 25: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Summary

•  Humanevalua&on– Accurate,highquality,meaningful,expensive,slow,subjec&ve

•  Automa&cevalua&on– Cheap,quick,repeatable,objec&ve,approximate,lessaccurate

– Fluency,adequacy

– Reference-based:BLEU,TER,HTER(prosandcons)– Reference-free:qualityes&ma&on(goal,methods,openissues)

Summary•  Keyconcepts:

Adequacy

Referencetra

nsla&on

Agreement

Correla&on

Post-edi&ngeffort

CATtoolFeature

Cogni&veeffortHTER

MeanAbsoluteError

Evalua&onofMachineTransla&onQuality

MarcoTurchiFBKTrento,Italyturchi@<k.eu