a tool for automatic rule-based post-editing of...

1
a Tool for Automatic Rule-based Post-editing of SMT This research was supported by the grants FP7-ICT-2011-7-288487 (MosesCore), FP7-ICT-2013-10-610516 (QTLeap), GAUK 1572314, and SVV 260 104. This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013). Depx Rudolf Rosa [email protected] Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Input Treex Analysis tokenizer (Treex) lemmatizing tagger (MorphoDiTa) word aligner (GIZA++) dependency parser (MST) dependency relations labeller (Treex) named entity recognizer (Stanford) Noun – Adjective Agreement increase NN half-hearted JJ nárůstu NNIS2 nárůst F: gender = feminine polovičaté AAFS2 polovičatý nárůstu NNIS2 nárůst polovičatého AAIS2 polovičatý I: gender = masculine inanimate morphological generator word form: polovičatého lemma: polovičatý tag: AAIS2 morphological tag word form lemma set adjective gender to noun gender Preposition – Noun Agreement o PP--6 o about IN Mandela NNP o PP--6 o set noun case to preposition case Translation of ‘of’ schodek NNIS1 schodek nance NNFP4 nance decit NN of IN set noun case to genitive nances NNS schodek NNIS1 schodek nancí NNFP2 nance Translation of Possessive Nouns je VBS3P být ryba NNFS1 ryba sh NN David NNP generate corresponding possessive adjective ’s POS David NNMS1 David ryba NNFS1 ryba Davidova AUFS1M Davidův ‘David is sh’ Subject Morphological Case vyzvali VpMPXR vyzvat called VBD voters NNs set subject case to nominative voliče NNMP4 volič ‘voters were called’ vyzvali VpMPXR vyzvat voliči NNMP1 volič Verb Tense Translation defending VBG are VBP set verb tense to equivalent of English tense ‘were defending’ generals NNS generálové NNMP1 generál brání VBP3P bránit generálové NNMP1 generál Lost Negation Recovery deserve VB not RB negate the verb ‘deserves’ does VBZ Translation of ‘by’ zmař eno VsNSXX zmařit foiled JJ by IN set noun case to instrumentative voluntarism NN zmař eno VsNSXX zmařit Subject Person Projection are VBP we PRP ‘they are’ jsou VBP3P být jsme VBP1P být set verb person to English subject person Depx Correction Rules current version: 28 rules Input The source is available to all components to minimize error propagation. CU Chimera English Czech MT System http://ufal.m.cuni.cz/depx Get Depx Today! • implemented in Perl in Treex NLP framework • needs Linux + 4GB RAM • 3 seconds per sentence • released under GNU/GPL • source is commented Depx Evaluation (Δ BLEU) CU Bojar CU TectoMT UEDIN JHU Eurotran Microsoft Bing Google Translate +0.07 0.02 +0.23 +0.32 +0.15 +0.37 0.00 WMT 2012 +0.47 0.10 +0.64 +0.42 +0.21 ? +0.23 WMT 2011 Evaluation of CU Chimera CU Chimera UEDIN CU Bojar Google Translate CU TectoMT WMT 2013 0.578 0.525 0.580 0.562 0.476 Manual 20.0 19.1 20.1 ? 14.7 BLEU 19.2 18.6 18.9 ? 14.2 cBLEU WMT 2014 0.371 0.356 0.333 0.169 0.175 Manual 22.0 22.1 22.1 ? 15.8 BLEU 21.1 21.6 20.9 20.2 15.4 cBLEU Moses (Bojar) - phrase-based SMT - large-scale data - morphological tags as factors for a better grammatical coherence Depx (Rosa) - automatic post-editing - generates the nal output TectoMT (Popel) - hybrid (rule-based/statistical) MT system - transfer at tectogrammatical (deep syntactical) layer - our combination: get an extra phrase table for Moses from TectoMT output 7: instru- mentative voluntarismem NNIS7 voluntarismums 2: genitive voluntarismu NNIS2 voluntarismus 4: accusative 1: nominative 1: nominative Mandela NNMS1 Mandela 6: locative Mandelovi NNMS6 Mandela ‘we are’ ‘does not deserve’ N: negated A: armative nezaslouží VBS3P-N zasloužit zaslouží VBS3P-A zasloužit R: past P: present bránili VpMPXR bránit ‘are defending’ 4: accusative 2: genitive Depx improves most MT systems! CU Chimera is the state-of-the-art for English Czech MT! conversion to tectogrammar (Treex) source English sentence Czech sentence from SMT English Czech Machine Translation Moses/Google Translate/... a-tree zone=en I Sb PRP do AuxV VBP n't Neg RB mean Pred VB to AuxV TO imply Adv VB that Atr DT garbage Sb NN is AuxV VBZ n't Neg RB sorted Adv VBN in AuxP IN Brno Adv NNP . AuxK . a-tree zone=cs Nechci Pred VB-S1PA naznačovat Obj Vf , AuxX Z: že AuxC J, odpadky Sb NNIP1 se AuxR P7-X4 třídí Adv VB-P3PA v AuxP RR6 Brně Adv NNNS6 . AuxK Z: Analytical (surface syntax) dependency trees Brno g_ n-tree zone=en Brno g_ n-tree zone=cs Named entities t-tree zone=en #PersPron ACTn:subj I mean.enunc PREDv:n do n't mean #Cor ACTn:elided imply PAT v:to+inf to imply that RSTRn:attr that garbage ACTn:subj garbage sort PAT v:n is n't sorted Brno LOCn:in+X in Brno t-tree zone=cs #PersPron ACTdrop naznačovat.enunc PREDv:n Nechci naznačovat odpadky ACTn:1 odpadky třídit PAT v:že+n , že se třídí Brno LOCn:v+6 v Brně Tectogrammatical (deep syntax) trees 1. Analyzed input English source Czech MT output 4. Correct output 2. Correction on tag 3. Generation of according word form

Upload: others

Post on 25-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: a Tool for Automatic Rule-based Post-editing of SMTufal.mff.cuni.cz/biblio/attachments/2014-rosa-m4787378167403770… · a Tool for Automatic Rule-based Post-editing of SMT This research

a Tool for Automatic Rule-based Post-editing of SMT

This research was supported by the grants FP7-ICT-2011-7-288487 (MosesCore), FP7-ICT-2013-10-610516 (QTLeap), GAUK 1572314, and SVV 260 104. This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).

DepfixRudolf Rosa

[email protected] University in Prague

Faculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics

Input Treex Analysistokenizer (Treex)

lemmatizing tagger (MorphoDiTa)

word aligner (GIZA++)

dependency parser (MST)

dependency relations labeller (Treex)

named entity recognizer (Stanford)

Noun – Adjective Agreement

increaseNN

half-heartedJJ

nárůstuNNIS2nárůst

F: gender = feminine

polovičatéAAFS2polovičatý

nárůstuNNIS2nárůst

polovičatéhoAAIS2polovičatý

I: gender = masculine inanimate

morphologicalgenerator

word form:polovičatéholemma: polovičatý

tag: AAIS2

morphological tagword form

lemma

set adjective genderto noun gender

Preposition – Noun Agreement

oPP--6o

aboutIN

MandelaNNP

oPP--6o

set noun caseto preposition case

Translation of ‘of’

schodekNNIS1schodek

financeNNFP4finance

deficitNN

ofIN

set noun caseto genitive

financesNNS

schodekNNIS1schodek

financíNNFP2finance

Translation of Possessive Nouns

jeVBS3Pbýt

rybaNNFS1ryba

fishNN

DavidNNP

generate correspondingpossessive adjective

’sPOS

DavidNNMS1David

rybaNNFS1ryba

DavidovaAUFS1MDavidův

‘David is fish’

Subject Morphological Case

vyzvaliVpMPXRvyzvat

calledVBD

votersNNs

set subject caseto nominative

voličeNNMP4volič

‘voters were called’

vyzvaliVpMPXRvyzvat

voličiNNMP1volič

Verb Tense Translation

defendingVBG

areVBP

set verb tenseto equivalent of English tense

‘were defending’

generalsNNS

generálovéNNMP1generál

bráníVBP3Pbránit

generálovéNNMP1generál

Lost Negation Recovery

deserveVB

notRB negate the verb

‘deserves’

doesVBZ

Translation of ‘by’

zmařenoVsNSXXzmařit

foiledJJ

byIN

set noun caseto instrumentative

voluntarismNN

zmařenoVsNSXXzmařit

Subject Person Projection

areVBP

wePRP

‘they are’

jsouVBP3Pbýt

jsmeVBP1Pbýt

set verb person toEnglish subject person

Depfix Correction Rules current version: 28 rules

Input The source is available to all components to minimize error propagation.

CU Chimera English → Czech MT System

http://ufal.mff.cuni.cz/depfix

Get Depfix Today!

• implemented in Perl in Treex NLP framework• needs Linux + 4GB RAM• 3 seconds per sentence• released under GNU/GPL• source is commented

Depfix Evaluation (Δ BLEU)

CU BojarCU TectoMTUEDINJHUEurotranMicrosoft BingGoogle Translate

+0.07−0.02+0.23+0.32+0.15+0.37

0.00

WMT 2012+0.47−0.10+0.64+0.42+0.21

?+0.23

WMT 2011

Evaluation of CU Chimera

CU Chimera UEDINCU BojarGoogle TranslateCU TectoMT

WMT 2013

0.5780.5250.5800.5620.476

Manual20.019.120.1

?14.7

BLEU19.218.618.9

?14.2

cBLEUWMT 2014

0.3710.3560.3330.169

−0.175

Manual22.022.122.1

?15.8

BLEU21.121.620.920.215.4

cBLEU

Moses (Bojar)- phrase-based SMT- large-scale data- morphological tags as factors for a better grammatical coherence

Depfix (Rosa)- automatic post-editing- generates the final output

TectoMT (Popel)- hybrid (rule-based/statistical) MT system- transfer at tectogrammatical (deep syntactical) layer- our combination: get an extra phrase table for Moses from TectoMT output

7: instru-mentative

voluntarismemNNIS7voluntarismums

2: genitive

voluntarismuNNIS2voluntarismus

4: accusative

1: nominative

1: nominative

MandelaNNMS1Mandela

6: locative

MandeloviNNMS6Mandela

‘we are’ ‘does not deserve’

N: negatedA: affirmative

nezasloužíVBS3P-Nzasloužit

zasloužíVBS3P-Azasloužit

R: past P: present

brániliVpMPXRbránit

‘are defending’

4: accusative

2: genitive

Depfix improves most MT systems!

CU Chimera is the state-of-the-art for English → Czech MT!

conversion to tectogrammar (Treex)

source English sentence

Czech sentence from SMT

English → CzechMachine TranslationMoses/Google Translate/...

a-treezone=en

ISbPRP

doAuxVVBP

n'tNegRB

meanPredVB

toAuxVTO

implyAdvVB

thatAtrDT

garbageSbNN

isAuxVVBZ

n'tNegRB

sortedAdvVBN

inAuxPIN

BrnoAdvNNP

.AuxK.

a-treezone=cs

NechciPredVB-S1PA

naznačovatObjVf

,AuxXZ:

žeAuxCJ,

odpadkySbNNIP1

seAuxRP7-X4

třídíAdvVB-P3PA

vAuxPRR6

BrněAdvNNNS6

.AuxKZ:

Analytical (surface syntax)dependency trees

Brnog_

n-treezone=en

Brnog_

n-treezone=cs

Namedentities

t-treezone=en

#PersPronACTn:subjI

mean.enuncPREDv:findo n't mean

#CorACTn:elided

implyPAT v:to+infto imply

thatRSTRn:attrthat

garbageACTn:subjgarbage

sortPAT v:finis n't sorted

BrnoLOCn:in+Xin Brno

t-treezone=cs

#PersPronACTdrop

naznačovat.enuncPREDv:finNechci naznačovat

odpadkyACTn:1odpadky

tříditPAT v:že+fin, že se třídí

BrnoLOCn:v+6v Brně

Tectogrammatical(deep syntax) trees

1. Analyzed inputEnglish source Czech MT output

4. Correct output

2. Correction on tag

3. Generation ofaccording word form