a tool for automatic rule-based post-editing of...
TRANSCRIPT
a Tool for Automatic Rule-based Post-editing of SMT
This research was supported by the grants FP7-ICT-2011-7-288487 (MosesCore), FP7-ICT-2013-10-610516 (QTLeap), GAUK 1572314, and SVV 260 104. This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
DepfixRudolf Rosa
[email protected] University in Prague
Faculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics
Input Treex Analysistokenizer (Treex)
lemmatizing tagger (MorphoDiTa)
word aligner (GIZA++)
dependency parser (MST)
dependency relations labeller (Treex)
named entity recognizer (Stanford)
Noun – Adjective Agreement
increaseNN
half-heartedJJ
nárůstuNNIS2nárůst
F: gender = feminine
polovičatéAAFS2polovičatý
nárůstuNNIS2nárůst
polovičatéhoAAIS2polovičatý
I: gender = masculine inanimate
morphologicalgenerator
word form:polovičatéholemma: polovičatý
tag: AAIS2
morphological tagword form
lemma
set adjective genderto noun gender
Preposition – Noun Agreement
oPP--6o
aboutIN
MandelaNNP
oPP--6o
set noun caseto preposition case
Translation of ‘of’
schodekNNIS1schodek
financeNNFP4finance
deficitNN
ofIN
set noun caseto genitive
financesNNS
schodekNNIS1schodek
financíNNFP2finance
Translation of Possessive Nouns
jeVBS3Pbýt
rybaNNFS1ryba
fishNN
DavidNNP
generate correspondingpossessive adjective
’sPOS
DavidNNMS1David
rybaNNFS1ryba
DavidovaAUFS1MDavidův
‘David is fish’
Subject Morphological Case
vyzvaliVpMPXRvyzvat
calledVBD
votersNNs
set subject caseto nominative
voličeNNMP4volič
‘voters were called’
vyzvaliVpMPXRvyzvat
voličiNNMP1volič
Verb Tense Translation
defendingVBG
areVBP
set verb tenseto equivalent of English tense
‘were defending’
generalsNNS
generálovéNNMP1generál
bráníVBP3Pbránit
generálovéNNMP1generál
Lost Negation Recovery
deserveVB
notRB negate the verb
‘deserves’
doesVBZ
Translation of ‘by’
zmařenoVsNSXXzmařit
foiledJJ
byIN
set noun caseto instrumentative
voluntarismNN
zmařenoVsNSXXzmařit
Subject Person Projection
areVBP
wePRP
‘they are’
jsouVBP3Pbýt
jsmeVBP1Pbýt
set verb person toEnglish subject person
Depfix Correction Rules current version: 28 rules
Input The source is available to all components to minimize error propagation.
CU Chimera English → Czech MT System
http://ufal.mff.cuni.cz/depfix
Get Depfix Today!
• implemented in Perl in Treex NLP framework• needs Linux + 4GB RAM• 3 seconds per sentence• released under GNU/GPL• source is commented
Depfix Evaluation (Δ BLEU)
CU BojarCU TectoMTUEDINJHUEurotranMicrosoft BingGoogle Translate
+0.07−0.02+0.23+0.32+0.15+0.37
0.00
WMT 2012+0.47−0.10+0.64+0.42+0.21
?+0.23
WMT 2011
Evaluation of CU Chimera
CU Chimera UEDINCU BojarGoogle TranslateCU TectoMT
WMT 2013
0.5780.5250.5800.5620.476
Manual20.019.120.1
?14.7
BLEU19.218.618.9
?14.2
cBLEUWMT 2014
0.3710.3560.3330.169
−0.175
Manual22.022.122.1
?15.8
BLEU21.121.620.920.215.4
cBLEU
Moses (Bojar)- phrase-based SMT- large-scale data- morphological tags as factors for a better grammatical coherence
Depfix (Rosa)- automatic post-editing- generates the final output
TectoMT (Popel)- hybrid (rule-based/statistical) MT system- transfer at tectogrammatical (deep syntactical) layer- our combination: get an extra phrase table for Moses from TectoMT output
7: instru-mentative
voluntarismemNNIS7voluntarismums
2: genitive
voluntarismuNNIS2voluntarismus
4: accusative
1: nominative
1: nominative
MandelaNNMS1Mandela
6: locative
MandeloviNNMS6Mandela
‘we are’ ‘does not deserve’
N: negatedA: affirmative
nezasloužíVBS3P-Nzasloužit
zasloužíVBS3P-Azasloužit
R: past P: present
brániliVpMPXRbránit
‘are defending’
4: accusative
2: genitive
Depfix improves most MT systems!
CU Chimera is the state-of-the-art for English → Czech MT!
conversion to tectogrammar (Treex)
source English sentence
Czech sentence from SMT
English → CzechMachine TranslationMoses/Google Translate/...
a-treezone=en
ISbPRP
doAuxVVBP
n'tNegRB
meanPredVB
toAuxVTO
implyAdvVB
thatAtrDT
garbageSbNN
isAuxVVBZ
n'tNegRB
sortedAdvVBN
inAuxPIN
BrnoAdvNNP
.AuxK.
a-treezone=cs
NechciPredVB-S1PA
naznačovatObjVf
,AuxXZ:
žeAuxCJ,
odpadkySbNNIP1
seAuxRP7-X4
třídíAdvVB-P3PA
vAuxPRR6
BrněAdvNNNS6
.AuxKZ:
Analytical (surface syntax)dependency trees
Brnog_
n-treezone=en
Brnog_
n-treezone=cs
Namedentities
t-treezone=en
#PersPronACTn:subjI
mean.enuncPREDv:findo n't mean
#CorACTn:elided
implyPAT v:to+infto imply
thatRSTRn:attrthat
garbageACTn:subjgarbage
sortPAT v:finis n't sorted
BrnoLOCn:in+Xin Brno
t-treezone=cs
#PersPronACTdrop
naznačovat.enuncPREDv:finNechci naznačovat
odpadkyACTn:1odpadky
tříditPAT v:že+fin, že se třídí
BrnoLOCn:v+6v Brně
Tectogrammatical(deep syntax) trees
1. Analyzed inputEnglish source Czech MT output
4. Correct output
2. Correction on tag
3. Generation ofaccording word form