depfix: automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfrudolf rosa –...

100
Rudolf Rosa [email protected] Depfix: Automatic post-editing of phrase-based machine translation outputs Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Monday seminar, 14th October 2013

Upload: others

Post on 24-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf [email protected]

Depfix:

Automatic post-editingof phrase-based machine translation

outputs

Charles University in PragueFaculty of Mathematics and Physics

Institute of Formal and Applied Linguistics

Monday seminar, 14th October 2013

Page 2: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 2/100

Outline

translation of negation (and its correction) motivation the fixing pipeline

analysis and corrections m-layer (lemmas, tags, word-alignment) a-layer (dependency trees, analytical functions) t-layer (“tecto-trees”, formemes, grammatemes)

evaluation parsing of SMT outputs (MSTperl parser)

Page 3: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 3/100

Motivation: Translation of negation

http://thedrivenclass.com/blog-detail/32876

Page 4: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 4/100

Motivation: Errors in negation

These are not actually errors.

Page 5: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 5/100

Motivation: Errors in negation

These are not actually errors. Moses: Jsou to vlastně chyby. Gloss: These are actually errors.

Page 6: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 6/100

Motivation: Errors in negation

These are not actually errors. Moses: Jsou to vlastně chyby. Gloss: These are actually errors. Ref.: Nejsou to vlastně chyby.

Page 7: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 7/100

Motivation: Errors in negation

These are not actually errors. Moses: Jsou to vlastně chyby. Gloss: These are actually errors. Ref.: Nejsou to vlastně chyby.

I would not cheat on you.

Page 8: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 8/100

Motivation: Errors in negation

These are not actually errors. Moses: Jsou to vlastně chyby. Gloss: These are actually errors. Ref.: Nejsou to vlastně chyby.

I would not cheat on you. Moses: Já bych tě podváděl. Gloss: I would cheat on you. Ref.: Já bych tě nepodváděl.

Page 9: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 9/100

Motivation: Errors in negation

These are not actually errors. Moses: Jsou to vlastně chyby. Gloss: These are actually errors. Ref.: Nejsou to vlastně chyby.

I would not cheat on you. Moses: Já bych tě podváděl. Gloss: I would cheat on you. Ref.: Já bych tě nepodváděl.

some phenomena hard to get right with PBSMT

are are not

jsou nejsou

are not

jsou NULL

Page 10: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 10/100

Simple to fix?

there is a negation in the source These are not actually errors

Page 11: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 11/100

Simple to fix?

there is a negation in the source These are not actually errors

there is no negation in the target Jsou to vlastně chyby

Page 12: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 12/100

Simple to fix?

there is a negation in the source These are not actually errors

there is no negation in the target Jsou to vlastně chyby

Page 13: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 13/100

Simple to fix?

there is a negation in the source These are not actually errors

there is no negation in the target Jsou to vlastně chyby

add the negative prefix (ne-) into the target Nejsou to vlastně chyby

Page 14: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Simple to fix?

there is a negation in the source These are not actually errors

there is no negation in the target Jsou to vlastně chyby

add the negative prefix (ne-) into the target Nejsou to vlastně chyby

such a simple approach might be sufficient but usually useful to use some more NLP tools

Page 15: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 15/100

Part-of-speech tagger

run a POS tagger on the target sentence Jsou to vlastně chyby verb pronoun adverb noun

a good heuristic: negate the (finite) verb! Nejsou to vlastně chyby verb pronoun adverb noun

fine-grained tags may even mark the negation jsou VB-P---3P-AA--- nejsou VB-P---3P-NA---

Page 16: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 16/100

Dependency parser

parse both source and target project negation through word alignment

Page 17: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 17/100

Dependency parser

parse both source and target project negation through word alignment

Page 18: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 18/100

Deep syntactic analysis

auxiliary nodes collapsed into values of attributes on parent nodes

abstract from various ways of expressing negation (not, no, un-, in-,...) all marked by neg=1 on the lexical node

are

not

be, neg=1

Page 19: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 19/100

Morphological generator

form = generate(word, morphological features)

Page 20: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 20/100

Morphological generator

form = generate(lemma, tag)

Page 21: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 21/100

Morphological generator

form = generate(lemma, tag)

instead of: new_form = 'ne' + form 'nejsou' = 'ne' + 'jsou'

Page 22: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 22/100

Morphological generator

form = generate(lemma, tag)

instead of: new_form = 'ne' + form 'nejsou' = 'ne' + 'jsou'

use the more sophisticated: new_form = generate(lemma(form), negate(tag)) 'nejsou' = generate(lemma('jsou'), negate('VB-P---3P-AA---'))

Page 23: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 23/100

Morphological generator

form = generate(lemma, tag)

instead of: new_form = 'ne' + form 'nejsou' = 'ne' + 'jsou'

use the more sophisticated: new_form = generate(lemma(form), negate(tag)) 'nejsou' = generate(lemma('jsou'), negate('VB-P---3P-AA---'))

'nejsou' = generate( 'být' , 'VB-P---3P-NA---' )

Page 24: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 24/100

Depfix pipeline

input English sentence its Czech translation, provided by an SMT system

m-layer analysis and fixes (tagger, word-aligner) a-layer analysis and fixes (dependency parser) t-layer analysis and fixes (t-tree analyzer) output (morphological generator)

the Czech translation, corrected

Page 25: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 25/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

➔ analysis and corrections m-layer (lemmas, tags, word-alignment) a-layer (dependency trees, analytical functions) t-layer (“tecto-trees”, formemes, grammatemes)

evaluation parsing of SMT outputs (MSTperl parser)

Page 26: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 26/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

➔ analysis and corrections➔ m-layer (lemmas, tags, word-alignment) a-layer (dependency trees, analytical functions) t-layer (“tecto-trees”, formemes, grammatemes)

evaluation parsing of SMT outputs (MSTperl parser)

Page 27: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 27/100

M-layer

analysis: lemmas, tags, word-alignment +adding missing alignment links by string similarity

corrections: tokenization projection morphological number projection source-aware truecasing vocalisation of prepositions

Page 28: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 28/100

Tokenization projection

Source Michèle Alliot-Marie had sent a communication

Moses Michèle Alliot - Marie poslal sdělení

Gloss Michèle Alliot - Marie sentmasc

a communication

Depfix Michèle Alliot-Marie poslala sdělení

Gloss Michèle Alliot-Marie sentfem

a communication

Page 29: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 29/100

Source-aware truecasing

Source the director of the best hotel in Pec, Karel Rada.

Moses ředitel nejlepší hotel v peci, Karel rada.

Gloss the director of the best hotel in the oven, Karel advice.

Depfix ředitel nejlepšího hotelu v Peci, Karel Rada.

Glossthe director of the best hotel in Pec

town, Karel

Radasurname

.

Page 30: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 30/100

Vocalisation of prepositions

Source The work being done by experts from three institutions

Moses Práce odborníků z tří institucí

Gloss Work by experts from three institutions

Depfix Práce odborníků ze tří institucí

Gloss Work by experts from three institutions

Page 31: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 31/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

➔ analysis and corrections✔ m-layer (lemmas, tags, word-alignment)➔ a-layer (dependency trees, analytical functions) t-layer (“tecto-trees”, formemes, grammatemes)

evaluation parsing of SMT outputs (MSTperl parser)

Page 32: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 32/100

A-layer

analysis: dependency trees, analytical functions for Czech: MSTperl (adapted MST parser) + fixing prepositions without children, auxiliary verbs

with children, AuxT/AuxR...

morphological agreement fixes: preposition-noun, noun-adjective, subject-predicate...

fixes of transfer of meaning to morphology possessives, passives, subject...

Page 33: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 33/100

A-layer motivation

Source: All the winners received a diploma.

Moses: Všem výhercům obdržel diplom. To all the winners he received a diploma.

Depfix: Všichni výherci obdrželi diplom. All the winners received a diploma.

Page 34: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 34/100

Všem výhercům obdržel diplom.

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výhercům

NNMP3

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Obj

Page 35: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 35/100

Transfer of meaning: subject

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výhercůmObjNNMP3

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 36: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 36/100

Transfer of meaning: subject

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výhercůmObjNNMP3

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 37: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 37/100

Subject → nominative

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 38: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 38/100

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Všem výherci obdržel diplom.

Page 39: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 39/100

Noun-adjective agreement

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšemAtrPLXP3

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 40: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 40/100

Agreement: gender, case (number)

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšichniAtrPLMP1

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 41: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 41/100

Všichni výherci obdržel diplom.

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšichniAtrPLMP1

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 42: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 42/100

Subject-predicate agreement

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšichniAtrPLMP1

výherciSbNNMP1

obdrželPredVpYSXRA

diplomObjNNIS1

.AuxKZ:

Page 43: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 43/100

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšichniAtrPLMP1

výherciSbNNMP1

obdrželiPredVpMPXRA

diplomObjNNIS1

.AuxKZ:

Agreement: gender, num (person)

Page 44: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 44/100

Všichni výherci obdrželi diplom.

a-treezone=en

AllAtrPDT

theAuxADT

winnersSbNNS

receivedPredVBD

aAuxADT

diplomaObjNN

.AuxK.

a-treezone=cs

VšichniAtrPLMP1

výherciSbNNMP1

obdrželiPredVpMPXRA

diplomObjNNIS1

.AuxKZ:

Page 45: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 45/100

Preposition-noun agreement

Source It is a story about sport, race relations, and Nelson Mandela.

Moses Je to příběh o sportu, rasových vztahů, a Nelson Mandela.

GlossIt is a story about

loc sport

loc, race relations

gen,

and Nelsonnom

Mandelanom

.

Depfix Je to příběh o sportu, rasových vztazích, a Nelson Mandelovi.

GlossIt is a story about

loc sport

loc, race relations

loc,

and Nelsonnom

Mandelaloc

.

Page 46: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 46/100

Noun-adjective agreement

Source this half-hearted increase will bear the same fruit

Moses tato polovičatá nárůst bude nést stejné ovoce

Glossthis

fem half-hearted

fem increase

masc will bear the

same fruit

Depfix tento polovičatý nárůst bude nést stejné ovoce

Glossthis

masc half-hearted

masc increase

masc will bear

the same fruit

Page 47: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 47/100

Translation of 'of'

Source unsustainable deficit level of public finances.

Moses neudržitelná úroveň schodku veřejné finance.

Gloss unsustainable deficit level public finances.

Depfix neudržitelná úroveň schodku veřejných financí.

Gloss unsustainable deficit level of public finances.

Page 48: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 48/100

Translation of 's

Source Janota's possible continuation in office will be the topic of Friday's meeting.

Moses Janota je možné pokračování ve funkci bude tématem páteční schůze.

Gloss Janota is possible continuation in office will be the topic of Friday's meeting.

Depfix možné pokračování Janoty ve funkci bude tématem páteční schůze.

Glosspossible continuation of Janota in office will be the topic of Friday's meeting.

Page 49: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 49/100

Translation of Subject

SourceAt a time when Swiss voters

subj have called for

a ban on the construction of minarets

MosesV době, kdy švýcarské voliče

acc vyzvali k

zákazu výstavby minaretů

Gloss At a time, when Swiss voters were called for a ban on the construction of minarets

DepfixV době, kdy švýcarští voliči

nom vyzvali k zákazu

výstavby minaretů

GlossAt a time, when Swiss voters called for a ban on the construction of minarets

Page 50: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 50/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

➔ analysis and corrections✔ m-layer (lemmas, tags, word-alignment)✔ a-layer (dependency trees, analytical functions)➔ t-layer (“tecto-trees”, formemes, grammatemes)

evaluation parsing of SMT outputs (MSTperl parser)

Page 51: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 51/100

T-layer

analysis: t-trees, formemes, grammatemes + systematic analysis of English verb tenses

rule-based fixes:✔ negation verb tenses subject pronoun dropping

statistical fixes: valency

Page 52: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 52/100

Translation of verb tenses

Source This will bring problems for whoever is in office

Moses To přináší problémy pro každého, kdo je v kanceláři

Gloss This brings problems for anyone who is in office

Depfix To bude přinášet problémy pro každého, kdo je v kanceláři

GlossThis will bring problems for anyone who is in office

Page 53: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 53/100

Subject pronoun dropping

Source I don't blame them.

Moses Já se jim nedivím.

Gloss I myself them don't-blame1st person sg

.

Depfix Nedivím se jim.

Gloss Don't-blame1st person sg

myself them.

Page 54: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 54/100

Correction of valency errors

Source text in English:

EU criticizes not only the Greek government Google Translate to Czech (6th Aug 2013):

EU kritizuje nejen řecká vládanominative (subject)

Not only the Greek government criticizes EU

Post-editation by Deepfix:

EU kritizuje nejen řeckou vláduaccusative (object)

EU criticizes not only the Greek government

Page 55: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 55/100

Valency of criticize (kritizovat)

EUsubject

criticizes the Greek governmentobject

EUnominative

kritizuje řeckou vláduaccusative

a valency frame of a verb subject criticize object (position) nominative kritizovat accusative (cases)

decomposition into head-argument pairs (to criticize, government) ~ (kritizovat, vládu) (to criticize, Object) ~ (kritizovat, accusative)

Page 56: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 56/100

Deep syntactic dependency trees

EU

criticize

Greek

government

EU criticizesthe Greek government

EU kritizujeřecká vláda

EU

kritizovat

řecká

vláda

Page 57: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 57/100

Deep syntactic dependency trees

EU

criticize

Greek

government

EU criticizesthe Greek government

EU kritizujeřecká vláda

EU

kritizovat

řecká

vláda

Page 58: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 58/100

Deep syntactic dependency trees

EUN, subject

criticizeV, predicate

GreekA, attribute

governmentN, object

EU criticizesthe Greek government

EU kritizujeřecká vláda

EUN, nominative

kritizovatV, predicate

řeckáA, attribute

vládaN, nominative

Page 59: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 59/100

(head, arg) pair identification

EUN, subject

criticizeV, predicate

GreekA, attribute

governmentN, object

EU criticizesthe Greek government

EU kritizujeřecká vláda

EUN, nominative

kritizovatV, predicate

řeckáA, attribute

vládaN, nominative

Page 60: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 60/100

Valency models (FIX)

P(argcase

| headlemma

, English_argfunction

)

P(argcase

| headlemma

, English_argfunction

, arglemma

)

estimated from CzEng 1.0 (15M parallel stcs)

criticize

governmentobject

kritizovat

vládanominative

Page 61: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 61/100

Argument case probabilities

P(nominative | kritizovat, object) = 0.03 P(accusative | kritizovat, object) = 0.80

criticize

governmentobject

kritizovat

vládanominative

Page 62: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 62/100

Argument case probabilities

P(nominative | kritizovat, object) = 0.03 P(accusative | kritizovat, object) = 0.80 threshold: 0.55

criticize

governmentobject

kritizovat

vládanominative

Page 63: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 63/100

Argument case correction

P(nominative | kritizovat, object) = 0.03 P(accusative | kritizovat, object) = 0.80 threshold: 0.55

criticize

governmentobject

kritizovat

vládaaccusative

Page 64: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 64/100

Sentence correction

Statitical machine translation output:

EU kritizuje nejen řeckánominative

vládanominative

Not only the Greek government criticizes EU

Valency model correction:

EU kritizuje nejen řeckánominative

vláduaccusative

Agreement enforcement:

EU kritizuje nejen řeckouaccusative

vláduaccusative

EU criticizes not only the Greek government

Page 65: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 65/100

Some interesting details

the model actually works on formemes functions (EN), cases (CS), prepositions (EN, CS) in: The government spends on the middle schools. SMT: Vládá utrácí střední školy.

(spend, on+X) → (utrácet, 4) P = 0.07 The government destroys the middle schools.

out: Vládá utrácí za střední školy. (spend, on+X) → (utrácet, za+4) P = 0.89 The government spends on the middle schools.

Page 66: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 66/100

Some interesting details

the model actually works on formemes functions (EN), cases (CS), prepositions (EN, CS) in: The government spends on the middle schools. SMT: Vládá utrácí střední školy.

(spend, on+X) → (utrácet, 4) P = 0.07 The government destroys the middle schools.

out: Vládá utrácí za střední školy. (spend, on+X) → (utrácet, za+4) P = 0.89 The government spends on the middle schools.

we model both verb valency and noun valency

Page 67: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 67/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

✔ analysis and corrections✔ m-layer (lemmas, tags, word-alignment)✔ a-layer (dependency trees, analytical functions)✔ t-layer (“tecto-trees”, formemes, grammatemes)

➔ evaluation parsing of SMT outputs (MSTperl parser)

Page 68: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 68/100

Automatic evaluation (BLEU)

WMT10 (devel) WMT11 WMT12 WMT130

5

10

15

20

25

15.66 16.39

13.81

20.08

16.08 16.61

13.85

20.02SMT outputafter Depfix

Page 69: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 69/100

Manual evaluation

ImprovementImprovement43043058%58%

DegradationDegradation15215221%21%

IndefiniteIndefinite15715721%21%

changed sentences all sentences

Impr.Impr.43043032%32%

Degr.Degr.15215211%11%

Ind.Ind.15715712%12%

No changeNo change61161145%45%

Page 70: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 70/100

Manual evaluation (WMT13)

TectoMTTectoMT+Moses

TectoMT+Moses+DepfixGoogle Translate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.490.56 0.54 0.53

0.46

0.64 0.660.62

Mechanical TurkersScientists

Page 71: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 71/100

iHNed.cz evaluation (M. Kalina)

Jak překladač z Matfyzu porazil Google(...)

Pomohl odstraňovač chyb(...)

Aby byl výsledek ještě lepší, prošel výsledný text ještě automatickou korekturou pomocí českého systému Depfix, odstraňovače chyb, jenž opravil například špatně přeložené negativní věty a pády.http://tech.ihned.cz/hnfuture/c1-60978500-prekladac-google-maffyz-system

Page 72: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 72/100

Precision of rules

Negation

AuxV Children

Verb-En Subj Agr.

AuxTPresent Continuous

Of Passive-Aux Be Agr.

Subj-Past Participle Agr.

Drop Subj Pers Pron

Subj-Predicate Agr.

Casing

POSBy Tense

Retokenizing

Subject Case

Noun-Adjective Agr.

Subject-Pnom Agr.

Valency

Project Noun Number

Preposition-Noun Agr.

Prepositional Case

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100% precision=∣improved∣

∣improved∣∣worsened∣

Page 73: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 73/100

Impact of rules

Negation

AuxV Children

Verb-En Subj Agr.

AuxTPresent Continuous

Of Passive-Aux Be Agr.

Subj-Past Participle Agr.

Drop Subj Pers Pron

Subj-Predicate Agr.

Casing

POSBy Tense

Retokenizing

Subject Case

Noun-Adjective Agr.

Subject-Pnom Agr.

Valency

Project Noun Number

Preposition-Noun Agr.

Prepositional Case

0

20

40

60

80

100

120

140

160

180

|mo

difi

ed

|

impact=∣modified∣∣evaluated∣

50% precision75% precision 60% precision

5% impact

Page 74: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 74/100

Outline

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

✔ analysis and corrections✔ m-layer (lemmas, tags, word-alignment)✔ a-layer (dependency trees, analytical functions)✔ t-layer (“tecto-trees”, formemes, grammatemes)

✔ evaluation➔ parsing of SMT outputs (MSTperl parser)

Page 75: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 75/100

Parsing of SMT Outputs

can be useful in many applications automatic classification of translation errors automatic correction of translation errors (Depfix) multilingual question answering...

✔ we have the source sentence available Can we use it to help parsing?

✗ SMT outputs noisy (errors in fluency, grammar...) parsers trained on gold standard treebanks Can we adapt parser to noisy sentences?

Page 76: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 76/100

MST parser

Maximum Spanning Tree parser McDonald, Crammer, Pereira (2005)

Online large-margin trainingof dependency parsers

McDonald, Pereira, Ribarov, Hajič (2006) Non-projective dependency parsing

using spanning tree algorithms

Page 77: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 77/100

(1) Words and Tags

RudolphNNP

relaxesVBZ

abroadRB

#rootwords = nodes

Page 78: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 78/100

(2) (Nearly) Complete Graph

RudolphNNP

relaxesVBZ

abroadRB

#rootall possible edges =

directed edges

Page 79: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 79/100

(3) Assign Edge Weights

RudolphNNP

relaxesVBZ

abroadRB

#root

-7 -1659325

1490 1154

-263 24

185

368

Margin InfusedRelaxed Algorithm

(MIRA)

edge weight =

sum ofedge featuresweights

Page 80: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 80/100

(4) Maximum Spanning Tree

RudolphNNP

relaxesVBZ

abroadRB

#root

-7 -1659325

1490 1154

-263 24

185

368

non-projective trees:Chu-Liu-Edmonds algorithm

(projective trees:

Eisner algorithm)

Page 81: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 81/100

(5) Unlabeled Dependency Tree

RudolphNNP

relaxesVBZ

abroadRB

#rootdependency tree =

maximumspanning tree

Page 82: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 82/100

(6) Labeled Dependency Tree

RudolphNNP

relaxesVBZ

abroadRB

#root

Predicate

Subject Adverbial

labels asignedby a second stagelabeler

Page 83: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 83/100

Tool::Parser::MSTperl

reimplementation of MST Parser in Perl Treex::Block::W2A::CS::ParseMSTperl (so far only) first-order, non-projective

adapted for SMT outputs parsing worsening the training data (David Mareček) adding parallel information manually boosting feature weights exploiting large-scale data

Page 84: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 84/100

Parser Training Data

Prague Czech-English Dependency Treebank parallel treebank 50k sentences, 1.2M words morphological tags, surface syntax, deep syntax word alignment

Page 85: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 85/100

Worsening the Treebank

treebank used for training contains correct sentences

SMT output is noisy grammatical errors incorrect word order missing/superfluous words …

let's introduce similar errors into the treebank! so far, we have only tried inflection errors

Page 86: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 86/100

Worsen (1): Apply SMT

translate English side of PCEDT to Czech by an SMT system (we used Moses)

now we have e.g.: Gold English

Rudolph's car is black. Gold Czech

RudolfovoNEUT

autoNEUT

je černéNEUT

.

SMT Czech Rudolfova

FEM auto

NEUT je černý

MASC.

Page 87: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 87/100

Worsen (2): Align SMT to Gold

align SMT Czech to Gold Czech Monolingual Greedy Aligner (Martin Popel)

alignment link score = linear combination of: similarity of word forms (or lemmas) similarity of morphological tags (fine-grained) similarity of positions in the sentence indication whether preceding/following words aligned

repeat: align best scoring pair until below threshold no training: weights and threshold set manually

Page 88: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 88/100

Worsen (3): Create Error Model

for each tag: estimate probabilities of SMT system using an

incorrect tag instead of the correct tag(Maximum Likelihood Estimate)

Czech tagset: fine-grained morphological tags part-of-speech, gender, number, case, person,

tense, voice... 1500 different tags in training data

Page 89: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 89/100

Worsen (3): Error Model

Adjective, Masculine, Plural, Instrumental case(AAMP7), e.g. lingvistickými (linguistic)➔ 0.2 Adjective, Masculine, Singular, Nominative case

➔ e.g. lingvistický➔ 0.1 Adjective, Masculine, Plural, Nominative case

➔ e.g. lingvističtí➔ 0.1 Adjective, Neuter, Singular, Accusative case

➔ e.g. lingvistické

… altogether 2000 such change rules

Page 90: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 90/100

Worsen (4): Apply Error Model

take Gold Czech for each word:

assign a new tag randomly sampled according to Tag Error Model

generate a new word form rule-based generator, generates even unseen forms new_form = generate_form(lemma, tag) || old_form

→ get Worsened Czech use resulting Gold English-Worsened Czech

parallel treebank to train the parser

Page 91: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 91/100

Parallel Features

word alignment (using GIZA++) additional features (if aligned node exists):

aligned tag (NNS, VBD...) aligned dependency label (Subject, Attribute...) aligned edge existence (0/1)

Page 92: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 92/100

Parallel Features Example

RudolfNN M S 1

relaxujeVB S 3

vRR 6

zahraničíNN N S 6

RudolphNNP

relaxesVBZ

abroadRB

#root

#root

Pred

AuxP

AdvSubj

Subj

Pred

Adv

Page 93: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 93/100

Manually boosting feature weights

aligned edge existence is the key feature here observation: since the worsening is probably

too mild, its weight is too low edge exists: -0.57 edge does not exist: -0.83 missing aligned node(s): -0.67

Page 94: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 94/100

Manually boosting feature weights

aligned edge existence is the key feature here observation: since the worsening is probably

too mild, its weight is too low edge exists: -0.57 edge does not exist: -0.83 missing aligned node(s): -0.67

experiment: manually increase its weight edge exists: -0.25

Page 95: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 95/100

Manually boosting feature weights

aligned edge existence is the key feature here observation: since the worsening is probably

too mild, its weight is too low edge exists: -0.57 edge does not exist: -0.83 missing aligned node(s): -0.67

experiment: manually increase its weight edge exists: -0.25

success – manual changing of weights feasible

Page 96: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 96/100

Exploiting large-scale data

exploiting large-scale parsed data (CzEng)to provide additional lexical features

lexical features are important for the parser CzEng has 10 times more word types (lemmas)

than PCEDT (400k vs. 40k) training the parser on whole CzEng infeasible new feature: pointwise mutual information

PMI ' parent , child =logcount [ parent , child ]

count [ parent ,*] ⋅count [* , child ]

Page 97: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 97/100

Direct Evaluation: by Inspection

manual inspection of several parse trees comparing baseline and adapted parser ouputs

examples of improvements: subject identification even if not in nominative case adjective-noun dependence identification even if

agreement violated (gender, number, case)

hard to do reliably trying to find a correct parse tree for an (often)

incorrect sentence – not well defined

Page 98: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 98/100

Indirect Evaluation: in Depfix

improvements and deteriorations in comparison to Depfix employing a baseline parser: original McDonald's

MST Parser our baseline setup,

without adaptations

227227121121

7474

235235 114114

7373

Page 99: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 99/100

Outline – all done!

✔ translation of negation (and its correction)✔ motivation✔ the fixing pipeline

✔ analysis and corrections✔ m-layer (lemmas, tags, word-alignment)✔ a-layer (dependency trees, analytical functions)✔ t-layer (“tecto-trees”, formemes, grammatemes)

✔ evaluation✔ parsing of SMT outputs (MSTperl parser)

Page 100: Depfix: Automatic post-editing of phrase-based machine ...pondelni-seminar:depfix.pdfRudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 14/100

Rudolf Rosa – Depfix: Automatic post-editing of phrase-based machine translation outputs 100/100

Thank you for your attention

For this presentation and other information, please visit:

http://ufal.mff.cuni.cz/~rosa/

Charles University in PragueFaculty of Mathematics and Physics

Institute of Formal and Applied Linguistics

Depfix:

Automatic post-editingof phrase-based machine translation outputs

Rudolf [email protected]