morpho-syntax in statistical machine translation -...

19
Morpho-Syntax in Statistical Machine Translation © 2006 IBM Corporation Young-Suk Lee IBM T. J. Watson Research Center OpenLab 2006 March 30 - April 1, 2006

Upload: vuongnhan

Post on 28-May-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

Morpho-Syntax in Statistical Machine Translation

© 2006 IBM Corporation

Young-Suk LeeIBM T. J. Watson Research Center

OpenLab 2006March 30 − April 1, 2006

Page 2: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

2

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Reordering Rules: Motivations

IBM T. J. Watson Research Center

© 2006 IBM Corporation

NNSIN RB JJS

DTS NNS JJS JJS

de transportes especialmente peligrosos

of extremely dangerous transport

los procedimientos administrativos complejos

ø complex administrative procedures

Page 3: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

Outline

• Baseline Phrase Translation System

o Block Acquisition & Decoding

• Acquisition of Reordering Ruleso Base Reordering Rules

o Lexicalized Reordering Rules

• Experimental Results

• Related and Ongoing Work

IBM T. J. Watson Research Center

© 2006 IBM Corporation

Page 4: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

4

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Baseline Block Acquisition

IBM T. J. Watson Research Center

© 2006 IBM Corporation

f1 f2 f3

e1 e2 e3 e4 e5 e6

Block (b): a phrase translation pair consisting of source & target phrase

f

e

Tillmann 2003, EMNLP Proceedings

Page 5: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

5

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Extended Block Acquisition Algorithm

IBM T. J. Watson Research Center

© 2006 IBM Corporation

o Expansion word list: A list of target words typically aligned to null source words (e.g. I, we, are)

o Extend the target phrase to include an expansion word if it occurs in the neighborhood of a block

I think that we are getting a package

creo que realizamos un paquete

Page 6: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

6

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Decoding

• Phrase translation models

• Direct model:

• Source channel model:

• Block unigram model:

IBM T. J. Watson Research Center

© 2006 IBM Corporation

∑ ′′

=

efecount

fecountfep

),(

),()|(

),(,)(

)()( feb

bcount

bcountbp

b

=′

=

∑ ′

∑ ′′

=

fefcount

efcountefp

),(

),()|(

Page 7: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

7

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Decoding Cont’d ...

• IBM Model 1 cost per phrase in both directions

• Word & part-of-speech tag trigram language models

• Word-level distortion models applied to blocks• Al-Onaizan 2004, DARPA MT Evaluation Workshop

• Word & block count penalty• Zens and Ney 2004, HLT Proceedings

IBM T. J. Watson Research Center

© 2006 IBM Corporation

niefpmax i

m

j

ji ≤≤−∑=

1),|(10log1

Page 8: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

8

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Acquisition of Base Reordering Rules

IBM T. J. Watson Research Center

© 2006 IBM Corporation

• Viterbi-align

• Part-of-speech tagged source language corpus

• Un-tagged target language corpus

• Identify the source language part-of-speech tag sequence (monotone increasing)

• whose corresponding target word sequence is not monotone increasing

• Compute the reordering probabilities of each part-of-speech tag sequence

Page 9: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

9

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Reordering Probability Computation

∑ ′′

=

rreorde k

ki

kitagrreordecount

tagreordercounttagreorderp

),(

),()|(

)|(ki tagreorderp

IBM T. J. Watson Research Center

© 2006 IBM Corporation

0.0481 4 3 20.3751 4 3 2

0.0721 4 2 30.1111 4 2 3

0.5601 3 4 20.1091 3 4 2

0.0941 3 2 40.2391 3 2 4

0.0471 2 4 30.0591 2 4 3

0.1791 2 3 40.1071 2 3 4

reorder'reorder'

IN1 NNS2 RB3 JJS4DTS1 NNS2 JJS3 JJS4

)|(ki tagreorderp

Page 10: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

10

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

One Best Reordering Rules

α+> )|()|( tagreorderptagreorderp sf

IBM T. J. Watson Research Center

© 2006 IBM Corporation

DT1 JJ5 NN2 CC3 NN4DT1 NN2 CC3 NN4 JJ5

JJS2 CC3 JJS4 NNS1NNS1 JJS2 CC3 JJS4

DT1 JJ3 NN2 IN4DT1 NN2 JJ3 IN4

DTS1 JJS4 JJS3 NNS2DTS1 NNS2 JJS3 JJS4

IN1 RB3 JJS4 NNS2IN1 NNS2 RB3 JJS4

Page 11: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

11

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Lexicalization of Exceptions

The Fund must of course continue to serve its purpose and pursueresearch into1 varieties2 more3 suited4 to demand and causing as little harm as possible .

El Fondo, por supuesto, debe continuar cumpliendo con su misión de investigación sobre la búsqueda de1/IN variedades2/NNS más3/RBadaptadas4 /JJS a la demanda y lo menos nocivas posible,

IBM T. J. Watson Research Center

© 2006 IBM Corporation

the operational support of the1 Secretary2 General3 of4 the Council

el apoyo operativo de la1/DT Secretaría2/NN General3 /JJ del4 /INConsejo

DT1 NN2 JJ3[~General] IN4 → DT1 JJ3 NN2 IN4

IN1 NNS2 RB3 JJS4[~adaptadas] → IN1 RB3 JJS3 NNS2

Page 12: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

12

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Lexicalized Reordering Rules

• Identify the key part-of-speech tag in the base reordering rules

• Replace the key part-of-speech tag with the corresponding word

o DT NN JJ IN → DT NN General IN

• Compute reordering probabilities of lexicalized part-of-speech tag sequences

• Exception word list

o If the reordering pattern with the highest probability is monotone increasing, select the word in the pattern as an exception

IBM T. J. Watson Research Center

© 2006 IBM Corporation

Page 13: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

13

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Lexicalized Reordering Probabilities

0.0134 1 3 2

0.0074 1 2 3

0.0981 4 3 2

0.1941 4 2 3

0.0121 3 4 2

0.2011 3 2 4

0.0211 2 4 3

0.4541 2 3 4

reorder´

DT1 NN2 General3 IN4

IBM T. J. Watson Research Center

© 2006 IBM Corporation

)|(ki tagreorderp

Page 14: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

14

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Performance Evaluations

• Translation model training corpus

• ~1.3 M sentence pairs from EPPS distributed by RWTH

• Language model training corpus

• EPPS English corpus: ~35 M words

• UN parallel corpus English (LDC94T4A): ~45 M words

• English gigaword second edition (LDC2005T12): ~2.5 B words

IBM T. J. Watson Research Center

© 2006 IBM Corporation

Page 15: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

15

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Evaluation Corpus Statistics

IBM T. J. Watson Research Center

© 2006 IBM Corporation

31 words/segment920CORTES Dev06 VHT

37 words/segment753CORTES Dev06 FTE

31 words/segment792EPPS Dev06 VHT

35 words/segment699EPPS Dev06 FTE

Avg. Segment Length# of SegmentsData Sets

Page 16: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

16

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Lexicalized Reordering Rules: Impact

IBM T. J. Watson Research Center

© 2006 IBM Corporation

BL

EU

r2n

4c

0.5322

0.5123

0.4439

0.4186

0.5434

0.5204

0.4507

0.4242

0.4

0.44

0.48

0.52

0.56

SE

CO

ND

AR

Y

EPPS Dev06 FTE EPPS Dev06 VHT CORTES Dev06 FTE CORTES Dev06 VHT

Page 17: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

17

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Base vs. Lexicalized Reordering Rules

IBM T. J. Watson Research Center

© 2006 IBM Corporation

BL

EU

r2n

4c

0 . 5 12 3

0 . 4 4 3 9

0 . 4 18 6

0 . 5 4 13

0 . 5 2 0 5

0 . 4 4 5 2

0 .4 19 6

0 .5 4 3 4

0 .4 5 0 7

0 . 4 2 4 2

0 . 5 3 2 2

0 . 5 2 0 4

0.4

0.43

0.46

0.49

0.52

0.55

EPPS Dev06 FTE EPPS Dev06 VHT CORTES Dev06 FTE CORTES Dev06 VHT

Page 18: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

18

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Related Work

• N-best Reordering in Arabic-to-English Translation

o Statistically significant performance improvement by applying local reordering to noun phrase parsed Arabic

o IBM Site Report: DARPA MT Evaluation Workshop 2004

• Morphological Analysis for Statistical Machine Translation

o Identify one to one word correspondences between Arabic and English to improve word to word translation qualities

o Companion Volume of HLT-NAACL 2004, pages 57−60

• Local Reordering for Spanish-English Translations

o Presentation at TC-STAR 2005 Evaluation Workshop

o April 21-22, 2005, Trento, Italy

IBM T. J. Watson Research Center

© 2006 IBM Corporation

Page 19: Morpho-Syntax in Statistical Machine Translation - …tcstar.org/openlab2006/day2/YSLopenlab2006.pdfOutline • Baseline Phrase Translation System o Block Acquisition & Decoding •

19

Business Unit or Product Name

Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation

Ongoing Work

• Non-local reordering models

• [Se ha puesto a prueba]VP [su voluntad]NP →

[Its will]NP [has been put to the test]VP

• Todas sus Señorías firmaron [con los electores]PP [un contrato]NP →All your ladies and gentlemen signed [a contract]NP [with the electors]PP

• Integration of reordering models into the decoder

IBM T. J. Watson Research Center

© 2006 IBM Corporation