dr. preslav nakov — combining, adapting and reusing bi-texts between related languages —...

Post on 10-May-2015

229 Views

Category:

Internet

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts. We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii). We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder. Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.

TRANSCRIPT

Combining, Adapting and Reusing Bi-texts between Related Languages:

Application to Statistical Machine Translation

Preslav Nakov, Qatar Computing Research Institute(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)

Yandex seminarAugust 13, 2014, Moscow, Russia

2

Plan

•Part I- Introduction to Statistical Machine Translation

•Part II- Combining, Adapting and Reusing Bi-texts between Related

Languages: Application to Statistical Machine Translation

•Part III- Further Discussion on SMT

3

StatisticalMachine Translation

4

Statistical Machine Translation (SMT)Reach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event.

English

SMT systems:- learn from human-generated translations- extract useful knowledge and build models- use the models to translate new sentences

5

SMT:The Noisy Channel Model

6

Translation as Decoding•1947, Warren Weaver, Rockefeller Foundation:

One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’

Example:- Это действительно написано по-английски .

- This is really written in English .

7

The Basic Components of an SMT System

Look for the best English translation that both conveys the French meaning

and is grammatical.

8

Components of an SMT System

•Language Model- English text е P(e)

o good English high probabilityo bad English low probability

•Translation Model- Pair <f,e> P(f|e)

o <f,e> are translations high probabilityo <f,e> are not translations low probability

•Decoder- Given P(e), P(f|e), and f we look for е that maximizes

[P(e).P(f|e)]

9

Combining P(e) and P(f|e)

How do we translate to Englishthe Russian phrase “красный цветок”?

P(e) P(f|e) P(e).P(f|e)

a flower red ↓ ↑ ↓red flower a ↓ ↑ ↓flower red a ↓ ↑ ↓a red dog ↑ ↓ ↓dog cat mouse ↓ ↓ ↓ ↓a red flower ↑ ↑ ↑

10

SMT:The Language Model P(e)

11

Language Model

•Goal: prefer “good” to “bad” English- “good” ≠ grammatical- “bad” ≈ unlikely

•Examples (grammaticality):- I do not like strong tea. good - I do not like powerful tea. bad- I like strong tea not. bad- Like not tea strong do I. bad

12

Example:Grammatical but Low-probability Text

Eye halve a spelling checkerIt came with my pea seaIt plainly marks four my revueMiss steaks eye kin knot sea.

Eye strike a key and type a wordAnd weight four it two sayWeather eye am wrong oar writeIt shows me a strait a weigh.

As soon as a mist ache is maidIt nose bee fore two longAnd eye can put the error riteIts rare lea ever wrong.

Eye have run this poem threw itI am shore your pleased two noIts letter perfect awl the weighMy checker tolled me sew.

Торопыжка был голодный - проглотил утюг холодный.

13

Language Model:Learned from Monolingual Text

14

Bigram Language Model

First-orderMarkov model(approximation)

Chain rule

)6()...|(P)|(P)|(P)|(P)(P

)5()...|(P)|(P)|(P)|(P)(P

)4()...(P)(P

453423121

432153214213121

21

wwwwwwwww

wwwwwwwwwwwwwww

wwwe n

Andrei Markov

15

Bigram Language Model

)(

)(

)(

)()|(P

1

1

1

11

i

ii

wii

iiii wC

wwC

wwC

wwCww

i

n

iiin wwwPwww

21121 |P...P

P(“I eat an apple …”) = P(I | <S>) . P(eat | I) . P(an | eat) . P(apple | an) …

16

SMT:The Translation Model P(f|e)

17

Modeling P(f|e) – Sentence Level

Batman did not fight any cat woman .

Бэтмен не вел бой с никакой женщиной кошкой .

•Cannot be estimated directly

18

Modeling P(f|e)

Batman did not fight any cat woman .

Бэтмен не вел бой с никакой женщиной кошкой .

•Broken into smaller steps

19

IBM Model 4: Generation(Brown et al., CL 1993)

Batman did not fight any cat woman .

Batman not fight fight any cat woman .

Batman not fight fight NULL any cat woman .

Бэтмен не вел бой с никакой кошкой женщиной .

Бэтмен не вел бой с никакой женщиной кошкой .

n(3|fight)

P-NULL

t(не|not)

d(8|7)

(Brown et al., CL 1993)

20

IBM Model 4: Generation(Brown et al., CL 1993)

Batman did not fight any cat woman .

Batman not fight fight any cat woman .

Batman not fight fight NULL any cat woman .

Бэтмен не вел бой с никакой кошкой женщиной .

Бэтмен не вел бой с никакой женщиной кошкой .

n(3|fight)

P-NULL

t(не|not)

d(8|7)

• All these probabilities could be learned if word alignments were available.

• We can learn word alignments using EM.

(Brown et al., CL 1993)

21

Translation Model: Learned from a Bi-TextReach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event.

22

100 Sentence Pairs

23

1000 Sentence Pairs

24

10,000 Sentences = 1 Book

25

100,000 Sentences = Stack of Books

26

1,000,000 Sentences = Shelf of Books

27

10 Million Sentences = Large Shelf of Books

28

The Large Data Trend Continues

29

Alignment Levels

- Document

- Paragraph

- SentenceoGale & Church algorithm

- Wordso IBM models

30

Learning Word AlignmentsUsing Expectation Minimization (EM)

… красивые цветы … красивые красные цветы … красивые девушки …

… beautiful flowers … beautiful red flowers … beautiful girls …

31

Learning Word AlignmentsUsing Expectation Minimization (EM)

… красивые цветы … красивые красные цветы … красивые девушки …

… beautiful flowers … beautiful red flowers … beautiful girls …

32

Learning Word AlignmentsUsing Expectation Minimization (EM)

… красивые цветы … красивые красные цветы … красивые девушки …

… beautiful flowers … beautiful red flowers … beautiful girls …

33

Learning Word AlignmentsUsing Expectation Minimization (EM)

… красивые цветы … красивые красные цветы … красивые девушки …

… beautiful flowers … beautiful red flowers … beautiful girls …

34

Phrase-basedSMT

35

Phrase-Based SMT

•Sentence is broken into phrases- Contiguous token sequences- Not linguistic units

•Each phrase is translated in isolation•Translated phrases are reordered

Batman has not fought a cat woman yet . Бэтмен пока не сражался с женщиной кошкой .

(Koehn&al., HLT-NAACL 2003)

(Koehn&al., HLT-NAACL 2003)

36

Phrase-Based Translation

•Multiple words Multiple words

•Models context

•Handles non-compositional phrases

•More data – longer phrases

37

Phrase-Based SMT:Sample

Bulgarian-English Phrases

38

Sample Phrases: главен

главни прокурори chief prosecutors

главни счетоводители chief accountants

главни архитекти chief architects

главни щабове main staffs

главни улици main streets

главни методисти senior instructors

главно предизвикателство major challenge

39

Sample Phrases: както

•както физическа , така и психическа ||| both physical and psychological•както целият регион ||| like the whole region•както те са определени ||| as defined•както и размера ||| as well as the size•както и предишните редовни доклади ||| in line with previous regular reports•както и по други ||| and in other

40

Phrase-Based SMT:Sample

Russian-Bulgarian Phrases

41

Sample Phrases: заявление

•заявление ||| молба ||| 0.25 0.166667 1 1 2.718•заявление об ||| молба за ||| 1 0.00524692 1 0.53125 2.718•заявление об образовании ||| молба за образуването ||| 1 0.005 ...•заявления ||| заявление ||| 1 1 0.5 0.666667 2.718•заявления ||| заявление от ||| 1 0.500677 0.5 0.222222 2.718•заявляю ||| заявявам ||| 0.333333 0.6 1 1 2.718

42

Sample Phrases: звонок, звук

•звонка ||| звънец ||| 1 1 0.4 0.5 2.718•звонка ||| звънеца ||| 0.25 0.2 0.4 0.5 2.718•звонка ||| на звънеца ||| 1 0.2 0.2 0.128199 2.718•звонки ||| звънци ||| 0.4 0.4 1 1 2.718•звонко ||| звънко ||| 0.333333 0.428571 1 1 2.718•звонков ||| звънци ||| 0.4 0.4 1 1 2.718•звонку ||| звънеца ||| 0.25 0.2 1 1 2.718•звонок ||| звънеца ||| 0.375 0.3 0.375 0.3 2.718•звонок ||| звънецът ||| 1 1 0.125 0.1 2.718•звонок ||| иззвъня ||| 0.6 0.625 0.375 0.5 2.718

•звук ||| звук ||| 0.666667 0.666667 1 1 2.718•звука ||| звук ||| 0.333333 0.333333 0.666667 0.4 2.718•звука ||| звука ||| 1 0.666667 0.333333 0.4 2.718•звуки ||| звуци ||| 1 1 1 1 2.718

43

Sample Phrases: здание

•здание ||| здание ||| 1 1 0.4 0.4 2.718•здание ||| зданието ||| 0.75 0.5 0.6 0.6 2.718•здания ||| зданието ||| 0.25 0.5 0.2 0.375 2.718•здания ||| зданието на ||| 1 0.250861 0.4 0.140625 2.718•здания ||| сградите ||| 1 1 0.2 0.25 2.718•здания ||| сградите на ||| 1 0.500861 0.2 0.09375 2.718

44

Sample Phrases: здравствуй

•здравствуй ||| добро утро ||| 1 0.75 0.333 0.0625 2.718•здравствуй ||| здравей ||| 1 1 0.666667 0.5 2.718

•здравствуйте ||| здравейте ||| 1 1 1 1 2.718

•здравствует ||| живее ||| 0.4 0.333333 1 1 2.718

45

Sample Phrases: необычайное•необычайное ||| необикновено ||| 0.176471 0.142857 0.75 0.75 2.718•необычайное ||| необикновеното ||| 0.333333 0.333333 0.25 0.25 2.718•необычайно ||| извънредно ||| 1 0.4 0.125 0.117647 2.718•необычайно ||| необикновена ||| 0.222222 0.166667 0.125 0.117647 2.718•необычайно ||| необикновено ||| 0.588235 0.476191 0.625 0.588235 2.718•необычайно ||| необичайно ||| 1 1 0.0625 0.117647 2.718•необычайной ||| необикновена ||| 0.333333 0.416667 0.5 0.625 2.718•необычайной ||| необикновено ||| 0.0588235 0.047619 0.166667 0.125 2.718•необычайной ||| с необикновена ||| 1 0.209808 0.333333 0.15625 2.718•необычайные ||| необикновени ||| 0.5 0.5 1 1 2.718•необычайный ||| необикновен ||| 0.222222 0.222222 0.5 0.5 2.718•необычайный ||| необикновеният ||| 0.5 0.5 0.25 0.25 2.718•необычайный ||| необичайни ||| 0.333333 0.25 0.25 0.25 2.718

•необычное ||| необикновеното ||| 0.666667 0.666667 1 1 2.718•необычные ||| необичайни ||| 0.666667 0.5 1 1 2.718

•неожиданной ||| неочакваната ||| 0.333333 0.333333 0.25 0.25 2.718•неожиданной ||| неочаквана ||| 0.666667 0.6 0.75 0.75 2.718

46

SMT:Evaluation

47

How MT Evaluation is NOT Done…

•Backtranslation

- A “mythical” example (Hutchins,1995)o En: The spirit is willing, but the flesh is weak.o Ru: Дух бодр, но плоть слаба.o En. The vodka is good, but the meat is rotten.

- Not used, can be gamed easily:o En: The spirit is willing, but the flesh is weak.o Ru: The spirit is willing, but the flesh is weak.o En: The spirit is willing, but the flesh is weak.

48

The BLEU Evaluation Metric(Papineni et al., ACL 2002)

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

• BLEU4 formula (counts n-grams up to length 4)

exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0)

p1 = 1-gram precisionp2 = 2-gram precisionp3 = 3-gram precisionp4 = 4-gram precision

Correlates well with human judgments Very hard to “game” it

(Papineni et al., ACL 2002)

49

BLEU: Multiple Reference Translations

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

(Papineni et al., ACL 2002)

50

Phrase-Based SMT:Parameter Tuning

51

The Basic Model, Revisited

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) e

argmax P(e) x P(f | e) e

argmax P(e)2.4 x P(f | e) e

argmax P(e)2.4 x P(f | e) x #words(e)1.1

eRewards longer hypotheses, since they are unfairly penalized by P(e)

Works better

x P(e | f)1.1 x Plex(f | e)1.3 x Plex(e | f)0.9 x #phrases(e,f)0.5...

(Och, ACL 2003)

52

Maximum BLEU Training(Och, ACL 2003)

Translation System

(Automatic,Trainable)

Translation Quality

Evaluator(Automatic)

Frenchinput

EnglishMT Output

EnglishReference Translations(sample “right answers”)

BLEUscore

LanguageModel #1

TranslationModel

LanguageModel #2

Length Model

OtherFeatures

MERT: Minimum Error Rate Training(optimizes BLEU directly)

(Och, ACL 2003)

53

Statistical Phrase-Based Translation

1. Training:1. P(e): n-gram language model2. P(f|e):

1. Generate word alignments

2. Build a phrase table

2. Tuning:1. Use MERT to tune the parameters

3. Evaluation:1. Run the system on test data2. Calculate BLEU

top related