machine translation domain adaptation day 19 1. project #2 2

87
Machine Translation Domain Adaptation Day 19 1

Upload: jordan-earles

Post on 31-Mar-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

1

Machine TranslationDomain Adaptation

Day 19

Page 2: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

2

PROJECT #2

Page 3: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

MEMM tools

• Online description of project #2 has been updated with more information

Page 4: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

Page 5: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

You write code to convert this to features!

“featurize.pl training.txt training.feats”

Page 6: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

Run memm_train to train this model

“memm_train --input training.feats --classifier trigram.model --markovOrder 2”

Page 7: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

Get some unseen test data…

Page 8: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 

Use the same featurization code on test data

“featurize.pl test.txt test.feats”

Page 9: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 

test.tagsPRPVBD. NNPVBD.  

memm_test predicts tags (memm_test ignores first column; can include true tags)

“memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags”

Page 10: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

MEMM featurestraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

Actual features used by MEMMPRP w0=I:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]=<s>:1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1NNP w0=John:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]=<s>:1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1

You provide these features…

…and add the argument “--markovOrder 2”

The MEMM adds in features about tag

context add training and test time

Page 11: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

11

MACHINE TRANSLATION

Page 12: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

12

Acknowledgments

• Many thanks to (for helpful content and input on content):– Chris Callison-Burch, Matt Post, & Adam Lopez

(JHU)– Philipp Koehn & Barry Haddow (U Edinburgh)– Kevin Knight (ISI)

Page 13: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

13

Page 14: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

14

Page 15: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

15

Translation: global problem and interesting research problem

English32%

Chinese13%

Spanish9%

Japanese7%

French5%

German4%

Arabic4%

Portuguese4%

Other21%

Internet users – 2007• Non-English Internet content and user communities are increasing explosively

• Human translation costs are excessive: major languages range from 10-50 cents per word

• Non-English Internet content and user communities are increasing explosively

• Human translation costs are excessive: major languages range from 10-50 cents per word

Result: the vast majority of published material remains untranslated!

Page 16: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

16

Prevalence of MT on the Web

Estonian

Hungarian

Slovenian

Slovak

Romanian

Latvian

Lithuanian

12.13% 12.93%

25.47%

46.40% 47.40% 50.07% 51.53%

Proportion of MT’d Content by language

From Rarrick et al, 2010

Page 17: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

17

Page 18: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

18

The Goal: (sentence) translation

• Translate source sentences into target sentences– For now, ignore

discourse structure, co-reference, and phenomena across sentence boundaries

滴水之恩當以涌泉相報

A drop of water shall be returned with a burst of

spring.

Page 19: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

19

Types of MT systems

• Source of information– Rule based: People write rules to specify translations of

words, phrases– Data-driven: Use learning techniques to derive translation

“rules” from data sources (e.g., parallel corpora)

• Level of representationInterlingua

Semantic forms

Syntax trees

Phrases

WordsModified Vauquois pyramid

Page 20: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

20

Advantages of data-driven translation

• We can model the genres of documents that we would like to model– Learn contextually appropriate translations for technical

data, chat data, etc.• Very flexible system– Given corpus C = ({x1,y1}, {x2,y2}, …) of sentence pairs– Translate(C, x) = y is a function of the training data and the

input sentence– To build a new system (or optimize our old one) we just

change the data

– But…we need oodles of data to get “good” models

Page 21: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

21

Statistical MT

• Learn word and phrase alignments from “parallel” data

Page 22: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

22

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel data? – Parallel documents?

Page 23: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

23

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

Page 24: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

24

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

Page 25: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

25

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

Page 26: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

26

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

Page 27: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

27

Page 28: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

28

Page 29: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

29

Some Hmong

a house ib lub tsev

a new house ib lub tsev tshiab

my new house kuv lub tsev tshiab

eight new houses yim lub tsev tshiab

my eight new houses kuv yim lub tsev tshiab

Page 30: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

30

Some More Hmong

a house ib lub tsev

a new house ib lub tsev tshiab

my new house kuv lub tsev tshiab

eight new houses yim lub tsev tshiab

my eight new houses kuv yim lub tsev tshiab

the house lub tsev

Page 31: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

31

Even More Hmong

kuv pluag heev I'm very poorib pluag mov a meal ib taig mov a bowl of riceib taig zaub a bowl of vegetables

Page 32: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

32

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

Page 33: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

33

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents

• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

• Use monolingual data to– Build language models

• Inform ordering• Choose best translation from n-best list

Page 34: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

34

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

Page 35: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

35

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

• Monolingual data

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

• Language Model – P(E)

Page 36: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

36

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

• Monolingual data• Decoding Algorithm

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

• Language Model – P(E)• Decoder

– Maximizes P(F|E)*P(E)

Page 37: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

37

Statistical Machine Translation

• Given foreign f, find best English translation e*e* = argmaxe P(e | f)

• Use Bayes’ rule to get “noisy channel” modelP(e | f) = P(f | e) P(∙ e) / P(f)argmaxe P(e | f) = argmax P(f | e) P(∙ e)

• P(f | e) is the channel or translation model• P(e) is the language model

Page 38: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

38

Centauri/Arcturan [Knight, 1997]Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides 38-74 adapted from Kevin Knight and CCB’s JHU crew

Page 39: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

39

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 40: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

40

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 41: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

41

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Page 42: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

42

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 43: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

43

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 44: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

44

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 45: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

45

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 46: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

46

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 47: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

47

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 48: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

48

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process ofelimination

Page 49: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

49

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Page 50: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

50

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 51: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

51

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Page 52: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

52

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Page 53: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

53

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Page 54: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

54

Reorder

Page 55: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

55

Reorder

Page 56: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

56

Reorder

Page 57: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

57

Reorder

5040 Possible Orderings!!

Page 58: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

58

Page 59: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

59

Page 60: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

60

Page 61: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

61

Page 62: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

62

Page 63: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

63

Page 64: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

64

Page 65: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

65

Page 66: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

66

Page 67: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

67

Page 68: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

68

Page 69: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

69

Page 70: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

70

Page 71: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

71

Page 72: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

72

Page 73: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

73

Page 74: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

74

Page 75: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

75

Language Model

• Use a standard n-gram language model for P(E).• Trained on large monolingual corpus – 4- or 5-gram is typical– Often uses target side of parallel data + monolingual data

Page 76: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

76

Translation Model

• “Phrase table”– N-gram pairs and probabilities

Page 77: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

77

Statistical Machine Translation

Page 78: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

78

EVALUATING MT

Page 79: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

MT Evaluation

• I have a throbbing pain.• I am experiencing a throbbing

pain.• I am suffering from a throbbing

pain.• I am feeling a throbbing pain.• It is a throbbing pain.• It's throbbing and it really

hurts.• It's painful and it's throbbing.• It's throbbing with pain.

• It's in throbbing pain.• It hurts so much it's throbbing.• I've got a throbbing pain.• I can feel a throbbing pain.• I am suffering from a

throbbing pain.• I am experiencing a throbbing

pain.• I have a painful throbbing.• I feel a painful throbbing.

Source : ズキズキ 痛み ます 。16 human translations:

79

Data from International Workshop on Spoken Language Translation

Page 80: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

80

MT Evaluation

• No “right answer”!• What can we test instead?– Human adequacy / fluency ratings– Human efficacy in an application

(e.g. question answering from translated foreign documents vs. native documents)

– Very accurate, but slow & expensive• Agreement with reference translations– BLEU (BiLingual Evaluation Understudy: IBM)– Fast system development

Page 81: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

81

BLEU (Papineni, ACL 2002)

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

Page 82: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

82

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys

the commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

Page 83: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

83

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

Page 84: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

84

BLEU: observations

1: It is a guide to action which ensures that the military always obeys the commands of the party.

2: It is to insure the troops forever hearing the activity guidebook that party direct.

• Observations– Word overlap is indicative– n-gram (word sequence) overlap is even more distinct– Drawing from multiple reference translations helps

Page 85: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

85

BLEU metric

• Compute n-gram precisions:Pn = c(matched n-grams) / c(n-grams in candidate)

• Compute a brevity penalty(Prevent candidates from deleting difficult words)BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c =

candidate length• Combine using geometric mean

BLEU = BP (∏∙ i=1n Pi)^(1/n)

• Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)

Page 86: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

BLEU results circa 2002

[from Papineni et al., ACL 2002] [from G. Doddington, NIST]

Distinguishes humans from machines… …correlates well with human judgments

86

However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning

Page 87: Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

87

Next Time

• MT & Word Alignment• Application of EM