using comparable corpora to adapt a translation model to domains hiroyuki kaji, takashi tsunakawa,...

26
1 Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka University The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.

Upload: lorraine-hopkins

Post on 18-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

1

Using Comparable Corpora to Adapt

a Translation Model to Domains

Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada

Department of Computer Science, Shizuoka University

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.

Page 2: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

2

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 3: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

3

Statistical machine translation– Able to learn a translation model from a parallel corpus– Suffer from the limited availability of large parallel corpora

Use comparable corpora for SMT– Estimate translation pseudo-probabilities from a bilingual

dictionary and comparable corpora– Use the pseudo-probabilities estimated from in-domain

comparable corpora to• Adapt a translation model learned from an out-of-domain

parallel corpus, or• Augment a translation model learned from a small in-

domain parallel corpus

Motivation and goal

Page 4: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

4

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 5: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

5

Word associations suggest particular senses or translations of a polysemous word (Yarowsky 1993)

– (tank, soldier) the “military vehicle” sense or translation “ 戦車 [SENSHA]” of “tank”

– (tank, gasoline) the “container for liquid or gas” sense or translation “ タンク [TANKU]” of “tank”

Comparable corpora allow us to determine which word associations suggest which translations of a polysemous word (Kaji & Morimoto 2002)

Assume that the more word associations that suggest a translation, the higher the probability of the translation word would be

Basic idea for estimating word translation pseudo-probabilities from comparable corpora

Page 6: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

6

Naive method for estimating word translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

Align word associationsAlign word associations

Calculate the percentage of associated words suggesting each translationCalculate the percentage of associated words suggesting each translation

Extract word associationsExtract word associationsExtract word associationsExtract word associations

“Missile”, “soldier”, and others suggest “ 戦車 [SENSHA]”“Fuel”, “gasoline”, and others suggest “ タンク [TANKU]”

(tank, soldier)

(tank, gasoline)

(tank, missile)

(tank, fuel)

( 戦車 [SENSHA], 兵士 [HEISHI])

( タンク [TANKU], ガソリン[GASORIN])( 戦車 [SENSHA], ミサイル [MISAIRU])

( タンク [TANKU], 燃料[NENRYOU])

Pps( 戦車 [SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Pps( タンク [TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Page 7: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

7

Failure in word-association alignment

(tank, Chechen) ?– due to the disparity in topical coverage between two language

corpora

(tank, Chechen) ? ( 戦車 [SENSHA], チェチェン[CHECHEN])

– due to the incomplete coverage of the intermediary bilingual dictionary

Incorrect word-association alignment

(tank, troop) ( 水槽 [SUISOU], 群れ [MURE])– due to incidental word-for-word correspondence between

word associations that do not really correspond to each other

Difficulties the naive method suffers from

Page 8: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

8

Two words associated with a third word are likely to suggest the same sense or translation of the third word when they are also associated with each other

“Soldier” and “troop”, both of which are associated with “tank”, are associated with each other

“Soldier” and “troop” suggest the same translation “ 戦車 [SENSHA]” Define a correlation between an associated word and a translation

using the correlations between other associated words and the translation

– C(troop, 戦車 [SENSHA]) MI(troop, tank) {MI(troop, soldier) C(soldier, 戦車 [SENSHA])

+ MI(troop, missile) C(missile, 戦車 [SENSHA]) + …}

– C(troop, タンク [TANKU]) MI(troop, tank) {MI(troop, soldier) C(soldier, タンク [TANKU])

+ MI(troop, missile) C(missile, タンク[TANKU]) + …}

How to overcome the difficulties

Page 9: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

9

Calculate the correlations iteratively starting with the initial values determined according to the results of word-association alignment via a bilingual dictionary

tank 戦車[SENSHA]

タンク[TANKU]

Chechen

fuel gasoline missile soldier troop

• Alignment

tank 戦車[SENSHA]

タンク[TANKU]

Chechen 0.0 0.0

fuel 0.0 1.0

gasoline 0.0 1.0

missile 1.0 0.0

soldier 1.0 0.0

troop 0.5 0.5

• C0(associated_word, translation)

Page 10: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

10

Overview of our method for estimating noun translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

AlignAlign

Calculate pairwise correlation between associated words and translations iterativelyCalculate pairwise correlation between associated words and translations iteratively

Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*

English word associations

Calculate point-wise mutual information

Calculate point-wise mutual information

Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*

Calculate point-wise mutual information

Calculate point-wise mutual information

Japanese word associations

Initial value of correlation matrix of English associated words vs. Japanese translations for an English noun

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Correlation matrix of associated words vs. translations

Noun translation pseudo-probabilities

* Window size = 10 content words

Page 11: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

11

Example correlation matrix and estimated noun translation pseudo-probabilities

plant 装置[SOUCHI]

設備[SETSUBI

]

植物[SHOKU-BUTSU]

工場[KOUJOU]

プラント

[PURANTO]

苗[NAEI]

植木[UEKI]

activity 0.02 0.03 2.10 0.20 0.03 0.01 0.02bacteria 0.02 0.03 1.98 0.01 0.02 0.27 0.02boiler 0.05 2.70 0.05 0.03 2.73 0.03 0.04coal 0.87 2.35 1.70 0.68 2.06 0.65 0.99computer 0.55 0.71 0.02 0.49 0.73 0.01 0.01control 0.47 0.51 0.17 0.15 0.62 0.06 0.01culture 0.03 0.05 3.26 0.23 0.12 0.77 0.88environment 0.76 1.25 1.32 0.03 0.05 0.23 0.03failure 0.93 1.22 0.03 0.53 1.43 0.01 0.01flower 0.04 0.06 4.02 0.04 0.04 1.23 1.70

: : : : : : : :Translation pseudo-probabilities

.047 .241 .423 .022 .223 .022 .022

Page 12: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

12

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 13: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

13

Our method for estimating noun-sequence translation pseudo-probabilities

English corpus Japanese corpusEnglish-Japanese dictionary

Generate all compositional translations

Generate all compositional translations

E(1)=e1(1)e2

(1)… em(1),

E(2)= e1(2)e2

(2)… em(2),

…,

E(n)=e1(n)e2

(n)… em(n) F=f1f2…fm

Estimate according to occurrence frequenciesEstimate according to occurrence frequencies

Combine two estimatesCombine two estimates

Retrieve compo-sitional transla-tions and count their frequencies

Retrieve compo-sitional transla-tions and count their frequencies

Extract a noun sequence with its frequency

Extract a noun sequence with its frequency

Estimate according to constituent-word translation pseudo-probabilitiesEstimate according to constituent-word translation pseudo-probabilities

n

k

m

ii

kips

m

ii

jips

j fePfePFEP1 1

)(

1

)()(2 )|()|()|(

n

k

kjj EgEgFEP1

)()()(1 )()()|(

)|()|()|( )(2

)(1

)( FEPFEPFEP jjjps

Page 14: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

14

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 15: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

15

Phrase-based SMT using translation pseudo-probabilities

In-domain phrase table (pseudo-probabilities)

Bilingual dictionary

MergeMerge

Estimate translation pseudo-probabilities Estimate translation pseudo-probabilities

Giza++ & heuristics Giza++ & heuristics

SRILMSRILM

Out-of-domain (or in-domain) parallel corpus

In-domain source- language corpus

In-domain target- language corpus

Moses decoder Moses decoder

Basic phrase table

Adapted (or augmented) phrase table In-domain language model

Source language text Target language text

Page 16: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

16

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 17: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

17

Experiment AAdapt a phrase table learned from an out-of-domain parallel corpus by using in-domain comparable corpora

Experiment BAugment a phrase table learned from an in-domain small parallel corpus by using in-domain larger comparable corpora

Experimental setting

Experiment A Experiment B

Training parallel corpus

20,000 pairs of Japanese and English patent abstracts in the physics

20,000 pairs of Japanese and English sentences having high similarity ―ones extracted from scientific-paper abstracts in the chemistry

Training comparable corpora

Scientific-paper abstracts in the chemistry ・ Japanese: 151,958 abstracts (90.8 Mbytes) ・ English: 102,730 abstracts (64.9 Mbytes)

Test corpus 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry

Bilingual dictionary

333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries

Page 18: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

18

Our method in four cases using a different volume of comparable corpora1. Japanese: all, English: all2. Japanese: half, English: all3. Japanese: all, English: half4. Japanese: half, English: half

Two baseline methods using the phrase table learned from the parallel corpus1. Baseline without dictionary2. Baseline with dictionary: Phrase table were augmented with the bilingual

dictionary[Note] The TL language model learned from the whole TL monolingual

corpus was used commonly in all cases involving our method and the baseline methods

Evaluation metric: BLEU-4

Page 19: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

19

BLEU-4 score

Our method rather slightly improved the BLEU score The effect of the difference in volume of comparable corpora remains

unclear Simply adding a bilingual dictionary improved the out-of-domain phrase

table, but did not improve the in-domain phrase table

Method Experiment A Experiment B

Our method

J:all, E:all 13.30 16.82

J:half, E:all 13.19 16.70

J:all, E:half 13.21 16.78

J:half, E:half 13.27 16.71

Baseline w/o dictionary 11.42 16.37

Baseline w/ dictionary 12.94 16.32

Experimental results

Page 20: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

20

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 21: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

21

1. Optimization of the parameters – Parameters, including the window size and thresholds for word

occurrence frequency, co-occurrence frequency, and pointwise mutual information, affect the correlation matrix of associated words vs. translations

– How to optimize the values for the parameters remains unsolved

2. Alternatives for word-association measure – Pointwise mutual information, which tends to overestimate low-

frequency words, is not the most suitable for acquiring word associations

– Need to compare with alternatives such as log-likelihood ratio and the Dice coefficient

Discussions

Page 22: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

22

3. Refinement of the definition of translation pseudo-probability

– Need to consider the frequencies of associated words as well as the dependence among associated words

– Need to reconsider the strategy assigning an associated word to only one translation

4. Estimate of verb translation pseudo-probabilities

– Need to use syntactic co-occurrence, instead of co-occurrence in a widow, to extract verb-noun associations from corpora

– Need to define pariwise correlation between associated nouns and translations recursively based on heuristics where two nouns associated with a verb are likely to suggest the same sense of the verb when they belong to the same semantic class

Page 23: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

23

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 24: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

24

Related work

Many studies on bilingual lexicon acquisition from bilingual comparable corpora have been reported since the mid 90s, but few studies on word translation probability estimate from bilingual comparable corpora

Estimate of word translation probabilities from comparable corpora using an EM algorithm (Koehn & Knight 2000) could be greatly affected by the occurrence frequencies of translation candidates in the TL corpus

In contrast, our method produces translation pseudo-probabilities that reflect the distribution of the senses of the SL word in the SL corpus

Methods for extracting parallel sentence pairs from bilingual comparable corpora (Zhao & Vogel, 2002; Utiyama & Isahara 2003; Fung & Cheung, 2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to learn a translation model with a conventional method based on word-for-word alignment. This approach is applicable only to closely comparable corpora.

In contrast, our method is applicable even to a pair of unrelated monolingual corpora.

Page 25: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

25

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Page 26: Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka

26

Summary

A method for estimating translation pseudo-probabilities from a bilingual dictionary and bilingual comparable corpora was created– Assumption: The more associated words a translation is correlated with,

the higher its translation probability

– Essence of the method: Calculate pairwise correlations between associated words of an SL word and its TL translations

A phrase-based SMT framework using out-of-domain parallel corpus and in-domain comparable corpora was proposed– An experiment showed promising results; the BLEU score was improved

by using the translation pseudo-probabilities estimated from in-domain comparable corpora.

Future work includes optimizing the parameters and extending the method to estimate translation pseudo-probabilities for verbs.