using comparable corpora to adapt a translation model to domains hiroyuki kaji, takashi tsunakawa,...

1

Using Comparable Corpora to Adapt

a Translation Model to Domains

Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada

Department of Computer Science, Shizuoka University

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.

2

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

3

Statistical machine translation– Able to learn a translation model from a parallel corpus– Suffer from the limited availability of large parallel corpora

Use comparable corpora for SMT– Estimate translation pseudo-probabilities from a bilingual

dictionary and comparable corpora– Use the pseudo-probabilities estimated from in-domain

comparable corpora to• Adapt a translation model learned from an out-of-domain

parallel corpus, or• Augment a translation model learned from a small in-

domain parallel corpus

Motivation and goal

4


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

5

Word associations suggest particular senses or translations of a polysemous word (Yarowsky 1993)

– (tank, soldier) the “military vehicle” sense or translation “ 戦車 [SENSHA]” of “tank”

– (tank, gasoline) the “container for liquid or gas” sense or translation “ タンク [TANKU]” of “tank”

Comparable corpora allow us to determine which word associations suggest which translations of a polysemous word (Kaji & Morimoto 2002)

Assume that the more word associations that suggest a translation, the higher the probability of the translation word would be

Basic idea for estimating word translation pseudo-probabilities from comparable corpora

6

Naive method for estimating word translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

Align word associationsAlign word associations

Calculate the percentage of associated words suggesting each translationCalculate the percentage of associated words suggesting each translation

Extract word associationsExtract word associationsExtract word associationsExtract word associations

“Missile”, “soldier”, and others suggest “ 戦車 [SENSHA]”“Fuel”, “gasoline”, and others suggest “ タンク [TANKU]”

(tank, soldier)

(tank, gasoline)

(tank, missile)

(tank, fuel)

( 戦車 [SENSHA], 兵士 [HEISHI])

( タンク [TANKU], ガソリン[GASORIN])( 戦車 [SENSHA], ミサイル [MISAIRU])

( タンク [TANKU], 燃料[NENRYOU])

Pps( 戦車 [SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Pps( タンク [TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

7

Failure in word-association alignment

(tank, Chechen) ?– due to the disparity in topical coverage between two language

corpora

(tank, Chechen) ? ( 戦車 [SENSHA], チェチェン[CHECHEN])

– due to the incomplete coverage of the intermediary bilingual dictionary

Incorrect word-association alignment

(tank, troop) ( 水槽 [SUISOU], 群れ [MURE])– due to incidental word-for-word correspondence between

word associations that do not really correspond to each other

Difficulties the naive method suffers from

8

Two words associated with a third word are likely to suggest the same sense or translation of the third word when they are also associated with each other

“Soldier” and “troop”, both of which are associated with “tank”, are associated with each other

“Soldier” and “troop” suggest the same translation “ 戦車 [SENSHA]” Define a correlation between an associated word and a translation

using the correlations between other associated words and the translation

– C(troop, 戦車 [SENSHA]) MI(troop, tank) {MI(troop, soldier) C(soldier, 戦車 [SENSHA])

+ MI(troop, missile) C(missile, 戦車 [SENSHA]) + …}

– C(troop, タンク [TANKU]) MI(troop, tank) {MI(troop, soldier) C(soldier, タンク [TANKU])

+ MI(troop, missile) C(missile, タンク[TANKU]) + …}

How to overcome the difficulties

9

Calculate the correlations iteratively starting with the initial values determined according to the results of word-association alignment via a bilingual dictionary

tank 戦車[SENSHA]

タンク[TANKU]

Chechen

fuel gasoline missile soldier troop

• Alignment

tank 戦車[SENSHA]

タンク[TANKU]

Chechen 0.0 0.0

fuel 0.0 1.0

gasoline 0.0 1.0

missile 1.0 0.0

soldier 1.0 0.0

troop 0.5 0.5

• C0(associated_word, translation)

10

Overview of our method for estimating noun translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

AlignAlign

Calculate pairwise correlation between associated words and translations iterativelyCalculate pairwise correlation between associated words and translations iteratively

Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*

English word associations

Calculate point-wise mutual information


Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*



Japanese word associations

Initial value of correlation matrix of English associated words vs. Japanese translations for an English noun

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Correlation matrix of associated words vs. translations

Noun translation pseudo-probabilities

* Window size = 10 content words

11

Example correlation matrix and estimated noun translation pseudo-probabilities

plant 装置[SOUCHI]

設備[SETSUBI

]

植物[SHOKU-BUTSU]

工場[KOUJOU]

プラント

[PURANTO]

苗[NAEI]

植木[UEKI]

activity 0.02 0.03 2.10 0.20 0.03 0.01 0.02bacteria 0.02 0.03 1.98 0.01 0.02 0.27 0.02boiler 0.05 2.70 0.05 0.03 2.73 0.03 0.04coal 0.87 2.35 1.70 0.68 2.06 0.65 0.99computer 0.55 0.71 0.02 0.49 0.73 0.01 0.01control 0.47 0.51 0.17 0.15 0.62 0.06 0.01culture 0.03 0.05 3.26 0.23 0.12 0.77 0.88environment 0.76 1.25 1.32 0.03 0.05 0.23 0.03failure 0.93 1.22 0.03 0.53 1.43 0.01 0.01flower 0.04 0.06 4.02 0.04 0.04 1.23 1.70

: : : : : : : :Translation pseudo-probabilities

.047 .241 .423 .022 .223 .022 .022

12


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

13

Our method for estimating noun-sequence translation pseudo-probabilities

English corpus Japanese corpusEnglish-Japanese dictionary

Generate all compositional translations

Generate all compositional translations

E(1)=e1(1)e2

(1)… em(1),

E(2)= e1(2)e2

(2)… em(2),

…,

E(n)=e1(n)e2

(n)… em(n) F=f1f2…fm

Estimate according to occurrence frequenciesEstimate according to occurrence frequencies

Combine two estimatesCombine two estimates

Retrieve compo-sitional transla-tions and count their frequencies

Retrieve compo-sitional transla-tions and count their frequencies

Extract a noun sequence with its frequency

Extract a noun sequence with its frequency

Estimate according to constituent-word translation pseudo-probabilitiesEstimate according to constituent-word translation pseudo-probabilities

n

k

m

ii

kips

m

ii

jips

j fePfePFEP1 1

)(

1

)()(2 )|()|()|(

n

k

kjj EgEgFEP1

)()()(1 )()()|(

)|()|()|( )(2

)(1

)( FEPFEPFEP jjjps

14


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

15

Phrase-based SMT using translation pseudo-probabilities

In-domain phrase table (pseudo-probabilities)

Bilingual dictionary

MergeMerge

Estimate translation pseudo-probabilities Estimate translation pseudo-probabilities

Giza++ & heuristics Giza++ & heuristics

SRILMSRILM

Out-of-domain (or in-domain) parallel corpus

In-domain source- language corpus

In-domain target- language corpus

Moses decoder Moses decoder

Basic phrase table

Adapted (or augmented) phrase table In-domain language model

Source language text Target language text

16


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

17

Experiment AAdapt a phrase table learned from an out-of-domain parallel corpus by using in-domain comparable corpora

Experiment BAugment a phrase table learned from an in-domain small parallel corpus by using in-domain larger comparable corpora

Experimental setting

Experiment A Experiment B

Training parallel corpus

20,000 pairs of Japanese and English patent abstracts in the physics

20,000 pairs of Japanese and English sentences having high similarity ―ones extracted from scientific-paper abstracts in the chemistry

Training comparable corpora

Scientific-paper abstracts in the chemistry ・ Japanese: 151,958 abstracts (90.8 Mbytes) ・ English: 102,730 abstracts (64.9 Mbytes)

Test corpus 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry

Bilingual dictionary

333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries

18

Our method in four cases using a different volume of comparable corpora1. Japanese: all, English: all2. Japanese: half, English: all3. Japanese: all, English: half4. Japanese: half, English: half

Two baseline methods using the phrase table learned from the parallel corpus1. Baseline without dictionary2. Baseline with dictionary: Phrase table were augmented with the bilingual

dictionary[Note] The TL language model learned from the whole TL monolingual

corpus was used commonly in all cases involving our method and the baseline methods

Evaluation metric: BLEU-4

19

BLEU-4 score

Our method rather slightly improved the BLEU score The effect of the difference in volume of comparable corpora remains

unclear Simply adding a bilingual dictionary improved the out-of-domain phrase

table, but did not improve the in-domain phrase table

Method Experiment A Experiment B

Our method

J:all, E:all 13.30 16.82

J:half, E:all 13.19 16.70

J:all, E:half 13.21 16.78

J:half, E:half 13.27 16.71

Baseline w/o dictionary 11.42 16.37

Baseline w/ dictionary 12.94 16.32

Experimental results

20


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

21

1. Optimization of the parameters – Parameters, including the window size and thresholds for word

occurrence frequency, co-occurrence frequency, and pointwise mutual information, affect the correlation matrix of associated words vs. translations

– How to optimize the values for the parameters remains unsolved

2. Alternatives for word-association measure – Pointwise mutual information, which tends to overestimate low-

frequency words, is not the most suitable for acquiring word associations

– Need to compare with alternatives such as log-likelihood ratio and the Dice coefficient

Discussions

22

3. Refinement of the definition of translation pseudo-probability

– Need to consider the frequencies of associated words as well as the dependence among associated words

– Need to reconsider the strategy assigning an associated word to only one translation

4. Estimate of verb translation pseudo-probabilities

– Need to use syntactic co-occurrence, instead of co-occurrence in a widow, to extract verb-noun associations from corpora

– Need to define pariwise correlation between associated nouns and translations recursively based on heuristics where two nouns associated with a verb are likely to suggest the same sense of the verb when they belong to the same semantic class

23


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

24

Related work

Many studies on bilingual lexicon acquisition from bilingual comparable corpora have been reported since the mid 90s, but few studies on word translation probability estimate from bilingual comparable corpora

Estimate of word translation probabilities from comparable corpora using an EM algorithm (Koehn & Knight 2000) could be greatly affected by the occurrence frequencies of translation candidates in the TL corpus

In contrast, our method produces translation pseudo-probabilities that reflect the distribution of the senses of the SL word in the SL corpus

Methods for extracting parallel sentence pairs from bilingual comparable corpora (Zhao & Vogel, 2002; Utiyama & Isahara 2003; Fung & Cheung, 2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to learn a translation model with a conventional method based on word-for-word alignment. This approach is applicable only to closely comparable corpora.

In contrast, our method is applicable even to a pair of unrelated monolingual corpora.

25


2. Proposed method




3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

26

Summary

A method for estimating translation pseudo-probabilities from a bilingual dictionary and bilingual comparable corpora was created– Assumption: The more associated words a translation is correlated with,

the higher its translation probability

– Essence of the method: Calculate pairwise correlations between associated words of an SL word and its TL translations

A phrase-based SMT framework using out-of-domain parallel corpus and in-domain comparable corpora was proposed– An experiment showed promising results; the BLEU score was improved

by using the translation pseudo-probabilities estimated from in-domain comparable corpora.

Future work includes optimizing the parameters and extending the method to estimate translation pseudo-probabilities for verbs.

using comparable corpora to adapt a translation model to domains hiroyuki kaji, takashi tsunakawa,...

Documents