comparing example-based & statistical machine translation

39
National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way*† Nano Gough*, Declan Groves† National Centre for Language Technology School of Computing, Dublin City University {away,ngough,dgroves}@computing.dcu.ie [*To appear in the Journal of Natural Language Engineering, June 2005] [† To appear in the Workshop on Building and Using Parallel Texts: Data-Driven MT and Beyond, ACL-05, June 2005]

Upload: mitch

Post on 24-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Comparing Example-Based & Statistical Machine Translation. Andy Way*† Nano Gough*, Declan Groves† National Centre for Language Technology School of Computing, Dublin City University {away,ngough,dgroves}@computing.dcu.ie - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Comparing Example-Based & Statistical Machine Translation

Andy Way*†Nano Gough*, Declan Groves†

National Centre for Language TechnologySchool of Computing, Dublin City University

{away,ngough,dgroves}@computing.dcu.ie

[*To appear in the Journal of Natural Language Engineering, June 2005][† To appear in the Workshop on Building and Using Parallel Texts:

Data-Driven MT and Beyond, ACL-05, June 2005]

Page 2: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Plan of the Talk

1. Basic Situation in MT today:• Statistical MT (SMT)• Example-Based MT (EBMT)

2. Differences between Phrase-based SMT & EBMT.

3. Our ‘Marker-based’ EBMT system.4. Testing EBMT vs. word- & phrase-based SMT.5. Results & Observations.6. Concluding Remarks.7. Future Research Avenues.

Page 3: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

What is the Situation today in MT?

Most MT research undertaken today iscorpus-based (compared with rule-based methods).

Two main data-driven approaches:1. Example-Based MT (EBMT)2. Statistical MT (SMT)

SMT by far the more dominant paradigm.

Page 4: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

How does EBMT work?

French

F1

F2

F3

F4

EX (input)

search

F2 F4FX

(output)

English

E1

E2

E3

E4

Page 5: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

A (much simplified) Example

Given in corpus

John went to school Jean est allé à l’école.

The butcher’s is next to the baker’s La boucherie est à côté de la

boulangerie.

Isolate useful fragments

John went to Jean est allé àthe baker’s la boulangerie

We can now translate

John went to the baker’s Jean est allé à la boulangerie.

Page 6: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

How does SMT work?

SMT deduces language & translation models from huge quantities of monolingual and bilingual data using a range of theoretical approaches to probability distribution and estimation.

• Translation model establishes the set of target language words (and more recently, phrases) which are most likely to be useful in translating the source string.

– takes into account source and target word (and phrase) co-occurrence frequencies, sentence lengths and the relative sentence positions of source and target words.

• Language model tries to assemble these words (and phrases) in the best order possible.

– trained by determining all bigram and/or trigram frequency distributions occurring in the training data

Page 7: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

The Paradigms are Converging

Harder than it has ever been to describe the differences between the two methods.

This used to be easy: • from the beginning, EBMT has sought to

translate new texts by means of a range of sub-sentential data—both lexical and phrasal—stored in the system's memory.

• until quite recently, SMT models of translationwere based on the simple IBM word alignment models of [Brown et al., 1990].

Page 8: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

From word- to phrase-based SMT

• SMT systems now learn phrasal as well as lexical alignments [e.g. Koehn, Och, Marcu 2003; Och, 2003].

• Unsurprisingly, the quality of today's phrase-based SMT systems is considerably better than that of the poorer word-based models.

• Despite the fact that EBMT models have been modelling lexical and phrasal correspondences for 20 years, no papers on SMT acknowledge this debt to EBMT, nor describe their approach as ‘example-based’ …

Page 9: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Differences between EBMT and Phrase-Based SMT?

1. EBMT alignments remain available for reuse in the system, whereas (similar) SMT alignments ‘disappear’ in the probability models.

2. SMT systems never ‘learn’ from previously encountered data, i.e. when SMT sees a string it’s seen before, it processes it in the same way as ‘unseen’ data—EBMT will just ‘look up’ such strings in its databases and output the translation quite straightforwardly.

3. Depending on the model, EBMT builds in (some) syntax at its core—most SMT systems only use models of syntax in a post hoc reranking process, and even here, [Koehn et al., JHU Workshop 2003] demonstrated that ‘bolting on’ syntax in this manner did not help improve translation quality;

4. Given (3), phrase-based SMT systems are likely to ‘learn’ (some) chunks that EBMT systems would not.

Page 10: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

SMT chunks are different from EBMT chunks

En: Mary did not slap the green witch Sp: Maria no dió una botefada a la bruja verde.(Lit: `Mary not gave a slap to the witch green‘)

From this aligned example, an SMT system would potentially learn the following ‘phrases’ (along with many others):

• slap the dió una botefada a• slap the dió una botefada a la• the green witch a la bruja verde

NB, SMT essentially learns n-gram sequences, rather than phrases per se.

[Koehn & Knight, AMTA-04 SMT Tutorial Notes]

Page 11: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of

grammatical contexts and which signal that context.”

[Green, 1979]

Markers for English (and French):

Our Marker-Based EBMT System

Determiners <DET>

Quantifiers <QUANT>

Prepositions <PREP>

Conjunctions <CONJ>

Wh-Adverbs <WRB>

Possessive Pronouns <POSS>

Personal Pronouns <PRON>

Page 12: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

An Example

En: you click apply to view the effect of the selection Fr: vous cliquez sur appliquer pour visualiser l'effet de la

sélection

Source—target aligned sentences are traversed word by word and automatically tagged with their marker categories:

<PRON>you click apply <PREP>to view <DET>the effect <PREP>of <DET>the selection

<PRON>vous cliquez <PREP>sur appliquer <PREP>pour visualiser <DET>l'effet <PREP>de <DET>la sélection

Page 13: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Deriving Sub-Sentential Source—Target Chunks

From these tagged strings, we generate the following aligned marker chunks:

• <PRON> you click apply : vous cliquez sur appliquer • <PREP> to view : pour visualiser• <DET> the effect : l'effet • <PREP> of the selection : de la sélection

New source and target (not necessarily source—target!] fragments begin where marker words are met and end at the next marker word [+ cognates, MI etc source—target sub-sentential alignments].

One further constraint: each chunk must contain at least one non-marker word (cf. 4th marker chunk).

Page 14: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Deriving Lexical Mappings

Where chunks contain just one non-marker word in both source and target, we assume they are translations.

Thus we can extract the following ‘word-level’ translations:

• <PREP> to : pour• <LEX> view : visualiser• <LEX> effect : effet• <PRON> you : vous• <DET> the : l’• <PREP> of : de

Page 15: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Deriving Generalised Templates

In a final pre-processing stage, we produce a set of generalised marker templates by replacing marker words with their tags:

• <PRON> click apply : <PRON> cliquez sur appliquer • <PREP> view : <PREP> visualiser• <DET> effect : <DET> effet • <PREP> the selection : <PREP> la sélection

Any marker tag pair can now be inserted at the appropriate tag location.

More general examples add flexibility to the matching process and improve coverage (and quality).

Page 16: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Knowledge Sources

1. the original sententially-aligned source—target pairs;2. the marker-aligned chunks;3. the generalised marker chunks;4. the word-level lexicon.

New strings are segmented into all possible n-grams that might be retrieved from the system's memories.

Resources searched in the order provided here, from maximal (specific source—target sentence-pairs) to minimal context (word-for-word translation).

Page 17: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Application Areas for our EBMT System

1. Seeding System Memories with Penn-II Treebank phrases and translations [AMTA-02].

2. Controlled Language & EBMT [MT Summit-03, EAMT-04, MT Journal-05].

3. Integration with web-based MT Systems [CL Journal-03].4. Using the Web for Translation Validation (and

Correction, if required).5. Scalable EBMT [TMI-04, NLE Journal-05, ACL-05].

• Largest EnglishFrench EBMT System.• Robust, Wide-Coverage, Good Quality.• Outperforms good on-line MT Systems.

Page 18: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

What are we interested in finding out?

1. Whether our marker-based EBMT system could outperform (1) word-based and (2) phrase-based SMT systems compiled from generally available tools;

2. Whether such SMT systems outperform our EBMT system when given ‘enough’ training text.

3. Whether seeding SMT (and EBMT) systems with SMT and/or EBMT data improves translation quality.

NB, (astonishingly), no previous published research on comparing EBMT and SMT …

Page 19: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

What have we done vs. what are we doing?

1. WBSMT vs. EBMT2. PBSMT seeded with:

– SMT chunks;– EBMT chunks– Both knowledge sources (‘Hybrid Example-Based SMT’).

3. PBSMT vs. EBMT

Ongoing work1. EBMT seeded with:

– SMT chunks;– EBMT chunks– Merged knowledge sources (‘Hybrid Statistical EBMT’).

Page 20: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Word-Based SMT vs. EBMT

1. Marker-Based EBMT system [Gough & Way, TMI-04]

2. To develop language and translation models for the WBSMT system, we used:– Giza++ (for word-alignment)– the CMU-Cambridge statistical toolkit (for

computing the language and translation models)– the ISI ReWrite Decoder (for deriving translations)

Page 21: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 1 Set-Up

• 207K English—French Sun TM. • Randomly extracted 4K sentence test set. • Split remaining sentences into three training

sets: roughly 50K (1.1M words), 100K and 203K (4.8M words) sentence-pairs to test impact of training set size.

• Translation performed at each stage from English—French and French—English.

• Resulting translations evaluated using a range of automatic metrics.

Page 22: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

WBSMT vs. EBMT: English—French

BLEU Prec. Rec. WER SER

TS1 SMT .297 .674 .591 .549 .908

EBMT .332 .653 .618 .543 .892

TS2 SMT .338 .682 .596 .511 .899

EBMT .453 .736 .698 .448 .775

TS3 SMT .322 .651 .570 .535 .891

EBMT .441 .673 .688 .524 .656

•All metrics bar one suggest that EBMT can outperform WBSMT from French—English;•Only exception is for TS1, where WBSMT outperforms EBMT in terms of precision (.674 compared to .653)

Page 23: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

WBSMT vs. EBMT: English—French

In general, scores incrementally improve as training data increases.

But apart from SER, metrics suggest that training on just over 100K sentences pairs yields better results than training on just over 200K.

Why? Maybe due to overfitting or odd data …

Surprising: generally assumed that increasing training data in Machine Learning approaches will improve the quality of the output translations (variance analysis:bootstrap-resampling on test set [Koehn, EMNLP-04]; different test sets).

• Note especially the similarity of the WER scores, and the difference in SER values. Much more significant improvement for EBMT (20.6%) than for WBSMT (0.1%).

Page 24: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

WBSMT vs. EBMT: French—English

BLEU Prec. Rec. WER SER

TS1 SMT .379 .709 .736 .525 .865

EBMT .257 .542 .631 .697 .892

TS2 SMT .392 .721 ..743 .462 .813

EBMT ..426 .673 .796 .552 .662

TS3 SMT .446 .704 .724 .468 .808

EBMT .461 .678 .744 .508 .512

•All WBSMT scores higher than for French—English;•For EBMT, better translations from French—English for BLEU, Recall and SER; worse for WER (FR-EN: .508, EN-FR: .448) and precision (FR-EN: .678, EN-FR: .736);

Page 25: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

WBSMT vs. EBMT: French—English

• For TS1, EBMT does not outperform WBSMT from French—English for any of the five metrics.

• For TS2, EBMT beats WBSMT in terms of BLEU, Recall and SER (66.5% compared to 81.3 % for WBSMT), while WBSMT gets higher scores for Precision and WER (46.2% compared to 55.2%).

• For TS3, WBSMT again beats EBMT in terms of Precision (2.5%) and WER (4% - both less significant differences than for TS1 and TS2), but EBMT wins out according to the other three metrics—notably, by a huge 29.6% for SER.

• BLEU: WBSMT obtains significantly higher scores for French—English compared to English—French: 8% higher for TS1, 6%higher for TS2, and 12% higher for TS3. Apart from TS1, the EBMT scores for the two different language directions are much more in line, indicating perhaps that EBMT may be more consistent even for the same language pair in different directions.

Page 26: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Results

1. Both EBMT & WBSMT achieve better translation qualityfrom French—English compared to English—French. Of the five automatic evaluation metrics for each of the three training sets, in 9/15 cases WBSMT wins out over our EBMT system.

2. For English—French, in 14/15 cases EBMT beats WBSMT.

3. Summing these results together, EBMT outperforms WBSMT in 20 tests, while WBSMT does better in 10 experiments.

4. Assuming all of these tests to be of equal importance,EBMT appears to outperform WBSMT by a factor of two to one.

5. While the results are a little mixed, it is clear that EBMT tends to outperform WBSMT on this sublanguage and on these training sets.

Page 27: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 2: Phrase-Based SMT vs. EBMT

• Same EBMT system as for WBSMT experiment• To develop language and translation models

for the SMT system, we used:– Giza++ to extract word-alignments;– Refine these to extract Giza++ phrase-alignments;– Construct Probability Tables;– Pass these to CMU-SRI statistical toolkit & Pharaoh

Decoder to derive translations.

• Same Translation Pairs, Training Sets, Test Sets

• Resulting translations evaluated using a range of automatic metrics

Page 28: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

PBSMT vs. EBMT: English—French

Data BLEU Prec. Rec. WER SER

TS3 Giza++ .375 .659 .587 .585 .868

EBMT .364 .666 .576 .613 .879

WBSMT .322 .651 .570 .535 .891

EBMT .441 .673 .688 .524 .656

•PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data, but cf. size of data sets:

•EBMT: 403,317•PBSMT: 1.73M

•PBSMT beats WBSMT, notably for BLEU; but 5% worse for WER. SER still (disappointingly) high•EBMT beats PBSMT, esp. for BLEU, Recall, WER & SER

Page 29: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

PBSMT vs. EBMT: French—English

Data BLEU Prec. Rec. WER SER

TS3 Giza++ .420 .653 .710 .629 .828

EBMT .395 .615 .664 .748 .862

WBSMT .446 .704 .724 .468 .808

EBMT .461 .678 .744 .508 .512

•PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data (with same caveat)•PBSMT with both knowledge sources better for F—E than for E—F •PBSMT doesn’t beat WBSMT - ??•EBMT beats PBSMT

Page 30: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 3a: Seeding Pharaoh with Giza++ Words and EBMT Phrases: English—French

BLEU Prec. Rec. WER SER

TS3 .396 .677 .591 .593 .854

Giza++ Data

.375 .659 .587 .585 .868

•Hybrid PBSMT system beats ‘baseline’ PBSMT forBLEU, P&R, and SER; slightly worse WER•Data Size: 430K (cf. PBSMT 1.73M, EBMT 403K)•Still worse than EBMT scores

Page 31: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 3b: Seeding Pharaoh with Giza++ Words and EBMT Phrases: French—English

BLEU Prec. Rec. WER SER

TS3 .427 .642 .692 .681 .834

Giza++ Data

.420 .653 .710 .629 .828

•Hybrid PBSMT system beats ‘baseline’ PBSMT forBLEU; slightly worse for P&R, and SER; quite a bit worse for WER;•Still shy of the results for EBMT.

Page 32: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 4a: Seeding Pharaoh with All Data, English—French

BLEU Prec. Rec. WER SER

TS3 .426 .703 .610 .543 .836

Semi-Hybrid

.396 .677 .591 .593 .854

EBMT .441 .673 .688 .524 .656

•Hybrid System beats ‘semi-hybrid’ system on all metrics;•Loses out to EBMT system, except for Precision.•Data Set now >2M items.

Page 33: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Experiment 4b: Seeding Pharaoh with All Data, French—English

BLEU Prec. Rec. WER SER

TS3 .489 .693 .717 .564 .784

Semi-Hybrid

.427 .642 .692 .681 .834

EBMT .461 .678 .744 .508 .512

•Hybrid System beats ‘semi-hybrid’ system on all metrics;•Hybrid System beats EBMT on BLEU & Precision; EBMT ahead for Recall & WER; still well ahead for SER.

Page 34: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Results: WBSMT vs. EBMT

• None of these are ‘bad’ systems: for TS3, worst BLEU score is for WBSMT, EF, .322;

• WBSMT loses out to EBMT 2:1 (but better overall for FE);

• For TS3, WBSMT BLEU score of .446 and EBMT score of .461 are high scores;

• For WBSMT vs. EBMT experiments, odd finding: higher scores for 100K training set: investigate in future work.

Page 35: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Results: PBSMT vs. EBMT

• PBSMT scores better than for WBSMT, but odd result for FE …?!

• Best PBSMT BLEU scores (with Giza++ data only): .375 (EF), .420 (FE);

• Seeding PBSMT with EBMT data gets good scores: for BLEU, .364 (EF), .395 (FE); note differences in data size (1.73M vs. 403K);

• PBSMT loses out to EBMT;• PBSMT SER still very high (83—87%).

Page 36: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Results: Semi-Hybrid Systems

• Seeding Pharaoh with SMT words and EBMT phrases improves over baseline Giza++ seeded system;

• Data size diminishes considerably (430K vs. 1.73M);

• Still worse result for ‘semi-hybrid’ system for FE than for WBSMT… ?!

• Still worse results than for EBMT.

Page 37: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Summary of Results: Fully Hybrid Systems

• Better results than for ‘semi-hybrid’ systems: EF .426 (.396), FE .489 (.427);

• Data size increases;• For FE, Hybrid system beats EBMT on BLEU

(.461) & Precision; EBMT ahead for Recall & WER; still well ahead (27%) for SER.

Page 38: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Concluding Remarks

• Despite the convergence between EBMT and SMT, further gains to be made;

• Merging Giza++ and EBMT-induced data leads to an improved Hybrid Example-Based SMT system;

Lesson for SMT community: don’t disregard the large body of work on EBMT!

• We expect in further work that adding SMT sub-sentential data to our EBMT system will also lead to improvements;

Lesson for EBMT-ers: SMT data can help you too!

Page 39: Comparing Example-Based &  Statistical Machine Translation

National Centre for Language Technology

Future Work

• Carry out significance tests on these results.• Investigate what’s going on in 2nd 100K training set.• Develop ‘Statistical EBMT System’ as described;• Other issues in hybridity:

– Use target LM in EBMT;– Replace EBMT recombination process with SMT decoder;– Try different decoders, LMs and TMs;– Factor in Marker Tags into SMT Probability Tables.

• Experiment with other training data in other sublanguage domains, especially those where larger corpora are available (e.g. Canadian Hansards, European Parliament …);

• Try other language pairs.