phrase based machine translation professor sivaji bandyopadhyay computer science & engineering...

39
Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA.

Upload: antony-willis

Post on 22-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Phrase Based Machine Translation

Professor Sivaji BandyopadhyayComputer Science & Engineering Department

Jadavpur UniversityKolkata, INDIA.

Page 2: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Objectives

• What is Phrase based Machine Translation (PBMT)

• Development of a PBMT System

• Research directions in PBMT

Page 3: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Machine Translation

• Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another.

• Speech – to – Speech machine translation– Eg., ATR, CMU, DFKI, NEC, Matsushita,Hitachi

• High quality translation is achieved by– Carefully domain targeted systems– Control language based systems

Page 4: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Machine Translation approaches

• Translation Memory

• Machine Translation– Rule based MT– Corpus based MT / Data Driven MT

• Example Based MT (EBMT)• Statistical MT (SMT)

– Data Driven MT is nowadays the most prevalent trend in MT

Page 5: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Processes in a MT System

• The core process converts elements (entities, structures, words, etc.) of the input (SL) text into equivalent elements for the output (TL) text, where the output text means the same (or is functionally equivalent to) the input text.

• The analysis process precedes this core conversion (or ‘transfer’).

• The synthesis (or generation) process succedes conversion.

Page 6: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

RBMT

• The core process is mediated by bilingual dictionaries and rules for converting SL structures into TL structures, and/or by dictionaries and rules for deriving intermediary representations’ from which output can be generated.

• The preceding stage of analysis interprets (surface) input SL strings into appropriate ‘translation units’ (e.g. canonical noun and verb forms) and relations (e.g. dependencies and syntactic units).

• The succeeding stage of synthesis (or generation) derives TL texts from the TL structures or representations produced by the core ‘transfer’ (or ‘interlingual’) process.

Page 7: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

SMT• The core process involves a ‘translation model’ which takes as input

SL words or word sequences (‘phrases’) and produces as output TL words or word sequences.

• The following stage involves a ‘language model’ which synthesises the sets of TL words in ‘meaningful’ strings which are intended to be equivalent to the input sentences.

• The preceding ‘analysis’ stage is represented by the (trivial) process of matching individual words or word sequences of input SL text against entries in the translation model.

• More important is the essential preparatory stage of aligning SL and TL texts from a corpus and deriving the statistical frequency data for the ‘translation model’ (or adding statistical data from a corpus to a pre-existing ‘translation model’.)

• The monolingual ‘language model’ may or may not be derived from the same corpus as the ‘translation model’.

Page 8: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

EBMT• The core process is the selection and extraction of TL fragments

corresponding to SL fragments. • It is preceded by an ‘analysis’ stage for the decomposition of input

sentences into appropriate fragments (or templates with variables) and their matching against SL fragments (in a database). – Whether the ‘matching’ involves precompiled fragments (templates

derived from the corpus), whether the fragments are derived at ‘runtime’, and whether the fragments (chunks) contain variables or not, are all secondary factors.

• The succeeding stage of synthesis (or ‘recombination’ adapts the extracted TL fragments and combines them into TL (output) sentences.

• As in SMT, there are essential preparatory stages which align SL and TL sentences in the bilingual database and which derive any templates or patterns used in the processes of matching and extracting.

Page 9: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Mixing Approaches to Machine Translation

• The boundaries between the three principal approaches to MT are becoming narrower– Phrase based SMT models are incorporating

morphology, syntax and semantics into their systems– Rule based systems are using parallel corpora to

enrich their lexicon and grammars, to create new methods for disambiguation

– Probabilistic model for EBMT searches the translation example combination which has the highest probability

Page 10: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Web-Enabled Indian MT systems

• MAchiNe assisted TRAnslation tool (MANTRA)– English to Hindi– MANTRA uses the Lexicalized Tree Adjoining

Grammar (LTAG) formalism to represent the English as well as the Hindi grammar.

– http://www.cdac.in/html/aai/mantra.asp• AnglaHindi

– English to Hindi– pattern directed rule based system with CFG

like structure for English. – http://anglahindi.iitk.ac.in

Page 12: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

SMT involving Indian Languages

• English-Hindi SMT – IBM India Research Laboratory, IJCNLP 2004– CDAC Mumbai & IIT Bombay, IJCNLP 2008

Page 13: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

SMT at CDAC, Mumbai• The parallel corpus is word-aligned bidirectionally, and using

various heuristics (see (Koehn et al., 2003) for details) phrase correspondences are established.

• Given the set of collected phrase pairs, the phrase translation probability is calculated by relative frequency.

• Lexical weighting, which measures how well words within phrase pairs translate to each other, validates the phrase translation, and addresses the problem of data sparsity.

• The language model p(e) used in our baseline system is a trigram model with modified Kneser-Ney smoothing.

• Reordering is penalized according to a simple exponential distortion model.

• The weights for the various components of the model (phrase translation model, language model, distortion model etc.) are set by minimum error rate training (Och, 2003).

Page 14: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

MT Consortiums in India

• Indian Language to Indian Languages Machine Translation Systems (IL-ILMT)

• English to Indian Languages Machine Translation Systems (EILMT) – EILMT (1)– EILMT (2)

Page 15: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

MT approaches

• Anal-Gen based with MT

• TAG based MT

• EBMT

• SMT

Page 16: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Background• Statistical Machine Translation as a research area started in the late 1980s

with the Candide project at IBM. IBM's original approach maps individual words to words and allows for deletion and insertion of words.

• Lately, various researchers have shown better translation quality with the use of phrase translation. Phrase-based MT can be traced back to Och's alignment template model, which can be re-framed as a phrase translation system.

• Other researchers augmented their systems with phrase translation, such as Yamada, who use phrase translation in a syntax-based model.

• Marcu introduced a joint-probability model for phrase translation. • At this point, most competitive statistical machine translation systems use

phrase translation, such as the CMU, IBM, ISI, and Google systems, to name just a few. Phrase-based systems came out ahead at a recent international machine translation competition (DARPA TIDES Machine Translation Evaluation 2003-2006 on Chinese-English and Arabic-English).

Page 17: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Background

• Of course, there are other ways to do machine translation. Most commercial systems use transfer rules and a rich translation lexicon.

• There are also other ways to do statistical machine translation. There is some effort in building syntax-based models that either use real syntax trees generated by syntactic parsers, or tree transfer methods motivated by syntactic reordering patterns.

• The phrase-based statistical machine translation model we present here was defined by [Koehn/Och/Marcu, 2003]. See also the description by [Zens, 2002]. The alternative phrase-based methods differ in the way the phrase translation table is created, which we disuss in detail below.

Page 18: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

IBM Word Based Models

• Fundamental Equation of Machine Translation: e^ = argmax Pr(e) Pr(f le ).

• In Model 1 we assume all connections for each French position to be equally likely.

• Therefore, the order of the words in e and f does not affect Pr(f | e).

Page 19: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

• In Model 2 we make the more realistic assumption that the probability of a connection depends on the positions it connects and on the lengths of the two strings. Therefore, for Model 2, Pr(f |e) does depend on the order of the words in e and f.

Page 20: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Each phrase is translated into an English phrase, and English phrases in the output may be reordered.

                                                            

Page 21: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Phrase based Translation Model

• The phrase translation model is based on the noisy channel model. • We use Bayes rule to reformulate the translation probability for

translating a foreign sentence f into English e as – argmaxe p(e|f) = argmaxe p(f|e) p(e)

• This allows for a language model e and a separate translation model p(f|e).

• During decoding, the foreign input sentence f is segmented into a sequence of I phrases f1

I. We assume a uniform probability distribution over all possible segmentations.

• Each foreign phrase fi in f1I is translated into an English phrase ei.

The English phrases may be reordered. Phrase translation is modeled by a probability distribution φ(fi|ei).

Page 22: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Phrase based Translation Model

• Reordering of the English output phrases is modeled by a relative distortion probability distribution d(start i,endi-1), where starti denotes the start position of the foreign phrase that was translated into the i th English phrase, and endi-1 denotes the end position of the foreign phrase that was translated into the (i-1)th English phrase.

• We use a simple distortion model d(start i,endi-1) = α|starti-endi-1-1| with an appropriate value for the parameter α.

• In order to calibrate the output length, we introduce a factor ω (called word cost) for each generated English word in addition to the trigram language model pLM. This is a simple means to optimize performance. Usually, this factor is larger than 1, biasing toward longer output.

• In summary, the best English output sentence ebest given a foreign input sentence f according to our model is – ebest = argmax_e p(e|f) = argmaxe p(f|e) p_LM(e) ωlength(e)

where p(f|e) is decomposed into p(f1I|e1

I) = Φi=1I φ(fi|ei) d(starti,endi-1)

Page 23: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Decoder

• Phrase based Decoder– Beam search algorithm– A sequence of untranslated foreign words and

a possible English phrase translation for them is selected

– The English phrase is attached to the existing English output sequence

Page 24: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA
Page 25: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Hierarchical Phrase based SMT Model

• A Hierarchical Phrase-Based Model for Statistical Machine Translation, David Chiang, ACL 05

Page 26: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Phrase-based translationMorgen fliege ich nach Kanada zur KonferenzTomorrow I will fly to the conference in Canada• Foreign input is segmented in phrases– any sequence of words, not necessarily linguistically motivated• Each phrase is translated into English• Phrases are reordered

Page 27: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

• Phrase-based translation model• • Major components of phrase-based model• – phrase translation model (f |e)• – reordering model !length(e)• – language model plm(e)• • Bayes rule• argmaxep(e|f ) = argmaxep(f |e)p(e)• = argmaxe(f |e)plm(e)!length(e)• • Sentence f is decomposed into I phrases ¯ fI• 1 = ¯ f1, ..., ¯ fI• • Decomposition of (f |e)• ( ¯ fI• 1 |¯eI• 1) =• I Yi=1• ( ¯ fi|¯ei)d(ai − bi−1)

Page 28: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Syntax based Statistical Translation Model

• Yamada and Knight (2001)

Page 29: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Syntax based SMT

• One criticism of the PBMT is that it does not model structural or syntactic aspects of the language.

• Syntax based SMT transforms a source-language parse tree into a target-language string by applying stochastic operations at each node.

• These operations capture linguistic differences such as word order and case marking.

Page 30: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Syntax based SMT

• The input sentence is preprocessed by a syntactic parser.

• The channel performs operations on each node of the parse tree. The operations are reordering child nodes, inserting extra words at each node, and translating leaf words.

• Parsing is only needed on the channel input side.

Page 31: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Syntax based SMT

• The reorder operation is intended to model translation between languages with different word orders, such as SVO-languages (English or Chinese) and SOV-languages (Japanese or Turkish).

• The word-insertion operation is intended to capture linguistic differences in specifying syntactic cases. E.g., English and French use structural position to specify case, while Japanese and Korean use case-marker particles.

Page 32: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

Reorder operation

• Child nodes on each internal node are stochastically reordered. A node with N children has N! possible reorderings.

• The probability of taking a specific reordering is given by the model’s r-table.

Page 33: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

R-table

Page 34: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

N-table

• An extra word is stochastically inserted at each node

• A word can be inserted either to the left of the node, to the right of the node, or nowhere.

• The insertion probability is determined by the n-table.

Page 35: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

N-table

• The n-table is split into two– a table for insert positions

• node’s label and its parent’s label are used to index the table for insert positions.

– Probabilities of no insertion, left insertion and right insertion for each index condition

– a table for words to be inserted• Probability of insertion dependent only on the word

Page 36: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

T-table

• We apply the translate operation to each leaf. We assume that this operation is dependent only on the word itself and that no context is consulted.

• The model’s t-table specifies the probability for all cases.

Page 37: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

• Syntax-light alignment models such as the five IBM models (Brown et al., 1993) and their relatives have proved to be very successful and robust at producing word-level alignments, especially for closely related languages with similar word order and mostly local reorderings, which can be captured via simple models of relative word distortion.

Page 38: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

• However, these models have been less successful at modeling syntactic distortions with longer distance movement.

• In contrast, more syntactically informed approaches have been constrained by the often weak syntactic correspondences typical of real-world parallel texts, and by the difficulty of finding or inducing syntactic parsers for any but a few of the world’s most studied languages.

Page 39: Phrase Based Machine Translation Professor Sivaji Bandyopadhyay Computer Science & Engineering Department Jadavpur University Kolkata, INDIA

• Elliott Franco Drábek, David Yarowsky; ACL 2004 Proceedings of the interactive poster and demonstration sessions, 21-26 July 2004, Barcelona, Spain. ¤ An improved method for automated word alignment of parallel texts which takes advantage of knowledge of syntactic divergences, while avoiding the need for syntactic analysis of the less resource rich language, and retaining the robustness of syntactically agnostic approaches such as the IBM word alignment models.