february 2006machine translation ii.21 postgraduate diploma in translation example based machine...

Click here to load reader

Post on 05-Jan-2016

243 views

Category:

Documents

6 download

Embed Size (px)

TRANSCRIPT

  • Postgraduate DiplomaIn Translation

    Example Based Machine Translation Statistical Machine Translation

    Machine Translation II.2

  • Three ways to lighten the loadRestrict coverage to specialised domainsExploit existing sources of knowledge (convert machine readable dictionaries)Try to manage without explicit representationsExample Based MT (EBMT)Statistical MT (SMT)

    Machine Translation II.2

  • Todays LectureExample Based MTStatistical MT

    Machine Translation II.2

  • Part IExample Based Machine Translation

    Machine Translation II.2

  • EBMTBasic idea is that instead of being based on on rules and abstract representations, translation should be based on a database of examples.Each example is pairing of a source/target fragment.The original intuition came from Nagao, a well-known pioneer in the field of En/Jp translation.

    Machine Translation II.2

  • EBMT (Nagao 1984)Man does translation by: by properly decomposing an input sentence into certain fragmental phrases, thenby translating these phrases into other language phrases, and finallyby properly composing these fragmental translations into one long sentence.

    Machine Translation II.2

  • Three Step ProcessMatch: identify relevant source language examples in database.Align: find corresponding fragments in target language. Recombine: target language fragments to form sentences.

    Machine Translation II.2

  • EBMT

    An Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University

    Based on Notes by Dave Inman

    Machine Translation II.2

  • EBMT Corpus & IndexCorpusS1: The cat eats a fish. Le chat mange un poisson.

    S2: A dog eats a cat. Un chien mange un chat.

    S99,999,999 .

    Index

    the:S1cat:S1,S2eats:S1,S2fish:S1dog:S2

    Machine Translation II.2

  • EBMT: find chunksA source language sentence is input. The cat eats a dog.Chunks of this sentence are matched against the corpus.The cat: S1The cat eats: S1The cat eats a: S1a dog: S2

    Machine Translation II.2

  • Match and Align ChunksFor each chunk retrieve target.the cat eats : S1 The cat eats a fish. Le chat mange un poissona dog:S2 a dog. Un chien mange un chat.The chunks are aligned with target sentences The cat eats Le chat mange un poissonAlignment is difficult.

    Machine Translation II.2

  • RecombinationChunks are scored to find good matchThe cat eats/Le chat mange Score 78%The cat eats /Le chat dorme Score 43%a dog/un chienScore 67%a dog/le chienScore 56%a dog/un arbreScore 22%

    The best translated chunks are put together to make the final translation.The cat eats/Le chat mange a dog/un chien

    Machine Translation II.2

  • What Data Are Needed?A bilingual dictionary but we can induce this from the corpus. A target language root/synonym list. so we can see similarity between words and inflected forms (e.g. verbs) Classes of words easily translated such as numbers, towns, weekdays.

    A large corpus of parallel sentences. if possible in the same domain as the translations.

    Machine Translation II.2

  • How to create a bilingual lexiconTake each sentence pair in the corpus.For each word in the source sentence, add each word in the target sentence and increment the frequency count.Repeat for as many sentences as possible.Use a threshold to get possible alternative translations.

    Machine Translation II.2

  • How to create a lexiconThe cat eats a fish. Le chat mange un poisson.

    thele,1chat,1mange,1un,1poisson,1catle,1chat,1mange,1un,1poisson,1eatsle,1chat,1mange,1un,1poisson,1ale,1chat,1mange,1un,1poisson,1fishle,1chat,1mange,1un,1poisson,1

    Machine Translation II.2

  • After many sentences the le,956la,925un,235------ Threshold ----------chat,47mange,33poisson,28....arbre,18

    Machine Translation II.2

  • After many sentences cat chat,963------ Threshold ----------le,604la,485un,305mange,33poisson,28....arbre,47

    Machine Translation II.2

  • Indexing the CorpusFor speed the corpus is indexed on the source language sentences.

    Each word in each source language sentence is stored with info about the target sentence.

    Words can be added to the corpus and the index easily updated.

    Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.

    Machine Translation II.2

  • Finding Chunks to TranslateLook up each word in the source sentence in the index.

    Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus.

    Select last few matches against the corpus (translation memory).

    Pangloss uses the last 5 matches for any chunk.

    Machine Translation II.2

  • Matching a chunk against the target.For each source chunk found previously, retrieve the target sentences from the corpus (using the index).

    Try to find the translation for the source chunk from these sentences.

    This is the hard bit!

    Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.

    Machine Translation II.2

  • Scoring a segmentUnmatched Words : Higher priority is given to sentences containing all the words in an input chunk. Noise : Higher priority is given to corpus sentences which have fewer extra words. Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.

    Machine Translation II.2

  • Whole Sentence MatchIf we are lucky the whole sentence will be found in the corpus!

    In that case the target sentence is used without previous alignment.

    Useful if translation memory is available (sentences recently translated are added to the corpus).

    Machine Translation II.2

  • Quality of TranslationPangloss was tested against source sentences in a different domain to the examples in the corpus.Pangloss covered about 70% of the sentences input.This means a match was found against the corpus.but not necessarily a good match.Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.

    Machine Translation II.2

  • Speed of TranslationTranslations are much faster than for Systran.Simple sentences translated in seconds.Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station)A 270 Mbytes corpus takes 45 minutes to index.

    Machine Translation II.2

  • Positive PointsFastEasy to add a new language pairNo need to analyse languages (much)Can induce a dictionary from the corpusAllows easy implementation of translation memory

    Machine Translation II.2

  • Negative PointsQuality is second best at presentDepends on a large corpus of parallel, well translated sentences30% of source has no coverage (translation)Matching of words is brittle we can see a match Pangloss cannot.Domain of corpus should match domain to be translated - to match chunks

    Machine Translation II.2

  • ConclusionsAn alternative to SystranFasterLower qualityQuick to develop for a new language pair if corpus exists!Needs no linguisticsMight improve as bigger corpora become available?

    Machine Translation II.2

  • Part IIStatistical Translation

    Machine Translation II.2

  • Statistical TranslationRobustDomain independentExtensibleDoes not require language specialistsUses noisy channel model of translation

    Machine Translation II.2

  • Noisy Channel ModelSentence Translation (Brown et. al. 1990)source sentencetarget sentencesentence

    Machine Translation II.2

  • Basic PrincipleJohn loves Mary (S)Jean aime Marie (T) Given T, I have to find S such thatPtrans = probability that T is a translation of SPs= probability of SPtrans x Ps is greater than for any other S

    Machine Translation II.2

  • A Statistical MT SystemSource LanguageModelTranslationModelPs PtransSTDecoderTS

    Machine Translation II.2

  • The Three Components of a Statistical MT modelMethod for computing language model (Ps) probabilitiesMethod for computing translation (Ptrans) probabilitiesMethod for searching amongst source sentences for one that maximises Ptrans x Ps

    Machine Translation II.2

  • Simplest Language ModelProbability Ps of any sentence is the product of the probabilities of the words in it.For example, Probability of John loves Mary = P(John) x P(loves) x P(Mary)

    Machine Translation II.2

  • Simplest Translation Model (1)Assumption: target sentence is generated from the source sentence word-by-word S: John loves Mary T: Jean aime Marie

    Machine Translation II.2

  • Simplest Translation Model (2)Ptrans is just the product of the translation probabilities of each of the words. Ptrans = P(Jean|John) * P(aime|loves) * P(Marie|Mary)

    Machine Translation II.2

  • More Realistic ExampleThe proposal will not now be implementedLes propositions ne seront pas mises en application maintenant

    Machine Translation II.2

  • More Realistic Translation ModelsBetter translation models include other features such asFertility: the number of words in the target that are paired with each source word: (0 N)Distortion: the difference in sentence position between the source word and the target word

    Machine Translation II.2

  • SearchingMaintain list of hypotheses. Initial hypothesis: (Jean aime Marie | *)Search proceeds interatively. At each iteration we extend most promising hypotheses with additional words Jean aime Marie | John(1) * Jean aime Marie | * loves(2) * Jean aime Marie | * Mary(3) * Jean aime Marie | Jean(1) *

    Machine Translation II.2

  • Building ModelsIn general - large quantities of dataFor language model, we need only source language text.For translation model, we need pairs of sentences that are translations of each other.Use EM Algorithm (Baum 1972) to optimize model parameters.

    Machine Translation II.2

  • Experiment 1 (Brown et. al. 1990)Hansard. 40,000 pairs of sentences = approx. 800,000 words in each language.Considered 9,000 most common words in each language.Assumptions (initial parameter values)each of the 9000 target words equally likely as translations of each of the source words.each of the fertilities from 0 to 25 equally likely for each of the 9000 source wordseach target position equally likely given each source position and target length

    Machine Translation II.2

  • English: theFrenchProbabilityle.610la.178l .083les .023ce.013il.012de .009.007que.007FertilityProbability1.8710 .1242 .004

    Machine Translation II.2

  • English: notFrenchProbabilitypas.469ne.460non .024pas du tout .003faux .003plus.002ce .002que .002jamais.002FertilityProbability2.7580 .1331 .106

    Machine Translation II.2

  • English: hearFrenchProbabilitybravo.992entendre.005entendu .002entends .001FertilityProbability0 .5841 .416

    Machine Translation II.2

  • Experiment 2Perform translation using 1000 most frequent words in the English corpus.1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary.117,000 pairs of sentences completely covered by both vocabularies.Parameters of English language model from 570,000 sentences in English part.

    Machine Translation II.2

  • Experiment 2 contd73 French sentences tested from elsewhere in corpus. Results were classified asExact same as actual translationAlternate same meaningDifferent legitimate translation but different meaningWrong could not be intepreted as a translationUngrammatical grammatically deficientCorrections to the last three categories were made and keystrokes were counted

    Machine Translation II.2

  • Results

    Machine Translation II.2

  • Results - DiscussionAccording to Brown et. al., system performed successfully 48% of the time (first three categories).776 keystrokes needed to repair 1916 keystrokes to generate all 73 translations from scratch.According to authors, system therefore reduces work by 60%.

    Machine Translation II.2

  • BibliographyStatistical MT Brown et. al., A Statistical Approach to MT, Computational Linguistics 16.2, 1990 pp79-85 (search ACL Anthology)

    Machine Translation II.2