machine translation- 3

Click here to load reader

Post on 06-Jan-2016

42 views

Category:

Documents

5 download

Embed Size (px)

DESCRIPTION

Machine Translation- 3. Autumn 2008. Lecture 18 8 Sep 2008. Translation Steps. IBM Models 1–5. Model 1: Bag of words Unique local maxima Efficient EM algorithm (Model 1–2) Model 2: General alignment: Model 3: fertility: n(k | e) No full EM, count only neighbors (Model 3–5) - PowerPoint PPT Presentation

TRANSCRIPT

  • Machine Translation- 3

    Autumn 2008Lecture 18

    8 Sep 2008

    Natural Language Processing

  • Translation Steps

    Natural Language Processing

  • IBM Models 15Model 1: Bag of wordsUnique local maximaEfficient EM algorithm (Model 12)Model 2: General alignment: Model 3: fertility: n(k | e)No full EM, count only neighbors (Model 35)Deficient (Model 34)Model 4: Relative distortion, word classesModel 5: Extra variables to avoid deficiency

    Natural Language Processing

  • IBM Model 1Model parameters:T(fj | eaj ) = translation probability of foreign word given English word that generated it

    Natural Language Processing

  • IBM Model 1Generative story:Given e:Pick m = |f|, where all lengths m are equally probablePick A with probability P(A|e) = 1/(l+1)^m, since all alignments are equally likely given l and mPick f1fm with probability where T(fj | eaj ) is the translation probability of fj given the English word it is aligned to

    Natural Language Processing

  • IBM Model 1 Examplee: blue witch

    Natural Language Processing

  • IBM Model 1 Examplee: blue witch

    f: f1 f2Pick m = |f| = 2

    Natural Language Processing

  • IBM Model 1 Examplee: blue witch

    f: f1 f2Pick A = {2,1} with probability 1/(l+1)^m

    Natural Language Processing

  • IBM Model 1 Examplee: blue witch

    f: bruja f2Pick f1 = bruja with probability t(bruja|witch)

    Natural Language Processing

  • IBM Model 1 Examplee: blue witch

    f: bruja azulPick f2 = azul with probability t(azul|blue)

    Natural Language Processing

  • IBM Model 1: Parameter EstimationHow does this generative story help us to estimate P(f|e) from the data?

    Since the model for P(f|e) contains the parameter T(fj | eaj ), we first need to estimate T(fj | eaj )

    Natural Language Processing

  • lBM Model 1: Parameter EstimationHow to estimate T(fj | eaj ) from the data?If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

    Natural Language Processing

  • lBM Model 1: Parameter EstimationHow to estimate P(A|f,e)?P(A|f,e) = P(A,f|e) / P(f|e)

    ButSo we need to compute P(A,f|e)This is given by the Model 1 generative story:

    Natural Language Processing

  • IBM Model 1 Examplee: the blue witch

    f: la bruja azulP(A|f,e) = P(f,A|e)/ P(f|e) =

    Natural Language Processing

  • IBM Model 1: Parameter EstimationSo, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) In order to compute T(fj | eaj ) , we need to estimate P(A|f,e)And in order to compute P(A|f,e), we need to estimate T(fj | eaj )

    Natural Language Processing

  • IBM Model 1: Parameter EstimationTraining data is a set of pairs < ei, fi>Log likelihood of training data given model parameters is:

    To maximize log likelihood of training data given model parameters, use EM: hidden variable = alignments Amodel parameters = translation probabilities T

    Natural Language Processing

  • EMInitialize model parameters T(f|e)Calculate alignment probabilities P(A|f,e) under current values of T(f|e)Calculate expected counts from alignment probabilitiesRe-estimate T(f|e) from these expected countsRepeat until log likelihood of training data converges to a maximum

    Natural Language Processing

  • IBM Model 1 ExampleParallel corpus:the dog :: le chienthe cat :: le chatStep 1+2 (collect candidates and initialize uniformly):P(le | the) = P(chien | the) = P(chat | the) = 1/3P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3

    Natural Language Processing

  • IBM Model 1 ExampleStep 3: IterateNULL the dog :: le chienj=1 total = P(le | NULL)+P(le | the)+P(le | dog)= 1tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333tc(le | the) += P(le | the)/1= 0 += .333/1 = 0.333tc(le | dog) += P(le | dog)/1= 0 += .333/1 = 0.333j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333

    Natural Language Processing

  • IBM Model 1 ExampleNULL the cat :: le chatj=1 total = P(le | NULL)+P(le | the)+P(le | cat)=1tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333

    Natural Language Processing

  • IBM Model 1 ExampleRe-compute translation probabilities total(the) = tc(le | the) + tc(chien | the) + tc(chat | the) = 0.666 + 0.333 + 0.333 = 1.333 P(le | the) = tc(le | the)/total(the) = 0.666 / 1.333 = 0.5 P(chien | the) = tc(chien | the)/total(the) = 0.333/1.333 0.25 P(chat | the) = tc(chat | the)/total(the) = 0.333/1.333 0.25total(dog) = tc(le | dog) + tc(chien | dog) = 0.666 P(le | dog) = tc(le | dog)/total(dog) = 0.333 / 0.666 = 0.5 P(chien | dog) = tc(chien | dog)/total(dog) = 0.333 / 0.666 = 0.5

    Natural Language Processing

  • IBM Model 1 ExampleIteration 2:NULL the dog :: le chienj=1 total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333tc(le | the) += P(le | the)/1= 0 += .5/1.5 = 0.333tc(le | dog) += P(le | dog)/1= 0 += .5/1.5 = 0.333j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1 = 0.25 + 0.25 + 0.5 = 1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5

    Natural Language Processing

  • IBM Model 1 ExampleNULL the cat :: le chatj=1 total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833tc(le | the) += P(le | the)/1= 0.333 += .5/1 = 0.833tc(le | cat) += P(le | cat)/1= 0 += .5/1 = 0.5j=2total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1 = 0.25 + 0.25 + 0.5 = 1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5

    Natural Language Processing

  • IBM Model 1 ExampleRe-compute translations (iteration 2):total(the) = tc(le | the) + tc(chien | the) + tc(chat | the) = .833 + 0.25 + 0.25 = 1.333 P(le | the) = tc(le | the)/total(the) = .833 / 1.333 = 0.625 P(chien | the) = tc(chien | the)/total(the) = 0.25/1.333 = 0.188 P(chat | the) = tc(chat | the)/total(the) = 0.25/1.333 = 0.188total(dog) = tc(le | dog) + tc(chien | dog) = 0.333 + 0.5 = 0.833 P(le | dog) = tc(le | dog)/total(dog) = 0.333 / 0.833 = 0.4 P(chien | dog) = tc(chien | dog)/total(dog) = 0.5 / 0.833 = 0.6

    Natural Language Processing

  • IBM Model 1ExampleAfter 5 iterations: P(le | NULL) = 0.755608028335301 P(chien | NULL) = 0.122195985832349 P(chat | NULL) = 0.122195985832349 P(le | the) = 0.755608028335301 P(chien | the) = 0.122195985832349 P(chat | the) = 0.122195985832349 P(le | dog) = 0.161943319838057 P(chien | dog) = 0.838056680161943 P(le | cat) = 0.161943319838057 P(chat | cat) = 0.838056680161943

    Natural Language Processing

  • IBM Model 1 RecapIBM Model 1 allows for an efficient computation of translation probabilitiesNo notion of fertility, i.e., its possible that the same English word is the best translation for all foreign wordsNo positional information, i.e., depending on the language pair, there might be a tendency that words occurring at the beginning of the English sentence are more likely to align to words at the beginning of the foreign sentence

    Natural Language Processing

  • IBM Model 2Model parameters:T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated itd(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m

    Natural Language Processing

  • IBM Model 3Model parameters:T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated itr(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and mn(ei) = fertility of word ei , or number of foreign words aligned to eip1 = probability of generating a foreign word by alignment with the NULL English word

    Natural Language Processing

  • IBM Model 3IBM Model 3 offers two additional features compared to IBM Model 1:How likely is an English word e to align to k foreign words (fertility)? Positional information (distortion), how likely is a word in position i to align to a word in position j?

    Natural Language Processing

  • IBM Model 3: FertilityThe best Model 1 alignment could be that a single English word aligns to all foreign wordsThis is clearly not desirable and we want to constrain the number of words an English word can align to Fertility models a probability distribution that word e aligns to k words: n(k,e)Consequence: translation probabilities cannot be computed independently of each other anymoreIBM Model 3 has to work with full alignments, note there are up to (l+1)m different alignments

    Natural Language Processing

  • IBM Model 3 Generative Story:Choose fertilities for each English wordInsert spurious words according to probability of being aligned to the NULL English wordTranslate English words -> foreign wordsReorder words according to reverse distortion probabilities

    Natural Language Processing

  • IBM Model 3 ExampleConsider the following example from [Knight 1999]: Maria did not slap the green witch

    Natural Language Processing

  • IBM Model 3 ExampleMaria did not slap the green witch

    Maria not slap slap slap the green witch

    Choose fertilities: phi(Maria) = 1

    Natural Language Processing

  • IBM Model 3 ExampleMaria did not slap the green witch

    Maria not slap slap slap the green witch

    Maria not slap slap slap NULL the green witch

    Insert spurious words: p(NULL)

    Natural Language Processing

  • IBM Model 3 ExampleMaria did not slap the green witch

    Maria not slap slap slap the green witch

    Maria not slap slap slap NULL the green witch

    Maria no dio una bofetada a la verde bruja

    Translate words: t(verde|green)

    Natural Language Processing

  • IBM Model 3 ExampleMaria no dio una bofetada a la verde bruja

    Maria no dio una bofetada a la bruja verde

    Reorder words

    Natural Language Processing

  • IBM Model 3For models 1 and 2:We can compute exact EM updatesFor models 3 and 4:Exact EM updates cannot be efficiently computedUse best alignments from previous iterations to initialize each successive modelExplore only the subspace of potential alignments that lies within same neighborhood as the initial alignments

    Natural Language Processing

  • IBM Model 4Model parameters:Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)

    Natural Language Processing

  • Natural Language Processing

  • IBM Model 1 + Model 3Iterating over all possible alignments is computationally infeasibleSolution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging)Model 3 takes this restricted set of alignments as input

    Natural Language Processing

  • PeggingGiven an alignment a we can derive additional alignments from it by making small changes:Changing a link (j,i) to (j,i)Swapping a pair of links (j,i) and (j,i) to (j,i) and (j,i) The resulting set of alignments is called the neighborhood of a

    Natural Language Processing

  • IBM Model 3: DistortionThe distortion factor determines how likely it is that an English word in position i aligns to a foreign word in position j, given the lengths of both sentences: d(j | i, l, m)Note, positions are absolute positions

    Natural Language Processing

  • DeficiencyProblem with IBM Model 3: It assigns probability mass to impossible stringsWell formed string: This is possibleIll-formed but possible string: This possible isImpossible string:Impossible strings are due to distortion values that generate different words at the same positionImpossible strings can still be filtered out in later stages of the translation process

    Natural Language Processing

  • Limitations of IBM ModelsOnly 1-to-N word mappingHandling fertility-zero words (difficult for decoding)Almost no syntactic informationWord classesRelative distortionLong-distance word movementFluency of the output depends entirely on the English language model

    Natural Language Processing

  • DecodingHow to translate new sentences?A decoder uses the parameters learned on a parallel corpusTranslation probabilitiesFertilitiesDistortionsIn combination with a language model the decoder generates the most likely translationStandard algorithms can be used to explore the search space (A*, greedy searching, )Similar to the traveling salesman problem

    Natural Language Processing

  • Three Problems for Statistical MTLanguage modelGiven an English string e, assigns P(e) by formulagood English string -> high P(e)random word sequence -> low P(e)

    Translation modelGiven a pair of strings , assigns P(f | e) by formula look like translations -> high P(f | e) dont look like translations -> low P(f | e)

    Decoding algorithmGiven a language model, a translation model, and a new sentence f find translation e maximizing P(e) * P(f | e)Slide from Kevin Knight

    Natural Language Processing

  • The Classic Language ModelWord N-GramsGoal of the language model -- choose among:

    He is on the soccer fieldHe is in the soccer field

    Is table the on cup theThe cup is on the table

    Rice shrineAmerican shrineRice companyAmerican companySlide from Kevin Knight

    Natural Language Processing

  • Intuition of phrase-based translation (Koehn et al. 2003)

    Generative story has three stepsGroup words into phrasesTranslate each phraseMove the phrases around

    Natural Language Processing

  • Generative story againGroup English source words into phrases e1, e2, , enTranslate each English phrase ei into a Spanish phrase fj. The probability of doing this is (fj|ei)Then (optionally) reorder each Spanish phraseWe do this with a distortion probabilityA measure of distance between positions of a corresponding phrase in the 2 lgs.What is the probability that a phrase in position X in the English sentences moves to position Y in the Spanish sentence?

    Natural Language Processing

  • Distortion probabilityThe distortion probability is parameterized byai-bi-1Where ai is the start position of the foreign (Spanish) phrase generated by the ith English phrase ei.And bi-1 is the end position of the foreign (Spanish) phrase generated by the I-1th English phrase ei-1.Well call the distortion probability d(ai-bi-1).And well have a really stupid model:d(ai-bi-1) = |ai-bi-1|Where is some small constant.

    Natural Language Processing

  • Final translation model for phrase-based MT

    Lets look at a simple example with no distortion

    Natural Language Processing

  • Phrase-based MTLanguage model P(E)Translation model P(F|E)ModelHow to train the modelDecoder: finding the sentence E that is most probable

    Natural Language Processing

  • Training P(F|E)What we mainly need to train is (fj|ei)Suppose we had a large bilingual training corpusA bitextIn which each English sentence is paired with a Spanish sentenceAnd suppose we knew exactly which phrase in Spanish was the translation of which phrase in the EnglishWe call this a phrase alignmentIf we had this, we could just count-and-divide:

    Natural Language Processing

  • But we dont have phrase alignmentsWhat we have instead are word alignments:

    Natural Language Processing

  • Getting phrase alignmentsTo get phrase alignments:We first get word alignmentsThen we symmetrize the word alignments into phrase alignments

    Natural Language Processing

  • How to get Word AlignmentsWord alignment: a mapping between the source words and the target words in a set of parallel sentences.Restriction: each foreign word comes from exactly 1 English word

    Advantage: represent an alignment by the index of the English word that the French word comes fromAlignment above is thus 2,3,4,5,6,6,6

    Natural Language Processing

  • One addition: spurious wordsA word in the foreign sentenceThat doesnt align with any word in the English sentenceIs called a spurious word.We model these by pretending they are generated by an English word e0:

    Natural Language Processing

  • More sophisticated models of alignment

    Natural Language Processing

  • Computing word alignments: IBM Model 1For phrase-based machine translationWe want a word-alignmentTo extract a set of phrasesA word alignment algorithm gives us P(F,E)We want this to train our phrase probabilities (fj|ei) as part of P(F|E)But a word-alignment algorithm can also be part of a mini-translation model itself.

    Natural Language Processing

  • IBM Model 1

    Natural Language Processing

  • IBM Model 1

    Natural Language Processing

  • How does the generative story assign P(F|E) for a Spanish sentence F?Terminology:

    Suppose we had done steps 1 and 2, I.e. we already knew the Spanish length J and the alignment A (and English source E):

    Natural Language Processing

  • Lets formalize steps 1 and 2We want P(A|E) of an alignment A (of length J) given an English sentence EIBM Model 1 makes the (very) simplifying assumption that each alignment is equally likely.How many possible alignments are there between English sentence of length I and Spanish sentence of length J?Hint: Each Spanish word must come from one of the English source words (or the NULL word)(I+1)JLets assume probability of choosing length J is small constant epsilon

    Natural Language Processing

  • Model 1 continuedProb of choosing a length and then one of the possible alignments:

    Combining with step 3:

    The total probability of a given foreign sentence F:

    Natural Language Processing

  • DecodingHow do we find the best A?

    Natural Language Processing

  • Training alignment probabilitiesStep 1: get a parallel corpusHansardsCanadian parliamentary proceedings, in French and EnglishHong Kong Hansards: English and ChineseStep 2: sentence alignmentStep 3: use EM (Expectation Maximization) to train word alignments

    Natural Language Processing

  • Step 1: Parallel corporaExample from DE-News (8/1/1996)

    Slide from Christof Monz

    EnglishGermanDiverging opinions about planned tax reformUnterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues .Die Diskussion um die vorgesehene grosse Steuerreform dauert an .The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .

    Natural Language Processing

  • Step 2: Sentence AlignmentThe old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.

    Intuition: - use length in words or chars- together with dynamic programming - or use a simpler MT modelEl viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.Slide from Kevin Knight

    Natural Language Processing

  • Sentence AlignmentThe old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.El viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.Slide from Kevin Knight

    Natural Language Processing

  • Sentence AlignmentThe old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.El viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.Slide from Kevin Knight

    Natural Language Processing

  • Sentence AlignmentThe old man is happy. He has fished many times. His wife talks to him. The sharks await.El viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).Slide from Kevin Knight

    Natural Language Processing

  • Step 3: word alignmentsIt turns out we can bootstrap alignmentsFrom a sentence-aligned bilingual corpusWe use is the Expectation-Maximization or EM algorithm

    Natural Language Processing

  • EM for training alignment probs la maison la maison bleue la fleur

    the house the blue house the flower All word alignments equally likely

    All P(french-word | english-word) equally likelySlide from Kevin Knight

    Natural Language Processing

  • EM for training alignment probs la maison la maison bleue la fleur

    the house the blue house the flower la and the observed to co-occur frequently,so P(la | the) is increased.Slide from Kevin Knight

    Natural Language Processing

  • EM for training alignment probs la maison la maison bleue la fleur

    the house the blue house the flower house co-occurs with both la and maison, butP(maison | house) can be raised without limit, to 1.0,while P(la | house) is limited because of the

    (pigeonhole principle)Slide from Kevin Knight

    Natural Language Processing

  • EM for training alignment probs la maison la maison bleue la fleur

    the house the blue house the flower settling down after another iterationSlide from Kevin Knight

    Natural Language Processing

  • EM for training alignment probs la maison la maison bleue la fleur

    the house the blue house the flower Inherent hidden structure revealed by EM training!For details, see:

    Section 24.6.1 in the chapter A Statistical MT Tutorial Workbook (Knight, 1999). The Mathematics of Statistical Machine Translation (Brown et al, 1993) Software: GIZA++Slide from Kevin Knight

    Natural Language Processing

  • Statistical Machine Translation la maison la maison bleue la fleur

    the house the blue house the flower P(juste | fair) = 0.411P(juste | correct) = 0.027P(juste | right) = 0.020 new FrenchsentencePossible English translations,to be rescored by language modelSlide from Kevin Knight

    Natural Language Processing

  • A more complex model: IBM Model 3Brown et al., 1993Mary did not slap the green witchMary not slap slap slap the green witch n(3|slap)Maria no di una bofetada a la bruja verded(j|i)Mary not slap slap slap NULL the green witchP-NullMaria no di una bofetada a la verde brujat(la|the)Generative approach:Probabilities can be learned from raw bilingual text.

    Natural Language Processing

  • How do we evaluate MT? Human tests for fluencyRating tests: Give the raters a scale (1 to 5) and ask them to rateOr distinct scales forClarity, Naturalness, StyleOr check for specific problemsCohesion (Lexical chains, anaphora, ellipsis)Hand-checking for cohesion.Well-formedness5-point scale of syntactic correctnessComprehensibility testsNoise testMultiple choice questionnaireReadability testscloze

    Natural Language Processing

  • How do we evaluate MT? Human tests for fidelityAdequacyDoes it convey the information in the original?Ask raters to rate on a scaleBilingual raters: give them source and target sentence, ask how much information is preservedMonolingual raters: give them target + a good human translationInformativenessTask based: is there enough info to do some task?Give raters multiple-choice questions about content

    Natural Language Processing

  • Evaluating MT: ProblemsAsking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!)We cant build language engineering systems if we can only evaluate them once every quarter!!!!We need a metric that we can run every time we change our algorithm.It would be OK if it wasnt perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly.Bonnie Dorr

    Natural Language Processing

  • Automatic evaluationMiller and Beebe-Center (1958)Assume we have one or more human translations of the source passageCompare the automatic translation to these human translationsBleuNISTMeteorPrecision/Recall

    Natural Language Processing

  • BiLingual Evaluation Understudy (BLEU Papineni, 2001)Automatic Technique, but .Requires the pre-existence of Human (Reference) TranslationsApproach:Produce corpus of high-quality human translationsJudge closeness numerically (word-error rate)Compare n-gram matches between candidate translation and 1 or more reference translationshttp://www.research.ibm.com/people/k/kishore/RC22176.pdfSlide from Bonnie Dorr

    Natural Language Processing

  • Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.BLEU Evaluation Metric(Papineni et al, ACL-2002)N-gram precision (score is between 0 & 1)What percentage of machine n-grams can be found in the reference translation? An n-gram is an sequence of n wordsNot allowed to use same portion of reference translation twice (cant cheat by typing out the the the the the)

    Brevity penaltyCant just type out single word the (precision 1.0!)

    *** Amazingly hard to game the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesnt)

    Slide from Bonnie Dorr

    Natural Language Processing

  • Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.BLEU Evaluation Metric(Papineni et al, ACL-2002)BLEU4 formula (counts n-grams up to length 4)

    exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 max(words-in-reference / words-in-machine 1, 0)

    p1 = 1-gram precisionP2 = 2-gram precisionP3 = 3-gram precisionP4 = 4-gram precision Slide from Bonnie Dorr

    Natural Language Processing

  • Multiple Reference TranslationsSlide from Bonnie Dorr

    Natural Language Processing

  • BLEU in Action (Foreign Original) the gunman was shot to death by the police . (Reference Translation) the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 Slide from Bonnie Dorr

    Natural Language Processing

  • BLEU in Action (Foreign Original) the gunman was shot to death by the police . (Reference Translation) the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 green = 4-gram match (good!)red = word not matched (bad!)Slide from Bonnie Dorr

    Natural Language Processing

  • Bleu ComparisonChinese-English Translation Example:Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.Slide from Bonnie Dorr

    Natural Language Processing

  • How Do We Compute Bleu Scores?Intuition: What percentage of words in candidate occurred in some human translation?Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translationBut cant just count total # of overlapping N-grams!Candidate: the the the the the theReference 1: The cat is on the matSolution: A reference word should be considered exhausted after a matching candidate word is identified.Slide from Bonnie Dorr

    Natural Language Processing

  • Modified n-gram precisionFor each word compute: (1) total number of times it occurs in any single reference translation(2) number of times it occurs in the candidate translationInstead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcriptionNow use that modified count.And divide by number of candidate words.Slide from Bonnie Dorr

    Natural Language Processing

  • Modified Unigram Precision: Candidate #1Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1)Whats the answer???17/18Slide from Bonnie Dorr

    Natural Language Processing

  • Modified Unigram Precision: Candidate #2It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0)Whats the answer????8/14Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.Slide from Bonnie Dorr

    Natural Language Processing

  • Modified Bigram Precision: Candidate #1It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1)Whats the answer????10/17Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.Slide from Bonnie Dorr

    Natural Language Processing

  • Modified Bigram Precision: Candidate #2Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0)Whats the answer????1/13Slide from Bonnie Dorr

    Natural Language Processing

  • Catching CheatersReference 1: The cat is on the matReference 2: There is a cat on the matthe(2) the the the(0) the(0) the(0) the(0)Whats the unigram answer?2/7Whats the bigram answer?0/7Slide from Bonnie Dorr

    Natural Language Processing

  • Bleu distinguishes human from machine translationsSlide from Bonnie Dorr

    Natural Language Processing

  • Bleu problems with sentence lengthCandidate: of the

    Solution: brevity penalty; prefers candidates translations which are same length as one of the referencesReference 1: It is a guide to action that ensures that the military will forever heed Party commands.Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.Reference 3: It is the practical guide for the army always to heed the directions of the party.Problem: modified unigram precision is 2/2, bigram 1/1!Slide from Bonnie Dorr

    Natural Language Processing

  • BLEU Tends to Predict Human Judgmentsslide from G. Doddington (NIST)(variant of BLEU)

    Natural Language Processing

    Chart4

    0.02919079070.0291907907

    0.17610936130.1761093613

    0.6783037480.678303748

    -1.0890227742-1.0890227742

    -0.4378321806-0.4378321806

    -1.3533277986-1.3533277986

    0.65930821560.6593082156

    2.10000062282.1000006228

    -0.095615793-0.095615793

    -0.6671141922-0.6671141922

    Adequacy

    Fluency

    Human Judgments

    NIST Score

    9b-arabic

    scores/arabic-9a.dat:NISTAdequacyFluency

    F-ratio203.79178.54

    mean-0.0032-0.0035

    sys stdv0.55040.5217

    doc stdv0.03860.0390

    CMU-Arabic-SMT-0.599-1.038-1.018

    GeckoAE11.0410.7150.874

    ISI.arabic.best0.5300.1640.2

    ama-1.728-1.442-1.464

    ame0.4350.9120.816

    amf0.3220.6880.593

    mean0.000-0.0000.000

    stdev1.0001.0001.000

    Correlation with Adequacy91.2%100.0%99.5%

    Correlation with Fluency94.0%99.5%100.0%

    NIST scoreSystem

    scores/arabic-9b.out5.5373CMU-Arabic-SMT

    scores/arabic-9b.out7.5487GeckoAE1

    scores/arabic-9b.out6.9218ISI.arabic.best=g3

    scores/arabic-9b.out4.1538ama

    scores/arabic-9b.out6.8056ame

    scores/arabic-9b.out6.667amf

    AdequacyFluency

    F-ratio203.79178.54

    mean-0.0032-0.0035

    sys stdv0.55040.5217

    doc stdv0.03860.039

    ama-1.442-1.464An_Nakel

    ame0.9120.816Sakhr

    amf0.6880.593Apptek

    ari-1.038-1.018CMU

    arm0.7150.874IBM

    arp0.1640.2ISI

    9b-arabic

    00

    00

    00

    00

    00

    00

    Adequacy

    Fluency

    Human Judgments

    NIST Score

    9bc-arabic

    scores/arabic-9u.dat:NISTAdequacyFluency

    F-ratio203.79178.54

    mean-0.0032-0.0035

    sys stdv0.55040.5217

    doc stdv0.03860.0390

    CMU-Arabic-SMT-1.197-1.038-1.018

    GeckoAE10.5420.7150.874

    ISI.arabic.best0.0200.1640.2

    ama-1.216-1.442-1.464

    ame0.7260.9120.816

    amf1.1250.6880.593

    mean-0.000-0.0000.000

    stdev1.0001.0001.000

    Correlation with Adequacy96.5%100.0%99.5%

    Correlation with Fluency94.7%99.5%100.0%

    NIST ScoreSystem

    scores/arabic-9bc.out3.9393CMU-Arabic-SMT

    scores/arabic-9bc.out5.7254GeckoAE1

    scores/arabic-9bc.out5.1892ISI.arabic.best=g3

    scores/arabic-9bc.out3.9198ama

    scores/arabic-9bc.out5.9149ame

    scores/arabic-9bc.out6.3248amf

    AdequacyFluency

    F-ratio203.79178.54

    mean-0.0032-0.0035

    sys stdv0.55040.5217

    doc stdv0.03860.039

    ama-1.442-1.464An_Nakel

    ame0.9120.816Sakhr

    amf0.6880.593Apptek

    ari-1.038-1.018CMU

    arm0.7150.874IBM

    arp0.1640.2ISI

    9bc-arabic

    00

    00

    00

    00

    00

    00

    Adequacy

    Fluency

    Human Judgments

    NIST Score

    9b-chinese

    scores/chinese-9a.dat:NISTAdequacyFluency

    F-ratio41.2426.45

    mean-0.0024-0.0074

    sys stdv0.28280.2217

    doc stdv0.04400.0431

    CMU-SMT-Chinese-large0.9640.2040.025

    E05-0.6840.5690.341

    E09-0.2600.5880.494

    GeckoCE1-Large-0.289-1.447-1.750

    ISI.large-track.best0.616-0.243-0.134

    MSR-MT-1.912-1.058-1.050

    RWTH_LargeData_Best1.353-0.1140.983

    SYSTRAN0.9302.1011.762

    ict-0.8400.152-0.049

    voting.v90.122-0.753-0.621

    mean0.000-0.0000.000

    stdv1.0001.0001.000

    Correlation with Adequacy39.2%100.0%91.3%

    Correlation with Fluency55.9%91.3%100.0%

    NIST ScoreSystem

    scores/chinese-9b.out7.3379CMU-SMT-Chinese-large

    scores/chinese-9b.out6.0221E05

    scores/chinese-9b.out6.3606E09

    scores/chinese-9b.out6.3372GeckoCE1-Large

    scores/chinese-9b.out7.0601ISI.large-track.best=g3.with-named-entities

    scores/chinese-9b.out5.0417MSR-MT

    scores/chinese-9b.out7.6488RWTH_LargeData_Best

    scores/chinese-9b.out7.3105SYSTRAN

    scores/chinese-9b.out5.8975ict

    scores/chinese-9b.out6.6655voting.v9

    AdequacyFluency

    F-ratio41.2426.45

    mean-0.0024-0.0074

    sys stdv0.28280.2217

    doc stdv0.0440.0431

    E050.5690.341EWLite

    E090.5880.494Systran_Web

    E110.2040.025CMU

    E12-1.447-1.75IBM

    E130.152-0.049ICT

    E14-0.243-0.134ISI

    E15-0.753-0.621JHU/UMD

    E16-1.058-1.05MSR

    E17-0.1140.983RWTH

    E182.1011.762SYSTRAN

    9b-chinese

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Adequacy

    Fluency

    Human Judgments

    NIST Score

    9bc-chinese

    scores/chinese-9u.dat:NISTAdequacyFluency

    F-ratio41.2426.45

    mean-0.0024-0.0074

    sys stdv0.28280.2217

    doc stdv0.0440.0431

    CMU-SMT-Chinese-large0.02919079070.2040.025

    E050.17610936130.5690.341

    E090.6783037480.5880.494

    GeckoCE1-Large-1.0890227742-1.447-1.75

    ISI.large-track.best-0.4378321806-0.243-0.134

    MSR-MT-1.3533277986-1.058-1.05

    RWTH_LargeData_Best0.6593082156-0.1140.983

    SYSTRAN2.10000062282.1011.762

    ict-0.0956157930.152-0.049

    voting.v9-0.6671141922-0.753-0.621

    mean-0.000-0.0000.000

    stdv1.0001.0001.000

    Correlation with Adequacy93.8%100.0%91.3%

    Correlation with Fluency95.0%91.3%100.0%

    NIST ScoreSystem

    scores/chinese-9bc.out5.686CMU-SMT-Chinese-large

    scores/chinese-9bc.out5.785E05

    scores/chinese-9bc.out6.1234E09

    scores/chinese-9bc.out4.9325GeckoCE1-Large

    scores/chinese-9bc.out5.3713ISI.large-track.best=g3.with-named-entities

    scores/chinese-9bc.out4.7544MSR-MT

    scores/chinese-9bc.out6.1106RWTH_LargeData_Best

    scores/chinese-9bc.out7.0814SYSTRAN

    scores/chinese-9bc.out5.6019ict

    scores/chinese-9bc.out5.2168voting.v9

    AdequacyFluency

    F-ratio41.2426.45

    mean-0.0024-0.0074

    sys stdv0.28280.2217

    doc stdv0.0440.0431

    E050.5690.341EWLite

    E090.5880.494Systran_Web

    E110.2040.025CMU

    E12-1.447-1.75IBM

    E130.152-0.049ICT

    E14-0.243-0.134ISI

    E15-0.753-0.621JHU/UMD

    E16-1.058-1.05MSR

    E17-0.1140.983RWTH

    E182.1011.762SYSTRAN

    9bc-chinese

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Adequacy

    Fluency

    Human Judgments

    NIST Score

  • SummaryIntro and a little historyLanguage Similarities and DivergencesFour main MT ApproachesTransferInterlinguaDirectStatisticalEvaluation

    Natural Language Processing

  • ClassesLINGUIST 139M/239M. Human and Machine Translation. (Martin Kay)CS 224N. Natural Language Processing (Chris Manning)

    Natural Language Processing