lecture 8: statistical alignment & machine translation (chapter 13 of manning & schutze)

61
1 Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004S-NaturalLanguageProcessing/NLP_main_ 2004S.htm ) References: 1. Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. 2. Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. 3. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993.

Upload: xenos

Post on 09-Jan-2016

36 views

Category:

Documents


4 download

DESCRIPTION

Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze). Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University - PowerPoint PPT Presentation

TRANSCRIPT

  • *Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze)Wen-Hsiang Lu ()Department of Computer Science and Information Engineering,National Cheng Kung University2014/05/12(Slides from Berlin Chen, National Taiwan Normal Universityhttp://140.122.185.120/PastCourses/2004S-NaturalLanguageProcessing/NLP_main_2004S.htm)References:Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993.Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993.

  • *Machine Translation (MT)DefinitionAutomatic translation of text or speech from one language to another

    GoalProduce close to error-free output that reads fluently in the target languageFar from it ?

    Current StatusExisting systems seem not good in qualityBabel Fish Translation(before 2004)Google Translation

    A mix of probabilistic and non-probabilistic components

  • *IssuesBuild high-quality semantic-based MT systems in circumscribed domains

    Abandon automatic MT, build software to assist human translators insteadPost-edit the output of a buggy translation

    Develop automatic knowledge acquisition techniques for improving general-purpose MTSupervised or unsupervised learning

  • *Different Strategies for MT

  • *Word for Word MT Translate words one-by-one from one language to another

    Problems1. No one-to-one correspondence between words in different languages (lexical ambiguity)Need to look at the context larger than individual word ( phrase or clause)2. Languages have different word ordersEnglish Frenchsuit lawsuit (), set of garments ()meanings

  • *Syntactic Transfer MT Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic treeSolve the problems of word ordering

    ProblemsSyntactic ambiguityThe target syntax will likely mirror that of the source text

    German: Ich esse gern ( I like to eat )English: I eat readily/gladly NVAdv

  • *Text Alignment: DefinitionDefinitionAlign paragraphs, sentences or words in one language to paragraphs, sentences or words in another languagesThus can learn which words tend to be translated by which other words in another language

    Is not part of MT process per seBut the obligatory first step for making use of multilingual text corpora

    ApplicationsBilingual lexicographyMachine translationMultilingual information retrievalbilingual dictionaries, MT , parallel grammars

  • *Text Alignment: Sources and GranularitiesSources of parallel texts or bitextsParliamentary proceedings (Hansards)Newspapers and magazinesReligious and literary works

    Two levels of alignmentGross large scale alignmentLearn which paragraphs or sentences correspond to which paragraphs or sentences in another language

    Word alignmentLearn which words tend to be translated by which words in another languageThe necessary step for acquiring a bilingual dictionary

    with less literal translationOrders of word or sentence might not be preserved.

  • *Text Alignment: Example 12:2 alignment

  • *Text Alignment: Example 22:2 alignment1:1 alignment1:1 alignment2:1 alignmenta bead/a sentence alignmentStudies show that around 90% of alignments are 1:1 sentence alignment.

  • *Sentence AlignmentCrossing dependencies are not allowed hereWord ordering is preserved !

    Related work

  • *Sentence AlignmentLength-based

    Lexical-guided

    Offset-based

  • *Sentence AlignmentLength-based methodRationale: the short sentences will be translated as short sentences and long sentences as long sentencesLength is defined as the number of words or the number of characters

    Approach 1 (Gale & Church 1993)AssumptionsThe paragraph structure was clearly marked in the corpus, confusions are checked by hand

    Lengths of sentences measured in characters Crossing dependences are not handled hereThe order of sentences are not changed in the translations1s2s3s4...sI

    t1t2t3t4....tJ

    Ignore the rich information available in the text.Union Bank of Switzerland (UBS) corpus: English, French, and German

  • *Sentence Alignment Length-based methodt1t2t3t4....tJ

    s1s2s3s4...sI

    B1B2B3Bksourcetargetpossible alignments:{1:1, 1:0, 0:1, 2:1,1:2, 2:2,}a beadprobability independencebetween beads SourceTarget

  • *Sentence Alignment Length-based methodDynamic ProgrammingThe cost function (Distance Measure)

    Sentence is the unit of alignmentStatistically modeling of character lengthssquare difference of two paragraphsis a normal distributionRatio of texts in two languagesBayes LawThe prob. distributionof standard normal distribution

  • *Sentence Alignment Length-based methodThe priori probabilitySourceTargetsi-1sisi-2tjtj-1tj-2Or P( align)

  • *Sentence Alignment Length-based methodA simple example

  • *Sentence Alignment Length-based method4% error rate was achievedProblems: Can not handle noisy and imperfect inputE.g., OCR output or file containing unknown markup conventionsFinding paragraph or sentence boundaries is difficultSolution: just align text (position) offsets in two parallel texts (Church 1993)Questionable for languages with few cognates () or different writing systemsE.g., English Chinese eastern European languages Asian languages

  • *Sentence Alignment Lexical methodApproach 1 (Kay and Rscheisen 1993)First assume the first and last sentences of the text were align as the initial anchorsForm an envelope of possible alignmentsAlignments excluded when sentences across anchors or their respective distance from an anchor differ greatlyChoose word pairs their distributions are similar in most of the sentencesFind pairs of source and target sentences which contain many possible lexical correspondencesThe most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors)Iterations

  • *Word AlignmentThe sentence/offset alignment can be extended to a word alignmentSome criteria are then used to select aligned word pairs to include them into the bilingual dictionaryFrequency of word correspondencesAssociation measures.

  • *Statistical Machine TranslationThe noisy channel model

    Assumptions:An English word can be aligned with multiple French words while each French word is aligned with at most one English word Independence of the individual word-to-word translationsLanguage Model

    Translation Model

    Decoder

    e: Englishf: French|e|=l|f|=m

  • *Statistical Machine TranslationThree important components involvedDecoder

    Language modelGive the probability p(e)Translation model

    normalization constantall possiblealignments(the English word that a French word fj is aligned with)translation probability

  • *An Introduction to Statistical Machine TranslationDept. of CSIE, NCKUYao-Sheng ChangDate: 2011.04.12

  • *IntroductionTranslations involve many cultural respectsWe only consider the translation of individual sentence, just acceptable sentences.Every sentence in one language is a possible translation of any sentence in the otherAssign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

  • *Statistical Machine Translation(SMT)Noise channel problem

  • *Fundamental of SMTGiven a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Bayes theorem)

    Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

  • *Practical ChallengesComputation of translation model Pr(f|e)Computation of language model Pr(e)Decoding (i.e., search for e that maximize Pr(f|e) Pr(e))

  • *Alignment of case 1 (one-to-many)

  • *Alignment of case 2 (many-to-one)

  • *Alignment of case 3 (many-to-many)

  • *Formulation of Alignment(1)Let e = e1le1e2el and f = f1m f1f2fm An alignment between a pair of strings e and f use a mapping of every word ei to some word fj

    In other words, an alignment a between e and f tells that the word ei, 1 i l is generated by the word fj, aj{1,,l}There are (l+1)m different alignments between e and f. (Including Null no mapping )

  • *Translation ModelThe alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null).

  • *IBM Model I (1)

  • *IBM Model I (2)The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.

  • *Constrained MaximizationWe wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

  • *Lagrange Multipliers (1)Method of Lagrange multipliers: Lagrange multipliers with one constraint

    If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the

    f(x,y) is called the objective function. g(x,y) is called the constrained equation. F(x, y, ) is called the Lagrange function. is called the Lagrange multiplier.

  • *Lagrange Multipliers (2)Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

  • *Derivation (1)The partial derivative of h with respect to t(f|e) is

    where is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

  • *Derivation (2)

    We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,

  • *Derivation (3)replacing e by ePr(f|e), then Equation (11) can be written very compactly as

    In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes

  • *Derivation (4)For an expression that can be evaluated efficiently.

  • *Derivation (5)Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)

  • Other References about MT

  • *

  • *Chinese-English Sentence Alignment,,ROCLING XVI,2004: 2 Stages Iterative DP Stop list Partial Match

  • *Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03)

  • *Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03)

    Preprocessing steps to calculate the following information:Lists of preferred POS patterns of collocation in both languages.Collocation candidates matching the preferred POS patterns.N-gram statistics for both languages, N = 1, 2.Log likelihood Ratio statistics for two consecutive words in both languages.Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other.Content word alignment based on Competitive Linking Algorithm (Melamed 1997).

  • *Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03)

  • *Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03)

  • A Probability Model to Improve Word Alignment

    Colin Cherry & Dekang Lin,University of Alberta

  • OutlineIntroductionProbabilistic Word-Alignment ModelWord-Alignment AlgorithmConstraintsFeatures

  • IntroductionWord-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.E.g., translation lexicons, transfer rulesWord-alignment problemConventional approaches usually used co-occurrence modelsE.g., 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve

    To propose a probabilistic word-alignment model which allows easy integration of context-specific features.CISCO System Inc.

  • noise

  • Probabilistic Word-Alignment ModelGiven E = e1, e2, , em , F = f1, f2, , fnIf ei and fj are translation pair, then link l(ei, fj) existsIf ei has no corresponding translation, then null link l(ei, f0) existsIf fj has no corresponding translation, then null link l(e0, fj) existsAn alignment A is a set of links such that every word in E and F participates in at least one linkAlignment problem is to find alignment A to maximize P(A|E, F)IBMs translation model: maximize P(A, F|E)

  • Probabilistic Word-Alignment Model (Cont.)Given A = {l1, l2, , lt}, where lk = l(eik, fjk), then consecutive subsets of A, lij = {li, li+1, , lj}

    Let Ck= {E, F, l1k-1} represent the context of lk

  • Probabilistic Word-Alignment Model (Cont.)Ck = {E, F, l1k-1} is too complex to estimateFTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik, fjk, FTk)Let Ck = {eik, fjk} FTk

  • An Illustrative Example

  • Word-Alignment AlgorithmInput: E, F, TE TE is Es dependency tree which enable us to make use of features and constraints based on linguistic intuitions

    ConstraintsOne-to-one constraint: every word participates in exactly one linkCohesion constraint: use TE to induce TF with no crossing dependencies

  • Word-Alignment Algorithm (Cont.)FeaturesAdjacency features fta: for any word pair (ei, fj), if a link l(ei, fj) exists where -2 i-i 2 and -2 j-j 2, then fta(i-i, j-j, ei) is active for this context.

    Dependency features ftd: for any word pair (ei, fj), let ei be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei, fj) exists, then ftd(j-j, rel) is active for this context.

  • Experimental ResultsTest bed: Hansard corpusTraining: 50K aligned pairs of sentences (Och & Ney 2000)Testing: 500 pairs

  • Future WorkThe alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the many side.

    *Parliamentary: ,Hansard:Religious:,Literary:,