![Page 1: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/1.jpg)
111
CS 388: Natural Language Processing
Machine Translation
Raymond J. MooneyUniversity of Texas at Austin
![Page 2: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/2.jpg)
Machine Translation
• Automatically translate one natural language into another.
2
Mary didn’t slap the green witch.
Maria no dió una bofetada a la bruja verde.
![Page 3: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/3.jpg)
3
Ambiguity Resolution is Required for Translation
• Syntactic and semantic ambiguities must be properly resolved for correct translation:– “John plays the guitar.” → “John toca la guitarra.”
– “John plays soccer.” → “John juega el fútbol.”
• An apocryphal story is that an early MT system gave the following results when translating from English to Russian and then back to English:– “The spirit is willing but the flesh is weak.”
“The liquor is good but the meat is spoiled.”
– “Out of sight, out of mind.” “Invisible idiot.”
![Page 4: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/4.jpg)
Word Alignment
• Shows mapping between words in one language and the other.
4
Mary didn’t slap the green witch.
Maria no dió una bofetada a la bruja verde.
![Page 5: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/5.jpg)
Translation Quality
• Achieving literary quality translation is very difficult.• Existing MT systems can generate rough translations that
frequently at least convey the gist of a document.• High quality translations possible when specialized to
narrow domains, e.g. weather forcasts.• Some MT systems used in computer-aided translation in
which a bilingual human post-edits the output to produce more readable accurate translations.
• Frequently used to aid localization of software interfaces and documentation to adapt them to other languages.
5
![Page 6: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/6.jpg)
Linguistic Issues Making MT Difficult
• Morphological issues with agglutinative, fusion and polysynthetic languages with complex word structure.
• Syntactic variation between SVO (e.g. English), SOV (e.g. Hindi), and VSO (e.g. Arabic) languages.– SVO languages use prepositions– SOV languages use postpositions
• Pro-drop languages regularly omit subjects that must be inferred.
6
![Page 7: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/7.jpg)
Lexical Gaps
• Some words in one language do not have a corresponding term in the other.
• Rivière (river that flows into ocean) and fleuve (river that does not flow into ocean) in French
• Schedenfraude (feeling good about another’s pain) in German.
• Oyakoko (filial piety) in Japanese
7
![Page 8: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/8.jpg)
Vauquois Triangle
8
Interlingua
Semanticstructure
Semanticstructure
Syntacticstructure
Syntacticstructure
Words Words
Source Language Target Language
SRL &WSD
parsing
SemanticParsing
TacticalGeneration
Direct translation
Syntactic Transfer
Semantic Transfer
![Page 9: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/9.jpg)
Direct Transfer
• Morphological Analysis– Mary didn’t slap the green witch. →
Mary DO:PAST not slap the green witch.
• Lexical Transfer– Mary DO:PAST not slap the green witch.
– Maria no dar:PAST una bofetada a la verde bruja.
• Lexical Reordering– Maria no dar:PAST una bofetada a la bruja verde.
• Morphological generation– Maria no dió una bofetada a la bruja verde.
9
![Page 10: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/10.jpg)
Syntactic Transfer
• Simple lexical reordering does not adequately handle more dramatic reordering such as that required to translate from an SVO to an SOV language.
• Need syntactic transfer rules that map parse tree for one language into one for another.– English to Spanish:
• NP → Adj Nom NP → Nom ADJ
– English to Japanese:• VP → V NP VP → NP V
• PP → P NP PP → NP P
10
![Page 11: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/11.jpg)
Semantic Transfer
• Some transfer requires semantic information.• Semantic roles can determine how to properly
express information in another language.• In Chinese, PPs that express a goal, destination, or
benefactor occur before the verb but those expressing a recipient occur after the verb.
• Transfer Rule– English to Chinese
• VP → V PP[+benefactor] VP → PP[+benefactor] V
11
![Page 12: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/12.jpg)
Statistical MT
• Manually encoding comprehensive bilingual lexicons and transfer rules is difficult.
• SMT acquires knowledge needed for translation from a parallel corpus or bitext that contains the same set of documents in two languages.
• The Canadian Hansards (parliamentary proceedings in French and English) is a well-known parallel corpus.
• First align the sentences in the corpus based on simple methods that use coarse cues like sentence length to give bilingual sentence pairs.
12
![Page 13: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/13.jpg)
Picking a Good Translation
• A good translation should be faithful and correctly convey the information and tone of the original source sentence.
• A good translation should also be fluent, grammatically well structured and readable in the target language.
• Final objective:
13
)(fluency ),(ssfaithfulne argmaxTargetT
TSTTbest
![Page 14: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/14.jpg)
Noisy Channel Model
• Based on analogy to information-theoretic model used to decode messages transmitted via a communication channel that adds errors.
• Assume that source sentence was generated by a “noisy” transformation of some target language sentence and then use Bayesian analysis to recover the most likely target sentence that generated it.
14
Translate foreign language sentence F=f1, f2, …fm to anEnglish sentence Ȇ = e1, e2, …eI that maximizes P(E | F)
![Page 15: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/15.jpg)
Bayesian Analysis of Noisy Channel
15
)|(argmaxˆ FEPEEnglishE
)()|(argmax)(
)()|(argmax
EPEFPFP
EPEFP
EnglishE
EnglishE
Translation Model Language Model
A decoder determines the most probable translation Ȇ given F
![Page 16: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/16.jpg)
Language Model
• Use a standard n-gram language model for P(E).
• Can be trained on a large, unsupervised mono-lingual corpus for the target language E.
• Could use a more sophisticated PCFG language model to capture long-distance dependencies.
• Terabytes of web data have been used to build a large 5-gram model of English.
16
![Page 17: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/17.jpg)
Phrase-Based Translation Model
• Base P(F | E) on translating phrases in E to phrases in F.
• First segment E into a sequence of phrases ē1, ē1,…,ēI
• Then translate each phrase ēi, into fi, based on translation probability (fi | ēi)
• Then reorder translated phrases based on distortion probability d(i) for the ith phrase.
17
)(),()|(1
idefEFP i
I
ii
![Page 18: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/18.jpg)
Translation Probabilities
• Assuming a phrase aligned parallel corpus is available or constructed that shows matching between phrases in E and F.
• Then compute (MLE) estimate of based on simple frequency counts.
18
f
ef
efef
),(count
),(count),(
![Page 19: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/19.jpg)
Distortion Probability
• Measure distortion of phrase i as the distance between the start of the f phrase generated by ēi, (ai) and the end of the end of the f phrase generated by the previous phrase ēi-1, (bi-1).
• Typically assume the probability of a distortion decreases exponentially with the distance of the movement.
19
|| 1)( ii bacid
Set 0<<1 based on fit to phrase-aligned training dataThen set c to normalize d(i) so it sums to 1.
![Page 20: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/20.jpg)
Sample Translation Model
20
Position 1 2 3 4 5 6
English Mary did not slap the green witch
Spanish Maria no dió una bofetada a la bruja verde
ai−bi−1 1 1 1 1 2 1
121
111
)witchbruja,()greenverde,()thela,(
)a bofetada una dio slap,()not did no,()MaryMaria,()|(
ccc
cccEFp
![Page 21: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/21.jpg)
Word Alignment
• Directly constructing phrase alignments is difficult, so rely on first constructing word alignments.
• Can learn to align from supervised word alignments, but human-aligned bitexts are rare and expensive to construct.
• Typically use an unsupervised EM-based approach to compute a word alignment from unannotated parallel corpus.
21
![Page 22: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/22.jpg)
One to Many Alignment
• To simplify the problem, typically assume each word in F aligns to 1 word in E (but assume each word in E may generate more than one word in F).
• Some words in F may be generated by the NULL element of E.
• Therefore, alignment can be specified by a vector A giving, for each word in F, the index of the word in E which generated it.
22
NULL Mary didn’t slap the green witch.
Maria no dió una bofetada a la bruja verde.
0 1 2 3 4 5 6
1 2 3 3 3 0 4 6 5
![Page 23: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/23.jpg)
IBM Model 1
• First model proposed in seminal paper by Brown et al. in 1993 as part of CANDIDE, the first complete SMT system.
• Assumes following simple generative model of producing F from E=e1, e2, …eI
– Choose length, J, of F sentence: F=f1, f2, …fJ
– Choose a 1 to many alignment A=a1, a2, …aJ
– For each position in F, generate a word fj from the aligned word in E: eaj
23
![Page 24: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/24.jpg)
verde.1 2 3 3 3 0 4 6 5
Sample IBM Model 1 Generation
24
NULL Mary didn’t slap the green witch.0 1 2 3 4 5 6
Maria no dióuna bofetada a la bruja
![Page 25: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/25.jpg)
Computing P(F | E) in IBM Model 1
• Assume some length distribution P(J | E) • Assume all alignments are equally likely. Since
there are (I + 1)J possible alignments:
25
JI
EJPEJPJEAPEAP
)1(
)|()|(),|()|(
• Assume t(fx,ey) is the probability of translating ey
as fx, therefore:),(),|(
1ja
J
jj eftAEFP
• Determine P(F | E) by summing over all alignments:
),()1(
)|()|(),|()|(
1ja
J
jj
AJ
A
eftI
EJPEAPAEFPEFP
![Page 26: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/26.jpg)
Decoding for IBM Model 1
• Goal is to find the most probable alignment given a parameterized model.
26
)|,(argmaxˆA
EAFPA
),()1(
)|(argmax
1ja
J
jjJ
Aeft
I
EJP
),(argmax1
ja
J
jj
Aeft
Since translation choice for each position j is independent, the product is maximized by maximizing each term:
Jjefta ijIi
j
1 ),( argmax0
![Page 27: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/27.jpg)
HMM-Based Word Alignment• IBM Model 1 assumes all alignments are equally
likely and does not take into account locality:– If two words appear together in one language, then
their translations are likely to appear together in the result in the other language.
• An alternative model of word alignment based on an HMM model does account for locality by making longer jumps in switching from translating one word to another less likely.
![Page 28: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/28.jpg)
HMM Model
• Assumes the hidden state is the specific word occurrence ei in E currently being translated (i.e. there are I states, one for each word in E).
• Assumes the observations from these hidden states are the possible translations fj of ei.
• Generation of F from E then consists of moving to the initial E word to be translated, generating a translation, moving to the next word to be translated, and so on.
![Page 29: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/29.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria
1 2 3 4 5 6
![Page 30: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/30.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no
1 2 3 4 5 6
![Page 31: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/31.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió
1 2 3 4 5 6
![Page 32: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/32.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió una
1 2 3 4 5 6
![Page 33: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/33.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió una bofetada
1 2 3 4 5 6
![Page 34: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/34.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió una bofetada a
1 2 3 4 5 6
![Page 35: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/35.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió una bofetada a la
1 2 3 4 5 6
![Page 36: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/36.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
Maria no dió una bofetada a la bruja
1 2 3 4 5 6
![Page 37: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/37.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
verde.Maria no dió una bofetada a la bruja
1 2 3 4 5 6
![Page 38: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/38.jpg)
Sample HMM Generation
Mary didn’t slap the green witch.
verde.Maria no dió una bofetada a la bruja
1 2 3 4 5 6
![Page 39: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/39.jpg)
HMM Parameters
• Transition and observation parameters of states for HMMs for all possible source sentences are “tied” to reduce the number of free parameters that have to be estimated.
• Observation probabilities: bj(fi)=P(fi | ej) the same for all states representing an occurrence of the same English word.
• State transition probabilities: aij = s(ji) the same for all transitions that involve the same jump width (and direction).
![Page 40: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/40.jpg)
Computing P(F | E) in the HMM Model
• Given the observation and state-transition probabilities, P(F | E) (observation likelihood) can be computed using the standard forward algorithm for HMMs.
40
![Page 41: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/41.jpg)
Decoding for the HMM Model
• Use the standard Viterbi algorithm to efficiently compute the most likely alignment (i.e. most likely state sequence).
41
![Page 42: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/42.jpg)
Training Word Alignment Models
• Both the IBM model 1 and HMM model can be trained on a parallel corpus to set the required parameters.
• For supervised (hand-aligned) training data, parameters can be estimated directly using frequency counts.
• For unsupervised training data, EM can be used to estimate parameters, e.g. Baum-Welch for the HMM model.
![Page 43: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/43.jpg)
43
Sketch of EM Algorithm forWord Alignment
Randomly set model parameters. (making sure they represent legal distributions)Until converge (i.e. parameters no longer change) do: E Step: Compute the probability of all possible alignments of the training data using the current model. M Step: Use these alignment probability estimates to re-estimate values for all of the parameters.
Note: Use dynamic programming (as in Baum-Welch)to avoid explicitly enumerating all possible alignments
![Page 44: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/44.jpg)
Sample EM Trace for Alignment(IBM Model 1 with no NULL Generation)
green housecasa verde
the housela casa
TrainingCorpus
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
green
house
the
verde casa la
TranslationProbabilities
Assume uniforminitial probabilities
green housecasa verde
green housecasa verde
the housela casa
the housela casa
ComputeAlignmentProbabilitiesP(A, F | E) 1/3 X 1/3 = 1/9 1/3 X 1/3 = 1/9 1/3 X 1/3 = 1/9 1/3 X 1/3 = 1/9
Normalize to getP(A | F, E) 2
1
9/2
9/1
2
1
9/2
9/1 2
1
9/2
9/1
2
1
9/2
9/1
![Page 45: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/45.jpg)
Example cont.
green housecasa verde
green housecasa verde
the housela casa
the housela casa
1/2 1/2 1/2 1/2
Compute weighted translation counts
1/2 1/2 0
1/2 1/2 + 1/2 1/2
0 1/2 1/2
green
house
the
verde casa la
Normalizerows to sum to one to estimate P(f | e)
1/2 1/2 0
1/4 1/2 1/4
0 1/2 1/2
green
house
the
verde casa la
![Page 46: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/46.jpg)
Example cont.
green housecasa verde
green housecasa verde
the housela casa
the housela casa
1/2 X 1/4=1/8
1/2 1/2 0
1/4 1/2 1/4
0 1/2 1/2
green
house
the
verde casa la
RecomputeAlignmentProbabilitiesP(A, F | E) 1/2 X 1/2=1/4 1/2 X 1/2=1/4 1/2 X 1/4=1/8
Normalize to getP(A | F, E) 3
1
8/3
8/1
3
2
8/3
4/1
3
2
8/3
4/1
3
1
8/3
8/1
Continue EM iterations until translation parameters converge
TranslationProbabilities
![Page 47: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/47.jpg)
Phrase Alignments fromWord Alignments
• Phrase-based approaches to MT have been shown to be better than word-based models.
• However, alignment algorithms produce one to many word translations rather than many to many phrase translations.
• Combine E→F and F →E word alignments to produce a phrase alignment.
47
![Page 48: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/48.jpg)
Phrase Alignment Example
48
Maria no dio una bofetada a la bruja verde
Mary XXXX
did XX
not XX
slap XXXXXX
the XX
green XXXX
witch XXXXX
Spanish to English
![Page 49: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/49.jpg)
Phrase Alignment Example
49
Maria no dio una bofetada a la bruja verde
Mary XXXX
did XX
not XX
slap XXX XXX XXXXXX
the XX
green XXXX
witch XXXXX
English to Spanish
![Page 50: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/50.jpg)
Phrase Alignment Example
50
Maria no dio una bofetada a la bruja verde
Mary XXXX
did
not XX
slap XXXXXX
the XX
green XXXX
witch XXXXX
Intersection
![Page 51: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/51.jpg)
Phrase Alignment Example
51
Maria no dio una bofetada a la bruja verde
Mary XXXX
did XX
not XX
slap XXX XXX XXXXXX
the XX XX
green XXXX
witch XXXXX
Add alignments from union to intersectionto produce a consistent phrase alignment
![Page 52: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/52.jpg)
Decoding
• Goal is to find a translation that maximizes the product of the translation and language models.
52
)()|(argmax EPEFPEnglishE
• Cannot explicitly enumerate and test the combinatorial space of all possible translations.
• Must efficiently (heuristically) search the space of translations that approximates the solution to this difficult optimization problem.
• The optimal decoding problem for all reasonable model’s (e.g. IBM model 1) is NP-complete.
![Page 53: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/53.jpg)
Space of Translations
• The phrase translation table from phrase alignments defines a space of all possible translations.
53
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap green witch
no slap
to the
to
did not give the
slap the witch
![Page 54: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/54.jpg)
Pharaoh
• We describe a phrase-based decoder based on that of Koehn’s (2004) Pharaoh system.
• Code for Pharaoh is freely available for research purposes.
54
![Page 55: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/55.jpg)
Stack Decoding
• Use a version of heuristic A* search to explore the space of phrase translations to find the best scoring subset that covers the source sentence.
55
Initialize priority queue Q (stack) to empty translation.Loop: s = pop(Q) If h is a complete translation, exit loop and return it. For each refinement s´ of s created by adding a phrase translation Compute score f(s´) Add s´ to Q Sort Q by score f
![Page 56: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/56.jpg)
Search Heuristic
• A* is best-first search using the function f to sort the search queue:– f(s) = g(s) + h(s)– g(s): Cost of existing partial solution – h(s): Estimated cost of completion of solution
• If h(s) is an underestimate of the true remaining cost (admissible heuristic) then A* is guaranteed to return an optimal solution.
56
![Page 57: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/57.jpg)
Current Cost: g(s)
• Known quality of partial translation, E, composed of a set of chosen phrase translations S based on phrase translation and language models.
57
)()(),(
1log)(
EPidef
sg
Siii
![Page 58: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/58.jpg)
Estimated Future Cost: h(s)
• True future cost requires knowing the way of translating the remainder of the sentence in a way that maximizes the probability of the final translation.
• However, this is not computationally tractable.• Therefore under-estimate the cost of remaining
translation by ignoring the distortion component and computing the most probable remaining translation ignoring distortion (which is efficiently computable using the Viterbi algorithm)
58
![Page 59: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/59.jpg)
Beam Search
• However, Q grows too large to be efficient and guarantee an optimal result with full A* search.
• Therefore, always cut Q back to only the best (lowest cost) K items to approximate the best translation
59
Initialize priority queue Q (stack) to empty translation.Loop: If top item on Q is a complete translation, exit loop and return it. For each element s of Q do For each refinement s´ of s created by adding a phrase translation Compute score f(s´) Add s´ to Q Sort Q by score f Prune Q back to only the first (lowest cost) K items
![Page 60: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/60.jpg)
Multistack Decoding
• It is difficult to compare translations that cover different fractions of the foreign sentence, so maintain multiple priority queues (stacks), one for each number of foreign words currently translated.
• Finally, return best scoring translation in the queue of translations that cover all of the words in F.
60
![Page 61: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/61.jpg)
Evaluating MT
• Human subjective evaluation is the best but is time-consuming and expensive.
• Automated evaluation comparing the output to multiple human reference translations is cheaper and correlates with human judgements.
61
![Page 62: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/62.jpg)
Human Evaluation of MT
• Ask humans to estimate MT output on several dimensions.– Fluency: Is the result grammatical, understandable, and
readable in the target language.
– Fidelity: Does the result correctly convey the information in the original source language.
• Adequacy: Human judgment on a fixed scale. – Bilingual judges given source and target language.
– Monolingual judges given reference translation and MT result.
• Informativeness: Monolingual judges must answer questions about the source sentence given only the MT translation (task-based evaluation).
62
![Page 63: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/63.jpg)
Computer-Aided Translation Evaluation
• Edit cost: Measure the number of changes that a human translator must make to correct the MT output.– Number of words changed– Amount of time taken to edit– Number of keystrokes needed to edit
63
![Page 64: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/64.jpg)
Automatic Evaluation of MT
• Collect one or more human reference translations of the source.
• Compare MT output to these reference translations.
• Score result based on similarity to the reference translations.– BLEU– NIST– TER– METEOR
64
![Page 65: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/65.jpg)
BLEU
• Determine number of n-grams of various sizes that the MT output shares with the reference translations.
• Compute a modified precision measure of the n-grams in MT result.
65
![Page 66: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/66.jpg)
BLEU Example
66
Cand 1: Mary no slap the witch greenCand 2: Mary did not give a smack to a green witch.
Ref 1: Mary did not slap the green witch.Ref 2: Mary did not smack the green witch.Ref 3: Mary did not hit a green sorceress.
Cand 1 Unigram Precision: 5/6
![Page 67: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/67.jpg)
BLEU Example
67
Cand 1 Bigram Precision: 1/5
Cand 1: Mary no slap the witch green.Cand 2: Mary did not give a smack to a green witch.
Ref 1: Mary did not slap the green witch.Ref 2: Mary did not smack the green witch.Ref 3: Mary did not hit a green sorceress.
![Page 68: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/68.jpg)
BLEU Example
68
Clip match count of each n-gram to maximumcount of the n-gram in any single referencetranslation
Ref 1: Mary did not slap the green witch.Ref 2: Mary did not smack the green witch.Ref 3: Mary did not hit a green sorceress.
Cand 1: Mary no slap the witch green.Cand 2: Mary did not give a smack to a green witch.
Cand 2 Unigram Precision: 7/10
![Page 69: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/69.jpg)
BLEU Example
69
Ref 1: Mary did not slap the green witch.Ref 2: Mary did not smack the green witch.Ref 3: Mary did not hit a green sorceress.
Cand 2 Bigram Precision: 4/9
Cand 1: Mary no slap the witch green.Cand 2: Mary did not give a smack to a green witch.
![Page 70: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/70.jpg)
Modified N-Gram Precision
• Average n-gram precision over all n-grams up to size N (typically 4) using geometric mean.
70
corpusC C
corpusC Cnp
gramn
gramnclip
gram)n(count
gram)n(countN
N
nnpp
1
408.05
1
6
52 pCand 1:
Cand 2: 558.09
4
10
72 p
![Page 71: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/71.jpg)
Brevity Penalty
• Not easy to compute recall to complement precision since there are multiple alternative gold-standard references and don’t need to match all of them.
• Instead, use a penalty for translations that are shorter than the reference translations.
• Define effective reference length, r, for each sentence as the length of the reference sentence with the largest number of n-gram matches. Let c be the candidate sentence length.
71
rce
rcBP cr if
if 1)/1(
![Page 72: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/72.jpg)
BLEU Score
• Final BLEU Score: BLEU = BP p Cand 1: Mary no slap the witch green.
Best Ref: Mary did not slap the green witch.
Cand 2: Mary did not give a smack to a green witch.
Best Ref: Mary did not smack the green witch.
72
846.0 ,7 ,6 )6/71( eBPrc345.0408.0846.0 BLEU
1 ,7 ,10 BPrc
558.0558.01 BLEU
![Page 73: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/73.jpg)
BLEU Score Issues
• BLEU has been shown to correlate with human evaluation when comparing outputs from different SMT systems.
• However, it is does not correlate with human judgments when comparing SMT systems with manually developed MT (Systran) or MT with human translations.
• Other MT evaluation metrics have been proposed that claim to overcome some of the limitations of BLEU.
73
![Page 74: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/74.jpg)
Syntax-Based Statistical Machine Translation
• Recent SMT methods have adopted a syntactic transfer approach.
• Improved results demonstrated for translating between more distant language pairs, e.g. Chinese/English.
74
![Page 75: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/75.jpg)
75
Synchronous Grammar
• Multiple parse trees in a single derivation.• Used by (Chiang, 2005; Galley et al., 2006).
• Describes the hierarchical structures of a sentence and its translation, and also the correspondence between their sub-parts.
![Page 76: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/76.jpg)
•76
X X 是甚麼 / What is X
Chinese: English:
Synchronous Productions
• Has two RHSs, one for each language.
![Page 77: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/77.jpg)
•77
Syntax-Based MT Example
Input: 俄亥俄州的首府是甚麼?
![Page 78: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/78.jpg)
•78
Syntax-Based MT Example
XX
Input: 俄亥俄州的首府是甚麼?
![Page 79: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/79.jpg)
•79
Syntax-Based MT Example
What is X
XX
X 是甚麼
Input: 俄亥俄州的首府是甚麼? X X 是甚麼 / What is X
![Page 80: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/80.jpg)
•80
Syntax-Based MT Example
X 首府
What is X
the capital X
XX
X 是甚麼
Input: 俄亥俄州的首府是甚麼? X X 首府 / the capital X
![Page 81: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/81.jpg)
•81
Syntax-Based MT Example
X 首府
What is X
the capital X
of X
XX
X 是甚麼
X 的
Input: 俄亥俄州的首府是甚麼? X X 的 / of X
![Page 82: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/82.jpg)
•82
Syntax-Based MT Example
X 首府
What is X
the capital X
of X
Ohio
X
俄亥俄州
X
X 是甚麼
X 的
Input: 俄亥俄州的首府是甚麼? X 俄亥俄州 / Ohio
![Page 83: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/83.jpg)
•83
Syntax-Based MT Example
X 首府
What is X
the capital X
of X
Ohio
X
俄亥俄州
X
X 是甚麼
X 的
Input: 俄亥俄州的首府是甚麼? Output: What is the capital of Ohio?
![Page 84: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/84.jpg)
Synchronous Derivationsand Translation Model
• Need to make a probabilistic version of synchronous grammars to create a translation model for P(F | E).
• Each synchronous production rule is given a weight λi that is used in a maximum-entropy (log linear) model.
• Parameters are learned to maximize the conditional log-likelihood of the training data.
84
![Page 85: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/85.jpg)
Minimum Error Rate Training
• Noisy channel model is not trained to directly minimize the final MT evaluation metric, e.g. BLEU.
• A max-ent (log-linear) model can be trained to directly minimize the final evaluation metric on the training corpus by using various features of a translation.– Language model: P(E)– Translation mode: P(F | E)– Reverse translation model: P(E | F)
85
![Page 86: 111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin](https://reader030.vdocuments.mx/reader030/viewer/2022032705/56649d965503460f94a7f0a0/html5/thumbnails/86.jpg)
Conclusions
• MT methods can usefully exploit various amounts of syntactic and semantic processing along the Vauquois triangle.
• Statistical MT methods can automatically learn a translation system from a parallel corpus.
• Typically use a noisy-channel model to exploit both a bilingual translation model and a monolingual language model.
• Automatic word alignment methods can learn a translation lexicon from a parallel corpus.
• Phrase-based and syntax based SMT methods are currently the state-of-the-art.
86