statistical nlp spring 2011
DESCRIPTION
Statistical NLP Spring 2011. Lecture 7: Phrase-Based MT. Dan Klein – UC Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A. Machine Translation: Examples. Levels of Transfer. World-Level MT: Examples. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/1.jpg)
Statistical NLPSpring 2011
Lecture 7: Phrase-Based MTDan Klein – UC Berkeley
![Page 2: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/2.jpg)
Machine Translation: Examples
![Page 3: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/3.jpg)
Levels of Transfer
![Page 4: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/4.jpg)
World-Level MT: Examples
la politique de la haine . (Foreign Original) politics of hate . (Reference Translation) the policy of the hatred . (IBM4+N-grams+Stack)
nous avons signé le protocole . (Foreign Original) we did sign the memorandum of agreement . (Reference
Translation) we have signed the protocol . (IBM4+N-
grams+Stack)
où était le plan solide ? (Foreign Original) but where was the solid plan ? (Reference
Translation) where was the economic base ? (IBM4+N-grams+Stack)
![Page 5: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/5.jpg)
Phrasal / Syntactic MT: Examples
![Page 6: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/6.jpg)
MT: Evaluation Human evaluations: subject measures,
fluency/adequacy
Automatic measures: n-gram match to references NIST measure: n-gram recall (worked poorly) BLEU: n-gram precision (no one really likes it, but
everyone uses it)
BLEU: P1 = unigram precision P2, P3, P4 = bi-, tri-, 4-gram precision Weighted geometric mean of P1-4 Brevity penalty (why?) Somewhat hard to game…
![Page 7: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/7.jpg)
Automatic Metrics Work (?)
![Page 8: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/8.jpg)
Corpus-Based MT
Modeling correspondences between languages
Sentence-aligned parallel corpus:
Yo lo haré mañanaI will do it tomorrow
Hasta prontoSee you soon
Hasta prontoSee you around
Yo lo haré prontoNovel Sentence
I will do it soon
I will do it around
See you tomorrow
Machine translation system:
Model of translation
![Page 9: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/9.jpg)
Phrase-Based Systems
Sentence-aligned corpus
cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …
Phrase table(translation model)Word alignments
Many slides and examples from Philipp Koehn or John DeNero
![Page 10: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/10.jpg)
![Page 11: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/11.jpg)
Phrase-Based Decoding
这 7 人 中包括 来自 法国 和 俄罗斯 的 宇航 员 .
Decoder design is important: [Koehn et al. 03]
![Page 12: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/12.jpg)
The Pharaoh “Model”
[Koehn et al, 2003]
Segmentation Translation Distortion
![Page 13: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/13.jpg)
The Pharaoh “Model”
Where do we get these counts?
![Page 14: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/14.jpg)
Phrase Weights
![Page 15: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/15.jpg)
Phrase-Based Decoding
![Page 16: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/16.jpg)
Monotonic Word Translation
Cost is LM * TM It’s an HMM?
P(e|e-1,e-2)
P(f|e)
State includes Exposed English Position in foreign
Dynamic program loop?
a <- t
o 0.8
a <- by 0.1
[…. a slap, 5]0.00001
[…. slap to, 6]0.00000016
[…. slap by, 6]0.00000001
a slap to
0.02
a slap by 0.01
for (fPosition in 1…|f|) for (eContext in allEContexts) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext+eOption) * TM(eOption, fWord[fPosition]) scores[fPosition][eContext[2]+eOption] =max score
![Page 17: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/17.jpg)
Beam Decoding For real MT models, this kind of dynamic program is a disaster
(why?) Standard solution is beam search: for each position, keep track of
only the best k hypotheses
Still pretty slow… why? Useful trick: cube pruning (Chiang 2005)
for (fPosition in 1…|f|) for (eContext in bestEContexts[fPosition]) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext+eOption) * TM(eOption, fWord[fPosition]) bestEContexts.maybeAdd(eContext[2]+eOption, score)
Example from David Chiang
![Page 18: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/18.jpg)
Phrase Translation
If monotonic, almost an HMM; technically a semi-HMM
If distortion… now what?
for (fPosition in 1…|f|) for (lastPosition < fPosition) for (eContext in eContexts) for (eOption in translations[fPosition]) … combine hypothesis for (lastPosition ending in eContext) with eOption
![Page 19: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/19.jpg)
Non-Monotonic Phrasal MT
![Page 20: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/20.jpg)
Pruning: Beams + Forward Costs
Problem: easy partial analyses are cheaper Solution 1: use beams per foreign subset Solution 2: estimate forward costs (A*-like)
![Page 21: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/21.jpg)
The Pharaoh Decoder
![Page 22: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/22.jpg)
Hypotheis Lattices
![Page 23: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/23.jpg)
Word Alignment
![Page 24: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/24.jpg)
Word Alignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
x zWhat
is the
anticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?
![Page 25: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/25.jpg)
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences
Output: alignments: pairs oftranslated words
When words have uniquesources, can represent asa (forward) alignmentfunction a from French toEnglish positions
nous acceptons votre opinion .
we accept your view .
![Page 26: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/26.jpg)
1-to-Many Alignments
![Page 27: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/27.jpg)
Many-to-Many Alignments
![Page 28: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/28.jpg)
IBM Model 1 (Brown 93) Alignments: a hidden vector called an alignment specifies which
English source is responsible for each French target word.
![Page 29: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/29.jpg)
A:
IBM Models 1/2
Thank you , I shall do so gladly .
1 3 7 6 9
1 2 3 4 5 76 8 9
Model ParametersTransitions: P( A2 = 3)Emissions: P( F1 = Gracias | EA1 = Thank )
Gracias , lo haré de muy buen grado .
8 8 88
E:
F:
![Page 30: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/30.jpg)
Evaluating TMs
How do we measure quality of a word-to-word model?
Method 1: use in an end-to-end translation system Hard to measure translation quality Option: human judges Option: reference translations (NIST, BLEU) Option: combinations (HTER) Actually, no one uses word-to-word models alone as TMs
Method 2: measure quality of the alignments produced Easy to measure Hard to know what the gold alignments should be Often does not correlate well with translation quality (like
perplexity in LMs)
![Page 31: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/31.jpg)
Alignment Error Rate Alignment Error Rate
Sure align.
Possible align.
Predicted align.
=
=
=
![Page 32: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/32.jpg)
Problems with Model 1
There’s a reason they designed models 2-5!
Problems: alignments jump around, align everything to rare words
Experimental setup: Training data: 1.1M
sentences of French-English text, Canadian Hansards
Evaluation metric: alignment error Rate (AER)
Evaluation data: 447 hand-aligned sentences
![Page 33: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/33.jpg)
Intersected Model 1 Post-intersection: standard
practice to train models in each direction then intersect their predictions [Och and Ney, 03]
Second model is basically a filter on the first
Precision jumps, recall drops End up not guessing hard
alignments
Model P/R AER
Model 1 EF 82/58 30.6
Model 1 FE 85/58 28.7
Model 1 AND 96/46 34.8
![Page 34: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/34.jpg)
Joint Training?
Overall: Similar high precision to post-intersection But recall is much higher More confident about positing non-null alignments
Model P/R AER
Model 1 EF 82/58 30.6
Model 1 FE 85/58 28.7
Model 1 AND 96/46 34.8
Model 1 INT 93/69 19.5
![Page 35: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/35.jpg)
Monotonic Translation
Le Japon secoué par deux nouveaux séismes
Japan shaken by two new quakes
![Page 36: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/36.jpg)
Local Order Change
Le Japon est au confluent de quatre plaques tectoniques
Japan is at the junction of four tectonic plates
![Page 37: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/37.jpg)
IBM Model 2 Alignments tend to the diagonal (broadly at least)
Other schemes for biasing alignments towards the diagonal: Relative vs absolute alignment Asymmetric distances Learning a full multinomial over distances
![Page 38: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/38.jpg)
EM for Models 1/2
Model 1 Parameters:Translation probabilities (1+2)Distortion parameters (2 only)
Start with uniform, including For each sentence:
For each French position j Calculate posterior over English positions
(or just use best single alignment) Increment count of word fj with word ei by these amounts Also re-estimate distortion probabilities for model 2
Iterate until convergence
![Page 39: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/39.jpg)
Example
![Page 40: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/40.jpg)
Phrase Movement
Des tremblements de terre ont à nouveau touché le Japon jeudi 4 novembre.
On Tuesday Nov. 4, earthquakes rocked Japan once again
![Page 41: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/41.jpg)
A:
The HMM Model
Thank you , I shall do so gladly .
1 3 7 6 9
1 2 3 4 5 76 8 9
Model ParametersTransitions: P( A2 = 3 | A1 = 1)Emissions: P( F1 = Gracias | EA1 = Thank )
Gracias , lo haré de muy buen grado .
8 8 88
E:
F:
![Page 42: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/42.jpg)
The HMM Model
Model 2 preferred global monotonicity We want local monotonicity:
Most jumps are small
HMM model (Vogel 96)
Re-estimate using the forward-backward algorithm Handling nulls requires some care
What are we still missing?
-2 -1 0 1 2 3
![Page 43: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/43.jpg)
HMM Examples
![Page 44: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/44.jpg)
AER for HMMs
Model AER
Model 1 INT 19.5
HMM EF 11.4
HMM FE 10.8
HMM AND 7.1
HMM INT 4.7
GIZA M4 AND 6.9
![Page 45: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/45.jpg)
IBM Models 3/4/5
Mary did not slap the green witch
Mary not slap slap slap the green witch
Mary not slap slap slap NULL the green witch
n(3|slap)
Mary no daba una botefada a la verde bruja
Mary no daba una botefada a la bruja verde
P(NULL)
t(la|the)
d(j|i)
[from Al-Onaizan and Knight, 1998]
![Page 46: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/46.jpg)
Examples: Translation and Fertility
![Page 47: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/47.jpg)
Example: Idioms
il hoche la tête
he is nodding
![Page 48: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/48.jpg)
Example: Morphology
![Page 49: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/49.jpg)
Some Results [Och and Ney 03]
![Page 50: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/50.jpg)
Decoding
In these word-to-word models Finding best alignments is easy Finding translations is hard (why?)
![Page 51: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/51.jpg)
Bag “Generation” (Decoding)
![Page 52: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/52.jpg)
Bag Generation as a TSP Imagine bag generation
with a bigram LM Words are nodes Edge weights are P(w|
w’) Valid sentences are
Hamiltonian paths Not the best news for
word-based MT!
it
is
not
clear
.
![Page 53: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/53.jpg)
IBM Decoding as a TSP
![Page 54: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/54.jpg)
Phrase Weights
![Page 55: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/55.jpg)
![Page 56: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/56.jpg)
Phrase Scoring
les chatsaiment
lepoisson
cats
like
fresh
fish
.
.frais
.
Learning weights has been tried, several times: [Marcu and Wong, 02] [DeNero et al, 06] … and others
Seems not to work well, for a variety of partially understood reasons
Main issue: big chunks get all the weight, obvious priors don’t help Though, [DeNero et al 08]
![Page 57: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/57.jpg)
Phrase Size
Phrases do help But they don’t need
to be long Why should this be?
![Page 58: Statistical NLP Spring 2011](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56813e14550346895da7f600/html5/thumbnails/58.jpg)
Lexical Weighting