![Page 1: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/1.jpg)
Lexical Translation Models 1I
January 29, 2013
Tuesday, February 19, 13
![Page 2: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/2.jpg)
Last Time ...
AlignmentX
Alignment
p( )Translationp( )= Translation,
Tuesday, February 19, 13
![Page 3: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/3.jpg)
Last Time ...
AlignmentX
Alignment
p( )Translationp( )= Translation,
Alignment Translation | Alignment⇥X
Alignment
p( p() )=
Tuesday, February 19, 13
![Page 4: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/4.jpg)
Last Time ...
AlignmentX
Alignment
p( )Translationp( )= Translation,
Alignment Translation | Alignment⇥X
Alignment
p( p() )=
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
| {z } | {z }z }| { z }| {
Tuesday, February 19, 13
![Page 5: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/5.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
Tuesday, February 19, 13
![Page 6: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/6.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
mY
i=1
p(ei | fai , fai�1)
mY
i=1
p(ei | fai , fai�1)
mY
i=1
p(ei | fai , ei�1)
Tuesday, February 19, 13
![Page 7: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/7.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
mY
i=1
p(ei | fai , fai�1)
mY
i=1
p(ei | fai , fai�1)
mY
i=1
p(ei | fai , ei�1)
mY
i=1
p(ei, ei+1 | fai)
What is the problem here?Tuesday, February 19, 13
![Page 8: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/8.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + n| {z }p(a|f,m)
⇥mY
i=1
p(ei | fai)
Tuesday, February 19, 13
![Page 9: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/9.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + n| {z }p(a|f,m)
⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + np(ei | fai)
Tuesday, February 19, 13
![Page 10: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/10.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + n| {z }p(a|f,m)
⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + np(ei | fai)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)
Tuesday, February 19, 13
![Page 11: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/11.jpg)
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + n| {z }p(a|f,m)
⇥mY
i=1
p(ei | fai)
=X
a2[0,n]m
mY
i=1
1
1 + np(ei | fai)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)
Can we do something better here?
Tuesday, February 19, 13
![Page 12: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/12.jpg)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)p(e | f,m)
Tuesday, February 19, 13
![Page 13: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/13.jpg)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)p(e | f,m)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
Tuesday, February 19, 13
![Page 14: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/14.jpg)
• Model alignment with an absolute position distribution
• Probability of translating a foreign word at position to generate the word at position (with target length and source length )
• EM training of this model is almost the same as with Model 1 (same conditional independencies hold)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
aii
m n
p(ai | i,m, n)
Tuesday, February 19, 13
![Page 15: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/15.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
natürlich ist das haus klein
Tuesday, February 19, 13
![Page 16: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/16.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
natürlich ist das haus klein
natürlich natürlich istdas haus klein
of course the house is small
Tuesday, February 19, 13
![Page 17: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/17.jpg)
• Pros
• Non-uniform alignment model
• Fast EM training / marginal inference
• Cons
• Absolute position is very naive
• How many parameters to model
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
p(ai | i,m, n)
Tuesday, February 19, 13
![Page 18: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/18.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
How much do we knowwhen we only know thesource & target lengthsand the current position?
Tuesday, February 19, 13
![Page 19: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/19.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
How much do we knowwhen we only know thesource & target lengthsand the current position?
Tuesday, February 19, 13
![Page 20: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/20.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
How much do we knowwhen we only know thesource & target lengthsand the current position?
How many parametersdo we need to model this?
Tuesday, February 19, 13
![Page 21: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/21.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
Tuesday, February 19, 13
![Page 22: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/22.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target
Tuesday, February 19, 13
![Page 23: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/23.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target pos in source
Tuesday, February 19, 13
![Page 24: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/24.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target pos in source
target len
Tuesday, February 19, 13
![Page 25: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/25.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target pos in source
target len source len
Tuesday, February 19, 13
![Page 26: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/26.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target pos in source
target len source len
b(j | i,m, n) =exp�h(j, i,m, n)Pj0 exp�h(j
0, i,m, n)
Tuesday, February 19, 13
![Page 27: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/27.jpg)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
null
j0 = 1
j0 = 2
j0 = 3
j0 = 4
j0 = 5
i=3
}n=
5
}m = 6
i=1
i=2
i=4
i=5
i=6
h(j, i,m, n) = �����i
m� j
n
����
pos in target pos in source
target len source len
b(j | i,m, n) =exp�h(j, i,m, n)Pj0 exp�h(j
0, i,m, n)
p(ai | i,m, n) =
(p0 if ai = 0
(1� p0)b(ai | i,m, n) otherwise
Tuesday, February 19, 13
![Page 28: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/28.jpg)
Tuesday, February 19, 13
![Page 29: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/29.jpg)
Tuesday, February 19, 13
![Page 30: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/30.jpg)
Tuesday, February 19, 13
![Page 31: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/31.jpg)
Words reorder in groups. Model this!
Tuesday, February 19, 13
![Page 32: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/32.jpg)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)p(e | f,m)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
Tuesday, February 19, 13
![Page 33: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/33.jpg)
=X
a2[0,n]m
mY
i=1
p(ai)⇥ p(ei | fai)p(e | f,m)
=X
a2[0,n]m
mY
i=1
p(ai | i,m, n)⇥ p(ei | fai)Model 2
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Tuesday, February 19, 13
![Page 34: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/34.jpg)
• Insight: words translate in groups
• Condition on previous alignment position
• Probability of translating a foreign word at position given that the previous position translated was
• EM training of this model using forward-backward algorithm (dynamic programming)
aiai�1
p(ai | ai�1)
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Tuesday, February 19, 13
![Page 35: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/35.jpg)
• Improvement: model “jumps” through the source sentence
• Relative position model rather than absolute position model
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
p(ai | ai�1) = j(ai � ai�1)
-4 0.0008
-3 0.0015
-2 0.08
-1 0.18
0 0.0881
1 0.4
2 0.16
3 0.064
4 0.0256
Tuesday, February 19, 13
![Page 36: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/36.jpg)
• Be careful! NULLs must be handled carefully. Here is one option (due to Och):
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
p(ai | ai�ni) =
(p0 if ai = 0
(1� p0)j(ai � ai�ni) otherwise
ni is the index of the first non-null alignedword in the alignment to the left of .i
Tuesday, February 19, 13
![Page 37: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/37.jpg)
• Other extensions: certain word-types are more likely to be reordered
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Tuesday, February 19, 13
![Page 38: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/38.jpg)
• Other extensions: certain word-types are more likely to be reordered
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Condition the jump probability on the previousword translated
j(� | f)
Tuesday, February 19, 13
![Page 39: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/39.jpg)
• Other extensions: certain word-types are more likely to be reordered
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Condition the jump probability on the previousword translated
j(� | f)
j(� | f, e)
Condition the jump probability on the previousword translated, and how it was translated
Tuesday, February 19, 13
![Page 40: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/40.jpg)
• Other extensions: certain word-types are more likely to be reordered
=X
a2[0,n]m
mY
i=1
p(ai | ai�1)⇥ p(ei | fai)HMM
Condition the jump probability on the previousword translated
j(� | f)
j(� | f, e)
Condition the jump probability on the previousword translated, and how it was translated
j(� | C(f))
j(� | A(f),B(e))
Tuesday, February 19, 13
![Page 41: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/41.jpg)
Fertility Models
• The models we have considered so far have been efficient
• This efficiency has come at a modeling cost:
• What is to stop the model from “translating” a word 0, 1, 2, or 100 times?
• We introduce fertility models to deal with this
Tuesday, February 19, 13
![Page 42: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/42.jpg)
IBM Model 3
Tuesday, February 19, 13
![Page 43: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/43.jpg)
Fertility• Fertility: the number of English words generated by a foreign
word
• Modeled by categorical distribution
• Examples:
n(� | f)
0 0.00008
1 0.1
2 0.0002
3 0.8
4 0.009
5 0
Unabhaengigkeitserklaerung
0 0.01
1 0
2 0.9
3 0.0009
4 0.0001
5 0
zum = (zu + dem)
0 0.01
1 0.92
2 0.07
3 0
4 0
5 0
Haus
Tuesday, February 19, 13
![Page 44: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/44.jpg)
Fertility
• Fertility models mean that we can no longer exploit conditional independencies to write as a series of localalignment decisions.
• How do we compute the statistics required for EM training?
p(a | f,m)
Tuesday, February 19, 13
![Page 45: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/45.jpg)
Fertility
• Fertility models mean that we can no longer exploit conditional independencies to write as a series of localalignment decisions.
• How do we compute the statistics required for EM training?
p(e | f,m) =X
a2[0,n]m
p(a | f,m)⇥mY
i=1
p(ei | fai)
p(a | f,m)
Tuesday, February 19, 13
![Page 46: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/46.jpg)
EM Recipe reminder
• If alignment points were visible, training fertility models would be easy
• We would _______ and ________
• But, alignments are not visible
n(� = 3 | f = Unabhaenigkeitserklaerung) =count(3,Unabhaenigkeitserklaerung)
count(Unabhaenigkeitserklaerung)
Tuesday, February 19, 13
![Page 47: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/47.jpg)
EM Recipe reminder
• If alignment points were visible, training fertility models would be easy
• We would _______ and ________
• But, alignments are not visible
n(� = 3 | f = Unabhaenigkeitserklaerung) =count(3,Unabhaenigkeitserklaerung)
count(Unabhaenigkeitserklaerung)
n(� = 3 | f = Unabhaenigkeitserklaerung) =E[count(3,Unabhaenigkeitserklaerung)]E[count(Unabhaenigkeitserklaerung)]
Tuesday, February 19, 13
![Page 48: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/48.jpg)
Expectation & Fertility
• We need to compute expected counts under p(a | f,e,m)
• Unfortunately p(a | f,e,m) doesn’t factorize nicely. :(
• Can we sum exhaustively? How many different a’s are there?
• What to do?
Tuesday, February 19, 13
![Page 49: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/49.jpg)
Sample Alignments• Monte-Carlo methods
• Gibbs sampling
• Importance sampling
• Particle filtering
• For historical reasons
• Use model 2 alignment to start (easy!)
• Weighted sum over all alignment configurations that are “close” to this alignment configuration
• Is this correct? No! Does it work? Sort of.
Tuesday, February 19, 13
![Page 50: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/50.jpg)
Tuesday, February 19, 13
![Page 51: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/51.jpg)
Pitfalls of Conditional Models
IBM Model 4 alignment Our model's alignment
Figure 2: Example English-Urdu alignment under IBM Model 4 (left) and our discriminative model (right). Model4 displays two characteristic errors: garbage collection and an overly-strong monotonicity bias. Whereas our modeldoes not exhibit these problems, and in fact, makes no mistakes in the alignment.
pervised setting. The contrastive estimation tech-nique proposed by Smith and Eisner (2005) is glob-ally normalized (and thus capable of dealing with ar-bitrary features), and closely related to the model wedeveloped; however, they do not discuss the problemof word alignment. Berg-Kirkpatrick et al. (2010)learn locally normalized log-linear models in a gen-erative setting. Globally normalized discriminativemodels with latent variables (Quattoni et al., 2004)have been used for a number of language processingproblems, including MT (Dyer and Resnik, 2010;Blunsom et al., 2008a). However, this previouswork relied on translation grammars constructed us-ing standard generative word alignment processes.
7 Future Work
While we have demonstrated that this model can besubstantially useful, it is limited in some importantways which are being addressed in ongoing work.First, training is expensive, and we are exploring al-ternatives to the conditional likelihood objective thatis currently used, such as contrastive neighborhoodsadvocated by (Smith and Eisner, 2005). Addition-ally, there is much evidence that non-local featureslike the source word fertility are (cf. IBM Model 3)useful for translation and alignment modeling. To betruly general, it must be possible to utilize such fea-tures. Unfortunately, features like this that dependon global properties of the alignment vector, a, make
the inference problem NP-hard, and approximationsare necessary. Fortunately, there is much recentwork on approximate inference techniques for incor-porating nonlocal features (Blunsom et al., 2008b;Gimpel and Smith, 2009; Cromieres and Kurohashi,2009; Weiss and Taskar, 2010), suggesting that thisproblem too can be solved using established tech-niques.
8 Conclusion
We have introduced a globally normalized, log-linear lexical translation model that can be traineddiscriminatively using only parallel sentences,which we apply to the problem of word alignment.Our approach addresses two important shortcomingsof previous work: (1) that local normalization ofgenerative models constrains the features that can beused, and (2) that previous discriminatively trainedword alignment models required supervised align-ments. According to a variety of measures in a vari-ety of translation tasks, this model produces superioralignments to generative approaches. Furthermore,the features learned by our model reveal interestingcharacteristics of the language pairs being modeled.
AcknowledgmentsThis work was supported in part by the DARPA GALEprogram; the U. S. Army Research Laboratory and theU. S. Army Research Office under contract/grant num-
Tuesday, February 19, 13
![Page 52: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/52.jpg)
Pitfalls of Conditional Models
IBM Model 4 alignment Our model's alignment
Figure 2: Example English-Urdu alignment under IBM Model 4 (left) and our discriminative model (right). Model4 displays two characteristic errors: garbage collection and an overly-strong monotonicity bias. Whereas our modeldoes not exhibit these problems, and in fact, makes no mistakes in the alignment.
pervised setting. The contrastive estimation tech-nique proposed by Smith and Eisner (2005) is glob-ally normalized (and thus capable of dealing with ar-bitrary features), and closely related to the model wedeveloped; however, they do not discuss the problemof word alignment. Berg-Kirkpatrick et al. (2010)learn locally normalized log-linear models in a gen-erative setting. Globally normalized discriminativemodels with latent variables (Quattoni et al., 2004)have been used for a number of language processingproblems, including MT (Dyer and Resnik, 2010;Blunsom et al., 2008a). However, this previouswork relied on translation grammars constructed us-ing standard generative word alignment processes.
7 Future Work
While we have demonstrated that this model can besubstantially useful, it is limited in some importantways which are being addressed in ongoing work.First, training is expensive, and we are exploring al-ternatives to the conditional likelihood objective thatis currently used, such as contrastive neighborhoodsadvocated by (Smith and Eisner, 2005). Addition-ally, there is much evidence that non-local featureslike the source word fertility are (cf. IBM Model 3)useful for translation and alignment modeling. To betruly general, it must be possible to utilize such fea-tures. Unfortunately, features like this that dependon global properties of the alignment vector, a, make
the inference problem NP-hard, and approximationsare necessary. Fortunately, there is much recentwork on approximate inference techniques for incor-porating nonlocal features (Blunsom et al., 2008b;Gimpel and Smith, 2009; Cromieres and Kurohashi,2009; Weiss and Taskar, 2010), suggesting that thisproblem too can be solved using established tech-niques.
8 Conclusion
We have introduced a globally normalized, log-linear lexical translation model that can be traineddiscriminatively using only parallel sentences,which we apply to the problem of word alignment.Our approach addresses two important shortcomingsof previous work: (1) that local normalization ofgenerative models constrains the features that can beused, and (2) that previous discriminatively trainedword alignment models required supervised align-ments. According to a variety of measures in a vari-ety of translation tasks, this model produces superioralignments to generative approaches. Furthermore,the features learned by our model reveal interestingcharacteristics of the language pairs being modeled.
AcknowledgmentsThis work was supported in part by the DARPA GALEprogram; the U. S. Army Research Laboratory and theU. S. Army Research Office under contract/grant num-
Tuesday, February 19, 13
![Page 53: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/53.jpg)
Lexical Translation• IBM Models 1-5 [Brown et al., 1993]
• Model 1: lexical translation, uniform alignment
• Model 2: absolute position model
• Model 3: fertility
• Model 4: relative position model (jumps in target string)
• Model 5: non-deficient model
• HMM translation model [Vogel et al., 1996]
• Relative position model (jumps in source string)
• Latent variables are more useful these days than the translations
• Widely used Giza++ toolkit
Tuesday, February 19, 13
![Page 54: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/54.jpg)
p(e|f)
A few tricks...
p(f|e)
Tuesday, February 19, 13
![Page 55: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/55.jpg)
A few tricks...
p(f|e) p(e|f)
Tuesday, February 19, 13
![Page 56: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/56.jpg)
A few tricks...
p(f|e) p(e|f)
Tuesday, February 19, 13
![Page 57: Lexical Translation Models 1Idemo.clab.cs.cmu.edu/sp2013-11731/slides/05.lexical2.pdf · •Model alignment with an absolute position distribution • Probability of translating a](https://reader033.vdocuments.mx/reader033/viewer/2022050109/5f46aef61ea42e193b6ecf10/html5/thumbnails/57.jpg)
Announcements• Upcoming language-in-10
• Thursday: Weston - 官话
• Tuesday: Jon/Austin - Русский
• Leaderboard is functional
Tuesday, February 19, 13