A Neural Reordering Model for Phrase-based Translation
Peng Li Tsinghua University
joint work with Yang Liu, Maosong Sun, Tatsuya Izuha, Dakun Zhang
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
segmentation
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
segmentation reordering
Phrase-based Translation
2
(Koehn et al., 2003; Och and Ney, 2004)
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
segmentation reordering translation
Reordering is Hard
3
Chinese
President
XiJinping
and
his
Us
counterpart Barack
Obama
opentwo
days
of
talks
in
California on
anumber
of
high-stakes issues
Reordering is Hard
3
Chinese
President
XiJinping
and
his
Us
counterpart Barack
Obama
opentwo
days
of
talks
in
California on
anumber
of
high-stakes issues
Q: Can you figure out a sentence using these words?
Reordering is Hard
4
Chinese President Xi Jinping and his Us counterpart
Barack Obama open two days
of
talks in California
on a number
of
high-stakes issues
Q: Can you figure out a sentence using these words?
Reordering is Hard• An NP-complete problem (Knight, 1999; Zaslavskiy et al., 2009)
• Reordering modeling has attracted intensive attention, e.g.
• Distance-based model (Koehn et al., 2003)
• Word-based lexicalized model (Koehn et al., 2007)
• Phrase-based lexicalized model (Tillman, 2004)
• Hierarchical phrase-based lexicalized model (Galley and Manning, 2008)
5
Distance-based Model
6
(Koehn et al., 2003)
+2
d=2
布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharond=0 d=5
-5
Lexicalized Models
7
(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)
M布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
Lexicalized Models
8
(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)
M布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
D
Lexicalized Models
9
(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)
M布什 与 沙⻰龙 举⾏行 了 会谈
Bush held a talk with Sharon
S D
Challenge #1: Sparsity
10
Source Phrase Target Phrase M S D
布什 Bush 0.7 0.2 0.1
举⾏行 了 会谈 held a talk 0.1 0.1 0.8
与 沙⻰龙 with Sharon 0.7 0.1 0.2
举⾏行 了 held a 0.6 0.1 0.3
会谈 talk 0.4 0.3 0.3
Challenge #1: Sparsity• Probability distributions are estimated by MLE
11
Challenge #1: Sparsity• Probability distributions are estimated by MLE
11
P举⾏行 了 会谈
held a talkD =
Challenge #1: Sparsity• Probability distributions are estimated by MLE
11
P举⾏行 了 会谈
held a talkD =
# D举⾏行 了 会谈
held a talk
Challenge #1: Sparsity• Probability distributions are estimated by MLE
11
P举⾏行 了 会谈
held a talkD =
# D举⾏行 了 会谈
held a talk
# ●举⾏行 了 会谈
held a talk
Challenge #2: Ambiguity
12
沙⻰龙与布什
Challenge #3: Context Insensitivity
13
举⾏行
Bush held a talk with Sharon
S了 会谈
沙⻰龙与布什
Challenge #3: Context Insensitivity
13
举⾏行
Bush held a talk with Sharon
S了 会谈
How to resolve the three challenges?
Including More Contexts
14
与 沙⻰龙 举⾏行 了 会谈
held a talk
S
with Sharon
Sparsity
Ambiguity
Context Insensitivity
Including More Contexts
14
与 沙⻰龙 举⾏行 了 会谈
held a talk
S
with Sharon
Sparsity
Ambiguity
Context Insensitivity✔
✔
Including More Contexts
14
与 沙⻰龙 举⾏行 了 会谈
held a talk
S
with Sharon
Sparsity
Ambiguity
Context Insensitivity✔
?
✔
Sparsity• Including more contexts leads to severer sparsity
15
=P举⾏行 了 会谈
held a talkD
与 沙⻰龙
with Sharon
#举⾏行 了 会谈
held a talkD
与 沙⻰龙
with Sharon
#举⾏行 了 会谈
held a talk•
与 沙⻰龙
with Sharon
Sparsity• Including more contexts leads to severer sparsity
15
=P举⾏行 了 会谈
held a talkD
与 沙⻰龙
with Sharon
#举⾏行 了 会谈
held a talkD
与 沙⻰龙
with Sharon
#举⾏行 了 会谈
held a talk•
与 沙⻰龙
with Sharon
Reordering as Classification
Neural Reordering Model
16
Neural Reordering Model• A neural classifier for predicting reordering orientations
16
Neural Reordering Model• A neural classifier for predicting reordering orientations
• Conditioned on both the current and previous phrase pairs
• Improves context sensitivity
• Reduces reordering ambiguity
16
Neural Reordering Model• A neural classifier for predicting reordering orientations
• Conditioned on both the current and previous phrase pairs
• Improves context sensitivity
• Reduces reordering ambiguity
• A single classifier for all phrase pairs
• Uses vector space representations
• Alleviates the data sparsity problem
16
Recursive Autoencoder (RAE)
17
aheld talk(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1 x2
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1 x2 x3
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1 x2 x3
y1 = f
(1)(W (1)[x1;x2] + b
(1))y1
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1 x2 x3
y1 = f
(1)(W (1)[x1;x2] + b
(1))y1
x
01 x
02
[x01;x
02] = f
(2)(W (2)y1 + b
(2))
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
17
aheld talk
x1 x2 x3
y1
x
01 x
02
||x1 � x
01||2 ||x2 � x
02||2
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
18
aheld talk
x1 x2 x3
y1
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
18
aheld talk
x1 x2 x3
y1
y2y2 = f
(1)(W (1)[y1;x3] + b
(1))
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
18
aheld talk
x1 x2 x3
y1
y2y2 = f
(1)(W (1)[y1;x3] + b
(1))
[y01;x03] = f
(2)(W (2)y2 + b
(2))y01 x
03
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
18
aheld talk
x1 x2 x3
y1
y2
y01 x
03
||x3 � x
03||2
||y1 � y01||2
(Pollack; 1990; Socher et. al, 2011)
Recursive Autoencoder (RAE)
18
aheld talk
x1 x2 x3
y1
y2
(Pollack; 1990; Socher et. al, 2011)
Neural Classifier
19
orientations
current phrase pair previous phrase pair
Neural Classifier
19
orientations
current phrase pair previous phrase pair
RAE
Neural Classifier
19
orientations
current phrase pair previous phrase pair
softmax
Training
20
Reordering error on predicting orientations
Reconstruction error on recovering training examples
Training
20
Reordering error on predicting orientations
Reconstruction error on recovering training examples
linear interpolation
Reconstruction Error• Reconstruction error
!
• Source side average reconstruction error
!
• Total reconstruction error
21
Erec([c1; c2]; ✓) =1
2||[c1; c2]� [c01; c
02]||2
Erec,s(S; ✓) =1
Ns
X
i
X
p2T ✓R(ti,s)
Erec([p.c1, p.c2]; ✓)
Erec(S; ✓) = Erec,s(S; ✓) + Erec,t(S; ✓)
Reordering Error• Average cross-entropy error
!
• Joint training objective
22
E
reo
(S; ✓) =1
|S|X
i
�X
o
d
ti(o) · log(P✓
(o|ti
))
!
J = ↵Erec
(S; ✓) + (1� ↵)Ereo
(S; ✓) +R(✓)
R(✓) =�L
2||✓
L
� ✓L0 ||2 +
�rec
2||✓
rec
||2 + �reo
2||✓
reo
||2
Optimization• Hyper-parameters optimization
•
• Optimized by random search (Bergstra and Bengio, 2012)
• Training objective optimization: L-BFGS
• Using backpropagation through structures to compute the gradients (Goller and Kuchler, 1996)
23
↵,�L
,�rec
,�reo
Experiments• Chinese-English translation
• Training: 1.2M sentence pairs
• LM: 4-gram, 397.6M words
• Dev. set: NIST 06
• Test set: NIST 02-05, 08
• Case-insensitive BLEU
24
• Baselines
• Distance-based model
• Lexicalized model
Experiments• Chinese-English translation
• Training: 1.2M sentence pairs
• LM: 4-gram, 397.6M words
• Dev. set: NIST 06
• Test set: NIST 02-05, 08
• Case-insensitive BLEU
24
× M/S/D
left/right
orientations
word-based
phrase-based
hier. phrase-based
model type
M/S/D Orientations• Care about relative position and adjacency
25
与 沙⻰龙
with Sharon
举⾏行 了 会谈
held a talkBush
布什D
与 沙⻰龙
with Sharon
举⾏行 了 会谈
held a talkBush
布什S
Left/Right Orientations• Only care about relative position
26
与 沙⻰龙
with Sharon
举⾏行 了 会谈
held a talkBush
布什right
与 沙⻰龙
with Sharon
举⾏行 了 会谈
held a talkBush
布什left
Translation
27
Translation
27
BLEU
24
26.25
28.5
30.75
33
MT06 (dev) MT02 MT03 MT04 MT05 MT08
baseline neural M/S/D neural left/right
Translation
27
BLEU
24
26.25
28.5
30.75
33
MT06 (dev) MT02 MT03 MT04 MT05 MT08
baseline neural M/S/D neural left/right
Translation
27
BLEU
24
26.25
28.5
30.75
33
MT06 (dev) MT02 MT03 MT04 MT05 MT08
baseline neural M/S/D neural left/right
Non-Separability• The unaligned Chinese word “de” makes a big difference in
determining M/S/D orientations
28
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
M
Non-Separability• The unaligned Chinese word “de” makes a big difference in
determining M/S/D orientations
28
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
M
Non-Separability• The unaligned Chinese word “de” makes a big difference in
determining M/S/D orientations
28
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
M
Non-Separability• The unaligned Chinese word “de” makes a big difference in
determining M/S/D orientations
29
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
D
Non-Separability
30
M S D
六 万 的
60000
常住 ⼈人⼝口
resident population
六 万
60000
常住 ⼈人⼝口
resident population
Non-Separability• Left/right orientations are not so sensitive to unaligned words
31
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
right
Non-Separability• Left/right orientations are not so sensitive to unaligned words
32
⾦金⻔门 有 六 万 的 常住 ⼈人⼝口
Kinmen has 60000 resident population
jinmen you liu wan de changzhu renkou
right
Non-Separability
33
left right
六 万 的
60000
常住 ⼈人⼝口
resident population
六 万
60000
常住 ⼈人⼝口
resident population
Non-Separability
34
left rightM S D
Distortion Limit
35
2 4 6 822
23
24
25
Distortion Limit
BL
EU
neural
lexicalized
Word Vectors
36
BLEU
20
23.5
27
30.5
34
MT06 (dev) MT02 MT03 MT04 MT05 MT08
task-oriented vectors word2vec vectors
Vector Space Representations
37
but is willing toeconomy is required to
range of services to said his visit is to
is making use of
june 18, 2001
late 2011
as detention centergroup all togethertake care of old
by june 1
and complete by end 1998
or other economicand for other
within the agencies
Conclusion• We propose a neural reordering model for phrase-based translation
• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem
38
Conclusion• We propose a neural reordering model for phrase-based translation
• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem
• Future work
• Train MT system and neural classifier jointly
• Develop more efficient models to leverage larger contexts
• Extend our work to syntax-based and n-gram based models
38
Thanks!
39