Download - A Neural Reordering Model for Phrase-based Translationnlp.csai.tsinghua.edu.cn/~lpeng/talks/coling-2014.pdfReordering is Hard • An NP-complete problem (Knight, 1999; Zaslavskiy et

A Neural Reordering Model for Phrase-based Translation

Peng Li Tsinghua University

[email protected]

joint work with Yang Liu, Maosong Sun, Tatsuya Izuha, Dakun Zhang

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什与沙⻰龙举⾏行了会谈


2



Bush


2



Bush held a talk


2



Bush held a talk with Sharon


2




segmentation


2




segmentation reordering


2




segmentation reordering translation

Reordering is Hard

3

Chinese

President

XiJinping

and

his

Us

counterpart Barack

Obama

opentwo

days

of

talks

in

California on

anumber

of

high-stakes issues

Reordering is Hard

3

Chinese

President

XiJinping

and

his

Us

counterpart Barack

Obama

opentwo

days

of

talks

in

California on

anumber

of

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard

4

Chinese President Xi Jinping and his Us counterpart

Barack Obama open two days

of

talks in California

on a number

of

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard• An NP-complete problem (Knight, 1999; Zaslavskiy et al., 2009)

• Reordering modeling has attracted intensive attention, e.g.

• Distance-based model (Koehn et al., 2003)

• Word-based lexicalized model (Koehn et al., 2007)

• Phrase-based lexicalized model (Tillman, 2004)

• Hierarchical phrase-based lexicalized model (Galley and Manning, 2008)

5

Distance-based Model

6

(Koehn et al., 2003)

+2

d=2


Bush held a talk with Sharond=0 d=5

-5

Lexicalized Models

7

(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)

M布什与沙⻰龙举⾏行了会谈


Lexicalized Models

8




D

Lexicalized Models

9




S D

Challenge #1: Sparsity

10

Source Phrase Target Phrase M S D

布什 Bush 0.7 0.2 0.1

举⾏行了会谈 held a talk 0.1 0.1 0.8

与沙⻰龙 with Sharon 0.7 0.1 0.2

举⾏行了 held a 0.6 0.1 0.3

会谈 talk 0.4 0.3 0.3

Challenge #1: Sparsity• Probability distributions are estimated by MLE

11


11

P举⾏行了会谈

held a talkD =


11

P举⾏行了会谈

held a talkD =

# D举⾏行了会谈

held a talk


11

P举⾏行了会谈

held a talkD =

# D举⾏行了会谈

held a talk

# ●举⾏行了会谈

held a talk

Challenge #2: Ambiguity

12

沙⻰龙与布什

Challenge #3: Context Insensitivity

13

举⾏行


S了会谈

沙⻰龙与布什

Challenge #3: Context Insensitivity

13

举⾏行


S了会谈

How to resolve the three challenges?

Including More Contexts

14

与沙⻰龙举⾏行了会谈

held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity


14


held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

✔


14


held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

?

✔

Sparsity• Including more contexts leads to severer sparsity

15

=P举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talk•

与沙⻰龙

with Sharon

Sparsity• Including more contexts leads to severer sparsity

15

=P举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talk•

与沙⻰龙

with Sharon

Reordering as Classification

Neural Reordering Model

16

Neural Reordering Model• A neural classifier for predicting reordering orientations

16


• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

16


• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

• A single classifier for all phrase pairs

• Uses vector space representations

• Alleviates the data sparsity problem

16

Recursive Autoencoder (RAE)

17

aheld talk(Pollack; 1990; Socher et. al, 2011)


17

aheld talk

x1

(Pollack; 1990; Socher et. al, 2011)


17

aheld talk

x1 x2



17

aheld talk

x1 x2 x3



17

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1



17

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1

x

01 x

02

[x01;x

02] = f

(2)(W (2)y1 + b

(2))



17

aheld talk

x1 x2 x3

y1

x

01 x

02

||x1 � x

01||2 ||x2 � x

02||2



18

aheld talk

x1 x2 x3

y1



18

aheld talk

x1 x2 x3

y1

y2y2 = f

(1)(W (1)[y1;x3] + b

(1))



18

aheld talk

x1 x2 x3

y1

y2y2 = f

(1)(W (1)[y1;x3] + b

(1))

[y01;x03] = f

(2)(W (2)y2 + b

(2))y01 x

03



18

aheld talk

x1 x2 x3

y1

y2

y01 x

03

||x3 � x

03||2

||y1 � y01||2



18

aheld talk

x1 x2 x3

y1

y2


Neural Classifier

19

orientations

current phrase pair previous phrase pair

Neural Classifier

19

orientations


RAE

Neural Classifier

19

orientations


softmax

Training

20

Reordering error on predicting orientations

Reconstruction error on recovering training examples

Training

20

Reordering error on predicting orientations

Reconstruction error on recovering training examples

linear interpolation

Reconstruction Error• Reconstruction error

!

• Source side average reconstruction error

!

• Total reconstruction error

21

Erec([c1; c2]; ✓) =1

2||[c1; c2]� [c01; c

02]||2

Erec,s(S; ✓) =1

Ns

X

i

X

p2T ✓R(ti,s)

Erec([p.c1, p.c2]; ✓)

Erec(S; ✓) = Erec,s(S; ✓) + Erec,t(S; ✓)

Reordering Error• Average cross-entropy error

!

• Joint training objective

22

E

reo

(S; ✓) =1

|S|X

i

�X

o

d

ti(o) · log(P✓

(o|ti

))

!

J = ↵Erec

(S; ✓) + (1� ↵)Ereo

(S; ✓) +R(✓)

R(✓) =�L

2||✓

L

� ✓L0 ||2 +

�rec

2||✓

rec

||2 + �reo

2||✓

reo

||2

Optimization• Hyper-parameters optimization

•

• Optimized by random search (Bergstra and Bengio, 2012)

• Training objective optimization: L-BFGS

• Using backpropagation through structures to compute the gradients (Goller and Kuchler, 1996)

23

↵,�L

,�rec

,�reo

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

24

• Baselines

• Distance-based model

• Lexicalized model

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

24

× M/S/D

left/right

orientations

word-based

phrase-based

hier. phrase-based

model type

M/S/D Orientations• Care about relative position and adjacency

25

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什D

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什S

Left/Right Orientations• Only care about relative position

26

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什right

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什left

Translation

27

Translation

27

BLEU

24

26.25

28.5

30.75

33

MT06 (dev) MT02 MT03 MT04 MT05 MT08

baseline neural M/S/D neural left/right

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

28

⾦金⻔门有六万的常住⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

M

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

29




D

Non-Separability

30

M S D

六万的

60000

常住⼈人⼝口

resident population

六万

60000

常住⼈人⼝口

resident population

Non-Separability• Left/right orientations are not so sensitive to unaligned words

31




right

Non-Separability• Left/right orientations are not so sensitive to unaligned words

32




right

Non-Separability

33

left right

六万的

60000

常住⼈人⼝口

resident population

六万

60000

常住⼈人⼝口

resident population

Non-Separability

34

left rightM S D

Distortion Limit

35

2 4 6 822

23

24

25

Distortion Limit

BL

EU

neural

lexicalized

Word Vectors

36

BLEU

20

23.5

27

30.5

34

MT06 (dev) MT02 MT03 MT04 MT05 MT08

task-oriented vectors word2vec vectors

Vector Space Representations

37

but is willing toeconomy is required to

range of services to said his visit is to

is making use of

june 18, 2001

late 2011

as detention centergroup all togethertake care of old

by june 1

and complete by end 1998

or other economicand for other

within the agencies

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

38

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

• Future work

• Train MT system and neural classifier jointly

• Develop more efficient models to leverage larger contexts

• Extend our work to syntax-based and n-gram based models

38

Thanks!

39

Download - A Neural Reordering Model for Phrase-based Translationnlp.csai.tsinghua.edu.cn/~lpeng/talks/coling-2014.pdfReordering is Hard • An NP-complete problem (Knight, 1999; Zaslavskiy et

Top Related