exact decoding of phrase-based translation models through lagrangian relaxation #emnlpreading

Exact Decoding of Phrase-‐based Translation Models through

Lagrangian Relaxation Yin-‐Wen Chang (MIT),

Michael Collins (Columbia University)

EMNLP 2011 reading

About the presenter

•  Name: Yoh Okuno

•  Software Engineer at Web company

•  Interest: NLP, Machine Learning, Data Mining

•  Skill: C/C++, Python, Hadoop, etc.

•  Weblog: http://d.hatena.ne.jp/nokuno/

Decoding in Phrase-‐based SMT

•  Decoding in SMT is NP-‐Hard

– Approximate search: beam search

– Exact search: ILP(Integer Linear Programming)

•  Propose adoption of Lagrangian relaxation

and efficient dynamic programming

Phrase-‐based SMT Model

•  Reordering makes the problem complicated

•  Use 3-‐gram language model

f(y) = h(e(y)) +L�

k=1

g(pk) +L−1�

k=1

ηδ(t(pk), s(pk+1))

LM Translation Distortion y =< p1p2...pL >pk = (s, t, e)δ(t, s) = |t+ 1− s|

output: phrase: distortion:

η : negative constant x: input sentence

Decoding with constraints

•  Our purpose: solve

•  Define y(i) = # of x_i is translated in y

1.  Each word in the input is translated exactly

once: y(i) = 1 for all i

2.  Distortion limit:

argmaxy∈Y

f(y)

δ(t(pk), s(pk+1)) < d

Exact dynamic programming

•  Use states:

•  w1, w2: trigram context words

•  b: bit string which input words are translated

•  r: end position of the previous phrase

(w1, w2, b, r)

Exact dynamic programming •  Yet it is intractable

Decoding based on Lagrangian Relaxation

•  Consider broader set of Y

•  Y’ use looser constraint:

•  That means, N words are translated

argmaxy∈Y �

f(y)

N�

i=1

y(i) = N

Efficient Dynamic Programming

•  Use states:

– or

•  n: number of translated words

•  (l,m): range of previous translated words

•  Transition as one phrase translation

(w1, w2, n, r)

(w1, w2, n, l,m, r)

pk = (s, t, e)

Applying Lagrangian Relaxation •  Solve relaxed problem + constraints

•  Apply Lagrangian method

•  Dual objective and dual problem:

argmaxy∈Y �

f(y) such that ∀ i, y(i) = 1

L(u, y) = f(y) +�

i

u(i)(y(i)− 1)

minu

L(u) = minu

maxy∈Y �

L(u, y)

Decoding by subgradient method

Intuitive interpretation •  Lagrange multiplier u(i) penalizes or rewards

input word i to be translated exactly once

•  Update: – Declease u(i) if y(i) > 1,

–  Inclease u(i) if y(i) = 0

– Do nothing if y(i) = 1

ut(i) = ut−1(i)− αt(yt(i)− 1)

Input: dadurch konnen die qualit ¨ at und die regelm ¨ aßige postzustellung auch weiterhin sichergestellt werden .

the quality and also the and the quality and also the regular will continue to be continue to be continue to.. in that way, and can thus quality in that way, the qualit and.. can the regular distribution should also ensure distribution.. the regular and regular and regular the quality and the .. in that way, the quality of the quality of the distribution...

output: in that way, the quality and the regular distribution should continue to be guaranteed.

Experimental summary

•  Language: German to English translation

•  Corpus: Europarl data (1,824 sentence)

•  Proposed method finds exact solutions on 99%

•  Average run time is 120 seconds

•  Moses makes search errors of 4 to 18%

Table 1: iteration and conversion

•  97% of the examples converge within 120 iter.

Table 4: ILP/LP are too slow

Table 5: Moses search errors

Table 7: BLUE doesn’t improveL

Conclusion •  Described an exact decoding algorithm for

SMT using Lagrangian relaxation

•  Proposed method finds exact solutions on

99% samples within 120 seconds in average

•  Future work: apply Lagrangian relaxation to

training algorithms for SMT

Any Question?

Transition for DP

•  Define transition as one phrase translation

(w1, w2, n, l,m, r) −→ (w�1, w

�2, n

�, l�,m�, r�)pk = (s, t, e)

(w�1, w

�2) = (eM−1, eM ) if M > 1

(w2, e1) if M = 1n� = n+ t− s+ 1

exact decoding of phrase-based translation models through lagrangian relaxation #emnlpreading

Technology

n words

exact decoding of phrase

exact search

phrasebased smt decoding

exact decoding algorithm

regular distribution

l1 f y

lagrangian method lu