novel reordering approaches in phrase-based statistical machine translation s. kanthak, d. vilar, e....

Novel Reordering Approaches in Phrase-Based Statistical Machine

Translation

S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney

ACL Workshop on Building and Using Parallel Text 2005

British guy

From: Goscinny/Uderzo: Astérix chex les Bretons

French guy

Problem: Reordering

Potentially many long-distance reorderings

Probably source of highest number of errors in current MT systems:

Would you like to go to the cinema with me on Saturday?

Möchtest du mit mir am Samstag ins Kino gehen?

two weeks ago in the south of france 17 years old and the sixteen yearsold robert friend romain with a gun and a baseball bat killed in unfounded, without motive, as a Sunday evening their johan on television

one Sunday evening a fortnight ago in the south of france, johan, aged 17,and robert, aged 16, murdered their childhood friend, romain, with a firearmand a baseball bat, for no reason or motive, just like on tv

Basic Translation Approach: WFSTs Focus on translation of spoken language Translation needs to be integrated with

speech recognition ASR systems use strict left-to-right finite-state

decoding (HMMs) FST-based translation makes integration

easy

Notation

1

1

J I1 1

j

: target language sentence

: source language sentence

e : segmentation of e into J phrases

(f , ) : aligned tuple of source word and target phrase

A: alignment

I

J

j

e

f

e

Instead of using a conditional model , use a joint-probability model:

1

1

1

1

1

1 1 1

1 1

1 1

1 11 1

: 1,...,

1 1

: 1,...,

ˆ arg max ( , )

arg max ( , , )

arg max max ( ) ( , | )

arg max max ( , | , , )

arg max max ( , | , , )

I

J

J

Jj

Jj

I I J

e

J J

e A A

J J

A Ae

j jj j

A Ae f j J

j jj j j m j m

A Ae f j J

e P e f

P e f A

P A P e f A

P e f e f A

P e f e f A

Each source word is aligned with a target phrase (can be empty phrase)

For uniform probability distribution over all alignments A, translation model is an m-gram model over pairs of source words and target phrases

Can be expressed as WFST T:

J1 1ˆ ( (f ))Ie T project - output best

Reordering problem

FST model does not work well when alignment is non-monotonic (does not satisfy )

bad for languages with very different word order apply reordering during training and search to

either source or target language sentences Here: reorder source words prior to training such

that alignments become monotonic for all sentences

'

'i i

i i a a

Reordering during training

Perform bidirectional word alignment Estimate a cost matrix C for each sentence pair Cij indicates local cost of aligning source word fj to

target word ei

Cost is derived from state occupation probabilities:

'

1

1 1 1 1

1 11 1

'1 1

1

1 1 1 1 1:

( , : , ) : log ( | , )

( , | )( | , )

( , | )

( , | ) ( , | )J

j

J I J Ij

J IjJ I

j IJ I

ji

J I J J Ij

a a i

w i j f e p i f e

p i f ep i f e

p i f e

p i f e P f a e

prob. of ei occuring at targetsentence position i as translation of word fj

State occupation probability: normalization over target sentence positions

cost


Reordering is function of source words

All source words must be aligned; new sentence Create second alignment as function of target

words

Based on new cost matrix obtained by reordering C or by re-estimating it

1

1

:{1,..., } {1,..., }

( ) arg min i ij

A J I

A j c

1Jf

2

2

:{1,..., } {1,..., }

( ) arg min j ij

A I J

A i c

C


If cost matrix is re-estimated, monotonic alignment cannot be guaranteed

Find minimum-cost monotonic alignment path through cost matrix using dynamic programming

1,… J

.

.

I

C11 C12 … … c1J

C21 C22 …

… …

cI1 cIJ

Reordering during search

During search source sentence needs to be permuted in all possible ways (J! options)

Represented as an FST with 2J states Expensive, therefore computed on demand Beam pruning applied to eliminate unlikely permutations

Each state in automaton represent permutation of subset of words

Represented as bit vector: each bit stands for arc in input FSA, set to 1 if arc has been used on path from initial to final state

1 1ˆ ( ( ( ) ))I Je f T project- output best permute

Reordering Constraints Representation makes it easy to

minimize/determinize permutation automaton For long sentences, still too comples Need additional constraints on permutation

IBM constraints Inverse IBM constraints Local Constraints ITG Constraints

Reordering Constraints

IBM constraints: at each state, can translate any of the first l word positions that are still uncovered

Inverse IBM constraints: choose any uncovered position for translation unless l-1 words on positions > 1st uncovered position (j) have been translated (in that case, translate j)

Local constraints: choose next word to translate from window of size l around first uncovered position (words in window may be covered or uncovered)

ITG constraints: input = sequence of segments; initially, each word is a segment, then recursive combination into larger segments. At each combination step, possibility of reversing segments. Stop when only segment left is the entire sentence.

Permutation probabilities

Monotonic orderings are given higher probability than non-monotonic translations

At each state, assign probability α to the outgoing arc that maintains monotonicity

Distribute probability mass 1- α to all other arcs (uniformly)

Computed on demand at each state

Data

Basic Travel Expressions Corpus (BTEC) Chinese-to-English (20K train, ~500 dev/test) Japanese-to-English (20K train, 500 dev/test) Italian to English (66K train, ~500K dev/test) Part of IWSLT Evaluated using BLEU, WER, PER, NIST Multiple reference translations in first 2 cases

Experiments

4-gram language model over tuples Moderate beam pruning for l>3 Window size/type of reordering constraint

optimized on the dev set Rescoring of n-best lists Japanese-English: highly non-monotonic,

best performance with 9-word window and inverse IBM constraints

Experiments

Chinese-English: moderately non-monotonic Window size of 7 gave best performance but

windows size 4 quite suitable for most sentences

Italian-English: almost monotonic IBM or local reordering constraints with

window size 3 or 4 Improvement due to reordering not as large

as for other language

novel reordering approaches in phrase-based statistical machine translation s. kanthak, d. vilar, e....

Documents

training reordering

tv slide

notation slide

cost matrix c

new cost matrix

word e i cost

translation model

local cost