novel reordering approaches in phrase-based statistical machine translation s. kanthak, d. vilar, e....
Post on 20-Dec-2015
219 views
TRANSCRIPT
Novel Reordering Approaches in Phrase-Based Statistical Machine
Translation
S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney
ACL Workshop on Building and Using Parallel Text 2005
British guy
From: Goscinny/Uderzo: Astérix chex les Bretons
French guy
Problem: Reordering
Potentially many long-distance reorderings
Probably source of highest number of errors in current MT systems:
Would you like to go to the cinema with me on Saturday?
Möchtest du mit mir am Samstag ins Kino gehen?
two weeks ago in the south of france 17 years old and the sixteen yearsold robert friend romain with a gun and a baseball bat killed in unfounded, without motive, as a Sunday evening their johan on television
one Sunday evening a fortnight ago in the south of france, johan, aged 17,and robert, aged 16, murdered their childhood friend, romain, with a firearmand a baseball bat, for no reason or motive, just like on tv
Basic Translation Approach: WFSTs Focus on translation of spoken language Translation needs to be integrated with
speech recognition ASR systems use strict left-to-right finite-state
decoding (HMMs) FST-based translation makes integration
easy
Notation
1
1
J I1 1
j
: target language sentence
: source language sentence
e : segmentation of e into J phrases
(f , ) : aligned tuple of source word and target phrase
A: alignment
I
J
j
e
f
e
Instead of using a conditional model , use a joint-probability model:
1
1
1
1
1
1 1 1
1 1
1 1
1 11 1
: 1,...,
1 1
: 1,...,
ˆ arg max ( , )
arg max ( , , )
arg max max ( ) ( , | )
arg max max ( , | , , )
arg max max ( , | , , )
I
J
J
Jj
Jj
I I J
e
J J
e A A
J J
A Ae
j jj j
A Ae f j J
j jj j j m j m
A Ae f j J
e P e f
P e f A
P A P e f A
P e f e f A
P e f e f A
Each source word is aligned with a target phrase (can be empty phrase)
For uniform probability distribution over all alignments A, translation model is an m-gram model over pairs of source words and target phrases
Can be expressed as WFST T:
J1 1ˆ ( (f ))Ie T project - output best
Reordering problem
FST model does not work well when alignment is non-monotonic (does not satisfy )
bad for languages with very different word order apply reordering during training and search to
either source or target language sentences Here: reorder source words prior to training such
that alignments become monotonic for all sentences
'
'i i
i i a a
Reordering during training
Perform bidirectional word alignment Estimate a cost matrix C for each sentence pair Cij indicates local cost of aligning source word fj to
target word ei
Cost is derived from state occupation probabilities:
'
1
1 1 1 1
1 11 1
'1 1
1
1 1 1 1 1:
( , : , ) : log ( | , )
( , | )( | , )
( , | )
( , | ) ( , | )J
j
J I J Ij
J IjJ I
j IJ I
ji
J I J J Ij
a a i
w i j f e p i f e
p i f ep i f e
p i f e
p i f e P f a e
prob. of ei occuring at targetsentence position i as translation of word fj
State occupation probability: normalization over target sentence positions
cost
Reordering during training
Reordering is function of source words
All source words must be aligned; new sentence Create second alignment as function of target
words
Based on new cost matrix obtained by reordering C or by re-estimating it
1
1
:{1,..., } {1,..., }
( ) arg min i ij
A J I
A j c
1Jf
2
2
:{1,..., } {1,..., }
( ) arg min j ij
A I J
A i c
C
Reordering during training
If cost matrix is re-estimated, monotonic alignment cannot be guaranteed
Find minimum-cost monotonic alignment path through cost matrix using dynamic programming
1,… J
.
.
I
C11 C12 … … c1J
C21 C22 …
… …
cI1 cIJ
Reordering during training
If cost matrix is re-estimated, monotonic alignment cannot be guaranteed
Find minimum-cost monotonic alignment path through cost matrix using dynamic programming
1,… J
.
.
I
C11 C12 … … c1J
C21 C22 …
… …
cI1 cIJ
Reordering during search
During search source sentence needs to be permuted in all possible ways (J! options)
Represented as an FST with 2J states Expensive, therefore computed on demand Beam pruning applied to eliminate unlikely permutations
Each state in automaton represent permutation of subset of words
Represented as bit vector: each bit stands for arc in input FSA, set to 1 if arc has been used on path from initial to final state
1 1ˆ ( ( ( ) ))I Je f T project- output best permute
Reordering Constraints Representation makes it easy to
minimize/determinize permutation automaton For long sentences, still too comples Need additional constraints on permutation
IBM constraints Inverse IBM constraints Local Constraints ITG Constraints
Reordering Constraints
IBM constraints: at each state, can translate any of the first l word positions that are still uncovered
Inverse IBM constraints: choose any uncovered position for translation unless l-1 words on positions > 1st uncovered position (j) have been translated (in that case, translate j)
Local constraints: choose next word to translate from window of size l around first uncovered position (words in window may be covered or uncovered)
ITG constraints: input = sequence of segments; initially, each word is a segment, then recursive combination into larger segments. At each combination step, possibility of reversing segments. Stop when only segment left is the entire sentence.
Permutation probabilities
Monotonic orderings are given higher probability than non-monotonic translations
At each state, assign probability α to the outgoing arc that maintains monotonicity
Distribute probability mass 1- α to all other arcs (uniformly)
Computed on demand at each state
Data
Basic Travel Expressions Corpus (BTEC) Chinese-to-English (20K train, ~500 dev/test) Japanese-to-English (20K train, 500 dev/test) Italian to English (66K train, ~500K dev/test) Part of IWSLT Evaluated using BLEU, WER, PER, NIST Multiple reference translations in first 2 cases
Experiments
4-gram language model over tuples Moderate beam pruning for l>3 Window size/type of reordering constraint
optimized on the dev set Rescoring of n-best lists Japanese-English: highly non-monotonic,
best performance with 9-word window and inverse IBM constraints
Experiments
Chinese-English: moderately non-monotonic Window size of 7 gave best performance but
windows size 4 quite suitable for most sentences
Italian-English: almost monotonic IBM or local reordering constraints with
window size 3 or 4 Improvement due to reordering not as large
as for other language