10.11.19 09:43
Neural Machine Translation2 CMPT 413/825, Fall 2019
Jetic GūSchool of Computing Science
12 Nov. 2019
Overview• Focus: Neural Machine Translation
• Architecture: Encoder-Decoder Neural Network
• Main Story:
• Extension to Seq2Seq: Copy Mechanism
• Extension to Seq2Seq: Ensemble
• Extension to Seq2Seq: BeamSearch
• [Extra] Beyond Seq2Seq: Attention is all you need
• [Extra] Beyond NMT
2
DecoderEncoder
Sequence-to-Sequence
Review
Pr(E |F) = ΠtPr(e′ = et |F, e<t)CLM
?
RNN
<EOS>
what
RNN
did
did
RNN
prof.
prof.
RNN
sarkar
sarkar
RNN
eat
eat
RNN
?
<SOS>
RNN
what
RNNRNN RNN RNN RNN RNN
?
RNN
was
RNN
hat
RNN
prof.
RNN
sarkar
RNN
gegessen
RNN
P0 Review
Attention
Review
!
RNN RNN RNN
<SOS> prof likes
prof likes cheese
RNN
cheese
<EOS>! ! !
RNN
Encoder
RNNRNN RNN RNN RNN RNN
?
RNN
was
RNN
hat
RNN
prof.
RNN
sarkar
RNN
gegessen
RNN
Decoder
P0 Review
Attention
Review
!
Frau
in
Gelb
mit
einem
Kinderwagen
Mann
mit
blauem
Hemd
und
ein
Sour
ce (G
erm
an)
wom
an in
shirt
with 1a
stro
ller
1and 2a
man in 3a
blue
shirt
Prediction (English)
gold
P0 Review
Attention• Why does Attention work?
• What is in the context vector?
• What is in ?
• It’s alignment1 !
• learns to refer to useful information in src2
• similar to human attention: we pay attention to whatever is needed
α
Review
P0 XXXXXXX !
contextt = ∑i
αt,ihenci
scoret,i = f(henci , hdec
t )αt = softmax(scoreti)
1. CL2015008 [Bahdanau et al.] Neural Machine Translation by Jointly Learning to Align and Translate 2. CL2017342 [Ghader et Monz] What does Attention in Neural Machine Translation Pay Attention to?
P0 Review
Common Problems of NMT
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
Review
P0 XXXXXXX
P0 Review
OOV
Review
P0 XXXXXXX
1. https://en.wikipedia.org/wiki/Zipf%27s_law
P0 Review
Zipf’s Law
Occ
urre
nce
Cou
nt
1
10
100
1000
10000
100000
1000000
Frequency Rank1 25000 50000
• rare word are exponentially less frequent than frequent words
• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once
Common Problems of NMT
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation
Review
P0 XXXXXXX
P0 Review
<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation
Under-trans<EOS>
Review
P0 XXXXXXX
P0 Review
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody
<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation
Under-trans<EOS>
Review
P0 XXXXXXX
P0 Review
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody
'inquisition': 0.49 <EOS>: 0.51
<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation<EOS>
Under-trans<EOS>
Review
P0 XXXXXXX
P0 Review
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
Common Problems of NMT
Review
P0 XXXXXXX
P0 Review
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Under translation
Common Problems of NMT
Review
P0 XXXXXXX
P0 Review
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Copy Mechanisms
• Char-level Encoder (oops for logogram, e.g. Chinese)
• Under translation
• Ensemble
• Beam search
• Coverage models
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Under translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
! !
RNN
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• Sees <UNK> at step
• looks at attention weight
• replace <UNK> with the source word
t
αt
fargmaxiαt,i
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• binary classifier switch • use decoder output • use copy mechanism
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
fargmaxiαt,i
*Pointer-Based Dictionary Fusion
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with translation of the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
dict( fargmaxiαt,i)
Ensemble
• Similar to voting mechanism, but with probabilities
• multiple models of different parameters (usually from different checkpoints of the same training instance)
• use the output with the highest probability across all models
Concep
t
P0 XXXXXXX
P2 Ensemble
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
seq2seq_E049.pt
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
kebab
kebab
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
kebab
kebab
BeamSearch
• Consider multiple hypothesis at each step , formulate decoding as a search programme on a tree
t
Concep
t
P0 XXXXXXX
P3 BeamSearch
BeamSearch
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
johnterryeric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
john
eric
terry
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
P = 0.5
P = 0.4
P = 0.09
P = 0.01
john
eric
terry
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
P = 0.5
P = 0.4
P = 0.09
P = 0.01
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
P = 0.5
P = 0.4
P = 0.09
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof cheese
RNN
<EOS>
RNN
RNN
RNN
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
wants
loves
meets
kicks
is
has
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
wants
loves
meets
kicks
is
has
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
wants
loves
meets
kicks
is
has
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
RNN
RNN
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
RNN
RNN
pizza
kebab
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
RNN
RNN
pizza
kebab
RNN
RNN
<EOS>
<EOS>
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
RNN
RNN
pizza
kebab
RNN
RNN
<EOS>
<EOS>
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
RNN
RNN
likes
loves
RNN
RNN
pizza
kebab
RNN
RNN
<EOS>
<EOS>
Output: "prof loves cheese"
john
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN
prof loves cheese <EOS>
<SOS>
likes
loves
pizza
kebab
<EOS>
<EOS>
Output: "prof loves cheese"
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN
prof loves cheese <EOS>
<SOS>
loves
pizza
kebab
<EOS>
<EOS>
Output: "prof loves cheese"
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN
prof loves cheese <EOS>
<SOS>
loves
pizza
kebab
<EOS>
<EOS>
Output: "prof loves cheese"
P(E |F)
P(E |F)
P(E |F)
eric
BeamSearch (Size=3)
Demo
P0 XXXXXXX
P3 BeamSearch
RNN
prof loves cheese <EOS>
<SOS>
loves
pizza
kebab
<EOS>
<EOS>
Output: "prof loves cheese"
P('prof loves cheese <EOS>' |F)
P('eric loves kebab <EOS>' |F)
P('eric loves pizza <EOS>' |F)
BeamSearch
• Consider multiple hypotheses at each step , formulate decoding as a search programme on a tree
• Consider multiple complete translation hypotheses, but reduces the total search space from to
t
|VE |m
Concep
t
P0 XXXXXXX
P3 BeamSearch
BeamSearch
• Consider multiple hypotheses at each step , formulate decoding as a search programme on a tree
• Consider multiple complete translation hypotheses, but reduces the total search space from to
t
|VE |m
Concep
t
P0 XXXXXXX
P3 BeamSearch
batchSize × m
Self-Attentive NMT
Review
P0 XXXXXXX
Extra1 Transformer
RNN
Encoder Representations
I like cheese
score calculation
!
Encoder Representations
I like cheese
x x x+ + =
DecoderRNNState
α1 α2 α3;[ ]ht
• Aggregating src information , with as query
henc1:|F|
hdect
Self-Attentive NMT
Review
P0 XXXXXXX
Extra1 Transformer
• Aggregating src information , with as query
• : aggregating information , with as query
• : aggregating information , with as query
• Can we replace the RNN with attention?
henc1:|F| hdec
t
h enc f<i fi
h enc f>i fi
Self-Attentive NMT
• Transformer1
• Encoder-Decoder
• No RNN -> all RNNs in Seq2Seq replaced by identical self-attention blocks
• Multi-headed attention blocks*
• read the paper for more information
Concep
t
P0 XXXXXXX
Extra1 Transformer
1. CL2017244 [Vaswani et al.] Attention Is All You Need
Self-Attentive NMT
• Does everything Seq2Seq does
• BERT: Bidirectional Encoder Representation of Transformer
Concep
t
P0 XXXXXXX
Extra1 Transformer
1. CL2017244 [Vaswani et al.] Attention Is All You Need 2. CL2018460 [Devlin et al.] BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
Beyond NMT
• NMT weaknesses
• Longer src sentences -> SMT does it better, augment attention?
• Growing lexicon/Terminologies -> Pointer-based Dictionary Fusion
• Sensitive domain -> offline computing, human-involved system to ensure accuracy
• Mobile Platform deployment -> model compression
Concep
t
P0 XXXXXXX
Extra2 Beyond NMT
Beyond NMT
• NMT weaknesses
• Longer src sentences -> SMT does it better, augment attention?
• Growing lexicon/Terminologies -> Pointer-based Dictionary Fusion
• Sensitive domain -> offline computing, human-involved system to ensure accuracy
• Mobile Platform deployment -> model compression
Concep
t
P0 XXXXXXX
Extra2 Beyond NMT
HYBRID
PURISM FOREVER!
MAKE NMT GREAT AGAIN
ATTENTION IS ALL YOU
NEED
HYBRIDS ARE DUMB
Beyond NMT
Concep
t
P0 XXXXXXX
Extra2 Beyond NMT
HYBRID
My work here is done, thank you.