softmax alternatives in neural mt - graham neubig · 24/5/2017 · copying source words as-is...
Post on 03-Oct-2020
1 Views
Preview:
TRANSCRIPT
1
Softmax Alternatives in Neural MT
Softmax Alternatives inNeural MT
Graham Neubig5/24/2017
2
Softmax Alternatives in Neural MT
Neural MT Models
kouen givewookonaimasu </s> a talk
P(e1∣F )
a
P(e2∣F ,e1)
talk
P(e3∣F ,e12)
</s>
P(e4∣F ,e13)
giveargmax
3
Softmax Alternatives in Neural MT
How we Calculate Probabilities
p(ei|h
i) = softmax( W * h
i + b )
Next word prob. Weights Hidden context Bias
W[w
*,1, w
*,2, w
*,3, ...]
bb
1
b2
b3
...
In other words, the score is:
s(ei|c
i) = w
*,k・ c
i + b
k
Closeness of output embedding and context+ bias. Choose word with highest score
4
Softmax Alternatives in Neural MT
A Visual Example
W h bsoftmax( + )p =
5
Softmax Alternatives in Neural MT
Problems w/ Softmax● Computationally inefficient at training time● Computationally inefficient at test time● Many parameters● Sub-optimal accuracy
6
Softmax Alternatives in Neural MT
Calculation/Parameter Efficient Softmax Variants
7
Softmax Alternatives in Neural MT
Negative Sampling/Noise Contrastive Estimation
● Calculate the denominator over a subset
W c b+ W' c b'+
Negative samples according to distribution q
8
Softmax Alternatives in Neural MT
Lots of Alternatives!● Noise contrastive estimation: train a model to
discriminate between true and false examples
● Negative sampling: e.g. word2vec
● BlackOut
Ref: Chris Dyer, 2014. Notes on Noise Contrastive Estimation and Negative Sampling
Used in MT: Eriguchi et al. 2016: Tree-to-sequence attentional neural machine translation
9
Softmax Alternatives in Neural MT
GPUifying Noise Contrastive Estimation
● Creating the negative samples and arranging memory is expensive on GPU
● Simple solution: sample the negative samples once for each mini-batch
Zoph et al. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies
10
Softmax Alternatives in Neural MT
Summary of Negative Sampling Approaches
● Train time efficiency: Much faster!● Test time efficiency: Same● Number of parameters: Same● Test time accuracy: A little worse?● Code complexity: Moderate
11
Softmax Alternatives in Neural MT
Vocabulary Selection● Select the vocabulary on a per-sentence basis
L'Hostis et al. 2016. Vocabulary Selection Strategies for NMTMi 2016. Vocabulary Manipulation for NMT
12
Softmax Alternatives in Neural MT
Summary of Vocabulary Selection● Train time efficiency: A little faster● Test time efficiency: Much faster!● Number of parameters: Same● Test time accuracy: Better or a little worse● Code complexity: Moderate
13
Softmax Alternatives in Neural MT
Class-based Softmax● Predict P(class|hidden), then P(word|class,hidden)● Because P(w|c,h) is 0 for all but one class, efficient
computation
Goodman 2001. Classes for Fast Maximum Entropy Training
Wc h b
c+
Ww h b
w+
softmax( )
softmax( )
14
Softmax Alternatives in Neural MT
Hierarchical Softmax● Tree-structured prediction of word ID● Usually modeled as a sequence of binary decisions
Morin and Bengio 2005: Hierarchical Probabilistic NNLM
0 1 1 1 0 → word 14
15
Softmax Alternatives in Neural MT
Summary of Class-based Softmaxes
● Train time efficiency: Faster on CPU, Pain to GPU● Test time efficiency: Worse● Number of parameters: More● Test time accuracy: Slightly worse to slightly better● Code complexity: High
16
Softmax Alternatives in Neural MT
Binary Code Prediction● Just directly predict the binary code of the word ID
● Like hierarchical softmax, but with shared weights at every layer → fewer parameters, easy to GPU
Oda et al. 2017: NMT Via Binary Code Prediction
W h b+σ( ) =
01110
↓
word 14
17
Softmax Alternatives in Neural MT
Two Improvements
Hybrid model Error correcting codes
18
Softmax Alternatives in Neural MT
Summary of Binary Code Prediction
● Train time efficiency: Faster● Test time efficiency: Faster (12x on CPU!)● Number of parameters: Fewer● Test time accuracy: Slightly worse● Code complexity: Moderate
19
Softmax Alternatives in Neural MT
Parameter Sharing
20
Softmax Alternatives in Neural MT
Parameter Sharing● We have two |V| x |h| matrices in the decoder:
● Input word embeddings, which we look up and feed into the RNN
● Output word embeddings, which are the weight matrix W in the softmax
● Simple idea: tie their weights together
Press et al. 2016: Using the output embedding to improvelanguage modelsInan et al. 2016: Tying Word Vectors and Word Classifiers:A Loss Framework for Language Modeling
21
Softmax Alternatives in Neural MT
Summary of Parameter Sharing● Train time efficiency: Same● Test time efficiency: Same● Number of parameters: Fewer● Test time accuracy: Better● Code complexity: Low
22
Softmax Alternatives in Neural MT
Incorporating External Information
23
Softmax Alternatives in Neural MT
Problems w/ Lexical Choice in Neural MT
Arthur et al. 2016:Incorporating Discrete Translation Lexicons in NMT
24
Softmax Alternatives in Neural MT
When Does Translation Succeed?(in Output Embedding Space)I come from Tunisia
w*,tunisia
w*,norway
w*,sweden
w*,nigeria
h1
w*,eat
w*,consume
25
Softmax Alternatives in Neural MT
When Does Translation Fail?Embeddings Version
I come from Tunisia
w*,tunisia
w*,norway
w*,sweden
w*,nigeria
h1
w*,eat
w*,consume
26
Softmax Alternatives in Neural MT
When Does Translation Fail?Bias Version
I come from Tunisia
w*,tunisia
w*,norway
w*,sweden
w*,nigeria
h1
w*,eat
w*,consume
w*,china
btunisia
= -0.5 bchina
= 4.5
27
Softmax Alternatives in Neural MT
What about Traditional Symbolic Models?
his father likes Tunisia
karenochichiwachunijiagasukida
P(kare|his) = 0.5P(no|his) = 0.5P(chichi|father) = 1.0P(chunijia|
Tunisia) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5
1-to-1 alignment
28
Softmax Alternatives in Neural MT
Even if We Make a Mistake...
his father likes Tunisia
karenochichiwachunijiagasukida
P(kare|his) = 0.5P(no|his) = 0.5P(chichi|Tunisia) = 1.0P(chunijia|
father) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5
Different mistakesthan neural MT
☓
☓
Soft alignmentpossible
29
Softmax Alternatives in Neural MT
Calculating Lexicon Probabilities
I come from Tunisia0.05 0.01 0.02 0.93
watashiore…kurukara…chunijiaoranda
0.60.2…0.010.02…0.00.0
0.030.01…0.30.1…0.00.0
0.010.02…0.010.5…0.00.0
0.00.0…0.00.01…0.960.0
Attention
0.030.01…0.000.02…0.890.00
Word-by-wordlexicon prob
Conditionallexicon prob
30
Softmax Alternatives in Neural MT
Incorporating w/ Neural MT● softmax bias:
● Linear interpolation:
p(ei|h
i) = softmax( W * h
i + b + log (lex
i + ε))
p(ei|h
i) = γ * softmax( W * h
i + b ) + (1-γ) * lex
i
To prevent -∞ scores
31
Softmax Alternatives in Neural MT
Summary of External Lexicons● Train time efficiency: Worse● Test time efficiency: Worse● Number of parameters: Same● Test time accuracy: Better to Much Better● Code complexity: High
32
Softmax Alternatives in Neural MT
Other Varieties of Biases● Copying source words as-is
● Remembering and copying target wordsWere called cache models, now called pointer ★sentinel models★ :)
Gu et al. 2016. Incorporating copying mechanism insequence-to-sequence learning
Gulcehre et al. 2016. Pointing the unknown words
Merity et al. 2016. Pointer Sentinel Mixture Models
33
Softmax Alternatives in Neural MT
Use of External Phrase Tables
Tang et al. 2016. NMT with External Phrase Memory
34
Softmax Alternatives in Neural MT
Conclusion
35
Softmax Alternatives in Neural MT
Conclusion● Lots of softmax alternatives for neural MT
→ Consider them in your systems!● But there is no fast at train, fast at test, accurate,
small, and simple method→ Consider making one yourself!
top related