softmax alternatives in neural mt - graham neubig · 24/5/2017 · copying source words as-is...

1

Softmax Alternatives in Neural MT

Softmax Alternatives inNeural MT

Graham Neubig5/24/2017

2


Neural MT Models

kouen givewookonaimasu </s> a talk

P(e1∣F )

a

P(e2∣F ,e1)

talk

P(e3∣F ,e12)

</s>

P(e4∣F ,e13)

giveargmax

3


How we Calculate Probabilities

p(ei|h

i) = softmax( W * h

i + b )

Next word prob. Weights Hidden context Bias

W[w

*,1, w

*,2, w

*,3, ...]

bb

1

b2

b3

...

In other words, the score is:

s(ei|c

i) = w

*,k・ c

i + b

k

Closeness of output embedding and context+ bias. Choose word with highest score

4


A Visual Example

W h bsoftmax( + )p =

5


Problems w/ Softmax● Computationally inefficient at training time● Computationally inefficient at test time● Many parameters● Sub-optimal accuracy

6


Calculation/Parameter Efficient Softmax Variants

7


Negative Sampling/Noise Contrastive Estimation

● Calculate the denominator over a subset

W c b+ W' c b'+

Negative samples according to distribution q

8


Lots of Alternatives!● Noise contrastive estimation: train a model to

discriminate between true and false examples

● Negative sampling: e.g. word2vec

● BlackOut

Ref: Chris Dyer, 2014. Notes on Noise Contrastive Estimation and Negative Sampling

Used in MT: Eriguchi et al. 2016: Tree-to-sequence attentional neural machine translation

9


GPUifying Noise Contrastive Estimation

● Creating the negative samples and arranging memory is expensive on GPU

● Simple solution: sample the negative samples once for each mini-batch

Zoph et al. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies

10


Summary of Negative Sampling Approaches

● Train time efficiency: Much faster!● Test time efficiency: Same● Number of parameters: Same● Test time accuracy: A little worse?● Code complexity: Moderate

11


Vocabulary Selection● Select the vocabulary on a per-sentence basis

L'Hostis et al. 2016. Vocabulary Selection Strategies for NMTMi 2016. Vocabulary Manipulation for NMT

12


Summary of Vocabulary Selection● Train time efficiency: A little faster● Test time efficiency: Much faster!● Number of parameters: Same● Test time accuracy: Better or a little worse● Code complexity: Moderate

13


Class-based Softmax● Predict P(class|hidden), then P(word|class,hidden)● Because P(w|c,h) is 0 for all but one class, efficient

computation

Goodman 2001. Classes for Fast Maximum Entropy Training

Wc h b

c+

Ww h b

w+

softmax( )

softmax( )

14


Hierarchical Softmax● Tree-structured prediction of word ID● Usually modeled as a sequence of binary decisions

Morin and Bengio 2005: Hierarchical Probabilistic NNLM

0 1 1 1 0 → word 14

15


Summary of Class-based Softmaxes

● Train time efficiency: Faster on CPU, Pain to GPU● Test time efficiency: Worse● Number of parameters: More● Test time accuracy: Slightly worse to slightly better● Code complexity: High

16


Binary Code Prediction● Just directly predict the binary code of the word ID

● Like hierarchical softmax, but with shared weights at every layer → fewer parameters, easy to GPU

Oda et al. 2017: NMT Via Binary Code Prediction

W h b+σ( ) =

01110

↓

word 14

17


Two Improvements

Hybrid model Error correcting codes

18


Summary of Binary Code Prediction

● Train time efficiency: Faster● Test time efficiency: Faster (12x on CPU!)● Number of parameters: Fewer● Test time accuracy: Slightly worse● Code complexity: Moderate

19


Parameter Sharing

20


Parameter Sharing● We have two |V| x |h| matrices in the decoder:

● Input word embeddings, which we look up and feed into the RNN

● Output word embeddings, which are the weight matrix W in the softmax

● Simple idea: tie their weights together

Press et al. 2016: Using the output embedding to improvelanguage modelsInan et al. 2016: Tying Word Vectors and Word Classifiers:A Loss Framework for Language Modeling

21


Summary of Parameter Sharing● Train time efficiency: Same● Test time efficiency: Same● Number of parameters: Fewer● Test time accuracy: Better● Code complexity: Low

22


Incorporating External Information

23


Problems w/ Lexical Choice in Neural MT

Arthur et al. 2016:Incorporating Discrete Translation Lexicons in NMT

24


When Does Translation Succeed?(in Output Embedding Space)I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

25


When Does Translation Fail?Embeddings Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

26


When Does Translation Fail?Bias Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

w*,china

btunisia

= -0.5 bchina

= 4.5

29


Calculating Lexicon Probabilities

I come from Tunisia0.05 0.01 0.02 0.93

watashiore…kurukara…chunijiaoranda

0.60.2…0.010.02…0.00.0

0.030.01…0.30.1…0.00.0

0.010.02…0.010.5…0.00.0

0.00.0…0.00.01…0.960.0

Attention

0.030.01…0.000.02…0.890.00

Word-by-wordlexicon prob

Conditionallexicon prob

30


Incorporating w/ Neural MT● softmax bias:

● Linear interpolation:

p(ei|h

i) = softmax( W * h

i + b + log (lex

i + ε))

p(ei|h

i) = γ * softmax( W * h

i + b ) + (1-γ) * lex

i

To prevent -∞ scores

31


Summary of External Lexicons● Train time efficiency: Worse● Test time efficiency: Worse● Number of parameters: Same● Test time accuracy: Better to Much Better● Code complexity: High

32


Other Varieties of Biases● Copying source words as-is

● Remembering and copying target wordsWere called cache models, now called pointer ★sentinel models★ :)

Gu et al. 2016. Incorporating copying mechanism insequence-to-sequence learning

Gulcehre et al. 2016. Pointing the unknown words

Merity et al. 2016. Pointer Sentinel Mixture Models

33


Use of External Phrase Tables

Tang et al. 2016. NMT with External Phrase Memory

34


Conclusion

35


Conclusion● Lots of softmax alternatives for neural MT

→ Consider them in your systems!● But there is no fast at train, fast at test, accurate,

small, and simple method→ Consider making one yourself!

softmax alternatives in neural mt - graham neubig · 24/5/2017 · copying source words as-is...

Documents