softmax alternatives in neural mt - graham neubig · 24/5/2017  · copying source words as-is...

35
1 Softmax Alternatives in Neural MT Softmax Alternatives in Neural MT Graham Neubig 5/24/2017

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

1

Softmax Alternatives in Neural MT

Softmax Alternatives inNeural MT

Graham Neubig5/24/2017

Page 2: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

2

Softmax Alternatives in Neural MT

Neural MT Models

kouen givewookonaimasu </s> a talk

P(e1∣F )

a

P(e2∣F ,e1)

talk

P(e3∣F ,e12)

</s>

P(e4∣F ,e13)

giveargmax

Page 3: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

3

Softmax Alternatives in Neural MT

How we Calculate Probabilities

p(ei|h

i) = softmax( W * h

i + b )

Next word prob. Weights Hidden context Bias

W[w

*,1, w

*,2, w

*,3, ...]

bb

1

b2

b3

...

In other words, the score is:

s(ei|c

i) = w

*,k・ c

i + b

k

Closeness of output embedding and context+ bias. Choose word with highest score

Page 4: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

4

Softmax Alternatives in Neural MT

A Visual Example

W h bsoftmax( + )p =

Page 5: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

5

Softmax Alternatives in Neural MT

Problems w/ Softmax● Computationally inefficient at training time● Computationally inefficient at test time● Many parameters● Sub-optimal accuracy

Page 6: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

6

Softmax Alternatives in Neural MT

Calculation/Parameter Efficient Softmax Variants

Page 7: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

7

Softmax Alternatives in Neural MT

Negative Sampling/Noise Contrastive Estimation

● Calculate the denominator over a subset

W c b+ W' c b'+

Negative samples according to distribution q

Page 8: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

8

Softmax Alternatives in Neural MT

Lots of Alternatives!● Noise contrastive estimation: train a model to

discriminate between true and false examples

● Negative sampling: e.g. word2vec

● BlackOut

Ref: Chris Dyer, 2014. Notes on Noise Contrastive Estimation and Negative Sampling

Used in MT: Eriguchi et al. 2016: Tree-to-sequence attentional neural machine translation

Page 9: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

9

Softmax Alternatives in Neural MT

GPUifying Noise Contrastive Estimation

● Creating the negative samples and arranging memory is expensive on GPU

● Simple solution: sample the negative samples once for each mini-batch

Zoph et al. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies

Page 10: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

10

Softmax Alternatives in Neural MT

Summary of Negative Sampling Approaches

● Train time efficiency: Much faster!● Test time efficiency: Same● Number of parameters: Same● Test time accuracy: A little worse?● Code complexity: Moderate

Page 11: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

11

Softmax Alternatives in Neural MT

Vocabulary Selection● Select the vocabulary on a per-sentence basis

L'Hostis et al. 2016. Vocabulary Selection Strategies for NMTMi 2016. Vocabulary Manipulation for NMT

Page 12: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

12

Softmax Alternatives in Neural MT

Summary of Vocabulary Selection● Train time efficiency: A little faster● Test time efficiency: Much faster!● Number of parameters: Same● Test time accuracy: Better or a little worse● Code complexity: Moderate

Page 13: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

13

Softmax Alternatives in Neural MT

Class-based Softmax● Predict P(class|hidden), then P(word|class,hidden)● Because P(w|c,h) is 0 for all but one class, efficient

computation

Goodman 2001. Classes for Fast Maximum Entropy Training

Wc h b

c+

Ww h b

w+

softmax( )

softmax( )

Page 14: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

14

Softmax Alternatives in Neural MT

Hierarchical Softmax● Tree-structured prediction of word ID● Usually modeled as a sequence of binary decisions

Morin and Bengio 2005: Hierarchical Probabilistic NNLM

0 1 1 1 0 → word 14

Page 15: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

15

Softmax Alternatives in Neural MT

Summary of Class-based Softmaxes

● Train time efficiency: Faster on CPU, Pain to GPU● Test time efficiency: Worse● Number of parameters: More● Test time accuracy: Slightly worse to slightly better● Code complexity: High

Page 16: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

16

Softmax Alternatives in Neural MT

Binary Code Prediction● Just directly predict the binary code of the word ID

● Like hierarchical softmax, but with shared weights at every layer → fewer parameters, easy to GPU

Oda et al. 2017: NMT Via Binary Code Prediction

W h b+σ( ) =

01110

word 14

Page 17: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

17

Softmax Alternatives in Neural MT

Two Improvements

Hybrid model Error correcting codes

Page 18: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

18

Softmax Alternatives in Neural MT

Summary of Binary Code Prediction

● Train time efficiency: Faster● Test time efficiency: Faster (12x on CPU!)● Number of parameters: Fewer● Test time accuracy: Slightly worse● Code complexity: Moderate

Page 19: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

19

Softmax Alternatives in Neural MT

Parameter Sharing

Page 20: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

20

Softmax Alternatives in Neural MT

Parameter Sharing● We have two |V| x |h| matrices in the decoder:

● Input word embeddings, which we look up and feed into the RNN

● Output word embeddings, which are the weight matrix W in the softmax

● Simple idea: tie their weights together

Press et al. 2016: Using the output embedding to improvelanguage modelsInan et al. 2016: Tying Word Vectors and Word Classifiers:A Loss Framework for Language Modeling

Page 21: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

21

Softmax Alternatives in Neural MT

Summary of Parameter Sharing● Train time efficiency: Same● Test time efficiency: Same● Number of parameters: Fewer● Test time accuracy: Better● Code complexity: Low

Page 22: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

22

Softmax Alternatives in Neural MT

Incorporating External Information

Page 23: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

23

Softmax Alternatives in Neural MT

Problems w/ Lexical Choice in Neural MT

Arthur et al. 2016:Incorporating Discrete Translation Lexicons in NMT

Page 24: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

24

Softmax Alternatives in Neural MT

When Does Translation Succeed?(in Output Embedding Space)I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

Page 25: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

25

Softmax Alternatives in Neural MT

When Does Translation Fail?Embeddings Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

Page 26: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

26

Softmax Alternatives in Neural MT

When Does Translation Fail?Bias Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

w*,china

btunisia

= -0.5 bchina

= 4.5

Page 27: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

27

Softmax Alternatives in Neural MT

What about Traditional Symbolic Models?

his father likes Tunisia

karenochichiwachunijiagasukida

P(kare|his) = 0.5P(no|his) = 0.5P(chichi|father) = 1.0P(chunijia|

Tunisia) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5

1-to-1 alignment

Page 28: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

28

Softmax Alternatives in Neural MT

Even if We Make a Mistake...

his father likes Tunisia

karenochichiwachunijiagasukida

P(kare|his) = 0.5P(no|his) = 0.5P(chichi|Tunisia) = 1.0P(chunijia|

father) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5

Different mistakesthan neural MT

Soft alignmentpossible

Page 29: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

29

Softmax Alternatives in Neural MT

Calculating Lexicon Probabilities

I come from Tunisia0.05 0.01 0.02 0.93

watashiore…kurukara…chunijiaoranda

0.60.2…0.010.02…0.00.0

0.030.01…0.30.1…0.00.0

0.010.02…0.010.5…0.00.0

0.00.0…0.00.01…0.960.0

Attention

0.030.01…0.000.02…0.890.00

Word-by-wordlexicon prob

Conditionallexicon prob

Page 30: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

30

Softmax Alternatives in Neural MT

Incorporating w/ Neural MT● softmax bias:

● Linear interpolation:

p(ei|h

i) = softmax( W * h

i + b + log (lex

i + ε))

p(ei|h

i) = γ * softmax( W * h

i + b ) + (1-γ) * lex

i

To prevent -∞ scores

Page 31: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

31

Softmax Alternatives in Neural MT

Summary of External Lexicons● Train time efficiency: Worse● Test time efficiency: Worse● Number of parameters: Same● Test time accuracy: Better to Much Better● Code complexity: High

Page 32: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

32

Softmax Alternatives in Neural MT

Other Varieties of Biases● Copying source words as-is

● Remembering and copying target wordsWere called cache models, now called pointer ★sentinel models★ :)

Gu et al. 2016. Incorporating copying mechanism insequence-to-sequence learning

Gulcehre et al. 2016. Pointing the unknown words

Merity et al. 2016. Pointer Sentinel Mixture Models

Page 33: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

33

Softmax Alternatives in Neural MT

Use of External Phrase Tables

Tang et al. 2016. NMT with External Phrase Memory

Page 34: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

34

Softmax Alternatives in Neural MT

Conclusion

Page 35: Softmax Alternatives in Neural MT - Graham Neubig · 24/5/2017  · Copying source words as-is Remembering and copying target words Were called cache models, now called ★pointer

35

Softmax Alternatives in Neural MT

Conclusion● Lots of softmax alternatives for neural MT

→ Consider them in your systems!● But there is no fast at train, fast at test, accurate,

small, and simple method→ Consider making one yourself!