softmax alternatives in neural mt - graham neubig · 24/5/2017  · copying source words as-is...

Post on 03-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Softmax Alternatives in Neural MT

Softmax Alternatives inNeural MT

Graham Neubig5/24/2017

2

Softmax Alternatives in Neural MT

Neural MT Models

kouen givewookonaimasu </s> a talk

P(e1∣F )

a

P(e2∣F ,e1)

talk

P(e3∣F ,e12)

</s>

P(e4∣F ,e13)

giveargmax

3

Softmax Alternatives in Neural MT

How we Calculate Probabilities

p(ei|h

i) = softmax( W * h

i + b )

Next word prob. Weights Hidden context Bias

W[w

*,1, w

*,2, w

*,3, ...]

bb

1

b2

b3

...

In other words, the score is:

s(ei|c

i) = w

*,k・ c

i + b

k

Closeness of output embedding and context+ bias. Choose word with highest score

4

Softmax Alternatives in Neural MT

A Visual Example

W h bsoftmax( + )p =

5

Softmax Alternatives in Neural MT

Problems w/ Softmax● Computationally inefficient at training time● Computationally inefficient at test time● Many parameters● Sub-optimal accuracy

6

Softmax Alternatives in Neural MT

Calculation/Parameter Efficient Softmax Variants

7

Softmax Alternatives in Neural MT

Negative Sampling/Noise Contrastive Estimation

● Calculate the denominator over a subset

W c b+ W' c b'+

Negative samples according to distribution q

8

Softmax Alternatives in Neural MT

Lots of Alternatives!● Noise contrastive estimation: train a model to

discriminate between true and false examples

● Negative sampling: e.g. word2vec

● BlackOut

Ref: Chris Dyer, 2014. Notes on Noise Contrastive Estimation and Negative Sampling

Used in MT: Eriguchi et al. 2016: Tree-to-sequence attentional neural machine translation

9

Softmax Alternatives in Neural MT

GPUifying Noise Contrastive Estimation

● Creating the negative samples and arranging memory is expensive on GPU

● Simple solution: sample the negative samples once for each mini-batch

Zoph et al. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies

10

Softmax Alternatives in Neural MT

Summary of Negative Sampling Approaches

● Train time efficiency: Much faster!● Test time efficiency: Same● Number of parameters: Same● Test time accuracy: A little worse?● Code complexity: Moderate

11

Softmax Alternatives in Neural MT

Vocabulary Selection● Select the vocabulary on a per-sentence basis

L'Hostis et al. 2016. Vocabulary Selection Strategies for NMTMi 2016. Vocabulary Manipulation for NMT

12

Softmax Alternatives in Neural MT

Summary of Vocabulary Selection● Train time efficiency: A little faster● Test time efficiency: Much faster!● Number of parameters: Same● Test time accuracy: Better or a little worse● Code complexity: Moderate

13

Softmax Alternatives in Neural MT

Class-based Softmax● Predict P(class|hidden), then P(word|class,hidden)● Because P(w|c,h) is 0 for all but one class, efficient

computation

Goodman 2001. Classes for Fast Maximum Entropy Training

Wc h b

c+

Ww h b

w+

softmax( )

softmax( )

14

Softmax Alternatives in Neural MT

Hierarchical Softmax● Tree-structured prediction of word ID● Usually modeled as a sequence of binary decisions

Morin and Bengio 2005: Hierarchical Probabilistic NNLM

0 1 1 1 0 → word 14

15

Softmax Alternatives in Neural MT

Summary of Class-based Softmaxes

● Train time efficiency: Faster on CPU, Pain to GPU● Test time efficiency: Worse● Number of parameters: More● Test time accuracy: Slightly worse to slightly better● Code complexity: High

16

Softmax Alternatives in Neural MT

Binary Code Prediction● Just directly predict the binary code of the word ID

● Like hierarchical softmax, but with shared weights at every layer → fewer parameters, easy to GPU

Oda et al. 2017: NMT Via Binary Code Prediction

W h b+σ( ) =

01110

word 14

17

Softmax Alternatives in Neural MT

Two Improvements

Hybrid model Error correcting codes

18

Softmax Alternatives in Neural MT

Summary of Binary Code Prediction

● Train time efficiency: Faster● Test time efficiency: Faster (12x on CPU!)● Number of parameters: Fewer● Test time accuracy: Slightly worse● Code complexity: Moderate

19

Softmax Alternatives in Neural MT

Parameter Sharing

20

Softmax Alternatives in Neural MT

Parameter Sharing● We have two |V| x |h| matrices in the decoder:

● Input word embeddings, which we look up and feed into the RNN

● Output word embeddings, which are the weight matrix W in the softmax

● Simple idea: tie their weights together

Press et al. 2016: Using the output embedding to improvelanguage modelsInan et al. 2016: Tying Word Vectors and Word Classifiers:A Loss Framework for Language Modeling

21

Softmax Alternatives in Neural MT

Summary of Parameter Sharing● Train time efficiency: Same● Test time efficiency: Same● Number of parameters: Fewer● Test time accuracy: Better● Code complexity: Low

22

Softmax Alternatives in Neural MT

Incorporating External Information

23

Softmax Alternatives in Neural MT

Problems w/ Lexical Choice in Neural MT

Arthur et al. 2016:Incorporating Discrete Translation Lexicons in NMT

24

Softmax Alternatives in Neural MT

When Does Translation Succeed?(in Output Embedding Space)I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

25

Softmax Alternatives in Neural MT

When Does Translation Fail?Embeddings Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

26

Softmax Alternatives in Neural MT

When Does Translation Fail?Bias Version

I come from Tunisia

w*,tunisia

w*,norway

w*,sweden

w*,nigeria

h1

w*,eat

w*,consume

w*,china

btunisia

= -0.5 bchina

= 4.5

27

Softmax Alternatives in Neural MT

What about Traditional Symbolic Models?

his father likes Tunisia

karenochichiwachunijiagasukida

P(kare|his) = 0.5P(no|his) = 0.5P(chichi|father) = 1.0P(chunijia|

Tunisia) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5

1-to-1 alignment

28

Softmax Alternatives in Neural MT

Even if We Make a Mistake...

his father likes Tunisia

karenochichiwachunijiagasukida

P(kare|his) = 0.5P(no|his) = 0.5P(chichi|Tunisia) = 1.0P(chunijia|

father) = 1.0P(suki|likes) = 0.5P(da|likes) = 0.5

Different mistakesthan neural MT

Soft alignmentpossible

29

Softmax Alternatives in Neural MT

Calculating Lexicon Probabilities

I come from Tunisia0.05 0.01 0.02 0.93

watashiore…kurukara…chunijiaoranda

0.60.2…0.010.02…0.00.0

0.030.01…0.30.1…0.00.0

0.010.02…0.010.5…0.00.0

0.00.0…0.00.01…0.960.0

Attention

0.030.01…0.000.02…0.890.00

Word-by-wordlexicon prob

Conditionallexicon prob

30

Softmax Alternatives in Neural MT

Incorporating w/ Neural MT● softmax bias:

● Linear interpolation:

p(ei|h

i) = softmax( W * h

i + b + log (lex

i + ε))

p(ei|h

i) = γ * softmax( W * h

i + b ) + (1-γ) * lex

i

To prevent -∞ scores

31

Softmax Alternatives in Neural MT

Summary of External Lexicons● Train time efficiency: Worse● Test time efficiency: Worse● Number of parameters: Same● Test time accuracy: Better to Much Better● Code complexity: High

32

Softmax Alternatives in Neural MT

Other Varieties of Biases● Copying source words as-is

● Remembering and copying target wordsWere called cache models, now called pointer ★sentinel models★ :)

Gu et al. 2016. Incorporating copying mechanism insequence-to-sequence learning

Gulcehre et al. 2016. Pointing the unknown words

Merity et al. 2016. Pointer Sentinel Mixture Models

33

Softmax Alternatives in Neural MT

Use of External Phrase Tables

Tang et al. 2016. NMT with External Phrase Memory

34

Softmax Alternatives in Neural MT

Conclusion

35

Softmax Alternatives in Neural MT

Conclusion● Lots of softmax alternatives for neural MT

→ Consider them in your systems!● But there is no fast at train, fast at test, accurate,

small, and simple method→ Consider making one yourself!

top related