long short-term memory (lstm) - twiki · long short-term memory (lstm) a brief introduction daniel...
TRANSCRIPT
Long Short-Term Memory (LSTM)A brief introduction
Daniel Renshaw
24th November 2014
1 / 15
Context and notation
I Just to give the LSTM something to do: neural networklanguage modelling
I Vocabulary, size V
I xt ∈ RV : true word in position t (one-hot)
I yt ∈ RV : predicted word in position t (distribution)
I Assume all sentences zero padded to length L
2 / 15
Context and notation
I Model: yt+1 = p (xt | xt−1, xt−2, . . . , x1) for 1 ≤ t < L
I Minimize cross-entropy objective:
J =L−1∑t=1
H (yt+1, xt) =L−1∑t=1
V∑i
xt,i log (yt+1,i )
I σ (•) is some sigmoid-like function (e.g. logistic or tanh)
I b• is a bias vector, W • is a weight vector
3 / 15
Multi-Layer Perceptron (MLP)
yt
ht
et−3 et−3 et−3
xt−3 xt−2 xt−1
yt+1 = softmax(W yhht
)ht = σ
(W he [et−1; et−2; et−3] + bh
)et = W exxt
4 / 15
Recurrent Neural Network (RNN)
yt yt+1 yt+2
ht−1 ht ht+1
et−1 et et+1
xt−1 xt xt+1
yt+1 = softmax(W yhht
)ht = σ
(W heet +W hhht−1 + bh
)et = W exxt
5 / 15
Vanishing gradients
I Error gradients pass through nonlinearity every step
Image from https://theclevermachine.wordpress.com
I Unless weights large, error signal will degrade
δh = σ′ (•)W (h+1)hδh+1
6 / 15
Vanishing gradients
I Gradients may vanish or explode
I Can a�ect any 'deep' network
I e.g. �ne-tuning a non-recurrent deep neural network
Image from Alex Graves' textbook
7 / 15
Constant Error Carousel
I Allow the network to propagate errors without modi�cation
I No nonlinearity in recursion
yt yt+1 yt+2
mt−1 mt mt+1
ht−1 ht ht+1
et−1 et et+1
xt−1 xt xt+1
8 / 15
Constant Error Carousel
I Allow the network to propagate errors without modi�cation
I No nonlinearity in recursion
yt yt+1 yt+2
mt−1 mt mt+1
ht−1 ht ht+1
et−1 et et+1
xt−1 xt xt+1
8 / 15
Constant Error Carousel
I Allow the network to propagate errors without modi�cation
I No nonlinearity in recursion
mt
et
mt−1
ht−1 ht
••bh
•dense matrix multiplication
8 / 15
Constant Error Carousel
I Allow the network to propagate errors without modi�cation
I No nonlinearity in recursion
yt+1 = softmax (W ymmt)
mt = σ (ht)
ht = ht−1+σ(W heet +W hmmt−1 + bh
)et = W exxt
8 / 15
LSTM v1: input and output gates
I Attenuate input and output signals
mt
et
mt−1
ht−1
ot
it
ht
•
•
•
•
•
•
logistic
logistic
bh
bo
bi
9 / 15
LSTM v1: input and output gates
I Attenuate input and output signals
yt+1 = softmax (W ymmt)
mt = ot�σ (ht)ot = logistic (W oeet +W ommt−1 + bo)
ht = ht−1 + it�σ(W heet +W hmmt−1 + bh
)it = logistic
(W ieet +W immt−1 + bi
)et = W exxt
9 / 15
LSTM v2: forget (remember) gate
I Model controls when memory, ht , is reduced
I Forget gate should be called remember gate
mt
et
mt−1
ht−1
ot
ft
it
ht
•
•
•
•
•
•
•
•
logistic
logistic
logistic
bh
bo
bf
bi
10 / 15
LSTM v2: forget (remember) gate
I Model controls when memory, ht , is reduced
I Forget gate should be called remember gate
yt+1 = softmax (W ymmt)
mt = ot � σ (ht)ot = logistic (W oeet +W ommt−1 + bo)
ht = fi�ht−1 + it � σ(W heet +W hmmt−1 + bh
)it = logistic
(W ieet +W immt−1 + bi
)fi = logistic
(W feet +W fmmt−1 + bf
)et = W exxt
10 / 15
LSTM v3: peepholes
I Allow the gates to additionally see the internal memory state
I Diagonal matrices only (all others dense)
mt
et
mt−1
ht−1
ot
ft
it
ht
•
•
•
•
•
•
•
•
◦
◦
◦
logistic
logistic
logistic
bh
bo
bf
bi
◦diagonal matrix multiplication
11 / 15
LSTM v3: peepholes
I Allow the gates to additionally see the internal memory state
I Diagonal matrices only (all others dense)
yt+1 = softmax (W ymmt)
mt = ot � σ (ht)
ot = logistic(W oeet +W ommt−1+W ohht + bo
)ht = fi � ht−1 + it � σ
(W heet +W hmmt−1 + bh
)it = logistic
(W ieet +W immt−1+W ihht−1 + bi
)fi = logistic
(W feet +W fmmt−1+W fhht−1 + bf
)et = W exxt
11 / 15
LSTM v4: output projection layer
I Reduces dimensionality of recursive messages
I Can speed up training without a�ecting results quality
mt
et
mt−1
ht−1
ot
ft
it
ht
•
•
•
•
•
•
•
•
•
◦
◦
◦
logistic
logistic
logistic
bh
bo
bf
bi
12 / 15
LSTM v4: output projection layer
I Reduces dimensionality of recursive messages
I Can speed up training without a�ecting results quality
yt+1 = softmax (W ymmt)
mt = Wmm (ot � σ (ht))
ot = logistic(W oeet +W ommt−1 +W ohht + bo
)ht = fi � ht−1 + it � σ
(W heet +W hmmt−1 + bh
)it = logistic
(W ieet +W immt−1 +W ihht−1 + bi
)fi = logistic
(W feet +W fmmt−1 +W fhht−1 + bf
)et = W exxt
12 / 15
Gradients no longer vanish
Image from Alex Graves' textbook
13 / 15
LSTM implementations
I RNNLIB (Alex Graves) http://sourceforge.net/p/rnnl/
I PyLearn2 (experimental code, in sandbox/rnn/models/rnn.py)
I Theano, e.g.
def lstm_step(x_t, m_tm1, h_tm1, w_xe, ..., b_o):
e_t = dot(x_t, w_xe)
i_t = sigmoid(dot(e_t, w_ei) + dot(m_tm1, w_mi) + c_tm1 * w_ci + b_i)
f_t = sigmoid(dot(e_t, w_ef) + dot(m_tm1, w_mf) + c_tm1 * w_cf + b_f)
h_t = f_t * h_tm1 + i_t * tanh(dot(e_t, w_eh) + dot(m_tm1, w_mh) + b_h)
o_t = sigmoid(dot(e_t, w_eo) + dot(m_tm1, w_mo) + c_t * w_co + b_o)
m_t = dot(o_t * tanh(h_t), w_mm)
y_t = softmax(dot(m_t, w_my))
return m_t, c_t, y_t
14 / 15
Further thoughts
I Sequences vs. hierarchies vs. plain 'deep'
I Other solutions to vanishing gradients
I Clockwork RNNI Di�erent training algorithms (e.g. Hessian Free optimization)I Recti�ed linear units (ReLU)? σ (x) = max (0, x); constant
gradient when active
15 / 15