long short-term memory (lstm) - twiki · long short-term memory (lstm) a brief introduction daniel...

Long Short-Term Memory (LSTM)A brief introduction

Daniel Renshaw

24th November 2014

1 / 15

Context and notation

I Just to give the LSTM something to do: neural networklanguage modelling

I Vocabulary, size V

I xt ∈ RV : true word in position t (one-hot)

I yt ∈ RV : predicted word in position t (distribution)

I Assume all sentences zero padded to length L

2 / 15

Context and notation

I Model: yt+1 = p (xt | xt−1, xt−2, . . . , x1) for 1 ≤ t < L

I Minimize cross-entropy objective:

J =L−1∑t=1

H (yt+1, xt) =L−1∑t=1

V∑i

xt,i log (yt+1,i )

I σ (•) is some sigmoid-like function (e.g. logistic or tanh)

I b• is a bias vector, W • is a weight vector

3 / 15

Multi-Layer Perceptron (MLP)

yt

ht

et−3 et−3 et−3

xt−3 xt−2 xt−1

yt+1 = softmax(W yhht

)ht = σ

(W he [et−1; et−2; et−3] + bh

)et = W exxt

4 / 15

Recurrent Neural Network (RNN)

yt yt+1 yt+2

ht−1 ht ht+1

et−1 et et+1

xt−1 xt xt+1

yt+1 = softmax(W yhht

)ht = σ

(W heet +W hhht−1 + bh

)et = W exxt

5 / 15

Vanishing gradients

I Error gradients pass through nonlinearity every step

Image from https://theclevermachine.wordpress.com

I Unless weights large, error signal will degrade

δh = σ′ (•)W (h+1)hδh+1

6 / 15

https://theclevermachine.wordpress.com

Vanishing gradients

I Gradients may vanish or explode

I Can a�ect any 'deep' network

I e.g. �ne-tuning a non-recurrent deep neural network

Image from Alex Graves' textbook

7 / 15

Constant Error Carousel

I Allow the network to propagate errors without modi�cation

I No nonlinearity in recursion

yt yt+1 yt+2

mt−1 mt mt+1

ht−1 ht ht+1

et−1 et et+1

xt−1 xt xt+1

8 / 15




mt

et

mt−1

ht−1 ht

••bh

•dense matrix multiplication

8 / 15




yt+1 = softmax (W ymmt)

mt = σ (ht)

ht = ht−1+σ(W heet +W hmmt−1 + bh

)et = W exxt

8 / 15

LSTM v1: input and output gates

I Attenuate input and output signals

mt

et

mt−1

ht−1

ot

it

ht

•

•

•

•

•

•

logistic

logistic

bh

bo

bi

9 / 15

LSTM v1: input and output gates

I Attenuate input and output signals


mt = ot�σ (ht)ot = logistic (W oeet +W ommt−1 + bo)

ht = ht−1 + it�σ(W heet +W hmmt−1 + bh

)it = logistic

(W ieet +W immt−1 + bi

)et = W exxt

9 / 15

LSTM v2: forget (remember) gate

I Model controls when memory, ht , is reduced

I Forget gate should be called remember gate

mt

et

mt−1

ht−1

ot

ft

it

ht

•

•

•

•

•

•

•

•

logistic

logistic

logistic

bh

bo

bf

bi

10 / 15

LSTM v2: forget (remember) gate

I Model controls when memory, ht , is reduced

I Forget gate should be called remember gate


mt = ot � σ (ht)ot = logistic (W oeet +W ommt−1 + bo)

ht = fi�ht−1 + it � σ(W heet +W hmmt−1 + bh

)it = logistic

(W ieet +W immt−1 + bi

)fi = logistic

(W feet +W fmmt−1 + bf

)et = W exxt

10 / 15

LSTM v3: peepholes

I Allow the gates to additionally see the internal memory state

I Diagonal matrices only (all others dense)

mt

et

mt−1

ht−1

ot

ft

it

ht

•

•

•

•

•

•

•

•

◦

◦

◦

logistic

logistic

logistic

bh

bo

bf

bi

◦diagonal matrix multiplication

11 / 15

LSTM v3: peepholes

I Allow the gates to additionally see the internal memory state

I Diagonal matrices only (all others dense)


mt = ot � σ (ht)

ot = logistic(W oeet +W ommt−1+W ohht + bo

)ht = fi � ht−1 + it � σ

(W heet +W hmmt−1 + bh

)it = logistic

(W ieet +W immt−1+W ihht−1 + bi

)fi = logistic

(W feet +W fmmt−1+W fhht−1 + bf

)et = W exxt

11 / 15

LSTM v4: output projection layer

I Reduces dimensionality of recursive messages

I Can speed up training without a�ecting results quality

mt

et

mt−1

ht−1

ot

ft

it

ht

•

•

•

•

•

•

•

•

•

◦

◦

◦

logistic

logistic

logistic

bh

bo

bf

bi

12 / 15

LSTM v4: output projection layer

I Reduces dimensionality of recursive messages

I Can speed up training without a�ecting results quality


mt = Wmm (ot � σ (ht))

ot = logistic(W oeet +W ommt−1 +W ohht + bo

)ht = fi � ht−1 + it � σ

(W heet +W hmmt−1 + bh

)it = logistic

(W ieet +W immt−1 +W ihht−1 + bi

)fi = logistic

(W feet +W fmmt−1 +W fhht−1 + bf

)et = W exxt

12 / 15

Gradients no longer vanish

Image from Alex Graves' textbook

13 / 15

LSTM implementations

I RNNLIB (Alex Graves) http://sourceforge.net/p/rnnl/

I PyLearn2 (experimental code, in sandbox/rnn/models/rnn.py)

I Theano, e.g.

def lstm_step(x_t, m_tm1, h_tm1, w_xe, ..., b_o):

e_t = dot(x_t, w_xe)

i_t = sigmoid(dot(e_t, w_ei) + dot(m_tm1, w_mi) + c_tm1 * w_ci + b_i)

f_t = sigmoid(dot(e_t, w_ef) + dot(m_tm1, w_mf) + c_tm1 * w_cf + b_f)

h_t = f_t * h_tm1 + i_t * tanh(dot(e_t, w_eh) + dot(m_tm1, w_mh) + b_h)

o_t = sigmoid(dot(e_t, w_eo) + dot(m_tm1, w_mo) + c_t * w_co + b_o)

m_t = dot(o_t * tanh(h_t), w_mm)

y_t = softmax(dot(m_t, w_my))

return m_t, c_t, y_t

14 / 15

http://sourceforge.net/p/rnnl/

Further thoughts

I Sequences vs. hierarchies vs. plain 'deep'

I Other solutions to vanishing gradients

I Clockwork RNNI Di�erent training algorithms (e.g. Hessian Free optimization)I Recti�ed linear units (ReLU)? σ (x) = max (0, x); constant

gradient when active

15 / 15

long short-term memory (lstm) - twiki · long short-term memory (lstm) a brief introduction daniel...

Documents