long short-term memory in recurrent neural...

22
Long Short-Term Memory in Recurrent Neural Net- work Shixiang Gu, Andrey Malinin [email protected],[email protected] November 20, 2014

Upload: vuongphuc

Post on 11-Feb-2018

232 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Long Short-Term Memory in Recurrent Neural Net-workShixiang Gu, Andrey Malinin

[email protected],[email protected]

November 20, 2014

Page 2: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Outline

NN: Review

Early RNN

LSTM

Modern RNN

2 of 17

Page 3: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Machine Learning ReviewDiscrimitive Model learns P(Y |X )Generative Model learns P(X ,Y ), or P(X )Supervised Learning uses pairs of input X and output Y , andaims to learn P(Y |X ) or f : X −→ YUnsupervised Learning uses input X only, and aims to learn P(X )or f : X −→ H, where H is “better” representation of XModels can be stochastic or deterministic

In this presentation, we focus on deterministic, supervised,discrimitive model based on Recurrent Neural Network(RNN)/Long Short-Term Memory (LSTM)

Applications: Speech Recognition, Machine Translation, OnlineHandwriting Recognition, Language Modeling, Music Composition,Reinforcement Learning, etc.

3 of 17

Page 4: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

NN (1940s-, or early 1800s)input: X ∈ RQ

target output: Y ∈ RD

predicted output: Y ∈ RD

hidden: H,Z ∈ RP

weights: WX ∈ RP×Q,WY ∈ RD×P

activations: f : RP −→ RP ,g : RD −→ RD

loss function: L : RD × RD −→ Rloss: E ∈ Robjective: minimize total loss E for(X (i),Y (i))i=1...N training examples

Forward Propagation:Z = WX · XH = f (Z )Y = g(WY · H)E = L(Y , Y )

4 of 17

Page 5: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Activation functionsCommon activation functions f : X −→ Y on X ,Y ∈ RD:

Linear: f (X ) = X

Sigmoid/Logistic: f (X ) = 11+e−X

Rectified Linear (ReLU): f (X ) = max(0,X )

Tanh: f (X ) = tanh(X ) = eX−e−X

eX+e−X

Softmax: f : Yi =eXi∑j=D

j=1 eXi

Most activations are element-wise and non-linearDerivatives are easy to compute

5 of 17

Page 6: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

NN: Backpropagation (1960s-1981)Backpropagation (i.e. chain rule):assume g(X ) = X ,

L(Y , Y ) = 12∑D

i=1(Yi − Yi)2:

∂E∂Y

= Y − Y , ∂Y∂H = WY , ∂H

∂Z = f ′(Z )

∂E∂Z = ∂E

∂Y∂Y∂H

∂H∂Z = (W T

Y · (Y − Y ))� f ′(Z )

∂E∂WY

= ∂E∂Y

∂Y∂WY

= (Y − Y )⊗ H∂E∂WX

= ∂E∂Z

∂Z∂WX

= ∂E∂Z ⊗ X

*abusing matrix calculus notations

Stochastic Gradient Descentlearning rate: ε > 0W = W − ε ∂E

∂W ,∀W ∈ {WY ,WX}

6 of 17

Page 7: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

RNN (1980s-)

Why Recurrent?I Human brain is a recurrent neural network, i.e. has feedback

connections

I A lot of data is sequential and dynamic (can grow or shrink)

I RNNs are Turing-Complete[Siegelmann and Sontag, 1995, Graves et al., 2014]

7 of 17

Page 8: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Unrolling RNNs

Figure: RNN

Figure: Unrolled RNN

8 of 17

Page 9: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

RNN (1980s-)Forward Propagation:Given H0Zt = WX · Xt + WH · Ht−1Ht = f (Zt)Yt = g(WY · Ht)Et = L(Yt , Yt)E =

∑Tt=1 Et

RNN=Very Deep NN w/ tied weights

Sequence learning: unknown inputlen −→ unknown output len

9 of 17

Page 10: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Vanishing/exploding gradients (1991)

Fundamental Problem in Deep Learning [Hochreiter, 1991]:

Let vj,t =∂E∂Zj,t

vi,t−1 = f ′(Zi,t−1) ·∑P

j=1 wji · vj,t

∂vi,t−q∂vj,t

=∑P

l1=1 · · ·∑P

lq−1=1∏q

m=1 f ′lm(Zlm,t−m) · wlm,lm−1

|f ′lm(Zlm,t−m) · wlm,lm−1 | > 1.0∀m −→ explodes|f ′lm(Zlm,t−m) · wlm,lm−1 | < 1.0∀m −→ vanishes

Gradient vanishes/increases exponentially in terms of time steps Te.g. “bufferfly effect”

10 of 17

Page 11: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Early work on RNN (1980s-90s)

Exact gradient (TOO HARD!(back then)):I Backpropagation Through Time (BPTT) [Werbos, 1990]

Approximate gradient:I Real Time Recurrent Learning (RTRL)

[Robinson and Fallside, 1987, Williams and Zipser, 1989]

I Truncated BPTT [Williams and Peng, 1990]Non-gradient methods:

I Random Guessing [Hochreiter and Schmidhuber, 1996]Unsupervised “pretraining”:

I History Compressor [Schmidhuber, 1992]Others:

I Time-delay Neural Network [Lang et al., 1990], HierarchicalRNNs [El Hihi and Bengio, 1995], NARX [Lin et al., 1996] andmore...

sources: [Graves, 2006, Sutskever, 2013, Schmidhuber, 2014]11 of 17

Page 12: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Pretraining RNN (1992)

History compressor (HC) [Schmidhuber, 1992] = unsupervised,greedy layer-wise pretraining of RNN(Recall in standard DNN: unsupervised pretraining (AE,RBM)[Hinton and Salakhutdinov, 2006]−→ supervised[Krizhevsky et al., 2012])Forward Propagation:Given H0,X0,Yt = Xt+1Zt = WXH · Xt−1 + WH · Ht−1Ht = f (Zt)Yt = g(WY · Ht + WXY · Xt)Et = L(Yt , Yt)E =

∑T−1t=0 Et

X ′ ←− {X0,H0, (t ,Xt) ∀ t 3 Xt 6= Yt−1}−→ train new HC using X ′

Figure: HC

Greedily pretrain RNN, using “surprises” from previous step12 of 17

Page 13: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Basic “LSTM” (1991-95)

Idea: Have separate linear units that simply deliver errors[Hochreiter and Schmidhuber, 1997]

Forward Propagation*:Let Ct ∈ RP = “cell state”,Given H0,C0Ct = Ct−1 + f1(WX · Xt + WH · Ht−1)Ht = f2(Ct)BPTT: Truncate gradients outside the

cell*approximately

13 of 17

Page 14: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Original LSTM (1995-97)

Idea: Have additional controls for R/W access (input/output gates)[Hochreiter and Schmidhuber, 1997]

Forward Propagation*:Given H0,C0Ct = Ct−1 + it � f1(WX · Xt + WH · Ht−1)Ht = ot � f2(Ct)it = σ(Wi,X ·Xt +Wi,C ·Ct−1+Wi,H ·Ht−1)ot = σ(Wo,X ·Xt +Wo,C ·Ct +Wo,H ·Ht−1)BPTT: Truncate gradients outside the

cell*approximately

14 of 17

Page 15: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Modern LSTM (full version) (2000s-)

Idea: Have control for forgeting cell state (forget gate)[Gers, 2001, Graves, 2006]

Forward Propagation:Given H0,C0Ct = ft�Ct−1+ it� f1(WX ·Xt +WH ·Ht−1)Ht = ot � f2(Ct)it = σ(Wi,X ·Xt +Wi,C ·Ct−1+Wi,H ·Ht−1)ot = σ(Wo,X ·Xt +Wo,C ·Ct +Wo,H ·Ht−1)ft = σ(Wf ,X ·Xt +Wf ,C ·Ct−1+Wf ,H ·Ht−1)BPTT: Use exact gradients

15 of 17

Page 16: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Recent Progress in RNN

Echo-State Network (ESN): [Jaeger and Haas, 2004]I only train hidden-output weights

I other weights drawn from carefully-chosen distribution andfixed

Hessian-Free (HF) optimization: [Martens and Sutskever, 2011]I approximate second-order method

I outperformed LSTM on small-scale tasksMomentum methods: [Sutskever, 2013]

I Could we get some good results without HF?

I Yes! With “good” intialization & aggressive momemtumscheduling

I Nesterov’s accelerated gradient?

16 of 17

Page 17: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

LSTMs on TheanoDEMO!

17 of 17

Page 18: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

El Hihi, S. and Bengio, Y. (1995).Hierarchical recurrent neural networks for long-termdependencies.In NIPS, pages 493–499. Citeseer.

Gers, F. (2001).Long Short-Term Memory in Recurrent Neural Networks.PhD thesis, Ecole Polytechnique Federale de Lausanne.

Graves, A. (2006).Supervised Sequence Labelling with Recurrent NeuralNetworks.PhD thesis, Technische Universitat Munchen.

Graves, A., Wayne, G., and Danihelka, I. (2014).Neural turing machines.arXiv preprint arXiv:1410.5401.

Hinton, G. E. and Salakhutdinov, R. R. (2006).Reducing the dimensionality of data with neural networks.Science, 313(5786):504–507.

17 of 17

Page 19: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Hochreiter, S. (1991).Untersuchungen zu dynamischen neuronalen Netzen.PhD thesis, Institut fur Informatik, Technische UniversitatMunchen.

Hochreiter, S. and Schmidhuber, J. (1996).Bridging long time lags by weight guessing and\ longshort-term memory.Spatiotemporal models in biological and artificial systems,37:65–72.

Hochreiter, S. and Schmidhuber, J. (1997).Long short-term memory.Neural computation, 9(8):1735–1780.

Jaeger, H. and Haas, H. (2004).Harnessing nonlinearity: Predicting chaotic systems and savingenergy in wireless communication.Science, 304(5667):78–80.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).17 of 17

Page 20: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Imagenet classification with deep convolutional neuralnetworks.In Advances in neural information processing systems, pages1097–1105.

Lang, A., Waibel, A., and Hinton, G. E. (1990).A time-delay neural network architecture for isolated wordrecognition.3:23–43.

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996).Learning long-term dependencies in narx recurrent neuralnetworks.Neural Networks, IEEE Transactions on, 7(6):1329–1338.

Martens, J. and Sutskever, I. (2011).Learning recurrent neural networks with hessian-freeoptimization.In Proceedings of the 28th International Conference onMachine Learning (ICML-11), pages 1033–1040.

17 of 17

Page 21: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Robinson, A. and Fallside, F. (1987).The utility driven dynamic error propagation network.University of Cambridge Department of Engineering.

Schmidhuber, J. (1992).Learning complex, extended sequences using the principle ofhistory compression.Neural Computation, 4(2):234–242.

Schmidhuber, J. (2014).Deep learning in neural networks: An overview.arXiv preprint arXiv:1404.7828.

Siegelmann, H. T. and Sontag, E. D. (1995).On the computational power of neural nets.Journal of computer and system sciences, 50(1):132–150.

Sutskever, I. (2013).Training Recurrent Neural Networks.PhD thesis, University of Toronto.

17 of 17

Page 22: Long Short-Term Memory in Recurrent Neural Networksg717.user.srcf.net/wp-content/uploads/2014/11/presentation_lstm.pdf · Long Short-Term Memory in Recurrent Neural Net-work ... Language

Werbos, P. J. (1990).Backpropagation through time: what it does and how to do it.Proceedings of the IEEE, 78(10):1550–1560.

Williams, R. J. and Peng, J. (1990).An efficient gradient-based algorithm for on-line training ofrecurrent network trajectories.Neural Computation, 2(4):490–501.

Williams, R. J. and Zipser, D. (1989).A learning algorithm for continually running fully recurrentneural networks.Neural computation, 1(2):270–280.

17 of 17