recurrent neural networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-rnn_2017... · 2018-04-12 ·...

23
Recurrent Neural Network Minlie Huang [email protected] Department of Computer Science Tsinghua University Some slides are from Dr. Xipeng Qiu Summer, 2017

Upload: others

Post on 21-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Recurrent Neural Network

Minlie Huang

[email protected] of Computer Science

Tsinghua UniversitySome slides are from Dr. Xipeng Qiu

Summer, 2017

Page 2: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Recurrent Neural Network (RNN)The RNN has recurrent hidden states whose output at each time isdependent on that of the previous time.Given a sequence x(1 : n) = (x1, x2, ..., xt , ..., xn) (each xi is avector), the RNN updates its recurrent hidden state h(t) by

ht =

{0 if t=0

f (ht−1, xt) otherwise(1)

Page 3: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Simple Recurrent Network

I The recurrence function:

ht = f (Uht−1 + Wxt + b) (2)

where f is non-linear function (for instance, sigmoid or tanh ).

I A typical RNN structure can be illustrated as follows:

Page 4: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Backpropagation Through Time, BPTT

Suppose loss at time t is Jt , then the total loss is ΣTt=1Jt .

The gradient of J is

∂J

∂U=

T∑t=1

∂Jt∂U

=T∑t=1

∂ht∂U

∂Jt∂ht

(3)

Page 5: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Matrix Calculus

Figure: y = f (x1, x2, ..., xn)

Figure: Y is a matrix.

Page 6: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Matrix Calculus (cont’d)

Figure: Y ,X are column vectors.

Figure: F ,X are matrices.

Page 7: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Gradient of RNN

∂J

∂U=

T∑t=1

t∑k=1

∂hk∂U

∂ht∂hk

∂yt∂ht

∂Jt∂yt

(4)

∂ht∂hk

=t∏

i=k+1

∂hi∂hi−1

=t∏

i=k+1

UTdiag [f ′(hi−1)] (5)

∂J

∂U=

T∑t=1

t∑k=1

∂hk∂U

(t∏

i=k+1

UTdiag [f ′(hi−1)])∂yt∂ht

∂Jt∂yt

(6)

Page 8: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Long-Term Dependencies

Define γ = ||UTdiag [f ′(hi−1)]||,I Exploding Gradient Problem: When γ > 1 and t − k →∞,γt−k →∞.

I Vanishing Gradient Problem: When γ < 1 and t − k →∞,γt−k → 0.

There are various ways to solve Long-Term Dependency problem.

Page 9: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Gradient Explosion & Vanishing

I Gradient norm clipping (Mikolov thesis 2012; Pascanu,Mikolov, Bengio, ICML 2013)

g ← ∂J∂θ

IF||g || > Threshold

THEN

g ← Threshold

||g ||∗ g

I Gradient propagation regularizer (avoid vanishing gradient)

I LSTM self-loops (avoid vanishing gradient)

Page 10: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Long Short-Term Memory (LSTM)Key difference to RNN: a memory cell c which is controlled bythree gates:

I input gate i :it = σ(Wixt + Uiht−1 + bi ) (7)

I output gate o:

ot = σ(Woxt + Uoht−1 + bo) (8)

I forget gate f :

ft = σ(Wf xt + Uf ht−1 + bf ) (9)

I update of memory cell and hidden state:

Ct = tanh(Wcxt + Ucht−1 + bC ) (10)

Ct = ft ⊗ Ct−1 + it ⊗ Ct (11)

ht = ot ⊗ tanh(Ct) (12)

Page 11: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

LSTM Architecture

Ct = tanh(Wcxt + Ucht−1 + bC )

Ct = ft ⊗ Ct−1 + it ⊗ Ct

ht = ot ⊗ tanh(Ct)

The picture and following 4 pictures are fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Page 12: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

LSTM: Input and Forget Gates

it = σ(Wixt + Uiht−1 + bi )

Ct = tanh(Wcxt + Ucht−1 + bC )

ft = σ(Wf xt + Uf ht−1 + bf )

The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Page 13: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

LSTM:Output Gates and State Update

ot = σ(Woxt + Uoht−1 + bo)

ht = ot ⊗ tanh(Ct)

Ct = ft ⊗ Ct−1 + it ⊗ Ct

The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Page 14: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Gated Recurrent Unit, GRU

Instead of using three gates, only two gates is employed:

I update gate z :

zt = σ(Wzxt + Uzht−1) (13)

I reset gate r :rt = σ(Wrxt + Urht−1) (14)

I state update:

ht = tanh(Wcxt + U(rt ⊗ ht−1)) (15)

ht = (1− zt)⊗ ht−1 + zt ⊗ ht (16)

Page 15: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

GRU Architecture

The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Page 16: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

More Exploring to LSTM...

I How does LSTM deal with the explosion and vanishinggradient problems?

I refer to: Hochreiter S, Schmidhuber J. Long short-termmemory[J]. Neural computation, 1997, 9(8): 1735-1780.

Page 17: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Stacked RNN/LSTM

I Stack multiple RNN/LSTM in vertical direction

I More common is seen that raw input (xi ) is connected to alllayers as input.

Page 18: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Bidirectional RNN/LSTM

I Modeling the prefix context and suffix context at the sametime.

h(1)t = f (U(1)h

(1)t−1 + W (1)xt + b(1)) (17)

h(2)t = f (U(2)h

(2)t+1 + W (2)xt + b(2)) (18)

ht = h(1)t ⊕ h

(2)t (19)

Page 19: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Application of RNN: Classification

Applicable to many classification problems.

I Text Classification

I Sentiment Classification

The left: can be enhanced with attention mechanisms.

Page 20: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Application of RNN: Sequence Labeling

I Sequence Labeling, such as Chinese word segmentation,part-of-speech tagging, semantic role labeling

Page 21: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Application of RNN: Sequence to Sequence LearningI Machine TranslationI Sequence to sequence generation (for dialogue, question

answering, etc.)I Image captioning: from image to text

Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with

neural networks[C]//NIPS. 2014: 3104-3112.

Page 22: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Hierarchical LSTM

Hierarchical LSTM (low-level LSTMs are taken as input to thehigh-level LSTM)

Ghosh S, Vinyals O, Strope B, et al. Contextual LSTM (CLSTM) models

for Large scale NLP tasks[J]. arXiv preprint arXiv:1602.06291, 2016.

Page 23: Recurrent Neural Networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-RNN_2017... · 2018-04-12 · Recurrent Neural Network Minlie Huang aihuang@tsinghua.edu.cn Department of Computer

Summary

I RNN is very good at modeling text/languageI Basic RNN models: RNN, GRU, LSTM

I Stacked RNNI Bi-directional RNNI Hierarchical RNN

I Typical Learning Problems with RNN

I ClassificationI Sequential LabelingI Sequence to Sequence translation: MT, Dialogue Generation