recurrent neural networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-rnn_2017... · 2018-04-12 ·...

Recurrent Neural Network

Minlie Huang

[email protected] of Computer Science

Tsinghua UniversitySome slides are from Dr. Xipeng Qiu

Summer, 2017

Recurrent Neural Network (RNN)The RNN has recurrent hidden states whose output at each time isdependent on that of the previous time.Given a sequence x(1 : n) = (x1, x2, ..., xt , ..., xn) (each xi is avector), the RNN updates its recurrent hidden state h(t) by

ht =

{0 if t=0

f (ht−1, xt) otherwise(1)

Simple Recurrent Network

I The recurrence function:

ht = f (Uht−1 + Wxt + b) (2)

where f is non-linear function (for instance, sigmoid or tanh ).

I A typical RNN structure can be illustrated as follows:

Backpropagation Through Time, BPTT

Suppose loss at time t is Jt , then the total loss is ΣTt=1Jt .

The gradient of J is

∂J

∂U=

T∑t=1

∂Jt∂U

=T∑t=1

∂ht∂U

∂Jt∂ht

(3)

Matrix Calculus

Figure: y = f (x1, x2, ..., xn)

Figure: Y is a matrix.

Matrix Calculus (cont’d)

Figure: Y ,X are column vectors.

Figure: F ,X are matrices.

Gradient of RNN

∂J

∂U=

T∑t=1

t∑k=1

∂hk∂U

∂ht∂hk

∂yt∂ht

∂Jt∂yt

(4)

∂ht∂hk

=t∏

i=k+1

∂hi∂hi−1

=t∏

i=k+1

UTdiag [f ′(hi−1)] (5)

∂J

∂U=

T∑t=1

t∑k=1

∂hk∂U

(t∏

i=k+1

UTdiag [f ′(hi−1)])∂yt∂ht

∂Jt∂yt

(6)

Long-Term Dependencies

Define γ = ||UTdiag [f ′(hi−1)]||,I Exploding Gradient Problem: When γ > 1 and t − k →∞,γt−k →∞.

I Vanishing Gradient Problem: When γ < 1 and t − k →∞,γt−k → 0.

There are various ways to solve Long-Term Dependency problem.

Gradient Explosion & Vanishing

I Gradient norm clipping (Mikolov thesis 2012; Pascanu,Mikolov, Bengio, ICML 2013)

g ← ∂J∂θ

IF||g || > Threshold

THEN

g ← Threshold

||g ||∗ g

I Gradient propagation regularizer (avoid vanishing gradient)

I LSTM self-loops (avoid vanishing gradient)

Long Short-Term Memory (LSTM)Key difference to RNN: a memory cell c which is controlled bythree gates:

I input gate i :it = σ(Wixt + Uiht−1 + bi ) (7)

I output gate o:

ot = σ(Woxt + Uoht−1 + bo) (8)

I forget gate f :

ft = σ(Wf xt + Uf ht−1 + bf ) (9)

I update of memory cell and hidden state:

Ct = tanh(Wcxt + Ucht−1 + bC ) (10)

Ct = ft ⊗ Ct−1 + it ⊗ Ct (11)

ht = ot ⊗ tanh(Ct) (12)

LSTM Architecture

Ct = tanh(Wcxt + Ucht−1 + bC )

Ct = ft ⊗ Ct−1 + it ⊗ Ct

ht = ot ⊗ tanh(Ct)

The picture and following 4 pictures are fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

LSTM: Input and Forget Gates

it = σ(Wixt + Uiht−1 + bi )

Ct = tanh(Wcxt + Ucht−1 + bC )

ft = σ(Wf xt + Uf ht−1 + bf )

The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.

LSTM:Output Gates and State Update

ot = σ(Woxt + Uoht−1 + bo)

ht = ot ⊗ tanh(Ct)

Ct = ft ⊗ Ct−1 + it ⊗ Ct


Gated Recurrent Unit, GRU

Instead of using three gates, only two gates is employed:

I update gate z :

zt = σ(Wzxt + Uzht−1) (13)

I reset gate r :rt = σ(Wrxt + Urht−1) (14)

I state update:

ht = tanh(Wcxt + U(rt ⊗ ht−1)) (15)

ht = (1− zt)⊗ ht−1 + zt ⊗ ht (16)

GRU Architecture


More Exploring to LSTM...

I How does LSTM deal with the explosion and vanishinggradient problems?

I refer to: Hochreiter S, Schmidhuber J. Long short-termmemory[J]. Neural computation, 1997, 9(8): 1735-1780.

Stacked RNN/LSTM

I Stack multiple RNN/LSTM in vertical direction

I More common is seen that raw input (xi ) is connected to alllayers as input.

Bidirectional RNN/LSTM

I Modeling the prefix context and suffix context at the sametime.

h(1)t = f (U(1)h

(1)t−1 + W (1)xt + b(1)) (17)

h(2)t = f (U(2)h

(2)t+1 + W (2)xt + b(2)) (18)

ht = h(1)t ⊕ h

(2)t (19)

Application of RNN: Classification

Applicable to many classification problems.

I Text Classification

I Sentiment Classification

The left: can be enhanced with attention mechanisms.

Application of RNN: Sequence Labeling

I Sequence Labeling, such as Chinese word segmentation,part-of-speech tagging, semantic role labeling

Application of RNN: Sequence to Sequence LearningI Machine TranslationI Sequence to sequence generation (for dialogue, question

answering, etc.)I Image captioning: from image to text

Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with

neural networks[C]//NIPS. 2014: 3104-3112.

Hierarchical LSTM

Hierarchical LSTM (low-level LSTMs are taken as input to thehigh-level LSTM)

Ghosh S, Vinyals O, Strope B, et al. Contextual LSTM (CLSTM) models

for Large scale NLP tasks[J]. arXiv preprint arXiv:1602.06291, 2016.

Summary

I RNN is very good at modeling text/languageI Basic RNN models: RNN, GRU, LSTM

I Stacked RNNI Bi-directional RNNI Hierarchical RNN

I Typical Learning Problems with RNN

I ClassificationI Sequential LabelingI Sequence to Sequence translation: MT, Dialogue Generation

recurrent neural networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-rnn_2017... · 2018-04-12 ·...

Documents