recurrent neural networkcoai.cs.tsinghua.edu.cn/hml/media/files/2017-rnn_2017... · 2018-04-12 ·...
TRANSCRIPT
Recurrent Neural Network
Minlie Huang
[email protected] of Computer Science
Tsinghua UniversitySome slides are from Dr. Xipeng Qiu
Summer, 2017
Recurrent Neural Network (RNN)The RNN has recurrent hidden states whose output at each time isdependent on that of the previous time.Given a sequence x(1 : n) = (x1, x2, ..., xt , ..., xn) (each xi is avector), the RNN updates its recurrent hidden state h(t) by
ht =
{0 if t=0
f (ht−1, xt) otherwise(1)
Simple Recurrent Network
I The recurrence function:
ht = f (Uht−1 + Wxt + b) (2)
where f is non-linear function (for instance, sigmoid or tanh ).
I A typical RNN structure can be illustrated as follows:
Backpropagation Through Time, BPTT
Suppose loss at time t is Jt , then the total loss is ΣTt=1Jt .
The gradient of J is
∂J
∂U=
T∑t=1
∂Jt∂U
=T∑t=1
∂ht∂U
∂Jt∂ht
(3)
Matrix Calculus
Figure: y = f (x1, x2, ..., xn)
Figure: Y is a matrix.
Matrix Calculus (cont’d)
Figure: Y ,X are column vectors.
Figure: F ,X are matrices.
Gradient of RNN
∂J
∂U=
T∑t=1
t∑k=1
∂hk∂U
∂ht∂hk
∂yt∂ht
∂Jt∂yt
(4)
∂ht∂hk
=t∏
i=k+1
∂hi∂hi−1
=t∏
i=k+1
UTdiag [f ′(hi−1)] (5)
∂J
∂U=
T∑t=1
t∑k=1
∂hk∂U
(t∏
i=k+1
UTdiag [f ′(hi−1)])∂yt∂ht
∂Jt∂yt
(6)
Long-Term Dependencies
Define γ = ||UTdiag [f ′(hi−1)]||,I Exploding Gradient Problem: When γ > 1 and t − k →∞,γt−k →∞.
I Vanishing Gradient Problem: When γ < 1 and t − k →∞,γt−k → 0.
There are various ways to solve Long-Term Dependency problem.
Gradient Explosion & Vanishing
I Gradient norm clipping (Mikolov thesis 2012; Pascanu,Mikolov, Bengio, ICML 2013)
g ← ∂J∂θ
IF||g || > Threshold
THEN
g ← Threshold
||g ||∗ g
I Gradient propagation regularizer (avoid vanishing gradient)
I LSTM self-loops (avoid vanishing gradient)
Long Short-Term Memory (LSTM)Key difference to RNN: a memory cell c which is controlled bythree gates:
I input gate i :it = σ(Wixt + Uiht−1 + bi ) (7)
I output gate o:
ot = σ(Woxt + Uoht−1 + bo) (8)
I forget gate f :
ft = σ(Wf xt + Uf ht−1 + bf ) (9)
I update of memory cell and hidden state:
Ct = tanh(Wcxt + Ucht−1 + bC ) (10)
Ct = ft ⊗ Ct−1 + it ⊗ Ct (11)
ht = ot ⊗ tanh(Ct) (12)
LSTM Architecture
Ct = tanh(Wcxt + Ucht−1 + bC )
Ct = ft ⊗ Ct−1 + it ⊗ Ct
ht = ot ⊗ tanh(Ct)
The picture and following 4 pictures are fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.
LSTM: Input and Forget Gates
it = σ(Wixt + Uiht−1 + bi )
Ct = tanh(Wcxt + Ucht−1 + bC )
ft = σ(Wf xt + Uf ht−1 + bf )
The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.
LSTM:Output Gates and State Update
ot = σ(Woxt + Uoht−1 + bo)
ht = ot ⊗ tanh(Ct)
Ct = ft ⊗ Ct−1 + it ⊗ Ct
The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Gated Recurrent Unit, GRU
Instead of using three gates, only two gates is employed:
I update gate z :
zt = σ(Wzxt + Uzht−1) (13)
I reset gate r :rt = σ(Wrxt + Urht−1) (14)
I state update:
ht = tanh(Wcxt + U(rt ⊗ ht−1)) (15)
ht = (1− zt)⊗ ht−1 + zt ⊗ ht (16)
GRU Architecture
The picture is fromhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/.
More Exploring to LSTM...
I How does LSTM deal with the explosion and vanishinggradient problems?
I refer to: Hochreiter S, Schmidhuber J. Long short-termmemory[J]. Neural computation, 1997, 9(8): 1735-1780.
Stacked RNN/LSTM
I Stack multiple RNN/LSTM in vertical direction
I More common is seen that raw input (xi ) is connected to alllayers as input.
Bidirectional RNN/LSTM
I Modeling the prefix context and suffix context at the sametime.
h(1)t = f (U(1)h
(1)t−1 + W (1)xt + b(1)) (17)
h(2)t = f (U(2)h
(2)t+1 + W (2)xt + b(2)) (18)
ht = h(1)t ⊕ h
(2)t (19)
Application of RNN: Classification
Applicable to many classification problems.
I Text Classification
I Sentiment Classification
The left: can be enhanced with attention mechanisms.
Application of RNN: Sequence Labeling
I Sequence Labeling, such as Chinese word segmentation,part-of-speech tagging, semantic role labeling
Application of RNN: Sequence to Sequence LearningI Machine TranslationI Sequence to sequence generation (for dialogue, question
answering, etc.)I Image captioning: from image to text
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with
neural networks[C]//NIPS. 2014: 3104-3112.
Hierarchical LSTM
Hierarchical LSTM (low-level LSTMs are taken as input to thehigh-level LSTM)
Ghosh S, Vinyals O, Strope B, et al. Contextual LSTM (CLSTM) models
for Large scale NLP tasks[J]. arXiv preprint arXiv:1602.06291, 2016.
Summary
I RNN is very good at modeling text/languageI Basic RNN models: RNN, GRU, LSTM
I Stacked RNNI Bi-directional RNNI Hierarchical RNN
I Typical Learning Problems with RNN
I ClassificationI Sequential LabelingI Sequence to Sequence translation: MT, Dialogue Generation