recurrent neural networkstat.snu.ac.kr/mcp/rnn_dl.pdf · recurrent neural network long short-term...
TRANSCRIPT
Recurrent Neural Network
Recurrent Neural Network
Seoul National University Deep Learning September-December, 2018 1 / 24
Recurrent Neural Network
Recurrent Neural Network (RNN)
Sequence Modeling: Recurrent neural networks, or RNNs (Rumelhartet al., 1986a), are a family of neural networks for processingsequential data.
pt = softmax(c + Vht)
ht = tanh(b + wht−1 + Uxt)
Hidden unit at time t is a function of hidden unit at t − 1 and theinput at time t.
Unknown parameters do not depend on time.
Loss function: L({x1, · · · , xt}, {y1, · · · , yt})
Seoul National University Deep Learning September-December, 2018 2 / 24
Recurrent Neural Network
Building units of RNN
Input: xt
Hidden unit: ht = tanh(b + wht−1 + Uxt)
Output unit: ot = c + Vht
Predicted probability: pt = softmax(ot)
Seoul National University Deep Learning September-December, 2018 3 / 24
Recurrent Neural Network
Building units of RNN
Input: xt
Hidden unit: ht = tanh(b + wht−1 + Uxt)
Output unit: ot = c + Vht
Predicted probability: pt = softmax(ot)
Unknown parameters: (w ,U, b, c ,V )
Seoul National University Deep Learning September-December, 2018 4 / 24
Recurrent Neural Network
Applications of RNN
source: Andrej Karpathy blog
one to one: typical
one to many: image captioning (image to sequence of words)
many to one: sentiment analysis (sequence of words to sentiment)
many to many: with lag, machine translation (sequence of words tosequence of words); without lag, video classification on frame level
Seoul National University Deep Learning September-December, 2018 5 / 24
Recurrent Neural Network
Applications of RNN: Character-level language models
source: Andrej Karpathy blog
x1 = (1, 0, 0, 0), y1 = (0, 1, 0, 0), h1 = (0.3,−0.1, 0.9),o1 = (1.0, 2.2,−3.0, 4.2).
Seoul National University Deep Learning September-December, 2018 6 / 24
Recurrent Neural Network
Image captioning
Figure by A. Karpathy
Seoul National University Deep Learning September-December, 2018 7 / 24
Recurrent Neural Network
Back propagation in simple RNN
at = b + wht−1 + Uxt
ht = tanh(at)
ot = c + Vht
pt = softmax(ot)
Loss function: L = −∑T
t=1 yTt log(pt) =
∑Tt=1 Lt .
∂Lt∂pt
= − ytyTt pt
∂Lt∂ot
= ∂L∂pt
∂pt∂ot
= − ytyTt pt
(diag(pt)− ptpTt )
∂L
∂V=
T∑t=1
∂L
∂ot
∂ot∂V
=T∑t=1
∂L
∂otht
Seoul National University Deep Learning September-December, 2018 8 / 24
Recurrent Neural Network
Back propagation in simple RNN
at = b + wht−1 + Uxt
ht = tanh(at)
ot = c + Vht
pt = softmax(ot)
∂L
∂w=
T∑t=1
∂Lt∂w
Contribution from each time point is cumulative in time:
∂Lt∂w
=t∑
k=0
∂Lt∂ot
∂ot∂ht
∂ht∂hk
∂hk∂w
Seoul National University Deep Learning September-December, 2018 9 / 24
Recurrent Neural Network
Back propagation in simple RNN
at = b + wht−1 + Uxtht = tanh(at)
ot = c + Vhtpt = softmax(ot)
∂Lt∂w
=t∑
k=0
∂Lt∂ot
∂ot∂ht
∂ht∂hk
∂hk∂w
∂Lt∂ot
=∂L
∂pt
∂pt∂ot
= − yt
yTt pt(diag(pt)− ptp
Tt )
∂ot∂ht
= V
∂hk∂w
=∂hk∂ak
∂ak∂w
= diag(1− tanh2(ak))hk−1
Seoul National University Deep Learning September-December, 2018 10 / 24
Recurrent Neural Network
Back propagation in simple RNN
at = b + wht−1 + Uxt
ht = tanh(at)
ot = c + Vht
pt = softmax(ot)
∂Lt∂w
=t∑
k=0
∂Lt∂ot
∂ot∂ht
∂ht∂hk
∂hk∂w
Problematic part:
∂ht∂hk
=t−1∏j=k
∂hj+1
∂hj
∂ht∂ht−1
=∂ht∂at
∂at∂ht−1
Seoul National University Deep Learning September-December, 2018 11 / 24
Recurrent Neural Network
Back propagation in simple RNN
at = b +wht−1 +Uxt ; ht = tanh(at); ot = c +Vht ; pt = softmax(ot)
Denoting diag(1− tanh2(at)) by Dt
∂ht∂at
= diag(1− tanh2(at)) = Dt
∂ht∂ht−1
=∂ht∂at
∂at∂ht−1
= Dtw
∂ht∂hk
=t−1∏j=k
∂hj+1
∂hj=
t−1∏j=k
(Dj+1w)
Seoul National University Deep Learning September-December, 2018 12 / 24
Recurrent Neural Network
Back propagation in simple RNN
Contribution of each time point is cumulative up to time t but ∂ht∂hk
isvery small if k is far from t. That is, long-range dependence cannotbe incorporated in updating weights.
∂Lt∂w
=t∑
k=0
∂Lt∂ot
∂ot∂ht
∂ht∂hk
∂hk∂w
=t∑
k=0
∂Lt∂ot
∂ot∂ht
[t−1∏j=k
(Dj+1w)]∂hk∂w
Seoul National University Deep Learning September-December, 2018 13 / 24
Recurrent Neural Network
Two sources of exploding or vanishing gradients in RNN
Two sources of problems: Nonlinearity and explosion
∂ht∂hk
=t−1∏j=k
∂hj+1
∂hj=
t−1∏j=k
(Dj+1w)
Nonlinearity: (∏t−1
j=k Dj+1) appears due to tanh(.) relationship.
Explosion: w t−k could vanish if |w | < 1 or explode if |w | > 1.
Attempts have been made to replace tanh(.) with ReLu or gradientclipping to alleviate explosion.
Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU)are designed to solve the nonlinearity and explosion problems.
Seoul National University Deep Learning September-December, 2018 14 / 24
Recurrent Neural Network
Long Short-term Memory (LSTM)
Hochreiter and Schmidhuber, 1997
LSTM allows long-term dependence in time.
LSTM is designed to avoid the problem due to nonlinear relationshipand the problem of gradient explosion.
Successful in unconstrained handwriting recognition (Graveset al.,2009), speech recognition (Graves et al., 2013; Graves and Jaitly,2014), handwriting generation (Graves, 2013), machine translation(Sutskever et al., 2014), image captioning (Kiros et al., 2014b; Vinyalset al., 2014b; Xu et al., 2015), and parsing (Vinyals et al., 2014a).
Seoul National University Deep Learning September-December, 2018 15 / 24
Recurrent Neural Network
Long Short-term Memory (LSTM)
input gate: it = σ(b + Uxt + wht−1
)forget gate: ft = σ
(bf + U f xt + w f ht−1
)new memory cell: ct = tanh
(bg + Ugxt + wght−1
)updated memory: ct = ftct−1 + it ct
output gate: ot = σ(bo + Uoxt + woht−1
).
ht = tanh(ct)ot
pt = softmax(c + Vht)
Existing and new key elements
Seoul National University Deep Learning September-December, 2018 16 / 24
Recurrent Neural Network
Long Short-term Memory (LSTM)
LSTM introduces a memory cell state ct .
There is no direct relationship between ht−1 and ht , which caused theproblem of vanishing or exploding and lack of long-term dependence.
Update of ct controls the time dependence and the information flow.
Seoul National University Deep Learning September-December, 2018 17 / 24
Recurrent Neural Network
RNN vs. LSTM
Figure: Simple RNN
Figure: LSTM
Seoul National University Deep Learning September-December, 2018 18 / 24
Recurrent Neural Network
Long Short-term Memory (LSTM)
Consider (c ,V ), (bo ,Uo ,wo), (bg ,Ug ,wg , bf ,U f ,w f , b,U,w)
For (c ,V ), ∂L∂w s =
∑Tt=1
∂Lt∂pt
∂pt∂V
For (bo ,Uo ,wo), ∂L∂wo =
∑Tt=1
∂Lt∂ot
∂ot∂wo
For (bg ,Ug ,wg , bf ,U f ,w f , b,U,w), derivatives, ∂L∂w , ∂L
∂wg , ∂L∂w f ,
involve ∂L∂ct
since time dependence is mediated by ct .
For w ,
∂L
∂w=
T∑t=1
∂Lt∂w
Contribution from time t,
∂Lt∂w
=t∑
k=0
∂Lt∂ct
∂ct∂ck
∂ck∂w
Seoul National University Deep Learning September-December, 2018 19 / 24
Recurrent Neural Network
Long Short-term Memory (LSTM)
Contribution from time t,
∂Lt∂w
=t∑
k=0
∂Lt∂ct
∂ct∂ck
∂ck∂w
Problem part:
∂ct∂ck
=t−1∏j=k
∂cj+1
∂cj
But∂ct∂ct−1
= ft .
∂Lt∂w
=t∑
k=0
∂Lt∂ct
(t−1∏j=k
fj)∂ck∂w
Additive relationship and forget gate help alleviating vanishing orexploding problem.
Seoul National University Deep Learning September-December, 2018 20 / 24
Recurrent Neural Network
Multi-layered RNN and LSTM
• RNN or LSTMoutput ht becomes input for time t + 1.• In multi-layered RNNand LSTM, ht becomes input for timet + 1 and ht for the l th becomes input(xt) for the (l + 1)th layer. For example,LSTM input cells for the 1st and mth are
i(1)t = σ
(b + Uxt + wh
(1)t−1
)i(m)t = σ
(b + Uh
(m−1)t + wh
(m)t−1
)
Seoul National University Deep Learning September-December, 2018 21 / 24
Recurrent Neural Network
Gated Recurrent Unit (GRU)
Proposed by Cho et al. (2014). Further studied by Chung et al.(2014, 2015a); Jozefowicz et al. (2015); Chrupala et al. (2015).
GRU addresses the two sources of problems of RNN with a simplerstructure than LSTM.
LSTM and GRU are the two popular architectures of RNN.
GRU are reportedly easier to train.
LSTM in principle can carry a longer memory.
Seoul National University Deep Learning September-December, 2018 22 / 24
Recurrent Neural Network
Gated Recurrent Unit (GRU)
update gate: ut = σ(bu + Uuxt + wuht
)reset gate: rt = σ
(br + U rxt + w rht
)ht = tanh(Uhxt + rt�(whht−1) + bh)
ht = ut−1ht−1 + (1− ut−1)ht
pt = softmax(c + Vht)
When ut−1 = 0 and rt = 1, it reduces to the simple RNN.
Handles nonlinearity and explosion.
Seoul National University Deep Learning September-December, 2018 23 / 24
Recurrent Neural Network
Summary
RNNs can model sequence data.
Simple version of RNN has problems of vanishing or explodinggradient.
LSTM or GRU are designed to alleviate the problem.
When to use LSTM or GRU is not well understood.
Seoul National University Deep Learning September-December, 2018 24 / 24