recurrent neural networks 2 - github pages · pdf filereview: representation of sequence i...
TRANSCRIPT
Recurrent Neural Networks 2
CS 287
(Based on Yoav Goldberg’s notes)
Review: Representation of Sequence
I Many tasks in NLP involve sequences
w1, . . . ,wn
I Representations as matrix dense vectors X
(Following YG, slight abuse of notation)
x1 = x01W0, . . . , xn = x0nW
0
I Would like fixed-dimensional representation.
Review: Sequence Recurrence
I Can map from dense sequence to dense representation.
I x1, . . . , xn 7→ s1, . . . , sn
I For all i ∈ {1, . . . , n}
si = R(si−1, xi ; θ)
I θ is shared by all R
Example:
s4 = R(s3, x4)
= R(R(s2, x3), x4)
= R(R(R(R(s0, x1), x2), x3), x4)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
Review: Issues
I Can be inefficient, but batch/GPUs help.
I Model is much deeper than previous approaches.
I This matters a lot, focus of next class.
I Variable-size model for each sentence.
I Have to be a bit more clever in Torch.
Quiz
Consider a ReLU version of the Elman RNN with function R defined as
NN(x, s) = ReLU(sWs + xWx + b).
We use this RNN with an acceptor architecture over the sequence
x1, . . . , x5. Assume we have computed the gradient for the final layer
∂L
∂s5
What is the symbolic gradient of the previous state ∂L∂s4
?
What is the symbolic gradient of the first state ∂L∂s1
?
Answer
∂L
∂s4,i= ∑
j
∂s5,j∂s4,i
∂L
∂s5,j
= ∑j
W si ,j
∂L∂s5,j
s5,j > 0
0 o.w .
= ∑j
1(s5,j > 0)W si ,j
∂L
∂s5,j
I Chain-Rule
I ReLU Gradient rule
I Indicator notation
Answer
∂L
∂s1,j1= ∑
j2
∂s2,j2∂s1,j1
. . . ∑j5
∂s5,j5∂s4,j4
∂L
∂s5,j5
= ∑j2
. . . ∑j5
1(s2,j2 > 0) . . . 1(s5,j5 > 0)W sj1,j2 . . .W s
j4,j5
∂L
∂s5,j5
= ∑j2 ...j5
5
∏k=2
1(sk,jk > 0)W sjk−1,jk
∂L
∂s5,j5
I Multiple applications of Chain rule
I Combine and multiply.
I Product of weights.
The Promise of RNNs
I Hope: Learn the long-range interactions of language from data.
I For acceptors this means maintaining early state:
I Example: How can you not see this movie?
You should not see this movie.
I Memory interaction here is at s1, but gradient signal is at sn
Long-Term Gradients
I Gradients go through multiplicative layers.
I Fine at end layers, but issues with early layers.
I For instance consider quiz with hardtanh
∑j2...j5
5
∏k=2
1(1 > sk,jk > 0)W sjk−1,jk
∂L
∂s5,j5
I The multiplicative effect tends very large exploding or vanishing
gradients.
Exploding Gradients
Easier case, can be handled heuristically,
I Occurs if there is no saturation but exponential blowup.
∑j2...jn
n
∏k=2
W sjk−1,jk
I In these cases, there are reasonable short-term gradients, but bad
long-term gradients.
I Two practical heuristics:
I gradient clipping, i.e. bounding any gradient by a maximum value
I gradient normalization, i.e. renormalizing the the RNN gradients if
they are above a fixed norm value.
Vanishing Gradients
I Occurs when combining small weights (or from saturation)
∑j2...jn
n
∏k=2
W sjk−1,jk
I Again, affects mainly long-term gradients.
I However, not as simple as boosting gradients.
LSTM (Hochreiter and Schmidhuber, 1997)
LSTM Formally
R(si−1, xi ) = [ci ,hi ]
ci = j� i+ f � ci−1
hi = tanh(ci )� o
i = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
o = tanh(xWxo + hi−1Who + bo)
I (eeks.)
Unreasonable Effectiveness of RNNs
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain’d into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
Music Composition
http://www.hexahedria.com/2015/08/03/
composing-music-with-recurrent-neural-networks/
Today’s Lecture
I Build up to LSTMs
I Note: Development is ahistoric, later papers first.
I Main idea: Getting around the vanishing gradient issue.
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
Review: Deep ReLU Networks
NNlayer (x) = ReLU(xW1 + b1)
I Given a input x can build arbitrarily deep fully-connected networks.
I Deep Neural Network (DNN) :
NNlayer (NNlayer (NNlayer (. . .NNlayer (x))))
Deep Networks
NNlayer (x) = ReLU(xW1 + b1)
hn
hn−1
. . .
h2
h1
x
I Can have similar issues with vanishing gradients.
∂L
∂hn−1,jn−1= ∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn
Deep Networks
NNlayer (x) = ReLU(xW1 + b1)
hn
hn−1
. . .
h2
h1
x
I Can have similar issues with vanishing gradients.
∂L
∂hn−1,jn−1= ∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn
Review: NLM Skip Connections
Thought Experiment: Additive Skip-Connections
NNsl1(x) =1
2ReLU(xW1 + b1) +
1
2x
hn
hn−1
. . .
h3
h2
h1
x
Exercise: Gradients
Exercise: What is the gradient of ∂Lhn−1
with skip-connections?
Inductively what happens?
hn
hn−1
. . .
h3
h2
h1
x
Exercise
We now have the average of two terms. One with no saturation
condition or multiplicative term.
∂L
∂hn−1,jn−1= 1
2 (∑jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn) +
12 (hn−1,jn−1
∂L
∂hn,jn−1)
Thought Experiment 2: Dynamic Skip-Connections
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
W1 ∈ Rdhid×dhid
Wt ∈ Rdhid×1
hn
hn−1
. . .
h3
h2
h1
x
Dynamic Skip-Connections: intuition
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
I No longer directly computing the next layer hi .
I Instead computing ∆h and dynamic update proportion t.
I Seems like a small change from DNN. Why does this matter?
Dynamic Skip-Connections Gradient
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
I The t values are saved on the forward pass.
I t allows for a direct flow of gradients to earlier layers.
∂L
∂hn−1,jn−1= (1− t) (∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ t (hn−1,jn−1∂L
∂hn,jn−1) + . . .
I (What is the . . .?)
Dynamic Skip-Connections
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
∂L
∂hn−1,jn−1= (1− t) (∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ t (hn−1,jn−1∂L
∂hn,jn−1) + . . .
I Note: h and Wt are also receiving gradients through the sigmoid!
I (Backprop is fun.)
Highway Network (Srivastava et al., 2015)
I Now add a combination at each dimension.
I � is point-wise multiplication.
NNhighway (x) = (1− t)� h+ t� x
h = ReLU(xW1 + b1)
t = σ(xWt + bt)
Wt ,W1 ∈ Rdhid×dhid
bt ,b1 ∈ R1×dhid
I h; transform (e.g. standard MLP layer)
I t; carry (dimension-specific dynamic skipping)
Highway Gradients
I The t values are saved on the forward pass.
I tj determines the update of dimension j .
∂L
∂hn−1,jn−1= (∑
jn
(1− tjn)1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ tjn−1 (hn−1,jn−1∂L
∂hn,jn−1) + . . .
Intuition: Gating
I This is known as the gating operation
t� x
I Allows vector t to mask or gate x.
I True gating would have t ∈ {0, 1}dhid
I Approximate with the sigmoid,
t = σ(Wtx+ b)
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
Back To Acceptor RNNs
I Acceptor RNNs are deep networks with shared weights.
I Elman tanh layers
NN(x, s) = tanh(sWs + xWx + b).
sn
xn. . .
. . .s4
x4s3
x3s2
x2s1
x1s0
RNN with Skip-Connections
I Can replace layer with modified highway layer.
R(si−1, xi ) = (1− t)� h+ t� si−1
h = tanh(xWx + si−1Ws + b)
t = σ(xWxt + si−1Wst + bt)
Wxt ,Wx ∈ Rdin×dhid
Wst ,Ws ∈ Rdhid×dhid
bt ,b ∈ R1×dhid
RNN with Skip-Connections
sn
xn. . .
. . .s4
x4s3
x3s2
x2s1
x1s0
Final Idea: Stopping flow
I For many tasks, it is useful to halt propagation.
I Can do this by applying a reset gate r.
h = tanh(xWx + (r� si−1)Ws + b)
r = σ(xWxr + si−1Wsr + br )
Gated Recurrent Unit (GRU) (Cho et al 2014)
R(si−1, xi ) = (1− t)� h+ t� si−1
h = tanh(xWx + (r� si−1)Ws + b)
r = σ(xWxr + si−1Wsr + br )
t = σ(xWxt + si−1Wst + bt)
Wxt ,Wxr ,Wx ∈ Rdin×dhid
Wst ,Wsr ,Ws ∈ Rdhid×dhid
bt ,b ∈ R1×dhid
I t; dynamic skip-connections
I r; reset gating
I s; hidden state
I h; gated MLP
Example: Language Modeling
I In non-Markovian language modeling, treat corpus as a sequence
I r allows resetting the state after a sequence.
consumers may want to move their telephones a little closer to
the tv set </s> <unk> <unk> watching abc ’s monday
night football can now vote during <unk> for the greatest
play in N years from among four or five <unk> <unk>
</s> two weeks ago viewers of several nbc <unk> consumer
segments started calling a N number for advice on various
<unk> issues </s> and the new syndicated reality show
hard copy records viewers ’ opinions for possible airing on the
next day ’s show </s>
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
LSTM
I Long Short-Term Memory network uses the same idea.
I Model has three gates, input, output, forget.
I Developed first, but a bit harder to follow.
I Seems to be better at LM, comparable to GRU on other tasks.
LSTMs State
The state si is made of 2 components :
I ci ; cell
I hi ; hidden
R(si−1, xi ) = [ci ,hi ]
LSTM Development: Input and Forget Gates
The cell is updated with ∆c, not a convex combination (two gates).
R(si−1, xi ) = [ci ,hi ]
ci = j� h+ f � ci−1
h = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
I ci ; cell
I j; input gate
I f; forget gate
LSTM Development: Hidden
Hidden hi squashes ci
R(ci−1, xi ) = [ci ,hi ]
hi = tanh(ci )
ci = j� h+ f � ci−1
h = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
I ci ; cell
I hi ; hidden
I j; input gate
I f; forget gate
Long Short-Term Memory
Finally, another output gate is applied to h
R(si−1, xi ) = [ci ,hi ]
ci = j� i+ f � ci−1
hi = tanh(ci )� o
i = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
o = tanh(xWxo + hi−1Who + bo)
Output Gate?
I The output gate feels the most ad-hoc (why tanh?)
I Luckily Jozefowicz et al (2015) find output not too important.
Accuracy Results (Jozefowicz et al, 2015)
Accuracy of RNN models on three different tasks. LSTM models
include versions without the forget, input, and output gates.
English LM Results (Jozefowicz et al, 2015)
NLL (Perplexity)