recurrent neural networks 2 - github pages · pdf filereview: representation of sequence i...

Recurrent Neural Networks 2

CS 287

(Based on Yoav Goldberg’s notes)

Review: Representation of Sequence

I Many tasks in NLP involve sequences

w1, . . . ,wn

I Representations as matrix dense vectors X

(Following YG, slight abuse of notation)

x1 = x01W0, . . . , xn = x0nW

0

I Would like fixed-dimensional representation.

Review: Sequence Recurrence

I Can map from dense sequence to dense representation.

I x1, . . . , xn 7→ s1, . . . , sn

I For all i ∈ {1, . . . , n}

si = R(si−1, xi ; θ)

I θ is shared by all R

Example:

s4 = R(s3, x4)

= R(R(s2, x3), x4)

= R(R(R(R(s0, x1), x2), x3), x4)

Review: BPTT (Acceptor)

I Run forward propagation.

I Run backward propagation.

I Update all weights (shared)

y

sn

xn. . .

. . .s3

x3s2

x2s1

x1s0

Review: Issues

I Can be inefficient, but batch/GPUs help.

I Model is much deeper than previous approaches.

I This matters a lot, focus of next class.

I Variable-size model for each sentence.

I Have to be a bit more clever in Torch.

Quiz

Consider a ReLU version of the Elman RNN with function R defined as

NN(x, s) = ReLU(sWs + xWx + b).

We use this RNN with an acceptor architecture over the sequence

x1, . . . , x5. Assume we have computed the gradient for the final layer

∂L

∂s5

What is the symbolic gradient of the previous state ∂L∂s4

?

What is the symbolic gradient of the first state ∂L∂s1

?

Answer

∂L

∂s4,i= ∑

j

∂s5,j∂s4,i

∂L

∂s5,j

= ∑j

W si ,j

∂L∂s5,j

s5,j > 0

0 o.w .

= ∑j

1(s5,j > 0)W si ,j

∂L

∂s5,j

I Chain-Rule

I ReLU Gradient rule

I Indicator notation

Answer

∂L

∂s1,j1= ∑

j2

∂s2,j2∂s1,j1

. . . ∑j5

∂s5,j5∂s4,j4

∂L

∂s5,j5

= ∑j2

. . . ∑j5

1(s2,j2 > 0) . . . 1(s5,j5 > 0)W sj1,j2 . . .W s

j4,j5

∂L

∂s5,j5

= ∑j2 ...j5

5

∏k=2

1(sk,jk > 0)W sjk−1,jk

∂L

∂s5,j5

I Multiple applications of Chain rule

I Combine and multiply.

I Product of weights.

The Promise of RNNs

I Hope: Learn the long-range interactions of language from data.

I For acceptors this means maintaining early state:

I Example: How can you not see this movie?

You should not see this movie.

I Memory interaction here is at s1, but gradient signal is at sn

Long-Term Gradients

I Gradients go through multiplicative layers.

I Fine at end layers, but issues with early layers.

I For instance consider quiz with hardtanh

∑j2...j5

5

∏k=2

1(1 > sk,jk > 0)W sjk−1,jk

∂L

∂s5,j5

I The multiplicative effect tends very large exploding or vanishing

gradients.

Exploding Gradients

Easier case, can be handled heuristically,

I Occurs if there is no saturation but exponential blowup.

∑j2...jn

n

∏k=2

W sjk−1,jk

I In these cases, there are reasonable short-term gradients, but bad

long-term gradients.

I Two practical heuristics:

I gradient clipping, i.e. bounding any gradient by a maximum value

I gradient normalization, i.e. renormalizing the the RNN gradients if

they are above a fixed norm value.

Vanishing Gradients

I Occurs when combining small weights (or from saturation)

∑j2...jn

n

∏k=2

W sjk−1,jk

I Again, affects mainly long-term gradients.

I However, not as simple as boosting gradients.

LSTM (Hochreiter and Schmidhuber, 1997)

LSTM Formally

R(si−1, xi ) = [ci ,hi ]

ci = j� i+ f � ci−1

hi = tanh(ci )� o

i = tanh(xWxi + hi−1Whi + bi )

j = σ(xWxj + hi−1Whj + bj )

f = σ(xWxf + hi−1Whf + bf )

o = tanh(xWxo + hi−1Who + bo)

I (eeks.)

Unreasonable Effectiveness of RNNs

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

PANDARUS:

Alas, I think he shall be come approached and the day

When little srain would be attain’d into being never fed,

And who is but a chain and subjects of his death,

I should not sleep.

Second Senator:

They are away this miseries, produced upon my soul,

Breaking and strongly should be buried, when I perish

The earth and thoughts of many states.

DUKE VINCENTIO:

Well, your wit is in the care of side and that.

Music Composition

http://www.hexahedria.com/2015/08/03/

composing-music-with-recurrent-neural-networks/

http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/

http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/

Today’s Lecture

I Build up to LSTMs

I Note: Development is ahistoric, later papers first.

I Main idea: Getting around the vanishing gradient issue.

Contents

Highway Networks

Gated Recurrent Unit (GRU)

LSTM

Review: Deep ReLU Networks

NNlayer (x) = ReLU(xW1 + b1)

I Given a input x can build arbitrarily deep fully-connected networks.

I Deep Neural Network (DNN) :

NNlayer (NNlayer (NNlayer (. . .NNlayer (x))))

Deep Networks

NNlayer (x) = ReLU(xW1 + b1)

hn

hn−1

. . .

h2

h1

x

I Can have similar issues with vanishing gradients.

∂L

∂hn−1,jn−1= ∑

jn

1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn

Review: NLM Skip Connections

Thought Experiment: Additive Skip-Connections

NNsl1(x) =1

2ReLU(xW1 + b1) +

1

2x

hn

hn−1

. . .

h3

h2

h1

x

Exercise: Gradients

Exercise: What is the gradient of ∂Lhn−1

with skip-connections?

Inductively what happens?

hn

hn−1

. . .

h3

h2

h1

x

Exercise

We now have the average of two terms. One with no saturation

condition or multiplicative term.

∂L

∂hn−1,jn−1= 1

2 (∑jn


∂hn,jn) +

12 (hn−1,jn−1

∂L

∂hn,jn−1)

Thought Experiment 2: Dynamic Skip-Connections

NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx

t = σ(xWt + bt)

W1 ∈ Rdhid×dhid

Wt ∈ Rdhid×1

hn

hn−1

. . .

h3

h2

h1

x

Dynamic Skip-Connections: intuition


t = σ(xWt + bt)

I No longer directly computing the next layer hi .

I Instead computing ∆h and dynamic update proportion t.

I Seems like a small change from DNN. Why does this matter?

Dynamic Skip-Connections Gradient


t = σ(xWt + bt)

I The t values are saved on the forward pass.

I t allows for a direct flow of gradients to earlier layers.

∂L

∂hn−1,jn−1= (1− t) (∑

jn


∂hn,jn)

+ t (hn−1,jn−1∂L

∂hn,jn−1) + . . .

I (What is the . . .?)

Dynamic Skip-Connections


t = σ(xWt + bt)

∂L

∂hn−1,jn−1= (1− t) (∑

jn


∂hn,jn)

+ t (hn−1,jn−1∂L

∂hn,jn−1) + . . .

I Note: h and Wt are also receiving gradients through the sigmoid!

I (Backprop is fun.)

Highway Network (Srivastava et al., 2015)

I Now add a combination at each dimension.

I � is point-wise multiplication.

NNhighway (x) = (1− t)� h+ t� x

h = ReLU(xW1 + b1)

t = σ(xWt + bt)

Wt ,W1 ∈ Rdhid×dhid

bt ,b1 ∈ R1×dhid

I h; transform (e.g. standard MLP layer)

I t; carry (dimension-specific dynamic skipping)

Highway Gradients

I The t values are saved on the forward pass.

I tj determines the update of dimension j .

∂L

∂hn−1,jn−1= (∑

jn

(1− tjn)1(hn,jn > 0)Wjn−1,jn∂L

∂hn,jn)

+ tjn−1 (hn−1,jn−1∂L

∂hn,jn−1) + . . .

Intuition: Gating

I This is known as the gating operation

t� x

I Allows vector t to mask or gate x.

I True gating would have t ∈ {0, 1}dhid

I Approximate with the sigmoid,

t = σ(Wtx+ b)

Contents

Highway Networks


LSTM

Back To Acceptor RNNs

I Acceptor RNNs are deep networks with shared weights.

I Elman tanh layers

NN(x, s) = tanh(sWs + xWx + b).

sn

xn. . .

. . .s4

x4s3

x3s2

x2s1

x1s0

RNN with Skip-Connections

I Can replace layer with modified highway layer.

R(si−1, xi ) = (1− t)� h+ t� si−1

h = tanh(xWx + si−1Ws + b)

t = σ(xWxt + si−1Wst + bt)

Wxt ,Wx ∈ Rdin×dhid

Wst ,Ws ∈ Rdhid×dhid

bt ,b ∈ R1×dhid

RNN with Skip-Connections

sn

xn. . .

. . .s4

x4s3

x3s2

x2s1

x1s0

Final Idea: Stopping flow

I For many tasks, it is useful to halt propagation.

I Can do this by applying a reset gate r.

h = tanh(xWx + (r� si−1)Ws + b)

r = σ(xWxr + si−1Wsr + br )

Gated Recurrent Unit (GRU) (Cho et al 2014)

R(si−1, xi ) = (1− t)� h+ t� si−1

h = tanh(xWx + (r� si−1)Ws + b)

r = σ(xWxr + si−1Wsr + br )

t = σ(xWxt + si−1Wst + bt)

Wxt ,Wxr ,Wx ∈ Rdin×dhid

Wst ,Wsr ,Ws ∈ Rdhid×dhid

bt ,b ∈ R1×dhid

I t; dynamic skip-connections

I r; reset gating

I s; hidden state

I h; gated MLP

Example: Language Modeling

I In non-Markovian language modeling, treat corpus as a sequence

I r allows resetting the state after a sequence.

consumers may want to move their telephones a little closer to

the tv set </s> <unk> <unk> watching abc ’s monday

night football can now vote during <unk> for the greatest

play in N years from among four or five <unk> <unk>

</s> two weeks ago viewers of several nbc <unk> consumer

segments started calling a N number for advice on various

<unk> issues </s> and the new syndicated reality show

hard copy records viewers ’ opinions for possible airing on the

next day ’s show </s>

Contents

Highway Networks


LSTM

LSTM

I Long Short-Term Memory network uses the same idea.

I Model has three gates, input, output, forget.

I Developed first, but a bit harder to follow.

I Seems to be better at LM, comparable to GRU on other tasks.

LSTMs State

The state si is made of 2 components :

I ci ; cell

I hi ; hidden

R(si−1, xi ) = [ci ,hi ]

LSTM Development: Input and Forget Gates

The cell is updated with ∆c, not a convex combination (two gates).

R(si−1, xi ) = [ci ,hi ]

ci = j� h+ f � ci−1

h = tanh(xWxi + hi−1Whi + bi )



I ci ; cell

I j; input gate

I f; forget gate

LSTM Development: Hidden

Hidden hi squashes ci

R(ci−1, xi ) = [ci ,hi ]

hi = tanh(ci )

ci = j� h+ f � ci−1

h = tanh(xWxi + hi−1Whi + bi )



I ci ; cell

I hi ; hidden

I j; input gate

I f; forget gate

Long Short-Term Memory

Finally, another output gate is applied to h

R(si−1, xi ) = [ci ,hi ]

ci = j� i+ f � ci−1

hi = tanh(ci )� o

i = tanh(xWxi + hi−1Whi + bi )



o = tanh(xWxo + hi−1Who + bo)

Output Gate?

I The output gate feels the most ad-hoc (why tanh?)

I Luckily Jozefowicz et al (2015) find output not too important.

Accuracy Results (Jozefowicz et al, 2015)

Accuracy of RNN models on three different tasks. LSTM models

include versions without the forget, input, and output gates.

English LM Results (Jozefowicz et al, 2015)

NLL (Perplexity)

recurrent neural networks 2 - github pages · pdf filereview: representation of sequence i...

Documents