recurrent neural networks (rnns) - github pages · recurrent neural networks (rnns) dr. benjamin...

Recurrent Neural Networks (RNNs)

Dr. Benjamin Roth, Nina Poerner

CIS LMU Munchen

Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 1 / 24

9:15 - 10:00: RNN Basics

10:15 - 11:45: Ubungen: PyTorch, Word2Vec

Statt Ubungsblatt bis nachste Woche durcharbeiten:I http://www.deeplearningbook.org/contents/rnn.html (Abschnitte 10.0

- 10.2.1 (inclusive), 10.7, 10.10)I LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/I GRU: https://towardsdatascience.com/understanding-gru-networks-

2ef37df6c9be/

Nachste Woche:I 9:15 - 10:15: “Journal Club” zu LSTM und GRUI 10:30 - 11:45: Intro Keras

Family of neural networks for processing sequential data x(1) . . . x(T ).

Sequences of words, characters, video frames, audio frames, ...

Basic idea: For time step t, compute representation h(t) from currentinput x(t) and previous representation h(t−1).

h(t) = f (h(t−1), x(t); θ)

x(t) can be embeddings, one-hot, output of some previous layer ...

Question: By recursion, what does h(t) depend on?I all previous inputs x(1) . . . x(t)

I the initial state h(0) (typically all-zero, but not necessarily, c.f.encoder-decoder)

I the parameters θ

Question: How does this compare to an n-gram based model?I N-gram based models have limited memory, the RNN has theoretically

(!) unlimited memory

h(t) = f (h(t−1), x(t); θ)

I the parameters θ

h(t) = f (h(t−1), x(t); θ)

Question: By recursion, what does h(t) depend on?

I all previous inputs x(1) . . . x(t)

I the parameters θ

h(t) = f (h(t−1), x(t); θ)

I the parameters θ

Question: How does this compare to an n-gram based model?

I N-gram based models have limited memory, the RNN has theoretically(!) unlimited memory

h(t) = f (h(t−1), x(t); θ)

I the parameters θ

Parameter Sharing

Going from a time step t − 1 to t is parameterized by the sameparameters θ for all t!

h(t) = f (h(t−1), x(t); θ)

Question: Why is parameter sharing a good idea?

I Fewer parametersI Can learn to detect features regardless of their position

F “i went to nepal in 2009” vs. “in 2009 i went to nepal”

I Can generalize to longer sequences than were seen in training

Parameter Sharing

Going from a time step t − 1 to t is parameterized by the sameparameters θ for all t!

h(t) = f (h(t−1), x(t); θ)

Question: Why is parameter sharing a good idea?I Fewer parametersI Can learn to detect features regardless of their position

F “i went to nepal in 2009” vs. “in 2009 i went to nepal”

I Can generalize to longer sequences than were seen in training

Graphical Notation: Unrolling

Compact notation (left):I All time steps conflated.I � indicates “delay” of 1 time unit.

Unrolled notation (right):I Like a very deep feed-forward NN with parameter sharing across layers

Source: Goodfellow et al.: Deep Learning.

Any questions so far?

RNN: Output

The output at time t is typically computed from the hiddenrepresentation at time t:

o(t) = f (h(t); θo)

Typically a linear transformation: o(t) = θTo h(t)

Some RNNs compute o(t) at every time step, others only at the lasttime step o(T )

RNN: Output

n-to-1

Sentiment polarity, topic classification, grammaticality ...

tagging

o(1) o(2)

POS tagging, NER tagging, Language Model ...

tagging (selective)Sequence2Sequence

Decoder

Encoder

Machine Translation, Summarization, Image captioning (encoder CNN) ...

RNN: Output

n-to-1

tagging

o(1) o(2)

Decoder

Encoder

RNN: Output

n-to-1

tagging

o(1) o(2)

Decoder

Encoder

RNN: Output

n-to-1

tagging

o(1) o(2)

Decoder

Encoder

RNN: Output

n-to-1

tagging

o(1) o(2)

tagging (selective)

Sequence2Sequence

Decoder

Encoder

RNN: Output

n-to-1

tagging

o(1) o(2)

Decoder

Encoder

RNN: Output

n-to-1

tagging

o(1) o(2)

Decoder

Encoder

RNN: Loss Function

Loss function:I Several time steps: L(y (1), . . . y (T ); o(1) . . . o(T ))I Last time step: L(y ; o(T ))

Example: POS TaggingI Output o(t) is predicted distribution over POS tags

F o(t) = P(tag =?|h(t))F Typically: o(t) = softmax(θTo h

I Loss at time t: negative log-likelihood (NLL) of true label y (t)

L(t) = − logP(tag = y (t)|h(t); θo)

I Overall Loss for all time steps:

L =T∑t=1

RNN: Loss Function

L =T∑t=1

RNN: Loss Function

L =T∑t=1

RNN: Loss Function

L =T∑t=1

Graphical Notation: Including Output and Loss Function

Source: Goodfellow et al.: Deep Learning.

Backpropagation through time

We have calculated loss L(T ) at time step T and want to updateparameters θi with gradient descent.

For now, imagine that we have time step-specific

“dummy”-parameters θ(t)i , which are identical copies of θi

→ the unrolled RNN looks like a feed-forward-neural-network!

→ we can calculate ∂L(T )

∂θ(t)i

using standard backpropagation

To calculate ∂L(T )

∂θi, add up the “dummy” gradients:

∂L(T )

∂θi=

T∑t=1

∂L(T )

∂θ(t)i

∂L(T )

∂θi=

T∑t=1

∂L(T )

∂θ(t)i

∂L(T )

∂θi=

T∑t=1

∂L(T )

∂θ(t)i

Truncated backpropagation through time

Simple idea: Stop backpropagation through time after k time steps

∂L(T )

∂θi=

T∑t=T−k

∂L(T )

∂θ(t)i

Question: What are advantages and disadvantages?

I Advantage: Faster and parallelizableI Disadvantage: If k is too small, long-range dependencies are hard to

Truncated backpropagation through time

Simple idea: Stop backpropagation through time after k time steps

∂L(T )

∂θi=

T∑t=T−k

∂L(T )

∂θ(t)i

Question: What are advantages and disadvantages?I Advantage: Faster and parallelizableI Disadvantage: If k is too small, long-range dependencies are hard to

Vanilla RNN

h(t) = f (h(t−1), x(t); θ) = tanh(Ux(t) + Wh(h−1) + b)

θ = {W,U,b}

W: Hidden-to-hidden

U: Input-to-hidden

b: Bias term

Vanilla RNN in keras:

vanilla = SimpleRNN(units=10, use_bias = True)

vanilla.build(input_shape = (None, None, 30))

print([weight.shape for weight in vanilla.get_weights()])

[(30, 10), (10, 10), (10,)]

Question: Which shape belongs to which weight?

Vanilla RNN

h(t) = f (h(t−1), x(t); θ) = tanh(Ux(t) + Wh(h−1) + b)

θ = {W,U,b}

W: Hidden-to-hidden

U: Input-to-hidden

b: Bias term

Vanilla RNN in keras:

vanilla = SimpleRNN(units=10, use_bias = True)

vanilla.build(input_shape = (None, None, 30))

print([weight.shape for weight in vanilla.get_weights()])

[(30, 10), (10, 10), (10,)]

Question: Which shape belongs to which weight?

Bidirectional RNNs

Conceptually: Two RNNs that run in opposite directions over thesame input

Typically, each RNN has its own set of parameters

Results in two sequences of hidden vectors:→h(1) . . .

→h(T ),

←h(1) . . .

←h(T )

Before being passed to downstream layers,→h and

←h are typically

concatenated into one representation h, s.t.

dim(h) = dim(→h) + dim(

←h).

Question: Which hidden vectors should we concatenate if we want tocompute a single output (e.g., predict sentiment of sentence)?

I h =→h(T )||

←h(1)

I Because these are the vectors that have “read” the entire sequence

Question: Which hidden vectors should we concatenate in sequencetagging (e.g., to predict POS for word wt)?

I h(t) =→h(t)||

←h(t)

I Left context, right context (including t)

Question: Which hidden vectors should we concatenate to predictsymbol wt (i.e., in a bidirectional language model)?

→h(t−1)||

←h(t+1)

I Left context, right context (excluding t)

←h are typically

←h).

I h =→h(T )||

←h(1)

I h(t) =→h(t)||

←h(t)

→h(t−1)||

←h(t+1)

←h are typically

←h).

I h =→h(T )||

←h(1)

I h(t) =→h(t)||

←h(t)

→h(t−1)||

←h(t+1)

←h are typically

←h).

I h =→h(T )||

←h(1)

I h(t) =→h(t)||

←h(t)

→h(t−1)||

←h(t+1)

←h are typically

←h).

I h =→h(T )||

←h(1)

I h(t) =→h(t)||

←h(t)

→h(t−1)||

←h(t+1)

←h are typically

←h).

I h =→h(T )||

←h(1)

I h(t) =→h(t)||

←h(t)

→h(t−1)||

←h(t+1)

Bidirectional RNN for sequence tagging. Source: Goodfellow et al.: Deep Learning.

Multi-Layer RNNs

Conceptually: A stack of L RNNs, such that x(t)l = h

(t)l−1.

Feeding outputs back

What do we do if the input sequence x(1) . . . x(T ) is only given attraining time, but not at test time?

Examples: Machine Translation decoder, (generative) language model

Example: Machine Translation

h(0) h(1)

Last encoder state

(has “read” source)

Predicted

y (1) y (2)

x(2) x(3) Embeddings

Truew (1) w (2) Predicted

h(0) h(1)

Last encoder state

Predicted

y (1) y (2)

w (1) w (2) Predicted

h(0) h(1)

Last encoder state

Predicted

y (1) y (2)

Truew (1) w (2) Predicted

Oracle

Give Neural Network a signal that it will not have at test time

Can be useful during training (e.g., mix oracle and predicted signalwhen training a generative language model)

Can be used to establish upper bounds of modulesI Example: How much better do Neural MT systems become when they

take the translation of the previous sentence into account?I If we don’t see improvements, this could be because

F the previous sentence contains no useful information in generalF the translation of the previous sentence was not good enough to have a

positive effect

I → provide gold translation of previous sentence as oracle to find upperbound

Gated RNNs: Teaser

Vanilla RNNs are not frequently used, becauseI Vanishing/exploding gradients make them difficult to trainI They tend to forget past information quickly

Instead: LSTM, GRU, ... (next week!)

recurrent neural networks (rnns) - github pages · recurrent neural networks (rnns) dr. benjamin...

Documents