lecture 9: recurrent neural networks - github pages · lecture 9: recurrent neural networks deep...

50
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 1 Lecture 9: Recurrent Neural Networks Deep Learning @ UvA

Upload: others

Post on 25-Aug-2020

14 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 1

Lecture 9: Recurrent Neural NetworksDeep Learning @ UvA

Page 2: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 2

o Word and Language Representations

o From n-grams to Neural Networks

o Word2vec

o Skip-gram

Previous Lecture

Page 3: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 3

o Recurrent Neural Networks (RNN) for sequences

o Backpropagation Through Time

o RNNs using Long Short-Term Memory (LSTM)

o Applications of Recurrent Neural Networks

Lecture Overview

Page 4: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

RECURRENT NEURAL NETWORKS - 4

Recurrent Neural Networks

Recurrentconnections

NN CellState

Input

Output

Page 5: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 5

o Abusing the terminology◦ static data come in a mini-batch of size 𝐷

◦ dynamic data come in a mini-batch of size 1

Sequential vs. static data

Page 6: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 6

o Abusing the terminology◦ static data come in a mini-batch of size 𝐷

◦ dynamic data come in a mini-batch of size 1

Sequential vs. static data

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Page 7: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 7

o ... roughly equivalent to predicting what is going to happen next

Pr 𝑥 =

𝑖

Pr 𝑥𝑖 𝑥1, … , 𝑥𝑖−1)

o Easy to generalize to sequences of arbitrary length

o Considering small chunks 𝑥𝑖 fewer parameters, easier modelling

o Often is convenient to pick a “frame” 𝑇

Pr 𝑥 =

𝑖

Pr 𝑥𝑖 𝑥𝑖−𝑇 , … , 𝑥𝑖−1)

Modelling sequences is …

Page 8: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 8

o One-hot vector◦ After one-hot vector add an embedding

o Instead of one-hot vector use directly a word representation◦ Word2vec

◦ GloVE

Word representations

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0I

am

Bond

,

James

McGuire

tired

!

tired

Vocabulary

0 0 0 0

One-hot vectors

Page 9: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 9

o Data inside a sequence are non i.i.d.◦ Identically, independently distributed

o The next “word” depends on the previous “words”◦ Ideally on all of them

o We need context, and we need memory!

o How to model context and memory ?

What a sequence really is?

I am Bond , James

Page 10: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 10

o Data inside a sequence are non i.i.d.◦ Identically, independently distributed

o The next “word” depends on the previous “words”◦ Ideally on all of them

o We need context, and we need memory!

o How to model context and memory ?

What a sequence really is?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Page 11: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 11

o Data inside a sequence are non i.i.d.◦ Identically, independently distributed

o The next “word” depends on the previous “words”◦ Ideally on all of them

o We need context, and we need memory!

o How to model context and memory ?

What a sequence really is?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Page 12: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 12

o Data inside a sequence are non i.i.d.◦ Identically, independently distributed

o The next “word” depends on the previous “words”◦ Ideally on all of them

o We need context, and we need memory!

o How to model context and memory ?

What a sequence really is?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Bond

Page 13: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 13

o A representation (variable) of the past

o How to adapt a simple Neural Network to include a memory variable?

Modelling-wise, what is memory?

𝑥𝑡

𝑦𝑡

𝑥𝑡+1 𝑥𝑡+2 𝑥𝑡+3

𝑦𝑡+1 𝑦𝑡+2 𝑦𝑡+3

𝑈

𝑉

𝑊𝑐𝑡𝑐𝑡+1 𝑐𝑡+2 𝑐𝑡+3

Output parameters

Input parameters

Memory

parameters

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊Memory

Output

Input

Page 14: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 14

Recurrent Neural Network (RNN)

𝑥𝑡

𝑦𝑡

𝑥𝑡+1 𝑥𝑡+2

𝑦𝑡+1 𝑦𝑡+2

𝑈

𝑉

𝑊

𝑈

𝑐𝑡 𝑐𝑡+1 𝑐𝑡+2

𝑈

𝑉

𝑊

𝑥𝑡

𝑦𝑡

𝑊𝑉

𝑈

𝑐𝑡

(𝑐𝑡−1)

Unrolled/Unfolded Network Recurrent Neural Network

Page 15: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 15

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦1 𝑦2 𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

Page 16: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 16

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦1 𝑦2 𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

• Sometimes intermediate outputs are not even needed

Page 17: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 17

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

• Sometimes intermediate outputs are not even needed• Removing them, we almost end up to a standard

Neural Network

Page 18: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 18

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦1 𝑦2 𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

• Sometimes intermediate outputs are not even needed• Removing them, we almost end up to a standard

Neural Network

Page 19: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 19

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦1 𝑦2 𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

• Sometimes intermediate outputs are not even needed• Removing them, we almost end up to a standard

Neural Network

Page 20: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 20

o Imagine we care only for 3-grams

o Are the two networks that different?◦ Steps instead of layers

◦ Step parameters are same (shared parameters in a NN)

Why unrolled?

𝑦1 𝑦2 𝑦3

𝐿𝑎𝑦𝑒𝑟

1

𝐿𝑎𝑦𝑒𝑟

2

𝐿𝑎𝑦𝑒𝑟

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

Final output

Final output

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊1 𝑊2𝑊 𝑊3

• Sometimes intermediate outputs are not even needed• Removing them, we almost end up to a standard

Neural Network

Page 21: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 21

o At time step 𝑡

o Example◦ Vocabulary of 500 words

◦ An input projection of 50 dimensions (U: [50 × 500])

◦ A memory of 128 units (𝑐𝑡: 128 × 1 ,W: [128 × 128])

◦ An output projections of 500 dimensions (V: [500 × 128])

RNN equations

𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

𝑦𝑡 = softmax(𝑉𝑐𝑡)

One-hot vector

Page 22: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 22

o Cross entropy loss

𝑃 =

𝑡

𝑘

𝑦𝑡𝑘𝑙𝑡𝑘 ⇒ ℒ = − log 𝑃 = −

1

𝑇

𝑡

𝑙𝑡 log 𝑦𝑡 ,

◦ non-target words 𝑙𝑡 = 0 ⇒ Compute loss only for target words

◦ Random loss =? ⇒ ℒ = log𝐾

o Usually loss is measured w.r.t. perplexity

𝐽 = 2ℒ

Loss function

Page 23: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 23

o Backpropagation Through Time (BPTT)

Training an RNN

𝑦1 𝑦2 𝑦3

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

Page 24: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 24

o𝜕ℒ

𝜕𝑉,

𝜕ℒ

𝜕𝑊,𝜕ℒ

𝜕𝑈

o To make it simpler let’s focus on step 3

𝜕ℒ3

𝜕𝑉,𝜕ℒ3

𝜕𝑊,𝜕ℒ3

𝜕𝑈

Backpropagation Through Time

𝑦1, ℒ1 𝑦2, ℒ2 𝑦3, ℒ3

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑥1 𝑥2 𝑥3

𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

𝑦𝑡 = softmax(𝑉𝑐𝑡)

ℒ = −

𝑡

𝑙𝑡 log 𝑦𝑡 =

𝑡

ℒ𝑡

Page 25: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 25

𝜕ℒ3𝜕𝑉

=𝜕ℒ3𝜕𝑦3

𝜕𝑦3𝜕𝑉

= 𝑦3 − 𝑙3 × 𝑐3

Backpropagation Through Time

𝑦1, ℒ1 𝑦2, ℒ2 𝑦3, ℒ3

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑥1 𝑥2 𝑥3

𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

𝑦𝑡 = softmax(𝑉𝑐𝑡)

ℒ = −

𝑡

𝑙𝑡 log 𝑦𝑡 =

𝑡

ℒ𝑡

Page 26: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 26

o𝜕ℒ3

𝜕𝑊=

𝜕ℒ3

𝜕𝑦3

𝜕𝑦3

𝜕𝑐3

𝜕𝑐3

𝜕𝑊

o What is the relation between 𝑐3 and 𝑊?◦ Two-fold: 𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

o𝜕 𝑓(𝜑 𝑥 , 𝜓(𝑥))

𝜕𝑥=

𝜕𝑓

𝜕𝜑

𝜕𝜑

𝜕𝑥+

𝜕𝑓

𝜕𝜓

𝜕𝜓

𝜕𝑥

o𝜕𝑐3

𝜕𝑊∝ 𝑐2 +

𝜕𝑐2

𝜕𝑊(𝜕𝑊

𝜕𝑊= 1)

Backpropagation Through Time

𝑦1, ℒ1 𝑦2, ℒ2 𝑦3, ℒ3

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑥1 𝑥2 𝑥3

𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

𝑦𝑡 = softmax(𝑉𝑐𝑡)

ℒ = −

𝑡

𝑙𝑡 log 𝑦𝑡 =

𝑡

ℒ𝑡

Page 27: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 27

o𝜕𝑐3

𝜕𝑊= 𝑐2 +

𝜕𝑐2

𝜕𝑊

o𝜕𝑐2

𝜕𝑊= 𝑐1 +

𝜕𝑐1

𝜕𝑊

o𝜕𝑐1

𝜕𝑊= 𝑐0 +

𝜕𝑐0

𝜕𝑊

Recursively

𝑦1, ℒ1 𝑦2, ℒ2 𝑦3, ℒ3

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑈

𝑉

𝑊

𝑥1 𝑥2 𝑥3

𝑐𝑡 = tanh(𝑈 𝑥𝑡 +𝑊𝑐𝑡−1)

𝑦𝑡 = softmax(𝑉𝑐𝑡)

ℒ = −

𝑡

𝑙𝑡 log 𝑦𝑡 =

𝑡

ℒ𝑡

𝜕𝑐3𝜕𝑊

=

𝑡=1

3𝜕𝑐3𝜕𝑐𝑡

𝜕𝑐𝑡𝜕𝑊

⇒𝜕ℒ3𝜕𝑊

=

𝑡=1

3𝜕ℒ3𝜕𝑦3

𝜕𝑦3𝜕𝑐3

𝜕𝑐3𝜕𝑐𝑡

𝜕𝑐𝑡𝜕𝑊

Page 28: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 28

o NO◦ Although in theory yes!

o Vanishing gradient◦ After a few time steps the gradients become almost 0

o Exploding gradient◦ After a few time steps the gradients become huge

o Can’t really capture long-term dependencies

Are RNNs perfect?

Page 29: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 29

o𝜕ℒr

𝜕𝑊= 𝑡=1

r 𝜕ℒ𝑟

𝜕𝑦𝑟

𝜕𝑦𝑟

𝜕𝑐𝑟

𝜕𝑐𝑟

𝜕𝑐𝑡

𝜕𝑐𝑡

𝜕𝑊

o𝜕𝑐𝑟

𝜕𝑐𝑡can be decomposed further based on the chain rule

o𝜕𝑐𝑟

𝜕𝑐𝑡=

𝜕𝑐𝑟

𝜕𝑐𝑟−1⋅𝜕𝑐𝑟−1

𝜕𝑐𝑟−2⋅ … ⋅

𝜕𝑐𝑡+1

𝜕𝑐𝑡

o When many long-term factors, for many of which 𝜕𝑐𝑖

𝜕𝑐𝑖−1≫ 1

◦ then 𝜕𝑐𝑟

𝜕𝑐𝑡≫ 1

◦ then 𝜕ℒr

𝜕𝑊≫ 1

Exploding gradients

𝜏 ≫ 𝑟 → short-term factors 𝑅𝑒𝑠𝑡 → long-term factors 𝜏 ≫ 𝑟 → short-term factors

Gradient explodes

Page 30: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 30

o Scale the gradients to a threshold

o Step 1. g ←𝜕ℒ

𝜕𝜃

o Step 2. Is 𝑔 > 𝜃0?

◦ Step 3a. If yes g ←𝜃0

𝑔𝑔

◦ Step 3b. If no, then do nothing

o Simple, but works!

Remedy for exploding gradients

𝜃0

g

𝜃0𝑔

𝑔

Page 31: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 31

o𝜕ℒr

𝜕𝑊= 𝑡=1

r 𝜕ℒ𝑟

𝜕𝑦𝑟

𝜕𝑦𝑟

𝜕𝑐𝑟

𝜕𝑐𝑟

𝜕𝑐𝑡

𝜕𝑐𝑡

𝜕𝑊

o𝜕𝑐𝑟

𝜕𝑐𝑡can be decomposed further based on the chain rule

o𝜕𝑐𝑟

𝜕𝑐𝑡=

𝜕𝑐𝑟

𝜕𝑐𝑟−1⋅𝜕𝑐𝑟−1

𝜕𝑐𝑟−2⋅ … ⋅

𝜕𝑐𝑡+1

𝜕𝑐𝑡

o When many 𝜕𝑐𝑖

𝜕𝑐𝑖−1→ 0 (e.g. with sigmoids),

◦ then 𝜕𝑐𝑟

𝜕𝑐𝑡→ 0

◦ then 𝜕ℒr

𝜕𝑊→ 0

Vanishing gradients

𝜏 ≫ 𝑟 → short-term factors 𝑅𝑒𝑠𝑡 → long-term factors

Gradient vanishes

Page 32: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

RECURRENT NEURAL NETWORKS - 32

Advanced RNNs+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 33: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 33

A more realistic memory unit needs what?

New memory

and NN state

Page 34: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 34

A more realistic memory unit needs what?

New memory

and NN stateInput

Previous

memory

Previous

NN state

Page 35: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 35

Long Short-Termn Memory (LSTM: Beefed up RNN)

𝑖 = 𝜎 𝑥𝑡𝑈(𝑖) +𝑚𝑡−1𝑊

(𝑖)

𝑓 = 𝜎 𝑥𝑡𝑈(𝑓) +𝑚𝑡−1𝑊

(𝑓)

𝑜 = 𝜎 𝑥𝑡𝑈(𝑜) +𝑚𝑡−1𝑊

(𝑜)

𝑐𝑡 = tanh(𝑥𝑡𝑈𝑔 +𝑚𝑡−1𝑊

(𝑔))

𝑐𝑡 = 𝑐𝑡−1 ⊙𝑓 + 𝑐𝑡 ⊙ 𝑖

𝑚𝑡 = tanh 𝑐𝑡 ⊙𝑜

+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 36: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 36

o E.g. Model the sentence “Yesterday she slapped me. Today she loves me.”

o Decide what to forget and what to remember for the new memory◦ Sigmoid 1 Remember everything

◦ Sigmoid 0 Forget everything

LSTM Step-by-Step: Step (1)

𝑖𝑡 = 𝜎 𝑥𝑡𝑈(𝑖) +𝑚𝑡−1𝑊

(𝑖)

𝑓𝑡 = 𝜎 𝑥𝑡𝑈(𝑓) +𝑚𝑡−1𝑊

(𝑓)

𝑜𝑡 = 𝜎 𝑥𝑡𝑈(𝑜) +𝑚𝑡−1𝑊

(𝑜)

𝑐𝑡 = tanh(𝑥𝑡𝑈𝑔 +𝑚𝑡−1𝑊

(𝑔))

𝑐𝑡 = 𝑐𝑡−1 ⊙𝑓 + 𝑐𝑡 ⊙ 𝑖

𝑚𝑡 = tanh 𝑐𝑡 ⊙𝑜

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 37: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 37

o Decide what new information should you add to the new memory◦ Modulate the input 𝑖𝑡◦ Generate candidate memories 𝑐𝑡

LSTM Step-by-Step: Step (2)

𝑖𝑡 = 𝜎 𝑥𝑡𝑈(𝑖) +𝑚𝑡−1𝑊

(𝑖)

𝑓𝑡 = 𝜎 𝑥𝑡𝑈(𝑓) +𝑚𝑡−1𝑊

(𝑓)

𝑜𝑡 = 𝜎 𝑥𝑡𝑈(𝑜) +𝑚𝑡−1𝑊

(𝑜)

𝑐𝑡 = tanh(𝑥𝑡𝑈𝑔 +𝑚𝑡−1𝑊

(𝑔))

𝑐𝑡 = 𝑐𝑡−1 ⊙𝑓 + 𝑐𝑡 ⊙ 𝑖

𝑚𝑡 = tanh 𝑐𝑡 ⊙𝑜

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 38: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 38

o Compute and update the current cell state 𝑐𝑡◦ Depends on the previous cell state

◦ What we decided to forget

◦ What inputs we allowed

◦ The candidate memories

LSTM Step-by-Step: Step (3)

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑡 = 𝜎 𝑥𝑡𝑈(𝑖) +𝑚𝑡−1𝑊

(𝑖)

𝑓𝑡 = 𝜎 𝑥𝑡𝑈(𝑓) +𝑚𝑡−1𝑊

(𝑓)

𝑜𝑡 = 𝜎 𝑥𝑡𝑈(𝑜) +𝑚𝑡−1𝑊

(𝑜)

𝑐𝑡 = tanh(𝑥𝑡𝑈𝑔 +𝑚𝑡−1𝑊

(𝑔))

𝑐𝑡 = 𝑐𝑡−1 ⊙𝑓 + 𝑐𝑡 ⊙ 𝑖

𝑚𝑡 = tanh 𝑐𝑡 ⊙𝑜

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 39: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 39

o Modulate the output◦ Does the cell state contain something relevant? Sigmoid 1

o Generate the new memory

LSTM Step-by-Step: Step (4)

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑡 = 𝜎 𝑥𝑡𝑈(𝑖) +𝑚𝑡−1𝑊

(𝑖)

𝑓𝑡 = 𝜎 𝑥𝑡𝑈(𝑓) +𝑚𝑡−1𝑊

(𝑓)

𝑜𝑡 = 𝜎 𝑥𝑡𝑈(𝑜) +𝑚𝑡−1𝑊

(𝑜)

𝑐𝑡 = tanh(𝑥𝑡𝑈𝑔 +𝑚𝑡−1𝑊

(𝑔))

𝑐𝑡 = 𝑐𝑡−1 ⊙𝑓 + 𝑐𝑡 ⊙ 𝑖

𝑚𝑡 = tanh 𝑐𝑡 ⊙𝑜

𝑓𝑡

𝑐𝑡−1 𝑐𝑡

𝑜𝑡𝑖𝑡

𝑐𝑡

𝑚𝑡𝑚𝑡−1

𝑥𝑡

Page 40: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 40

o Macroscopically very similar to standard RNNs

o The engine is a bit different (more complicated)◦ Because of their gates LSTMs capture long and short term dependencies

LSTM Unrolled Network

× +

𝜎𝜎𝜎

×

tanh

×

tanh

× +

𝜎𝜎𝜎

×

tanh

×

tanh

× +

𝜎𝜎𝜎

×

tanh

×

tanh

Page 41: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 41

o LSTM with peephole connections◦ Gates have access also to the previous cell states 𝑐𝑡−1 (not only memories)

◦ Coupled forget and input gates, 𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 1 − 𝑓𝑡 ⊙ 𝑐𝑡◦ Bi-directional recurrent networks

o GRU◦ A bit simpler than LSTM

◦ Performance similar to LSTM

o Deep LSTM

Even more beef to the RNNs

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

Page 42: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

RECURRENT NEURAL NETWORKS - 42

Applications of Recurrent Networks

Click to go to the video in Youtube Click to go to the website

Page 43: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 43

o Generate text like Nietzsche, Shakespeare or Wikipedia

o Generate Linux kernel-like C++ code

o Or even generate a new website

Pr 𝑥 = 𝑖 Pr 𝑥𝑖 𝑥𝑝𝑎𝑠𝑡; 𝜃), where 𝜃 are the LSTM parameters

Text generation

LSTM (1) LSTM (1) LSTM (1)

To be or

Two are or

Shakespeare

Prediction

Training Inference

LSTM (1)

O

LSTM (1)

Romeo

Page 44: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 44

o Problem 1: Model real coordinates instead of one-hot vectors

o Recurrent Mixture Density Networks

o Have outputs follow a Gaussian Distribution◦ Output needs to be suitably squashed

o We don’t just fit a Gaussian to the data◦ We also condition on the previous outputs

Pr 𝑜𝑡 =

𝑖

𝑤𝑖 𝑥1:𝑇 𝑁(𝑜𝑡|𝜇𝑖 𝑥1:𝑇 , Σ𝑖(𝑥1:𝑇))

Handwriting generation

Page 45: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 45

o The phrase in the source language is one sequence ◦ “Today the weather is very good”

o The phrase in the target language is also a sequence◦ “Vandaag de weer is heel goed”

o Problems◦ no perfect word alignment, sentence length might differ

o Solution◦ Encoder-decoder scheme

Machine Translation

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good <EOS> Vandaag de weer is heel

Vandaag de weer is heel goed

Encoder

<EOS>

goed

Decoder

Page 46: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 46

o It might even pay off reversing the source sentence◦ The first target words will be closer to their respective source words

o The encoder and decoder parts can be modelled with different LSTMs

o Deep LSTM

Better Machine Translation

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

good is weather the Today <EOS> Vandaag de weer is heel

Vandaag de weer is heel goed

goed

<EOS>

Page 47: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 47

o An image is a thousand words, literally!

o Pretty much the same as machine transation

o Replace the encoder part with the output of a Convnet◦ E.g. use Alexnet or a VGG16 network

o Keep the decoder part to operate as a translator

Image captioning

LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good

Today the weather is good <EOS>

Convnet

Page 48: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 48

o Bleeding-edge research, no real consensus◦ Very interesting open, research problems

o Again, pretty much like machine translation

o Encoder-Decoder scheme◦ Insert the question to the encoder part◦ Model the answer at the decoder part

o You can also have question answering with images

◦ Again, bleeding-edge research◦ How/where to add the image?◦ What has been working so far is to add the image

only in the beginning

Question answering

Q: what are the people playing?

A: They play beach football

Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room

did John enter?

A: John entered the living room.

Page 49: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

RECURRENT NEURAL NETWORKS - 49

Summary

o Recurrent Neural Networks (RNN) for sequences

o Backpropagation Through Time

o RNNs using Long Short-Term Memory (LSTM)

o Applications of Recurrent Neural Networks

Page 50: Lecture 9: Recurrent Neural Networks - GitHub Pages · Lecture 9: Recurrent Neural Networks Deep Learning @ UvA. ... o Easy to generalize to sequences of arbitrary length o Considering

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

RECURRENT NEURAL NETWORKS - 50

Next lecture

o Memory networks

o Recursive networks