recurrent neural networksmausam/courses/col772/spring... · recurrent neural networks ... (long...

57
Recurrent Neural Networks Yoav Goldberg

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural

NetworksYoav Goldberg

Page 2: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Dealing with Sequences

• For an input sequence x1,...,xn, we can:

• If n is fixed: concatenate and feed into an MLP.

• sum the vectors (CBOW) and feed into an MLP.

• Break the sequence into windows. Find n-gram

embedding, sum into an MLP.

• Find good ngrams using ConvNet, using pooling

(either sum/avg or max) to combine to a single

vector.

Page 3: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Dealing with Sequences

• For an input sequence x1,...,xn, we can:

• If n is fixed: concatenate and feed into an MLP.

• sum the vectors (CBOW) and feed into an MLP.

• Break the sequence into windows (i.e., for tagging).

Each window is fixed size, concatenate into an MLP.

• Find good ngrams using ConvNet, using pooling

(either sum/avg or max) to combine to a single

vector.

Some of these approaches consider local word

order (which ones?).

How can we consider global word order?

Page 4: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• Very strong models of sequential data.

• Trainable function from n vectors to a single vector.

v(what) v(is) v(your) v(name) enc(what is your name)

Page 5: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• There are different variants (implementations).

• So far, we focused on the interface level.

Page 6: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• Very strong models of sequential data.

• Trainable function from n vectors to a single* vector.

Page 7: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

*this one is internal. we only care about the y

• Very strong models of sequential data.

• Trainable function from n vectors to a single* vector.

Page 8: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• Recursively defined.

• There's a vector for every prefix

Page 9: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• Recursively defined.

• There's a vector for every prefix

Page 10: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

• Recursively defined.

• There's a vector for every prefix

for every finite input sequence,

can unroll the recursion.

Page 11: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

for every finite input sequence,

can unroll the recursion.

An unrolled RNN is just a very deep Feed Forward Network

with shared parameters across the layers,

and a new input at each layer.

Page 12: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

Page 13: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

trained parameters.

• But we can train them.define function form

define loss

Page 14: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Recurrent Neural Networks

for Text Classification

Page 15: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

(what are the parameters?)

CBOW as an RNN

Page 16: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

(what are the parameters?)

CBOW as an RNN

Page 17: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

CBOW as an RNN

Is this a good parameterization?

Page 18: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

CBOW as an RNN

how about this modification?

Page 19: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Simple RNN (Elman RNN)

Page 20: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Simple RNN (Elman RNN)

• Looks very simple.

• Theoretically very powerful.

• In practice not so much (hard to train).

• Why? Vanishing gradients.

Page 21: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Simple RNN (Elman RNN)

• RNN as a "computer":

input xi arrives, memory s is updated.

• In the Elman RNN, entire memory is written at

each time-step.

• entire memory = output!

Another view on behavior:

Page 22: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

LSTM RNN

better controlled memory access

Page 23: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

continuous gates

Page 24: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Differentiable "Gates"

• The main idea behind the LSTM is that you want to

somehow control the "memory access".

• In a SimpleRNN:

• All the memory gets overwritten

read previous state memory write new input

Page 25: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

• We'd like to:

* Selectively read from some memory "cells".

* Selectively write to some memory "cells".

• A gate function:

vector of valuesgate controls access

(element-wise multiplication)

Page 26: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

• We'd like to:

* Selectively read from some memory "cells".

* Selectively write to some memory "cells".

• A gate function:

vector of valuesgate controls access

(element-wise multiplication)

Page 27: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

• We'd like to:

* Selectively read from some memory "cells".

* Selectively write to some memory "cells".

• A gate function:

vector of values gate controls access

Page 28: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

• Using the gate function to control access:

which cells to read which cells to write

Page 29: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

• Using the gate function to control access:

• (can also tie them: )

which cells to read which cells to write

Page 30: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Vector "Gates"

Page 31: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• Problem with the gates:

* they are fixed.

* they don't depend on the input or the output.

• Solution: make them smooth, input dependent, and

trainable.

Differentiable "Gates"

"almost 0"

or

"almost 1"

function of input and state

Page 32: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• Problem with the gates:

* they are fixed.

* they don't depend on the input or the output.

• Solution: make them smooth, input dependent, and

trainable.

Differentiable "Gates"

"almost 0"

or

"almost 1"

function of input and state

Page 33: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is a specific combination of gates.

LSTM (Long short-term Memory)

Page 34: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is a specific combination of gates.

LSTM (Long short-term Memory)

Page 35: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is a specific combination of gates.

LSTM (Long short-term Memory)

Page 36: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is a specific combination of gates.

LSTM (Long short-term Memory)

Page 37: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

or

g

Page 38: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The GRU is a different combination of gates.

GRU(Gated Recurrent Unit)

Page 39: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The GRU and the LSTM are very similar ideas.

• Invented independently of the LSTM, almost two

decades later.

GRU vs LSTM

Page 40: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

GRU(Gated Recurrent Unit)

• The GRU formulation:

Page 41: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

GRU(Gated Recurrent Unit)

• The GRU formulation:

Page 42: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

GRU(Gated Recurrent Unit)

Page 43: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

GRU(Gated Recurrent Unit)

Page 44: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The GRU formulation.

GRU(Gated Recurrent Unit)

Page 45: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• Many other variants exist.

• Mostly perform similarly to each other.

• Different tasks may work better with different

variants.

• The important idea is the differentiable gates.

Other Variants

Page 46: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is formulation:

LSTM (Long short-term Memory)

Page 47: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is formulation:

LSTM (Long short-term Memory)

Page 48: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is formulation:

LSTM (Long short-term Memory)

Page 49: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is formulation:

Recurrent Additive Networks

Page 50: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• Systematic search over LSTM choices

• Find that (1) forget gate is most important

• (2) non-linearity in output important since cell state

can be unbounded

• GRU effective since it doesn’t let cell state be

unbounded

Page 51: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The LSTM is formulation:

LSTM (Long short-term Memory)

Page 52: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Bidirectional LSTMs

Page 53: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated
Page 54: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Infinite window around the word

Page 55: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Deep LSTMs

Page 56: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

Deep Bi-LSTMs

Page 57: Recurrent Neural Networksmausam/courses/col772/spring... · Recurrent Neural Networks ... (Long short-term Memory) or g • The GRU is a different combination of gates. GRU (Gated

• The gated architecture also helps the vanishing

gradients problems.

• For a good explanation, see Kyunghyun Cho's

notes:

http://arxiv.org/abs/1511.07916 sections 4.2, 4.3

• Chris Olah's blog post

Read More