# tutorial on recurrent neural networks - jordi .utotrial on recurrent neural networks jordi pons...

Post on 07-Mar-2019

219 views

Embed Size (px)

TRANSCRIPT

1/37

Tutorial on recurrent neural networks

Jordi Pons

Music Technology Group

DTIC - Universitat Pompeu Fabra

January 16th, 2017

2/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode problem

Gated units: LSTM & GRU

Other solutions

Jordi Pons Tutorial on recurrent neural networks

3/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Recurrent Neural Networks?

Figure : Unfolded recurrent neural network

h(t) = (b+Wh(t1) +Ux(t))

o(t) = c+ Vh(t)

Throughout this presentation we assume single layer RNNs!

Jordi Pons Tutorial on recurrent neural networks

4/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Recurrent Neural Networks? Annotation

h(t) = (b+Wh(t1) +Ux(t)) o(t) = c+ Vh(t)

I x IRdinx1 y IRdoutx1

I b IRdhx1 h IRdhx1

I U IRdhxdin W IRdhxdhI V IRdoutxdh : element-wise non-linearity.

Where din, dh and dout correspond to the dimensions of the inputlayer, hidden layer and output layer, respectively.

Jordi Pons Tutorial on recurrent neural networks

5/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Recurrent Neural Networks?

Figure : Unfolded recurrent neural network

Jordi Pons Tutorial on recurrent neural networks

6/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: forward backward path

Figure : Forward and backward path for time-step t + 1 (yellow).

Jordi Pons Tutorial on recurrent neural networks

7/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: forward backward path

Figure : Problematic path (red).

Jordi Pons Tutorial on recurrent neural networks

8/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: intuition from the forward linear case

I Study case: forward problematic path for the linear case.h(t) = (b+Wh(t1) +Ux(t)) h(t) = Wh(t1)

I After t steps, this is equivalent to multiply Wt .

I Suppose that W has eigendecomposition W = Qdiag()QT .

I Then: Wt = (Qdiag()QT )t = Qdiag()tQT .

Jordi Pons Tutorial on recurrent neural networks

9/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: intuition from the forward linear case

h(t) = W h(t1) h(t) = Wt h(0)

Remember: Wt = Qdiag()tQT ,

then: h(t) = Qdiag()t QT h(0).

I Any eighenvalues i < 1 (and consequently h(t)) will vanish.

I Any eighenvalues i > 1 (and consequently h(t)) will explode.

Jordi Pons Tutorial on recurrent neural networks

10/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: intuition from the forward linear case

Observation 1:

I power method algorithm: nd the largest eigenvalue of amatrix W. Method very similar to Wt !

I Therefore, for: h(t) = Qdiag()t QT h(0)

Any component of h(0) that is not aligned with thelargest eigenvector will eventually be discarded.

Observation 2:

I Recurrent networks use the same matrix W at each time step.

I Feedforward networks do not use the same matrix W: Very deep feedforward networks can avoid thevanishing and exploding gradient problem. if an appropriate scaling for W's variance is chosen.

Jordi Pons Tutorial on recurrent neural networks

11/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

Consider the backward pass when non-linearities are present. This will allow to explicitly describe the gradients issue.

L

W=L(o(t+1), y(t+1)

)W

+L(o(t), y(t)

)W

+L(o(t1), y(t1)

)W

=

=tmax

k=tmin

L(o(tk), y(tk)

)W

where S [tmin, tmax ] - S being the horizon of the BPTT alghoritm.Jordi Pons Tutorial on recurrent neural networks

12/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

Let's now focus on time-step t + 1:

L(o(t1), y(t1)

)W

Jordi Pons Tutorial on recurrent neural networks

13/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

L(o(t+1), y(t+1)

)W

=L(o(t+1), y(t+1)

)o(t+1)

o(t+1)

h(t+1)h(t+1)

W+

+L(o(t+1), y(t+1)

)o(t+1)

o(t+1)

h(t+1)h(t+1)

h(t)h(t)

W+

+L(o(t+1), y(t+1)

)o(t+1)

o(t+1)

h(t+1)h(t+1)

h(t)h(t)

h(t1)h(t1)

W+...

Jordi Pons Tutorial on recurrent neural networks

14/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

Previous equation can be summarized as follows:

L(o(t+1), y(t+1)

)W

=

=1

k=tmin

L(o(t+1), y(t+1)

)o(t+1)

o(t+1)

h(t+1)h(t+1)

h(t+k)h(t+k)

W

where:

h(t+1)

h(t+k)=

1s=k+1

h(t+s)

h(t+s1)=

1s=k+1

WT diag [(b+W h(t+s1)+Ux(t+s))]

h(t+s) = (b+Wh(t+s1) +Ux(t+s))

f (x) = h(g(x)) f (x) = h(g(x)) g (x)

Jordi Pons Tutorial on recurrent neural networks

15/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

h(t+1)

h(t+k)=

1s=k+1

h(t+s)

h(t+s1)=

1s=k+1

WT diag [(b+W h(t+s1)+Ux(t+s))]

the L2 norm denes an upper bound for the jacobians: h(t+s)h(t+s1)2

WT

2

diag [()]2 w

w L2 norm of a matrix. [0, 1]. highest eigenvalue. spectral radius.

and therefore: h(t+1)h(t+k)2

(w)|k1|

Jordi Pons Tutorial on recurrent neural networks

16/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: back-propagation with non-linearities

then, consider: h(t+1)h(t+k)

2 (w)|k1|

w 1: h(1)(t)h(1)

(k)

2

L(o(t+1), y(t+1))W

LW explodes!

w 1: h(1)(t)h(1)

(k)

2

L(o(t+1), y(t+1))W

LW vanishes!

w 1: gradients should propagate well until the past

Jordi Pons Tutorial on recurrent neural networks

17/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode: take away message

Vanishing gradients make it dicult to know which directionthe parameters should move to improve the cost function.

While exploding gradients can make learning unstable.

The vanishing and exploding gradient problem refers to thefact that gradients are scaled according to:h(t+1)h(t+k)

2

(w)|k1| with, tipically : [0, 1]

Eigenvalues are specially relevant for this analysis:

w WT

2 largest eigenvalue W t diag()t .

Jordi Pons Tutorial on recurrent neural networks

18/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gradient vanish/explode problem

Gated units: LSTM & GRU

Other solutions

Jordi Pons Tutorial on recurrent neural networks

19/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

Gated units: intuition

1. Accumulating is just a bad memory?

W = I and linear = w = 1 1 = 1RNNs can accumulate but it might be useful to forget.

sequence is made of sub-sequences text is made of sentences

Forget is a good reason for willing the gradient to vanish?

2. Creating paths through time where derivatives can ow.

3. Learn when to forget!

Gates allow learning how to read, write and forget!

Two dierent gated units will be presented:

Long Short-Term Memory (LSTM)Gated Recurrent Unit (GRU)

Jordi Pons Tutorial on recurrent neural networks

20/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

LSTM - block diagram 1

Figure : Traditional LSTM diagram.

Jordi Pons Tutorial on recurrent neural networks

21/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

LSTM - block diagram 2

Figure : LSTM diagram as in the Deep Learning Book.

Jordi Pons Tutorial on recurrent neural networks

22/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

LSTM - block diagram 3

Figure : LSTM diagram as in colah.github.io

Jordi Pons Tutorial on recurrent neural networks

23/37

Gradient vanish/explode problem Gated units: LSTM & GRU Other solutions

LSTM key notes

I Two recurrences! Two past informations, through: W and direct.

I 1 unit 1cell.Formulation for a single cell:

State : s(t) = g(t)f s

(t1) + g(t)i i

(t)

Output : h(t) = (s(t))g(t)o

Input : i (t) = (b +Ux(t)

Recommended