# on the difficulty of training recurrent neural networks .on the di culty of training recurrent...

Post on 03-Jul-2018

212 views

Embed Size (px)

TRANSCRIPT

On the difficulty of training Recurrent Neural Networks

Razvan Pascanu pascanur@iro.umontreal.ca

Universite de Montreal

Tomas Mikolov t.mikolov@gmail.com

Brno University

Yoshua Bengio yoshua.bengio@umontreal.ca

Universite de Montreal

Abstract

There are two widely known issues with prop-erly training Recurrent Neural Networks, thevanishing and the exploding gradient prob-lems detailed in Bengio et al. (1994). Inthis paper we attempt to improve the under-standing of the underlying issues by explor-ing these problems from an analytical, a geo-metric and a dynamical systems perspective.Our analysis is used to justify a simple yet ef-fective solution. We propose a gradient normclipping strategy to deal with exploding gra-dients and a soft constraint for the vanishinggradients problem. We validate empiricallyour hypothesis and proposed solutions in theexperimental section.

1. Introduction

A recurrent neural network (RNN), e.g. Fig. 1, is aneural network model proposed in the 80s (Rumelhartet al., 1986; Elman, 1990; Werbos, 1988) for modelingtime series. The structure of the network is similar tothat of a standard multilayer perceptron, with the dis-tinction that we allow connections among hidden unitsassociated with a time delay. Through these connec-tions the model can retain information about the pastinputs, enabling it to discover temporal correlationsbetween events that are possibly far away from eachother in the data (a crucial property for proper learn-ing of time series).

While in principle the recurrent network is a simpleand powerful model, in practice, it is unfortunatelyhard to train properly. Among the main reasons whythis model is so unwieldy are the vanishing gradient

ut xtEt

Figure 1. Schematic of a recurrent neural network. Therecurrent connections in the hidden layer allow informationto persist from one input to another.

and exploding gradient problems described in Bengioet al. (1994).

1.1. Training recurrent networks

A generic recurrent neural network, with input ut andstate xt for time step t, is given by equation (1). Inthe theoretical section of this paper we will sometimesmake use of the specific parametrization given by equa-tion (11) 1 in order to provide more precise conditionsand intuitions about the everyday use-case.

xt = F (xt1,ut, ) (1)

xt = Wrec(xt1) + Winut + b (2)

The parameters of the model are given by the recurrentweight matrix Wrec, the biases b and input weightmatrix Win, collected in for the general case. x0 isprovided by the user, set to zero or learned, and is anelement-wise function (usually the tanh or sigmoid).A cost E measures the performance of the networkon some given task and it can be broken apart intoindividual costs for each step E = 1tT Et, whereEt = L(xt).One approach that can be used to compute the nec-essary gradients is Backpropagation Through Time(BPTT), where the recurrent model is represented as

1 This formulation is equivalent to the more widelyknown equation xt = (Wrecxt1 + Winut + b), and itwas chosen for convenience.

arX

iv:1

211.

5063

v2 [

cs.L

G]

16

Feb

2013

On the difficulty of training Recurrent Neural Networks

a deep multi-layer one (with an unbounded number oflayers) and backpropagation is applied on the unrolledmodel (see Fig. 2).

Et+1xt+1

Et+1EtEt1

xt+1xtxt1

ut1 ut ut+1

Etxt

Et1xt1

xt+2xt+1

xt+1xt

xtxt1

xt1xt2

Figure 2. Unrolling recurrent neural networks in time bycreating a copy of the model for each time step. We denoteby xt the hidden state of the network at time t, by ut theinput of the network at time t and by Et the error obtainedfrom the output at time t.

We will diverge from the classical BPTT equations atthis point and re-write the gradients (see equations (3),(4) and (5)) in order to better highlight the explodinggradients problem. These equations were obtained bywriting the gradients in a sum-of-products form.

E

=

1tT

Et

(3)

Et

=

1kt

(Etxt

xtxk

+xk

)(4)

xtxk

=ti>k

xixi1

=ti>k

WTrecdiag((xi1)) (5)

+xk refers to the immediate partial derivative of

the state xk with respect to , i.e., where xk1 istaken as a constant with respect to . Specifically,considering equation 2, the value of any row i of the

matrix ( +xk

Wrec) is just (xk1). Equation (5) also

provides the form of Jacobian matrix xixi1 for the

specific parametrization given in equation (11), wherediag converts a vector into a diagonal matrix, and

computes the derivative of in an element-wise fash-ion.

Note that each term Et from equation (3) has thesame form and the behaviour of these individual termsdetermine the behaviour of the sum. Henceforth wewill focus on one such generic term, calling it simplythe gradient when there is no confusion.

Any gradient component Et is also a sum (see equa-tion (4)), whose terms we refer to as temporal contribu-tions or temporal components. One can see that each

such temporal contribution Etxtxtxk

+xk measures how

at step k affects the cost at step t > k. The factors

xtxk

(equation (5)) transport the error in time fromstep t back to step k. We would further loosely distin-guish between long term and short term contributions,where long term refers to components for which k tand short term to everything else.

2. Exploding and Vanishing Gradients

As introduced in Bengio et al. (1994), the explodinggradients problem refers to the large increase in thenorm of the gradient during training. Such events arecaused by the explosion of the long term components,which can grow exponentially more then short termones. The vanishing gradients problem refers to theopposite behaviour, when long term components goexponentially fast to norm 0, making it impossible forthe model to learn correlation between temporally dis-tant events.

2.1. The mechanics

To understand this phenomenon we need to look at theform of each temporal component, and in particular atthe matrix factors xtxk (see equation (5)) that take theform of a product of t k Jacobian matrices. In thesame way a product of t k real numbers can shrinkto zero or explode to infinity, so does this product ofmatrices (along some direction v).

In what follows we will try to formalize these intu-itions (extending a similar derivation done in Bengioet al. (1994) where only a single hidden unit case wasconsidered).

If we consider a linear version of the model (i.e. set tothe identity function in equation (11)) we can use thepower iteration method to formally analyze this prod-uct of Jacobian matrices and obtain tight conditionsfor when the gradients explode or vanish (see the sup-plementary materials for a detailed derivation of theseconditions). It is sufficient for the largest eigenvalue1 of the recurrent weight matrix to be smaller than1 for long term components to vanish (as t) andnecessary for it to be larger than 1 for gradients toexplode.

We can generalize these results for nonlinear functions where the absolute values of (x) is bounded (sayby a value R) and therefore diag((xk)) .We first prove that it is sufficient for 1 1. For the linear case ( is theidentity function), v is the eigenvector correspondingto the largest eigenvalue of Wrec. If this bound istight, we hypothesize that in general when gradients

On the difficulty of training Recurrent Neural Networks

Figure 6. We plot the error surface of a single hidden unitrecurrent network, highlighting the existence of high cur-vature walls. The solid lines depicts standard trajectoriesthat gradient descent might follow. Using dashed arrowthe diagram shows what would happen if the gradients isrescaled to a fixed size when its norm is above a threshold.

explode so does the curvature along v, leading to awall in the error surface, like the one seen in Fig. 6.

If this holds, then it gives us a simple solution to theexploding gradients problem depicted in Fig. 6.

If both the gradient and the leading eigenvector of thecurvature are aligned with the exploding direction v, itfollows that the error surface has a steep wall perpen-dicular to v (and consequently to the gradient). Thismeans that when stochastic gradient descent (SGD)reaches the wall and does a gradient descent step, itwill be forced to jump across the valley moving perpen-dicular to the steep walls, possibly leaving the valleyand disrupting the learning process.

The dashed arrows in Fig. 6 correspond to ignoringthe norm of this large step, ensuring that the modelstays close to the wall. The key insight is that all thesteps taken when the gradient explodes are alignedwith v and ignore other descent direction (i.e. themodel moves perpendicular to the wall). At the wall, asmall-norm step in the direction of the gradient there-fore merely pushes us back inside the smoother low-curvature region besides the wall, whereas a regulargradient step would bring us very far, thus slowing orpreventing further training. Instead, with a boundedstep, we get back in that smooth region near the wallwhere SGD is free to explore other descent directions.

The important addition in this scenario to the classicalhigh curvature valley, is that we assume that the val-ley is wide, as we have a large region around the wallwhere if we land we can rely on first order methodsto move towards the local minima. This is why justclipping the gradient might be sufficient, not requiringthe use a second order method. Note that this algo-

rithm should work even when the rate of growth of thegradient is not the same as the one of the curvature(a case for which

Recommended