recurrent neural networks - cse.iitk.ac.in · recurrent neural networks feedforward n/ws have xed...

Recurrent neural networks

Feedforward n/ws have fixed size input, fixed size output, fixednumber of layers. RNNs can have variable input/output.

Figure: RNN schematic. Input: red, Output: blue, State: green.1-1: Std. feed forward, 1-M: image captioning,M-1: sentimentidentification, M-M: translation, M-Ms: Video frame labelling.(Source: Karpathy blog)

RNN applications

RNNs (and their variants) have been used to model a wide rangeof tasks in NLP:

I Machine translation. Here we need a sentence aligned datacorpus between the two languages.

I Various sequence labelling tasks e.g. PoS tagging, NER, etc.

I Language models. A language model predicts the next word ina sequence given the earlier words.

I Text generation.

I Speech recognition.

I Image annotation.

Recurrent neural network - basic

I Recurrent neural n/ws have feedback links.

Figure: An RNN node1

1Source: WildML blog

Recurrent neural network schematic

Figure: RNN unfolded2


RNN foward computation

I An RNN computation happens over an input sequence andproduces an output sequence using the tth element in theinput and the neuron state at the (t − 1)th step, ψt−1. Theassumption is that ψt−1 encodes earlier history.

I Let ψt−1, ψt be the states of the neuron at step t − 1 and trespectively, wt the tth input element and yt the tth outputelement (notation is differnt from figure to be consistent withour earlier notation). Assuming U, V , W are known ψt andyt are calculated by:

ψt = fa(Uwt + Wψt−1)

yt = fo(Vψt)

Here fa is the activation function, typically tanh, sigmoid orrelu which introduces non-linearity in the network. Similary, fois the output function, for example softmax which gives aprobability over the vocabulary vector.

I Often inputs or more commonly outputs at some time stepsmay not be useful. This depends on the nature of the task.

I Note that the weights represented by U, V , W are the samefor all elements in the sequence.

The above is a standard RNN. There are many variants in use thatare better suited to specific NLP tasks. But in principle they aresimilar to the standard RNN.

Training RNNs

I RNNs use a variant of the standard back propagationalgorithm called back propagation through time or BPTT. Tocalculate the gradient at step t we need the gradients atearlier time steps.

I The need for gradients at earlier time steps leads to twocomplications:

a) Computations are expensive (see later).b) The vanishing/exploding gradients (or unstable gradients)

problem seen in deep neural nets.

Coding of input, output, dimensions of vectors, matrices

It is instructive to see how the input, output is represented andcalculate the dimensions of the vectors, matrices. Assume |V| isvocabulary size, h is hidden layer size. Assume we are training arecurrent network language model. Then:

I Let m be the number of words in the current sentence. Theinput sentence will look like:[START-SENTENCE, w1, . . . ,wi ,wi+1, . . . ,wm, END-SENTENCE].

I Each word wi in the sentence will be input as a 1-hot vectorof size |V| with 1 at index of wi in V and rest 0.

I Each desired output yi is a 1-hot vector of size |V| with 1 atindex of wi+1 in V and 0 elsewhere. The actual output y is asoftmax output vector of size |V|.

I The dimensions are:Sentence: m × |V|, U: h × |V|, V: |V| × h, W: h × h

Update equations

At step t let yt be the true output value and yt the actual outputof the RNN for input wt then using cross-entropy error (often usedfor such networks instead of squared error) we get:

Et(yt , yt) = −Yt log(yt)

For the sequence from 0 to t which is one training sequence we get:

E(y , y) = −∑t

Yt log(yt)

BPTT will use backprop from 0 to t and use (stochastic) gradientdescent to learn appropriate values for U, V , W . Assume t = 3then the gradient for V is ∂E3

∂V

Using chain rule:

∂E3∂V

=∂E3∂y3

∂y3∂V

=∂E3∂y3

∂y3∂z3

∂z3∂V

= (y3 − y3)⊗ ψ3

z3 = Vψ3, u⊗ v = uvT (outer/tensor product).

Observation: ∂E3∂V depends only on values at t = 3.

For W (and similarly for U)

∂E3∂W

=∂E3∂y3

∂y3∂ψ3

∂ψ3

∂W

ψ3 = tanh(Uwt + Wψ2) and ψ2 depends on W , ψ2 etc. So, weget:

∂E3∂W

=3∑

t=0

∂E3∂y3

∂y3∂ψ3

∂ψ3

∂ψt

∂ψt

∂W

Figure: Backprop for W .3


Vanishing gradients

I Neural networks with a large number of layers using sigmoid ortanh activation functions face the unstable (typically vanishinggradients since explosions can be blocked by a threshold).

I ∂ψ3∂ψt

in the equation for ∂E3∂W is itself a product of gradients due

to the chain rule:

∂ψ3

∂ψt=

3∏i=t+1

∂ψi

∂ψi−1

So, ∂ψ3∂ψt

is a Jacobian matrix and we are doing O(m) matrixmultiplications (m is sentence length).

How to address unstable gradients?

I Do not use gradient based weight updates.

I Use Relu which has a gradient of 0 or 1.(relu(x) = max(0, x)).

I Use a different architecture. LSTM (Long, Short TermMemory) or GRU (Gated Recurrent Unit). We go foward withthis option - the most popular in current literature.

Schematic of LSTM cell

Figure: A basic LSTM memory cell. (From deeplearning4j.org).

LSTM: step by step

I The key data entity in an LSTM memory cell is the hiddeninternal state Ct at time step t, which is a vector. This stateis not visible outside the cell.

I The main computational step calculates the internal state Ct

which depends on the input xt and the previous visible stateht−1. But this dependence is not direct. It is controlled ormodulated via two gates: the forget gate gf and the inputgate gi . The output state ht that is visible outside the cell isobtained by modulating the hidden internal state Ct by theoutput gate go .

The gates

I Each gate gets as input the input vector xt and the previousvisible state ht−1, each linearly transformed by thecorresponding weight matrix.

I Intuitively, the gates control the following:gf : what part of the previous cell state Ct−1 will be carriedforward to step t.gi : what part of xt and visible state ht−1 will be used forcalculating Ct .go : what part of the current internal state Ct will be madevisible outside - that is ht .

I Ct does depend on the earlier state Ct−1 and the currentinput xt but not directly as in an RNN. The gates gf and gimodulate the two parts (see later).

Difference between RNN and LSTM

The difference with respect to a standard RNN is:

1. There is no hidden state in an RNN. The state ht is directlycomputed from the previous state ht−1 and the current inputxt each linearly transformed via the respective weight matricesU and W .

2. There is no modulation or control. There are no gates.

The gate equations

Each gate has as input the current input vector xt and the previousvisible state vector ht .

gf = σ(Wf xt + Whf ht−1 + bf )

gi = σ(Wixt + Whiht−1 + bi )

go = σ(Woxt + Whoht−1 + bo)

The gate non-linearity is always a sigmoid. This gives a valuebetween 0 and 1 for each gate output. The intuition is: a 0 valuecompletely forgets the past and a 1 value fully remembers the past.An in between value partially remembers the past. This controlslong term dependency.

Calculation of Ct , ht

I There are two vectors that must be calculated. The hiddeninternal state Ct and the visible cell state ht that will be thepropagated to step (t + 1).

I The Ct calculation is done in two steps. First an update Ct iscalculated based on the two inputs to the memory cell, xt andht−1.

Ct = tanh(Wcxt + Whcht−1 + bc)

Ct is calculated by modulating the previous state with gf andadding to it the update Ct modulated by gi giving:

Ct = (gf ⊗ Ct−1)⊕ (gi ⊗ Ct)

I Similarly, the new visible state or output ht is calculated by:

ht = go ⊗ tanh(Ct)

Detailed schematic of LSTM cell

Figure: Details of LSTM memory cell. (From deeplearning4j.org).

Peepholes

I A variant of the standard LSTM has peep holes. This meansthat all the gates have access to the earlier and current hiddenstates Ct−1 and Ct . So, the gate equations become:

gf = σ(Wf xt + Whf ht−1 + Wcf Ct−1 + bf )

gi = σ(Wixt + Whiht−1 + WciCt−1 + bi )

go = σ(Woxt + Whoht + WcoCt + bo)

This is shown by the blue dashed lines and non-dashed line in theearlier diagram.

Gated Recurrent Units (GRU)

I A GRU is a simpler version of an LSTM. The differences froman LSTM are:

I A GRU does not have an internal hidden cell state (like Ct).Instead it has only state ht .

I A GRU has only two gates gr (reset gate) and gu update gate.These together perform like the earlier gf and gi LSTM gates.There is no output gate.

I Since there is no hidden cell state or output gate the output isht without a tanh non-linearity.

GRU equations

I The gate equations are:

gr = σ(Wrixt + Wrhht−1 + br )

gu = σ(Wuixt + Wuhht−1 + bu)

I The state update is:

ht = tanh(Whixt + Whh(gr ⊗ ht−1) + bh)

The gate gr is used to control what part of the previous statewill be used in the update.

I The final state is:

ht = ((1− gu)⊗ ht)⊕ (gu ⊗ ht−1)

The update gate gu controls the relative contribution of theupdate (ht) and the previous state ht−1.

GRU schematic

Figure: Details of GRU memory cell. (From wikimedia.org).

recurrent neural networks - cse.iitk.ac.in · recurrent neural networks feedforward n/ws have xed...

Documents