recurrent neural networks - cse.iitk.ac.in · recurrent neural networks feedforward n/ws have xed...

25
Recurrent neural networks Feedforward n/ws have fixed size input, fixed size output, fixed number of layers. RNNs can have variable input/output. Figure: RNN schematic. Input: red, Output: blue, State: green. 1-1: Std. feed forward, 1-M: image captioning,M-1: sentiment identification, M-M: translation, M-Ms: Video frame labelling. (Source: Karpathy blog)

Upload: others

Post on 16-Oct-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Recurrent neural networks

Feedforward n/ws have fixed size input, fixed size output, fixednumber of layers. RNNs can have variable input/output.

Figure: RNN schematic. Input: red, Output: blue, State: green.1-1: Std. feed forward, 1-M: image captioning,M-1: sentimentidentification, M-M: translation, M-Ms: Video frame labelling.(Source: Karpathy blog)

Page 2: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

RNN applications

RNNs (and their variants) have been used to model a wide rangeof tasks in NLP:

I Machine translation. Here we need a sentence aligned datacorpus between the two languages.

I Various sequence labelling tasks e.g. PoS tagging, NER, etc.

I Language models. A language model predicts the next word ina sequence given the earlier words.

I Text generation.

I Speech recognition.

I Image annotation.

Page 3: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Recurrent neural network - basic

I Recurrent neural n/ws have feedback links.

Figure: An RNN node1

1Source: WildML blog

Page 4: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Recurrent neural network schematic

Figure: RNN unfolded2

2Source: WildML blog

Page 5: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

RNN foward computation

I An RNN computation happens over an input sequence andproduces an output sequence using the tth element in theinput and the neuron state at the (t − 1)th step, ψt−1. Theassumption is that ψt−1 encodes earlier history.

I Let ψt−1, ψt be the states of the neuron at step t − 1 and trespectively, wt the tth input element and yt the tth outputelement (notation is differnt from figure to be consistent withour earlier notation). Assuming U, V , W are known ψt andyt are calculated by:

ψt = fa(Uwt + Wψt−1)

yt = fo(Vψt)

Here fa is the activation function, typically tanh, sigmoid orrelu which introduces non-linearity in the network. Similary, fois the output function, for example softmax which gives aprobability over the vocabulary vector.

Page 6: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

I Often inputs or more commonly outputs at some time stepsmay not be useful. This depends on the nature of the task.

I Note that the weights represented by U, V , W are the samefor all elements in the sequence.

The above is a standard RNN. There are many variants in use thatare better suited to specific NLP tasks. But in principle they aresimilar to the standard RNN.

Page 7: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Training RNNs

I RNNs use a variant of the standard back propagationalgorithm called back propagation through time or BPTT. Tocalculate the gradient at step t we need the gradients atearlier time steps.

I The need for gradients at earlier time steps leads to twocomplications:

a) Computations are expensive (see later).b) The vanishing/exploding gradients (or unstable gradients)

problem seen in deep neural nets.

Page 8: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Coding of input, output, dimensions of vectors, matrices

It is instructive to see how the input, output is represented andcalculate the dimensions of the vectors, matrices. Assume |V| isvocabulary size, h is hidden layer size. Assume we are training arecurrent network language model. Then:

I Let m be the number of words in the current sentence. Theinput sentence will look like:[START-SENTENCE, w1, . . . ,wi ,wi+1, . . . ,wm, END-SENTENCE].

I Each word wi in the sentence will be input as a 1-hot vectorof size |V| with 1 at index of wi in V and rest 0.

I Each desired output yi is a 1-hot vector of size |V| with 1 atindex of wi+1 in V and 0 elsewhere. The actual output y is asoftmax output vector of size |V|.

I The dimensions are:Sentence: m × |V|, U: h × |V|, V: |V| × h, W: h × h

Page 9: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Update equations

At step t let yt be the true output value and yt the actual outputof the RNN for input wt then using cross-entropy error (often usedfor such networks instead of squared error) we get:

Et(yt , yt) = −Yt log(yt)

For the sequence from 0 to t which is one training sequence we get:

E(y , y) = −∑t

Yt log(yt)

BPTT will use backprop from 0 to t and use (stochastic) gradientdescent to learn appropriate values for U, V , W . Assume t = 3then the gradient for V is ∂E3

∂V

Page 10: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Using chain rule:

∂E3∂V

=∂E3∂y3

∂y3∂V

=∂E3∂y3

∂y3∂z3

∂z3∂V

= (y3 − y3)⊗ ψ3

z3 = Vψ3, u⊗ v = uvT (outer/tensor product).

Observation: ∂E3∂V depends only on values at t = 3.

Page 11: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

For W (and similarly for U)

∂E3∂W

=∂E3∂y3

∂y3∂ψ3

∂ψ3

∂W

ψ3 = tanh(Uwt + Wψ2) and ψ2 depends on W , ψ2 etc. So, weget:

∂E3∂W

=3∑

t=0

∂E3∂y3

∂y3∂ψ3

∂ψ3

∂ψt

∂ψt

∂W

Page 12: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Figure: Backprop for W .3

3Source: WildML blog

Page 13: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Vanishing gradients

I Neural networks with a large number of layers using sigmoid ortanh activation functions face the unstable (typically vanishinggradients since explosions can be blocked by a threshold).

I ∂ψ3∂ψt

in the equation for ∂E3∂W is itself a product of gradients due

to the chain rule:

∂ψ3

∂ψt=

3∏i=t+1

∂ψi

∂ψi−1

So, ∂ψ3∂ψt

is a Jacobian matrix and we are doing O(m) matrixmultiplications (m is sentence length).

Page 14: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

How to address unstable gradients?

I Do not use gradient based weight updates.

I Use Relu which has a gradient of 0 or 1.(relu(x) = max(0, x)).

I Use a different architecture. LSTM (Long, Short TermMemory) or GRU (Gated Recurrent Unit). We go foward withthis option - the most popular in current literature.

Page 15: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Schematic of LSTM cell

Figure: A basic LSTM memory cell. (From deeplearning4j.org).

Page 16: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

LSTM: step by step

I The key data entity in an LSTM memory cell is the hiddeninternal state Ct at time step t, which is a vector. This stateis not visible outside the cell.

I The main computational step calculates the internal state Ct

which depends on the input xt and the previous visible stateht−1. But this dependence is not direct. It is controlled ormodulated via two gates: the forget gate gf and the inputgate gi . The output state ht that is visible outside the cell isobtained by modulating the hidden internal state Ct by theoutput gate go .

Page 17: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

The gates

I Each gate gets as input the input vector xt and the previousvisible state ht−1, each linearly transformed by thecorresponding weight matrix.

I Intuitively, the gates control the following:gf : what part of the previous cell state Ct−1 will be carriedforward to step t.gi : what part of xt and visible state ht−1 will be used forcalculating Ct .go : what part of the current internal state Ct will be madevisible outside - that is ht .

I Ct does depend on the earlier state Ct−1 and the currentinput xt but not directly as in an RNN. The gates gf and gimodulate the two parts (see later).

Page 18: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Difference between RNN and LSTM

The difference with respect to a standard RNN is:

1. There is no hidden state in an RNN. The state ht is directlycomputed from the previous state ht−1 and the current inputxt each linearly transformed via the respective weight matricesU and W .

2. There is no modulation or control. There are no gates.

Page 19: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

The gate equations

Each gate has as input the current input vector xt and the previousvisible state vector ht .

gf = σ(Wf xt + Whf ht−1 + bf )

gi = σ(Wixt + Whiht−1 + bi )

go = σ(Woxt + Whoht−1 + bo)

The gate non-linearity is always a sigmoid. This gives a valuebetween 0 and 1 for each gate output. The intuition is: a 0 valuecompletely forgets the past and a 1 value fully remembers the past.An in between value partially remembers the past. This controlslong term dependency.

Page 20: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Calculation of Ct , ht

I There are two vectors that must be calculated. The hiddeninternal state Ct and the visible cell state ht that will be thepropagated to step (t + 1).

I The Ct calculation is done in two steps. First an update Ct iscalculated based on the two inputs to the memory cell, xt andht−1.

Ct = tanh(Wcxt + Whcht−1 + bc)

Ct is calculated by modulating the previous state with gf andadding to it the update Ct modulated by gi giving:

Ct = (gf ⊗ Ct−1)⊕ (gi ⊗ Ct)

I Similarly, the new visible state or output ht is calculated by:

ht = go ⊗ tanh(Ct)

Page 21: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Detailed schematic of LSTM cell

Figure: Details of LSTM memory cell. (From deeplearning4j.org).

Page 22: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Peepholes

I A variant of the standard LSTM has peep holes. This meansthat all the gates have access to the earlier and current hiddenstates Ct−1 and Ct . So, the gate equations become:

gf = σ(Wf xt + Whf ht−1 + Wcf Ct−1 + bf )

gi = σ(Wixt + Whiht−1 + WciCt−1 + bi )

go = σ(Woxt + Whoht + WcoCt + bo)

This is shown by the blue dashed lines and non-dashed line in theearlier diagram.

Page 23: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

Gated Recurrent Units (GRU)

I A GRU is a simpler version of an LSTM. The differences froman LSTM are:

I A GRU does not have an internal hidden cell state (like Ct).Instead it has only state ht .

I A GRU has only two gates gr (reset gate) and gu update gate.These together perform like the earlier gf and gi LSTM gates.There is no output gate.

I Since there is no hidden cell state or output gate the output isht without a tanh non-linearity.

Page 24: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

GRU equations

I The gate equations are:

gr = σ(Wrixt + Wrhht−1 + br )

gu = σ(Wuixt + Wuhht−1 + bu)

I The state update is:

ht = tanh(Whixt + Whh(gr ⊗ ht−1) + bh)

The gate gr is used to control what part of the previous statewill be used in the update.

I The final state is:

ht = ((1− gu)⊗ ht)⊕ (gu ⊗ ht−1)

The update gate gu controls the relative contribution of theupdate (ht) and the previous state ht−1.

Page 25: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output

GRU schematic

Figure: Details of GRU memory cell. (From wikimedia.org).