recurrent neural networks, language, semantics, and ... · recurrent neural networks, language,...

56
Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science Summer School

Upload: others

Post on 06-Feb-2020

36 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Recurrent Neural Networks, Language, Semantics, and Disentangling Factors

Yoshua Bengio

August 29th, 2017 @ DS3

Data Science Summer School

Page 2: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Recurrent Neural Networks

• Selectively summarize an input sequence in a fixed-size state vector via a recursive update

2

stst− 1 st + 1

F✓ F✓ F✓

x tx t− 1 x t + 1x

sF✓

unfold

Generalizes naturally to new lengths not seen during training

shared over time

Page 3: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Recurrent Neural Networks

• Can produce an output at each time step: unfolding the graph tells us how to back-prop through time.

3

x tx t− 1 x t + 1x

unfold

V WW

W W W

V V V

U U U U

s

o

st− 1

ot− 1 ot

st st + 1

ot + 1

Page 4: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Generative RNNs

4

x tx t− 1 x t + 1

WW W W

V V V

U U U

st− 1

ot− 1 ot

st st + 1

ot + 1

L t + 1L t− 1 L t

x t + 2

• An RNN can represent a fully-connected directed generativemodel: every variable predicted from all previous ones.

Page 5: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Neural word embeddings - visualization

5

Page 6: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

6

Conditional Distributions

• Sequence to vector

• Sequence to sequence of the same length, aligned

• Vector to sequence

• Sequence to sequence

stst− 1 st + 1

F✓ F✓ F✓

x tx t− 1 x t + 1x

sF✓

unfold

Page 7: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

• During training, past yin input is from training data

• At generation time, past y in input isgenerated

• Mismatch cancause ”compounding error”

7

Test-time path

Training-time path

Maximum Likelihood =

Teacher Forcing

Page 8: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Ideas to reduce the train/generate mismatch in

teacher forcing

• Scheduled sampling (S. Bengio et al, NIPS 2015)

• Backprop through open-loop sampling recurrence & minimize long-term cost (but which one? GAN would be most natural Professor Forcing)

8

Related toSEARN (Daumé et al 2009)DAGGER (Ross et al 2010)

Gradually increase theprobability of usingthe model’s samplesvs the ground truthas input.

Page 9: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Increasing the Expressive Power of RNNs with

more Depth

• ICLR 2014, How to construct deep recurrent neural networks

9

y t -1

x t -1

h t -1

x t

h t

y t

x t + 1

h t + 1

y t + 1

x t

h t -1h t

y t

x t

h t -1h t

y tOrdinary RNNs

+ deep hid-to-out+ deep hid-to-hid+deep in-to-hid

+ skip connections for creating shorter paths

x t

h t -1

h t

y t

z t -1z t

+ stacking

Page 10: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Bidirectional RNNs, Recursive Nets,

Multidimensional RNNs, etc.

• The unfolded architecture needs not be a straight chain

10

(Multidimensional RNNs, Graves et al 2007)

Bidirectional RNNs (Schuster and Paliwal, 1997)

See Alex Graves’s work, e.g., 2012

y

L

x1 x2 x3x4

V V V V

o

Recursive (tree-structured)Neural Nets:

Frasconi et al 97Socher et al 2011

Page 11: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Multiplicative Interactions

• Multiplicative Integration RNNs:

• Replace

• By

• Or more general:

11

(Wu et al, 2016, arXiv:1606.06630)

Page 12: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Learning Long-Term

Dependencies with Gradient

Descent is Difficult

Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994

Page 13: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Simple Experiments from 1991 while I was at

MIT

• 2 categories of sequences

• Can the single tanh unit learn to store for T time steps 1 bit of information given by the sign of initial input?

13

Prob(success | seq. length T)

Page 14: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

How to store 1 bit? Dynamics with multiple

basins of attraction in some dimensions

• Some subspace of the state can store 1 or more bits of information if the dynamical system has multiple basins of attraction in some dimensions

14

Basins boundary

Bit=1

Bit=0

Note: gradients MUST be high near the boundary

Page 15: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Robustly storing 1 bit in the presence of

bounded noise

• With spectral radius > 1, noise can kick state out of attractor

• Not so with radius<1

15

UNSTABLE

CONTRACTIVE STABLE

Page 16: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Storing Reliably Vanishing gradients

• Reliably storing bits of information requires spectral radius<1

• The product of T matrices whose spectral radius is < 1 is a matrix whose spectral radius converges to 0 at exponential rate in T

• If spectral radius of Jacobian is < 1 propagated gradients vanish

16

Page 17: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Vanishing or Exploding Gradients

• Hochreiter’s 1991 MSc thesis (in German) had independently discovered that backpropagated gradients in RNNs tend to either vanish or explode as sequence length increases

17

Page 18: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Why it hurts gradient-based learning

• Long-term dependencies get a weight that is exponentially smaller (in T) compared to short-term dependencies

18

Becomes exponentially smallerfor longer time differences,when spectral radius < 1

Page 19: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Vanishing Gradients in Deep Nets are Different

from the Case in RNNs

• If it was just a case of vanishing gradients in deep nets, we could just rescale the per-layer learning rate, but that does not really fix the training difficulties.

• Can’t do that with RNNs because the weights are shared, & total true gradient = sum over different “depths”

19

12

34

Page 20: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

To store information robustly the dynamics

must be contractive

• The RNN gradient is a product of Jacobian matrices, each associated with a step in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].

• Problems:

• e-values of Jacobians > 1 gradients explode

• or e-values < 1 gradients shrink & vanish

• or random variance grows exponentially

20

Storing bitsrobustly requirese-values<1

Gradient clipping

Page 21: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

RNN Tricks (Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)

• Clipping gradients (avoid exploding gradients)

• Leaky integration (propagate long-term dependencies)

• Momentum (cheap 2nd order)

• Initialization (start in right ballpark avoids exploding/vanishing)

• Sparse Gradients (symmetry breaking)

• Gradient propagation regularizer (avoid vanishing gradient)

• Gated self-loops (LSTM & GRU, reduces vanishing gradient)

21

Page 22: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Dealing with Gradient Explosion by Gradient

Norm Clipping

22

(Mikolov thesis 2012;Pascanu, Mikolov, Bengio, ICML 2013)

Page 23: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Conference version (1993) of the 1994 paper by

the same authors had a predecessor of GRU

and targetprop

• Flip-flop unit to store 1 bit, with gating signal to control when to write

• Pseudo-backprop through it by a form of targetprop

23

(The problem of learning long-term dependencies in recurrent networks, Bengio, Frasconi & Simard ICNN’1993)

Page 24: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

x tx t− 1 x t + 1x

unfold

s

o

st− 1

ot− 1 ot

st st + 1

ot + 1

W1

W3

W1 W1 W1 W1

W3

st− 2

W3 W3 W3

Delays & Hierarchies to Reach Farther

• Delays and multiple time scales, Elhihi & Bengio NIPS 1995, Koutnik et al ICML 2014

• How to do this right?

• How to automatically

and adaptively do it?

24

Hierarchical RNNs (words / sentences):Sordoni et al CIKM 2015, Serban et al AAAI 2016

Page 25: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Fighting the vanishing gradient:

LSTM & GRU

• Create a path wheregradients can flow for longer with a self-loop

• Corresponds to an eigenvalue of Jacobianslightly less than 1

• LSTM is now heavily used(Hochreiter & Schmidhuber1997)

• GRU light-weight version (Cho et al 2014)

25

LSTM: (Hochreiter & Schmidhuber 1997)(Hochreiter 1991); first version of the LSTM, called Neural Long-Term Storage with self-loop

Page 26: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Fast Forward 20 years: Attention Mechanisms

for Memory Access

• Neural Turing Machines (Graves et al 2014)

• and Memory Networks (Weston et al 2014)

• Use a content-based attention mechanism(Bahdanau et al 2014) to control the readand write access into a memory

• The attention mechanism outputs a softmax over memory locations

26

write

read

Page 27: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Large Memory Networks: Sparse Access

Memory for Long-Term Dependencies

• Memory = part of the state

• Memory-based networks are special RNNs

• A mental state stored in an external memory can stay for arbitrarilylong durations, until it is overwritten (partially or not)

• Forgetting = vanishing gradient.

• Memory = higher-dimensional state, avoiding or reducing the need for forgetting/vanishing

27

passive copy

access

Page 28: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Attention Mechanism for Deep Learning

• Consider an input (or intermediate) sequence or image

• Consider an upper level representation, which can choose « where to look », by assigning a weight or probability to each input position, as produced by an MLP, applied at each position

28

Lower-level

Higher-level

Softmax over lowerlocations conditionedon context at lower andhigher locations

• Soft attention (backprop) vs• Stochastic hard attention (RL)

(Bahdanau, Cho & Bengio, ICLR 2015; Jean et al ACL 2015; Jean et al WMT 2015; Xu et al ICML 2015; Chorowski et al NIPS 2015; Firat, Cho & Bengio 2016)

Page 29: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

End-to-End Machine Translation with Recurrent

Nets and Attention Mechanism

• Reached the state-of-the-art in one year, from scratch

29

(Bahdanau et al ICLR 2015, Jean et al ACL 2015, Gulcehre et al 2015, Firat et al 2016)

Page 30: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Google-Scale NMT Success

• Distributed training, very large model ensemble

• Not only does NMT work in terms of BLEU but it makes a killing in terms of human evaluation on Google Translate data

30

(Wu et al & Dean, Nature, 2016)

Humanevaluation

humantranslation

n-gramtranslation

currentneural nettranslation

Page 31: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Pointing the Unknown Words

The next word generated can either come from vocabulary or is copied from the input sequence.

31

Gulcehre, Ahn, Nallapati, Zhou & Bengio ACL 2016Based on ‘Pointer Networks’, Vinyals et al 2015

Text summarization

MachineTranslation

Page 32: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Designing the RNN Architecture

• Recurrent depth: max path length divided by sequence length

• Feedforward depth: max length from input to nearest output

• Skip coefficient: shortest path length divided sequence length

32

(Architectural Complexity Measures of Recurrent Neural NetworksZhang et al 2016, arXiv:1602.08210)

Page 33: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

It makes a difference

• Impact of change in recurrent depth

• Impact of change in skip coefficient

33

Page 34: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Near-Orthogonality to Help Information

Propagation

• Initialization to orthogonal recurrent W

• Unitary matrices: all e-values of matrix are 1

• Zoneout: randomly choose to simply copy the state unchanged

34

(Krueger et al 2016, submitted)

(Arjowski, Amar & Bengio ICML 2016)

(Saxe et al 2013, ICLR2014)

Page 35: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Variational Generative RNNs

• (Chung et al, NIPS’2015)

• Regular RNNs have noise injected only in input space

• VRNNs also allow noise (latent variable) injected in top hiddenlayer; more « high-level » variability

35

Injecting higher-level variations / latent variables in RNNs

Page 36: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Variational Hierarchical RNNs for Dialogue

Generation (Serban et al 2016)

• Lower level = words of an utterance (turn of speech)

• Upper level = state of the dialogue

• Inject high-level choices

36

Page 37: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

VHRNN

Results –

Twitter Dialogues

37

Page 38: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Other Fully-Observed Neural

Directed Graphical Models

38

Page 39: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Neural Auto-Regressive Models

• Decomposes the joint of a fully observed directed model in terms of conditionals

• Logistic auto-regressive: (Frey 1997)

• First neural version: (Bengio&BengioNIPS’99)

39

x1x2 x3

x4

P(x1) P(x2|x1)P(x3|x2 ,x1)

P(x4|x3 , x2 ,x1)

x1x2 x3

x4

h1

x1x2 x3

x4

P(x1) P(x2|x1)P(x3|x2 ,x1)P(x4|x3 , x2 ,x1)

h2 h3

Page 40: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

NADE: Neural AutoRegressive Density

Estimator

• Introduces smart sharing between some weights so that the different hidden groups use the same weights to the same input but look at more and more of the inputs.

40

h1

x1x2 x3

x4

P(x1) P(x2|x1)P(x3|x2 ,x1)

P(x4|x3 , x2 ,x1)

h2 h3

W1

W1

W1

W2

W2

W3

(Larochelle & Murray AISTATS 2011)

Page 41: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Pixel RNNs

• Similar to NADE and RNNs but for 2-D images

• Surprisingly sharp and realistic generation

• Gets texture right but not necessarily global structure

41

Pixel Recurrent Neural Networks

x1

x i

xn

xn 2

Figure 2. Left: To generate pixel x i one conditions on all thepre-

viously generated pixels left and above of x i . Center: Illustration

of a Row LSTM with a kernel of size 3. The dependency field of

the Row LSTM does not reach pixels further away on the sides

of the image. Right: Illustration of the two directions of the Di-

agonal BiLSTM. The dependency field of the Diagonal BiLSTM

covers the entire available context in the image.

Figure 3. In the Diagonal BiLSTM, to allow for parallelization

along the diagonals, the input map is skewed by offseting each

row by one position with respect to the previous row. When the

spatial layer is computed left to right and column by column, the

output map is shifted back into the original size. The convolution

usesakernel of size 2⇥1.

(2015); Uria et al. (2014)). By contrast we model p(x) as

a discrete distribution, with every conditional distribution

in Equation 2 being a multinomial that is modeled with a

softmax layer. Each channel variable x i ,⇤simply takes one

of 256 distinct values. Thediscrete distribution isrepresen-

tationally simple and has the advantage of being arbitrarily

multimodal without prior on the shape. Experimentally we

also find the discrete distribution to be easy to learn and

to produce better performance compared to a continuous

distribution (Section 5).

3. Pixel Recurrent Neural Networks

In this section we describe the architectural components

that compose the PixelRNN. In Sections 3.1 and 3.2, we

describe the two types of LSTM layers that use convolu-

tions to compute at once the states along one of the spatial

dimensions. In Section 3.3 we describe how to incorporate

residual connections to improvethetraining of aPixelRNN

with many LSTM layers. In Section 3.4 we describe the

softmax layer that computes the discrete joint distribution

of the colors and the masking technique that ensures the

proper conditioning scheme. In Section 3.5 wedescribe the

PixelCNN architecture. Finally in Section 3.6 we describe

themulti-scale architecture.

3.1. Row LSTM

The Row LSTM is a unidirectional layer that processes

the image row by row from top to bottom computing fea-

tures for a whole row at once; the computation is per-

formed with a one-dimensional convolution. For a pixel

x i the layer captures aroughly triangular context abovethe

pixel as shown in Figure 2 (center). The kernel of the one-

dimensional convolution has size k ⇥ 1 where k ≥ 3; the

larger thevalueof k thebroader thecontext that iscaptured.

The weight sharing in the convolution ensures translation

invariance of the computed features along each row.

The computation proceeds as follows. An LSTM layer has

an input-to-state component and a recurrent state-to-state

component that together determine thefour gates inside the

LSTM core. To enhance parallelizati on in the Row LSTM

theinput-to-state component isfirst computed for theentire

two-dimensional input map; for this a k ⇥1 convolution is

used to follow therow-wise orientation of theLSTM itself.

Theconvolution ismasked to includeonly thevalid context

(seeSection 3.4) and produces atensor of size 4h⇥n⇥n,

representing the four gate vectors for each position in the

input map, where h is the number of features in the LSTM

layer.

To compute one step of the state-to-state component of

the LSTM layer, one is given the previous hidden and cell

states h i − 1 and ci − 1, each of size h ⇥ n ⇥ 1. The new

hidden and cell states h i , ci are obtained as follows:

[oi , f i , i i , gi ] = σ(K ss ~ h i − 1 + K i s ~ x i )

ci = f i ci − 1 + i i gi

h i = oi tanh(ci )

(3)

where x i of size h ⇥n ⇥1 is row i of the input map, and

~ represents the convolution operation and the element-

wise multiplication. The weights K ss and K i s are the

kernel weights for the state-to-state and the input-to-state

components, where the latter is precomputed as described

above. In the case of the output, forget and input gates

oi , f i and i i , the activation σ is the logistic sigmoid func-

tion, whereas for the content gate gi , σ is the tanh func-

tion. Each step computes at once the new state for an en-

tire row of the input map. Since the Row LSTM layer is

unidirectional, it is relatively fast, but it has a considerable

drawback. Due to its roughly triangular shape, the recep-

tivefield induced by the layer misses a large portion of the

previously generated context corresponding to the areas on

either side of the current pixel. For example, for a value

of k = 3 for the state-to-state convolution, which we find

gives the best performance in the experiments, the recep-

tive field for the pixels near the center of the image misses

roughly half of the generated context (Figure 2).

(van den Oord et al ICML 2016, best paper)

Pixel Recurrent Neural Networks

occluded completions original occluded completions original

Figure 8. Image completions sampled from amodel that was trained on 32x32 ImageNet images. Note that diversity of the completions

is high, which can be attributed to the log-likelihood loss function used in this generative model, as it encourages models with high

entropy. As these are sampled from the model, we can easily generate millions of different completions. It is also interesting to see that

textures such aswater, wood and shrubbery are also inputed relativewell (see Figure 1).

trained to model the raw RGB pixel values of images. We

treated the pixel values as discrete random variables by us-

ing asoftmax layer in theconditional distributions. Weem-

ployed masked convolutions to allow PixelRNNs to model

full dependencies between the color channels. We pro-

posed and evaluated architectural improvements in these

models resulting in PixelRNNs with up to 12 LSTM lay-

ers.

We have shown that the PixelRNNs significantly improve

the state of the art on the Binary MNIST and CIFAR-10

datasets. We also provide new benchmarks for generative

image modeling on the ImageNet dataset. Based on the

samples and completions drawn from the models we can

conclude that the PixelRNNs are able to model both spa-

tially local and long-range correlations and are able to pro-

duce images that are sharp and coherent. Given that these

models improve as we make them larger and that there is

practically unlimited data available to train on, more com-

putation and larger models are likely to further improvethe

results.

Acknowledgements

Theauthorswould liketo thank Shakir Mohamed and Guil-

laume Desjardins for helpful input on this paper and Lu-

cas Theis, Alex Graves, Karen Simonyan, Lasse Espeholt,

Danilo Rezende, Karol Gregor and Ivo Danihelka for in-

sightful discussions.

References

Dinh, Laurent, Krueger, David, and Bengio, Yoshua.

NICE: Non-linear independent components estimation.

arXiv preprint arXiv:1410.8516, 2014.

Germain, Mathieu, Gregor, Karol, Murray, Iain, and

Larochelle, Hugo. MADE: Masked autoencoder for dis-

tribution estimation. arXiv preprint arXiv:1502.03509,

2015.

Graves, Alex. Generating sequences with recurrent neural

networks. arXiv preprint arXiv:1308.0850, 2013.

Graves, Alex and Schmidhuber, Jurgen. Offline handwrit-

ing recognition with multidimensional recurrent neural

networks. In Advances in Neural Information Process-

ing Systems, 2009.

Gregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell,

Charles, and Wierstra, Daan. Deep autoregressive net-

works. In Proceedings of the 31st International Confer-

ence on Machine Learning, 2014.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra,

Daan. DRAW: A recurrent neural network for image

generation. Proceedings of the 32nd International Con-

ference on Machine Learning, 2015.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,

Jian. Deep residual learning for imagerecognition. arXiv

preprint arXiv:1512.03385, 2015.

Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-

term memory. Neural computation, 1997.

Kalchbrenner, Nal and Blunsom, Phil. Recurrent continu-

ous translation models. In Proceedings of the2013 Con-

ferenceon Empirical Methods in Natural LanguagePro-

cessing, 2013.

Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex.

Grid long short-term memory. arXiv preprint

arXiv:1507.01526, 2015.

Kingma, Diederik P and Welling, Max. Auto-encoding

variational bayes. arXiv preprint arXiv:1312.6114,

2013.

Krizhevsky, Alex. Learning multiple layers of features

from tiny images. 2009.

Larochelle, Hugo and Murray, Iain. The neural autore-

gressive distribution estimator. The Journal of Machine

Learning Research, 2011.

Page 42: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Forward Computation of the Gradient

• BPTT does not seem biologically plausible and is memory-expensive

• RTRL (Real-Time Recurrent Learning, Williams & Zipser 1989, Neural Comp.)

• Practically useful: online learning, no need to store all the past states and revisit history backwards (which is biologically weird)

• Compute the gradients forward in time, rather than backwards• Think about multiplying many matrices left-to-right vs right-to-left

• BUT exact computation is O(nhidden x nweights) instead of O(nweights), to recursively compute dh(t)/dW all params

• Recently proposed, *approximate* the forward gradient using an efficient stochastic estimator (rank 1 estimator of dh/dWtensor)(Training recurrent networks online without backtracking, Ollivier et al

arXiv:1507.07680)

42

Page 43: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Rare & Dangerous

States

• Example: autonomousvehicles in near-accident situations

• Current supervised learningmay not handle well thesecases because they are toorare (not enough data)

• It would be even worse with current RL (statistical inefficiency)

• Long-term objective: develop better predictive models of the world able to generalize in completely unseen scenarios

• Example of similar human ability: figuring out intuitive physics, no need to die a thousand deaths

Page 44: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

How to Discover Good Representations

• How to discover abstractions? What is a good representation?

• Need clues to disentangle the underlying factors

• Spatial & temporal scales

• Marginal independence

• Controllable factors

44

Page 45: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Jointly Learn to Detect and Control an Attribute

of the Scene

45

Page 46: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

In order to quantify the change in f k when actions are taken according to ⇡k , wedefine the selectivity of a feature as:

sel (s, a, k) = Es0⇠P as s 0

|f k (s0) − f k (s)|P

k 0 |f k 0(s0) − f k 0(s)|(1)

where s,s0 are successive raw state representations (e.g. pixels), a is the action, Pass0 is the environment’s transition distribution from

s to s0 under action a. The normalization by the change in all features means that the selectivity of f k is maximal when only thatsingle featurechanges as a result of some action.

By having an objective that maximizes selectivity and minimizes the autoencoder objective, we can ensure that the features learnedcan both reconstruct thedata and recover independently controllable factors. Hence, wedefine the following objective, which can beminimized via stochastic gradient descent:

Es[ 12||s − g(f (s))||22]

| {z }reconstruction error

− λX

k

Es[X

a

⇡k (a|s) logsel (s, a, k)]

| {z }disentanglement objective

(2)

Here one can think of logsel(s, a, k) as the reward signal Rk (s, a) of a control problem, and the expected reward Ea⇠⇡ k[Rk ] is

maximized by finding theoptimal set of policies ⇡k .

Note that it isalso possible to havedirected selectivity: by not taking theabsolutevalueof thenumerator of (1) (and replacing logselwith log(1+ sel ) in (2)), thepolicies must learn to increase thelearned latent feature rather than simply change it. Thismay beusefulif the policy to gradually increase a feature is distinct from the policy that decreases it.

2.3 A first toy problem

Consider the simple environment described in Figure 1(a): the agent sees a 2⇥ 2 square of adjacent cells in the environment, andhas 4 actions that move it up, down, left or right. An autoencoder with directed selectivity (see Figure 1(c,d)) learns latent featuresthat map to the (x, y) position of the square in the input space, without ever having explicitly access to these values, and whilereconstructing the input properly. An autoencoder without selectivity also reconstructs the input properly but without learning thesetwo latent (x, y) features explicitly.

In thissetting f , g and ⇡ sharesomeof their parameters. Weusethefollowing architecture: f has two 16⇥3⇥3 ReLU convolutionallayers, followed by afully connected ReLU layer of 32 units, and atanh layer of n = 4 features; g is the transpose architecture of f ;⇡k is asoftmax policy over 4 actions, computed from the output of the ReLU fully connected layer.

(a) (b) (c) (d)

Figure 1: (a) & (b) A simple environment with 4 actions that push a square left, right, up or down. (a) is an example ground truth,(b) is the reconstruction of the model trained with selectivity. (c) The slope of a linear regression of the true features (the real xand y position of the agent) as a function of each latent feature. White is no correlation, blue and red indicate strong negative orpositive slopes respectively. We can see that features 0 and 1 recover y and features 2 and 3 recover x. (d) Representation of thelearned policies. Each row is a policy ⇡k , each column corresponds to an action (left/right/up/down). Each cell (k, i ) represents theprobability of action i in policy ⇡k ; Wecan see that features 0 and 1 correspond to going down and up (− y/+ y) and features 2 and 3correspond to going right and left (+ x/− x).

2.4 A slightly harder toy problem

In the next experiments, we aim to generalize the model above in a slightly more complex environment. Instead of having f and ⇡k

parametrized by thesameparameters, wenow introduce adifferent set of parameters for each policy and for theencoder, so that eachpolicy can be learned separately.

2

Independently Controllable Factors

(Bengio et al & Bengio, 2017)

• Jointly train for each aspect (factor)

• A policy (which tries to selectively change just thatfactor)

• A representation (which maps state to value of factor)

• Optimize both policy and representation to minimize

46

Page 47: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Discrete Set of Attributes:

Experiments

True underlying attributes are (x1, y2), (x2, y2) of digits.

Learner identifies them!

47

Page 48: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Ongoing Work

NIPS Submission: Independently Controllable Features

48

17/ 17

Independent ly Controllable Features

Questions?

E. Bengio, J. Pondard, M. Sarfat i, V. Thomas et al. Independent ly Controllable Features

Emmanuel Bengio, Valentin Thomas, Jules Pondard, Marc Sarfati, Philippe BeaudoinMarie-Jean Meurs, Doina Precup, Joelle Pineau, Yoshua Bengio

Page 49: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Continuous Set of Attributes: Attribute

Embeddings

49

Page 50: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Continuous Attributes Model: Computational

Graph

Total objective = autoencoder loss + selectivity loss

50

Page 51: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Continuous Attributes: Results

51

Page 52: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Predict the effect of actions in attribute space

Given initial state and set of actions, predict new attribute values and the corresponding reconstructed images

52

Page 53: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Given two states, recover the causal actions

leading from one to the other

53

Page 54: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Representing Agents, Objects, Policies

The controllable factors idea works on a small scale: ongoing workto expand this to more complex environments

• Expand to sequences of actions and use them to define options

• Notion of objects and attributes naturally falls out

• Extension to non-static set of objects: types & instances

• Objects are groups of controllable features: by whom? Agents

• Factors controlled by other agents; mirror neurons

• Because the set of objects may be unbounded, we need to learnto represent policies themselves, and the definition of an objectis bound to the policies associated with it (for using it and changing its attributes)

54

Page 55: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Research Program Towards Higher-Level Deep

Learning

• Build a series of gradually more complex environments

• Design training frameworks for learning agents to discover the notion of objects, attributes, types, other agents, relations, the combinatorial space of policies, etc…

• Associate all these concepts to linguistic constructions: learnsemantically grounded linguistic expressions.

• Words are clues of higher-level abstractions

• Language is an input/output modality

55

Page 56: Recurrent Neural Networks, Language, Semantics, and ... · Recurrent Neural Networks, Language, Semantics, and Disentangling Factors Yoshua Bengio August 29th, 2017 @ DS3 Data Science

Montreal Institute for Learning Algorithms