the new generation of neural networks1 - uamarantxa.ii.uam.es/~jmlobato/docs/presentaciondbn.pdf ·...

Deep Belief Networks

The New Generation of Neural Networks1

Jose Miguel Hernandez Lobato and Daniel Hernandez Lobato

Universidad Autonoma de Madrid, Computer Science Department

May 5, 2008

1This presentation is mainly based on the work by Geoffrey E. Hinton.1 / 28


Outline

1 Boltzmann Machines

2 Restricted Boltzmann Machines

3 Deep Belief Networks

4 Applications of Deep Belief Networks

5 Deep Belief Networks and the Human Brain

2 / 28


Boltzmann Machines

Outline






3 / 28


Boltzmann Machines

Boltzmann Machines

Networks of stochastic binnary units with associated energy

E (x) = −1

2xtWx (1)

and associated probability distribution

P(x|W) =1

Z (W)exp

[

1

2xtWx

]

. (2)

The activity rule of the net implements Gibbs sampling fromP(x|W):

P(xi = 1|W) =1

1 + e−ai. (3)

where ai =∑

j wijxj .

4 / 28


Boltzmann Machines

Learning in Boltzmann Machines

Given a set of examples {x(n)}N1 we want to adjust W so that

P(x|W) is a good generative model. For this, we maximize

log

[

N∏

n=1

P(x(n)|W)

]

=N

∑

n=1

[

1

2x(n)Wx(n) − log Z (W )

]

. (4)

The gradient descent learning rule is

∆wij = ηN(

EData[xixj ] − EP(x|W)[xixj ])

, (5)

where η is the learning rate. The rule has a wake and a sleep step.

5 / 28


Boltzmann Machines

Learning in Boltzmann Machines with hidden units

x denotes the visible units.

h denotes the hidden units.

yi denotes an arbitrary neuron.

The likelihood of W given a single data example x(n) is:

∑

h

P(x(n),h|W) =

∑

h

1

Z (W)exp

[

1

2

[

y(n)]t

Wy(n)

]

. (6)

The learning rule given a sample {x(n)}Nn=1is

∆wij = η

N∑

n=1

(

EP(h|x(n),W)[yiyj ] − EP(x,h|W)[yiyj ]

)

(7)

and again has a wake and a sleep step.6 / 28


Boltzmann Machines

Why are Boltzmann Machines not in widespread use?

Training depends on computing the gradient by Monte Carlomethods (Gibbs sampling).

A Boltzmann machine with many units requires a huge amount ofsamples to approximate the equilibrium distribution.

The origin of the problem is that the conditional distributions ofthe hidden units and the visible units do not factorize due to thevisible to visible and hidden to hidden connections.

7 / 28


Restricted Boltzmann Machines

Outline






8 / 28




They are Boltzmann Machines where learning is feasible.

No visible to visible and hidden to hidden connections.

The distributions P(h|x,W) and P(x|h,W) now factorize:

P(h|x,W) =∏

i

P(hi |x,W) (8)

P(x|h,W) =∏

i

P(xi |h,W) . (9)

The learning rule given a sample {x(n)}Nn=1 is still the same:

∆wij = η

N∑

n=1

(

EP(h|x(n),W)[xihj ] − EP(x,h|W)[xihj ]

)

. (10)

9 / 28



Learning in RBMs

In an RBM we can compute EP(h|x(n),W)[xihj ] exactly.

Contrastive divergence (an approximation to Gibbs sampling) isused to estimate EP(x,h|W)[xihj ] in the sleep step:

For each data point x(n) we

1 Sample h(1) from P(h|x(n)).

2 Sample x′(n) from P(x|h(1)).

3 Sample h(2) from P(h|x′(n)).

1N

∑Nn=1 x

′(n)i h

(2)j approximates EP(x,h|W)[xihj ].

10 / 28



Outline






11 / 28



Deep Belief Network

A deep generative model can be obtained by stacking RBMs. Eachlayer of RBMs models more and more abstract features.

Each time an RBM is added to the stack a lower bound on thelikelihood increases.

x

h1

RBM

h1

h2

x

RBM

RBM

h1

h2

x

RBM

RBM

RBM

h3

12 / 28



Greedy algorithm for stacking RBMs

The deep architecture is initially empty.

1 Learn an RBM and put it on the top.

2 Filter the data through the current deep architecture.

3 Learn an RBM using the filtered data and put it on the top.

4 Filter the data through the current deep architecture.

5 Repeat 3 and 4 until n RBMs have been stacked.

To filter the data through the deep architecture we just propagateexpectations up in the RBMs, conditioning to the original data.

We sample from the deep architecture by sampling from the topRBM and then sampling the remaining RBMs down.

13 / 28



Fine tuning DBNs

After the greedy algorithm the recognition and generative weightscan be fine-tuned by means of a Wake-Sleep method or aback-propagation technique.

Wake-Sleep

The original data is propagated up, the top RBM iterates a fewtimes and then a sample is propagated down. The weigths areupdated so that the sample of the DBN matches the original data.

Back-Propagation

The network is unfoled to produce encoder and decoder networks.Stochastic activities are replaced by deterministic probabilities andthe weights are updated by back-propagation for optimalreconstruction.

14 / 28


Applications of Deep Belief Networks

Outline






15 / 28



Applications of DBNs

High level feature extraction

Non-linear dimensionality reduction: MNIST and Olivetti faces.

Digits Recognition

MNIST example from Hinton’s web.

16 / 28



Feature Extraction: Olivetti faces

Description

400 faces in 24 × 24 bitmap images.

Gray scale images.

Pixel intesities (0 − 255) are normalized to lie in the [0, 1]interval.

5 transformations increase the set size to 1600 images.

17 / 28



Feature Extraction: Olivetti faces

18 / 28



Feature Extraction: Olivetti faces.

Comparison of original images, DBN reconstructions, and PCAreconstructions.

19 / 28



Feature Extraction: MNIST.

Description

60.0000 hand written digit images.

28 × 28 gray scale pixels.

Normalized to lie in [0, 1].

DBN arquitecture: 1000, 500, 250, 30, 2.

20 / 28



Feature Extraction: MNIST. PCA results.

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

PC1

PC

2

21 / 28



Feature Extraction: MNIST. DBN results.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

h1

h2

22 / 28



MNIST: digit recognition.

See web application.

23 / 28


Deep Belief Networks and the Human Brain

Outline






24 / 28



DBNs and the Human Brain

The human brain could be a huge DBN with a temporal dimension.

The memory-prediction framework is a theory of brain functioningwhich has many elements in common with DBNs:

A hierarchy of recognition with higher levels representing moreand more abstract and invariant features.

The predictions propagated down in the memory-predictionframework are similar to the sleep phase in DBNs.

DBNs have a top level associative memory. Thememory-prediction framework places the hippocampus at thetop of its hierarchy. The hippocampus is essential for theformation of long-term memory.

25 / 28



A Cube

26 / 28



References

David J. C. MacKay. Information Theory, Inference, andLearning Algorithms. Cambridge University Press. 2003.

Hinton, G. E., Osindero, S. and Teh, Y. A fast learningalgorithm for deep belief nets. Neural Computation 18, pp1527-1554. 2006.

Hinton, G. E. and Salakhutdinov, R. Reducing thedimensionality of data with neural networks. Science, Vol.313. no. 5786, pp. 504-507, 28 July 2006.

Jeff Hawkins, Sandra Blakeslee. On Intelligence. Timesbooks, 2004.

27 / 28



Questions?

28 / 28

the new generation of neural networks1 - uamarantxa.ii.uam.es/~jmlobato/docs/presentaciondbn.pdf ·...

Documents