the new generation of neural networks1 - uamarantxa.ii.uam.es/~jmlobato/docs/presentaciondbn.pdf ·...
TRANSCRIPT
Deep Belief Networks
The New Generation of Neural Networks1
Jose Miguel Hernandez Lobato and Daniel Hernandez Lobato
Universidad Autonoma de Madrid, Computer Science Department
May 5, 2008
1This presentation is mainly based on the work by Geoffrey E. Hinton.1 / 28
Deep Belief Networks
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
2 / 28
Deep Belief Networks
Boltzmann Machines
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
3 / 28
Deep Belief Networks
Boltzmann Machines
Boltzmann Machines
Networks of stochastic binnary units with associated energy
E (x) = −1
2xtWx (1)
and associated probability distribution
P(x|W) =1
Z (W)exp
[
1
2xtWx
]
. (2)
The activity rule of the net implements Gibbs sampling fromP(x|W):
P(xi = 1|W) =1
1 + e−ai. (3)
where ai =∑
j wijxj .
4 / 28
Deep Belief Networks
Boltzmann Machines
Learning in Boltzmann Machines
Given a set of examples {x(n)}N1 we want to adjust W so that
P(x|W) is a good generative model. For this, we maximize
log
[
N∏
n=1
P(x(n)|W)
]
=N
∑
n=1
[
1
2x(n)Wx(n) − log Z (W )
]
. (4)
The gradient descent learning rule is
∆wij = ηN(
EData[xixj ] − EP(x|W)[xixj ])
, (5)
where η is the learning rate. The rule has a wake and a sleep step.
5 / 28
Deep Belief Networks
Boltzmann Machines
Learning in Boltzmann Machines with hidden units
x denotes the visible units.
h denotes the hidden units.
yi denotes an arbitrary neuron.
The likelihood of W given a single data example x(n) is:
∑
h
P(x(n),h|W) =
∑
h
1
Z (W)exp
[
1
2
[
y(n)]t
Wy(n)
]
. (6)
The learning rule given a sample {x(n)}Nn=1is
∆wij = η
N∑
n=1
(
EP(h|x(n),W)[yiyj ] − EP(x,h|W)[yiyj ]
)
(7)
and again has a wake and a sleep step.6 / 28
Deep Belief Networks
Boltzmann Machines
Why are Boltzmann Machines not in widespread use?
Training depends on computing the gradient by Monte Carlomethods (Gibbs sampling).
A Boltzmann machine with many units requires a huge amount ofsamples to approximate the equilibrium distribution.
The origin of the problem is that the conditional distributions ofthe hidden units and the visible units do not factorize due to thevisible to visible and hidden to hidden connections.
7 / 28
Deep Belief Networks
Restricted Boltzmann Machines
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
8 / 28
Deep Belief Networks
Restricted Boltzmann Machines
Restricted Boltzmann Machines
They are Boltzmann Machines where learning is feasible.
No visible to visible and hidden to hidden connections.
The distributions P(h|x,W) and P(x|h,W) now factorize:
P(h|x,W) =∏
i
P(hi |x,W) (8)
P(x|h,W) =∏
i
P(xi |h,W) . (9)
The learning rule given a sample {x(n)}Nn=1 is still the same:
∆wij = η
N∑
n=1
(
EP(h|x(n),W)[xihj ] − EP(x,h|W)[xihj ]
)
. (10)
9 / 28
Deep Belief Networks
Restricted Boltzmann Machines
Learning in RBMs
In an RBM we can compute EP(h|x(n),W)[xihj ] exactly.
Contrastive divergence (an approximation to Gibbs sampling) isused to estimate EP(x,h|W)[xihj ] in the sleep step:
For each data point x(n) we
1 Sample h(1) from P(h|x(n)).
2 Sample x′(n) from P(x|h(1)).
3 Sample h(2) from P(h|x′(n)).
1N
∑Nn=1 x
′(n)i h
(2)j approximates EP(x,h|W)[xihj ].
10 / 28
Deep Belief Networks
Deep Belief Networks
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
11 / 28
Deep Belief Networks
Deep Belief Networks
Deep Belief Network
A deep generative model can be obtained by stacking RBMs. Eachlayer of RBMs models more and more abstract features.
Each time an RBM is added to the stack a lower bound on thelikelihood increases.
x
h1
RBM
h1
h2
x
RBM
RBM
h1
h2
x
RBM
RBM
RBM
h3
12 / 28
Deep Belief Networks
Deep Belief Networks
Greedy algorithm for stacking RBMs
The deep architecture is initially empty.
1 Learn an RBM and put it on the top.
2 Filter the data through the current deep architecture.
3 Learn an RBM using the filtered data and put it on the top.
4 Filter the data through the current deep architecture.
5 Repeat 3 and 4 until n RBMs have been stacked.
To filter the data through the deep architecture we just propagateexpectations up in the RBMs, conditioning to the original data.
We sample from the deep architecture by sampling from the topRBM and then sampling the remaining RBMs down.
13 / 28
Deep Belief Networks
Deep Belief Networks
Fine tuning DBNs
After the greedy algorithm the recognition and generative weightscan be fine-tuned by means of a Wake-Sleep method or aback-propagation technique.
Wake-Sleep
The original data is propagated up, the top RBM iterates a fewtimes and then a sample is propagated down. The weigths areupdated so that the sample of the DBN matches the original data.
Back-Propagation
The network is unfoled to produce encoder and decoder networks.Stochastic activities are replaced by deterministic probabilities andthe weights are updated by back-propagation for optimalreconstruction.
14 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
15 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Applications of DBNs
High level feature extraction
Non-linear dimensionality reduction: MNIST and Olivetti faces.
Digits Recognition
MNIST example from Hinton’s web.
16 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: Olivetti faces
Description
400 faces in 24 × 24 bitmap images.
Gray scale images.
Pixel intesities (0 − 255) are normalized to lie in the [0, 1]interval.
5 transformations increase the set size to 1600 images.
17 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: Olivetti faces
18 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: Olivetti faces.
Comparison of original images, DBN reconstructions, and PCAreconstructions.
19 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: MNIST.
Description
60.0000 hand written digit images.
28 × 28 gray scale pixels.
Normalized to lie in [0, 1].
DBN arquitecture: 1000, 500, 250, 30, 2.
20 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: MNIST. PCA results.
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
PC1
PC
2
21 / 28
Deep Belief Networks
Applications of Deep Belief Networks
Feature Extraction: MNIST. DBN results.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
h1
h2
22 / 28
Deep Belief Networks
Applications of Deep Belief Networks
MNIST: digit recognition.
See web application.
23 / 28
Deep Belief Networks
Deep Belief Networks and the Human Brain
Outline
1 Boltzmann Machines
2 Restricted Boltzmann Machines
3 Deep Belief Networks
4 Applications of Deep Belief Networks
5 Deep Belief Networks and the Human Brain
24 / 28
Deep Belief Networks
Deep Belief Networks and the Human Brain
DBNs and the Human Brain
The human brain could be a huge DBN with a temporal dimension.
The memory-prediction framework is a theory of brain functioningwhich has many elements in common with DBNs:
A hierarchy of recognition with higher levels representing moreand more abstract and invariant features.
The predictions propagated down in the memory-predictionframework are similar to the sleep phase in DBNs.
DBNs have a top level associative memory. Thememory-prediction framework places the hippocampus at thetop of its hierarchy. The hippocampus is essential for theformation of long-term memory.
25 / 28
Deep Belief Networks
Deep Belief Networks and the Human Brain
References
David J. C. MacKay. Information Theory, Inference, andLearning Algorithms. Cambridge University Press. 2003.
Hinton, G. E., Osindero, S. and Teh, Y. A fast learningalgorithm for deep belief nets. Neural Computation 18, pp1527-1554. 2006.
Hinton, G. E. and Salakhutdinov, R. Reducing thedimensionality of data with neural networks. Science, Vol.313. no. 5786, pp. 504-507, 28 July 2006.
Jeff Hawkins, Sandra Blakeslee. On Intelligence. Timesbooks, 2004.
27 / 28