deep learning with multiplicative interactions geoffrey hinton canadian institute for advanced...
TRANSCRIPT
![Page 1: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/1.jpg)
Deep learning with
multiplicative interactions
Geoffrey Hinton
Canadian Institute for Advanced Research
&
Department of Computer Science
University of Toronto
![Page 2: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/2.jpg)
Overview
• Background: How to learn a multilayer generative model of unlabeled data using a Restricted Boltzmann Machine– How to fine-tune for better discrimination– A speech recognition example (Dahl & Mohamed)
• The new idea: RBM’s with factored, 3-way interactions– Why generative models need 3-way interactions– Factorizing 3-way interactions to save on parameters– Inference and learning in the factored 3-way model
• Memisevic: Learning how images transform over time• Taylor: Transforming a model of human motion
• Ranzato: Creating a pixel covariance matrix on the fly– Applied to object recognition in tiny color images.
![Page 3: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/3.jpg)
Restricted Boltzmann Machines
• We restrict the connectivity to make learning easier.– Only one layer of
stochastic binary hidden units.
– No connections between hidden units.
• In an RBM, the hidden units are conditionally independent given the visible states. – So we can quickly get an
unbiased sample from the posterior distribution when given a data-vector.
hidden
i
j
visible
jijji
iijij
whvp
wvhp
)|1(
)|1(
h
v
bias terms left out to simplify the math
![Page 4: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/4.jpg)
The Energy of a joint configuration(ignoring terms to do with biases)
ji
ijji whvv,hE,
)(
weight between units i and j
Energy with binary vectors v on the visible units and h on the hidden units
binary state of visible unit i
binary state of hidden unit j
jiij
hvw
hvE
),(
![Page 5: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/5.jpg)
Using energies to define probabilities
• The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.
• The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.
gu
guE
hvE
e
ehvp
,
),(
),(
),(
gu
guEh
hvE
e
e
vp
,
),(
),(
)(
partition function
![Page 6: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/6.jpg)
A picture of the maximum likelihood learning algorithm for an RBM
0 jihv jihv
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
jiji
ijhvhv
w
vp 0)(log
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
![Page 7: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/7.jpg)
A quick way to learn an RBM
0 jihv1 jihv
i
j
i
j
t = 0 t = 1
)( 10 jijiij hvhvw
Start with a training vector on the visible units.
Update all the hidden units in parallel
Update the all the visible units in parallel to get a “reconstruction”.
Update the hidden units again.
This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function called contrastive divergence (Hinton, 2002).
reconstructiondata
![Page 8: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/8.jpg)
Training a deep network(the main reason RBM’s are interesting)
• First train a layer of features that receive input directly from the pixels.
• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.– This creates a multi-layer generative model.
• It can be proved that each time we add another layer of features we improve a variational lower bound on the log probability of the training data.– The proof is complicated.
![Page 9: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/9.jpg)
Fine-tuning for discrimination
• First learn one layer of features at a time without using label information.
• Then add a final layer of label units.• Then use backpropagation from the label units to fine-
tune the features that were learned in the unsupervised “pre-training” phase.
• This overcomes many of the limitations of standard backpropagation.– The label information is used to adjust decision
boundaries, not to discover features– It finds much deeper minima that generalize much
better (Bengio lab).
![Page 10: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/10.jpg)
Why unsupervised pre-training makes sense
stuff
image label
stuff
image label
If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity?
If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.
high bandwidth
low bandwidth
![Page 11: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/11.jpg)
A neat application of deep learning• A very deep belief net is beats the record at phone recognition on the very well-studied TIMIT
database.
• The task:
– Predict the probabilities of 183 context-dependent phone labels for the central frame of a short window of speech
• The training procedure:
– Train lots of big layers, one at a time, without using the labels.
– Add a 183-way softmax of context-specific phone labels
– Fine-tune with backprop on a big GPU board for several days
• The performance:
– After the standard post-processing using a bi-phone model this gets 23.0% phone error rate.
– Our speech experts believe that this beats all previous recognition methods that use the standard decoder.
– For TIMIT, the classification task is a bit easier than the recognition task. Deep networks are the best at classification too (Honglak Lee)
![Page 12: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/12.jpg)
One very deep belief net for phone recognition
11 frames of 39 MFCC’s
2000 binary hidden units
2000 binary hidden units
2000 binary hidden units
2000 binary hidden units
128 units
183 labels
Mohamed, Dahl & Hinton
poster in the NIPS speech workshop on Saturday
not pre-trained
The Mel Cepstrum Coefficients are a standard representation for speech
![Page 13: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/13.jpg)
A simple real-valued visible unit
• We model MFCC coefficients as Gaussian variables that are independent given the hidden states. Alternating Gibbs sampling is still easy, but learning needs to be much slower.
ijj
hidjvisihidjjj
visi i
ii whi
ivhbbv
,E 2
2
2
)()( hv
E
energy-gradient due to top-down input to unit i.parabolic containment
ii vb
![Page 14: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/14.jpg)
The new idea
• The basic RBM module is flawed.
• It is no good at dealing with multiplicative interactions.
• Multiplicative interactions are ubiquitous– Style and content (Freeman and Tenebaum)– Image transformations (Tensor faces)– Heavy-tailed distributions caused by multiplying
together two Gaussian distributed variables.
![Page 15: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/15.jpg)
Generating the parts of an object: why multiplicative interactions are useful
• One way to maintain the constraints between the parts is for the level above to specify the location of each part very accurately– But this would require a lot of communication
bandwidth.• Sloppy top-down specification of the parts is less
demanding – but it messes up relationships between parts– so use redundant features and specify lateral
interactions to sharpen up the mess.• Each part helps to locate the others
– This allows a noisy top-down channel
![Page 16: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/16.jpg)
Generating the parts of an object
sloppy top-down activation of parts
clean-up using lateral interactions specified by the layer above.
pose parameters
parts with top-down support
“square” +
Its like soldiers on a parade ground
![Page 17: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/17.jpg)
Towards a more powerful, multi-linear stackable learning module
• We want the states of the units in one layer to modulate the pair-wise interactions in the layer below (not just the biases)– Can we do this without losing the nice property that the
hidden units are conditionally independent given the visible states?
• To modulate pair-wise interactions we need higher-order Boltzmann machines. – These have far too many parameters, but we have a
trick for fixing that.
![Page 18: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/18.jpg)
Higher order Boltzmann machines (Sejnowski, ~1986)
• The usual energy function is quadratic in the states:
• But we could use higher order interactions:
ijjjii wsstermsbiasE
ijkkjkjii wssstermsbiasE
• Unit k acts as a switch. When unit k is on, it switches in the pairwise interaction between unit i and unit j. – Units i and j can also be viewed as switches that
control the pairwise interactions between the other two units.
![Page 19: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/19.jpg)
Using higher-order Boltzmann machines to model image transformations
(the unfactored version, Memisevic &Hinton CVPR 2007)
• A global transformation specifies which pixel goes to which other pixel.
• Conversely, each pair of similar intensity pixels, one in each image, votes for a particular global transformation.
image(t) image(t+1)
image transformation
![Page 20: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/20.jpg)
Factoring three-way multiplicative interactions
fhfjfifhj
hjii
ijhhjhjii
wwwsssE
wsssE
,,
,,
factored with linearly many parameters per factor.
unfactoredwith cubically many parameters
![Page 21: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/21.jpg)
A picture of the rank 1 tensor contributed by factor f
ifw
jfw
hfw
Its a 3-way outer product.
Each layer is a scaled version of the same rank 1 matrix.
![Page 22: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/22.jpg)
Inference with factored three-way multiplicative interactions
jjfjif
iihfhfhf
hhfh
jjfj
iifihfjfif
hjihjif
wswswsEsE
wswswswwwsssE
)()( 10
,,
How changing the binary state of unit h changes the energy contributed by factor f
What unit h needs to know in order to do Gibbs sampling
Energy contributed by factor f
=
![Page 23: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/23.jpg)
Belief propagation
ifw jfw
hfw
f
i j
h
The outgoing message at each vertex of the factor is the product of the weighted sums at the other two vertices.
![Page 24: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/24.jpg)
Learning with factored three-way multiplicative interactions
delmodata
modeldata
hfh
hfh
hf
f
hf
fhf
jjfjif
ii
hf
msms
w
E
w
Ew
wswsm
message from factor f to unit h
2.30?
![Page 25: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/25.jpg)
receptive field in pre-image
receptive field in post-image
Showing what a factor learns by alternating between its pre- and post- fields
pre-image post-image
![Page 26: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/26.jpg)
The factor receptive fields
The network is trained on translated random dot patterns.
![Page 27: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/27.jpg)
The factor receptive fields
The network is trained on translated random dot patterns.
![Page 28: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/28.jpg)
The network is trained on rotated random dot patterns.
![Page 29: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/29.jpg)
The network is trained on rotated random dot patterns.
![Page 30: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/30.jpg)
How does it perceive two overlaid sparse dot patterns moving in different directions?
• First we train a second hidden layer. Each of these units prefers motion in a different direction.
• Then we compute the perceived motion by adding up the preferences of the active units in the second hidden layer.
• If the two motions are within about 30 degrees it sees a single average motion.
• If they are further apart it sees two separate motions.– The separate motions are slightly further apart than the real
ones.– This is just like human perception and it was not trained on
transparent motion.– The training is entirely unsupervised.
![Page 31: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/31.jpg)
Time series models
• Inference is difficult in directed models of time series if we use non-linear, distributed representations in the hidden units.– It is hard to fit directed graphical models to high-
dimensional sequences (e.g motion capture data). • So people tend to use methods with much less
representational power– HMM’s give up on distributed representations– Linear Dynamical Systems give up on non-
linearity.
![Page 32: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/32.jpg)
The conditional RBM model (a partially observed bipartite CRF)
• Start with a generic RBM.• Add two types of conditioning
connections.• Given the data, the hidden units
at time t are conditionally independent.
• The autoregressive weights can model most short-term temporal structure very well, leaving the hidden units to model nonlinear irregularities.
t-2 t-1 t
i
j
h
v
![Page 33: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/33.jpg)
Causal generation from a learned model
• Keep the previous visible states fixed.– They provide a time-dependent
bias for the hidden units.• Perform alternating Gibbs sampling
for a few iterations between the hidden units and the most recent visible units.– This picks new hidden and visible
states that are compatible with each other and with the recent history.
i
j
![Page 34: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/34.jpg)
Higher level models
• Once we have trained the model, we can add more layers.
• Treat the hidden activities of the first CRBM as data for training the next CRBM.– Add “autoregressive”
connections to a layer when it becomes the visible layer.
• Adding a second layer makes it generate more realistic sequences.
i
j
k
t-2 t-1 t
skip?
![Page 35: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/35.jpg)
An application to modeling motion capture data
• Human motion can be captured by placing reflective markers on the joints– Use lots of infrared cameras to track the 3-D
positions of the markers
• Given a skeletal model, the 3-D positions of the markers can be converted into– The joint angles– The 3-D translation of the pelvis– The roll, pitch and delta yaw of the pelvis
![Page 36: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/36.jpg)
6 earlier visible frames
current visible frame
600 hidden units
100 style features
style: 1-of-N
Using a style variable to modulate the interactions (there is additional weight sharing: Taylor&Hinton, ICML 2009)
200 factors
![Page 37: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/37.jpg)
Show demo’s of multiple styles of walking
These can be found at www.cs.toronto.edu/~gwtaylor/
![Page 38: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/38.jpg)
Modeling the covariance structure of a static image by using two copies of the image
ifw jfw
hfw
f
i j
hEach factor sends the squared output of a linear filter to the hidden units.
It is exactly the standard model of simple and complex cells. It allows complex cells to extract oriented energy.
The standard model drops out of doing belief propagation for a factored third-order energy function. Copy 1 Copy 2
![Page 39: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/39.jpg)
An advantage of modeling covariances between pixels rather than pixels
• During generation, a hidden “vertical edge” unit can turn off the horizontal interpolation in a region without worrying about exactly where the intensity discontinuity will be.– This gives some translational invariance– It also gives a lot of invariance to brightness and
contrast.– The “vertical edge” unit acts like a complex cell.
• By modulating the correlations between pixels rather than the pixel intensities, the generative model can still allow interpolation parallel to the edge.
![Page 40: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/40.jpg)
Using linear filters to model the inverse covariance matrix of two pixel intensities
The joint distribution of 2 pixels
2ay
2by
a b
EepywywE bbaa ,22
Each factor creates a parabolic energy trough.
small weight
big weight
![Page 41: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/41.jpg)
Modulating the precision matrix by using additive contributions that can be switched off
• Use the squared outputs of a set of linear filters to create an energy function. – The energy function represents the negative log
probability of the data under a full covariance Gaussian.
• Adapt the precison matrix to each datapoint by switching off the energy contributions from some of the linear filters.– This is good for modeling smoothness constraints
that almost always apply, but sometimes fail catastrophically (e.g. at edges).
![Page 42: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/42.jpg)
Using binary hidden units to remove violated smoothness constraints
2aa yw 2
bb ywa b
When the negative input from the squared filter exceeds the positive bias, the hidden unit turns off.
filter output, y
Fre
e en
ergy
b bb
0
![Page 43: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/43.jpg)
Inference with hidden units that represent active smoothness constraints
• The hidden units are all independent given the pixel intensities– The factors do not create dependencies
between hidden units.• Given the states of the hidden units, the pixel
intensity distribution is a full covariance Gaussian that is adapted for that particular image.– The hidden states do create dependencies
between the pixels.
![Page 44: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/44.jpg)
Learning with an adaptive precision matrix
• Since the pixel intensities are no longer independent given the hidden states, it is much harder to produce reconstructions.– We could invert the precision matrix for each
training example, but this is slow.• Instead, we produce reconstructions using
Hybrid Monte Carlo, starting at the data.– The rest of the learning algorithm is the same
as before.
![Page 45: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/45.jpg)
Hybrid Monte Carlo
• Given the pixel intensities, we can integrate out the hidden states to get a free energy that is a deterministic function of the image.– Backpropagation can then be used to get the
derivatives of the free energy with respect to the pixel intensities.
• Hybrid Monte Carlo simulates a particle that starts at the datapoint with a random initial momentum and then moves over the free energy surface.– 20 leapfrog steps work well for our networks.
Skip?
![Page 46: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/46.jpg)
mcRBM (mean and covariance RBM)
• Use one set of binary hidden units to model the means of the real-valued pixels.– These hidden units learn blurry patterns for
coloring in regions
• Use a separate set of binary hidden units to model the image-specific precision matrix. – These hidden units get their input from factors.– The factors learn sharp edge filters for
representing breakdowns in smoothness.
![Page 47: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/47.jpg)
Receptive fields of the hidden units that represent the means
Trained on 16x16 patches of natural images.
![Page 48: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/48.jpg)
Receptive fields of the factors that are used to represent precisions
Notice the color blob with low frequency red-green and yellow-blue filters
![Page 49: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/49.jpg)
Why is the map topographic?
• We laid out the factors in a 2-D grid and then connected each hidden unit to a small set of nearby factors.
• If two factors get activated at the same time, it pays to connect them to the same hidden unit.– You only lose once by turning off that hidden
unit.
![Page 50: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/50.jpg)
Multiple reconstructions from the same hidden state of a mcRBM
The mcRBM hidden states are the same for each row. The hidden states should reflect human similarity judgements much better than squared difference of pixel intensities.
![Page 51: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/51.jpg)
Test examples from the CIFAR-10 dataset plane car bird cat deer dog frog horse ship truck
![Page 52: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/52.jpg)
Application to the CIFAR-10 labeled subset of the TINY images dataset (Marc’Aurelio Ranzato)
• There are 5000 32x32 training images and 1000 32x32 testing images for each of 10 different classes. – In addition, there are 80 million unlabeled images.
• Train the mcRBM model on a very large number of 8x8 color patches– 81 hiddens for the mean– 144 hiddens and 900 factors for the precision
• Replicate the patches across the 32x32 color images– 49 patches with a stride of 4– This gives 49 x 225 = 11025 hidden units.
![Page 53: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/53.jpg)
How well does it discriminate?
• Compare with Gaussian-Binary RBM model that has the same number of hidden units, but only models the means of the pixel intensities.
• Use multinomial logistic regression directly on the hidden units representing the means and the hidden units representing the precisions.– We can probably do better, but the aim is to
evaluate the mcRBM idea.• Also try unsupervised learning of extra hidden layers
with a standard RBM to see if this gives even better features for discrimination.
![Page 54: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/54.jpg)
Percent correct on CIFAR-10 test data
Gaussian RBM (only models the means)
49x225 = 11025 hiddens
59.7%
3-way RBM (only models the covariances) 49x225 = 11025 hiddens, 225 filters per patch
62.3%
3-way RBM (only models the covariances)
49x225 = 11025 hiddens, 900 filters per patch (extra factors allow pooling of similar filters)
67.8%
mcRBM (models means & covariances)
49x(81+144) = 11025 hids, 900 filters per patch
69.1%
mcRBM then extra hidden layer of 8096 units
49x(81+144) = 11025 hids, 900 filters per patch
72.1%
![Page 55: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/55.jpg)
Summary
• It is easy to learn deep generative models of unlabeled data by stacking RBM’s.
• RBM’s can be modified to allow factored multiplicative interactions. Inference is still easy.– Learning is still easy if we condition on one set of
inputs (the pre-image for learning image transformations; the style for learning mocap)
• Multiplicative interactions allow an RBM to model pixel covariances in an image-specific way.– This gives good hidden representations for object
recognition.
![Page 56: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/56.jpg)
THE END
A reading list on Deep Belief nets www.cs.toronto.edu/~hinton/deeprefs.html
![Page 57: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/57.jpg)
• First learn with all the weights tied– This is exactly equivalent to
learning an RBM– Contrastive divergence learning
is equivalent to ignoring the small derivatives contributed by the tied weights between deeper layers.
Learning a deep directed network
W
W
v1
h1
v0
h0
v2
h2
TW
TW
TW
W
etc.
v0
h0
W
![Page 58: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/58.jpg)
• Then freeze the first layer of weights in both directions and learn the remaining weights (still tied together).– This is equivalent to learning
another RBM, using the aggregated posterior distribution of h0 as the data.
W
v1
h1
v0
h0
v2
h2
TW
TW
TW
W
etc.
frozenW
v1
h0
W
TfrozenW
![Page 59: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/59.jpg)
The hybrid generative model after learning 3 layers
• To generate data:
1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling for a long time.
2. Perform a top-down pass to get states for all the other layers.
So the lower level bottom-up connections are not part of the generative model. They are just used for inference.
h2
data
h1
h3
2W
3W
1W
![Page 60: Deep learning with multiplicative interactions Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551b935d550346d6338b6017/html5/thumbnails/60.jpg)
Learning with unreliable labels
• We can infer the true hidden label by combining evidence.
• This allows us to get surprisingly low error rates with very bad labels:
• Perfect labels: 1%
• Labels 50% wrong: 2%
• Labels 80% wrong: 5%
• It’s the mutual information that matters.
2000 top-level neurons
500 neurons
500 neurons
28 x 28 pixel image
10 hidden
labels
10 noisy
labels
confusion matrix