dl1 deep learning_algorithms
TRANSCRIPT
deep learning
Algorithms and Applications
Bernardete Ribeiro, [email protected]
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015
III - Deep Learning Algorithms
1
elements 3: deep neural networks
outline
∙ Learning in Deep Neural Networks∙ Deep Learning: Evolution Timeline∙ Deep Architectures∙ Restricted Boltzmann Machines (RBMs)∙ Deep Belief Networks (DBNs)∙ Deep Models Overall Characteristics
3
learning in deep neural networks
learning in deep neural networks
1. No general learning algorithm (no-free lunch theorem byWolpert 1996)
2. Learning algorithm for specific tasks - perception, control,prediction, planning reasoning, language understanding
3. Limitations of BP - local minima, optimization challengesfor non-convex objective functions
4. Hinton’s deep belief networks (DBNs) as stack of RBMs5. LeCun’s energy based learning for DBNs
5
deep learning: evolution timeline
1. Perceptron [Frank Rosenblatt, 1959]2. Neocognitron [K Fukushima, 1980]3. Convolutional Neural Network (CNN) [LeCun, 1989]4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]5. Deep Belief Networks (DBNs) as stack of RBMs [GeoffreyHinton, 2006]
6
deep architectures
from brain-like computing to deep learning
∙ New empirical and theoretical results have brought deeparchitectures into the focus of the Machine Learning (ML)researchers [Larochelle et al., 2007].
∙ Theoretical results suggest that deep architectures arefundamental to learn the kind of brain-like complicatedfunctions that can represent high-level abstractions (e.g.vision, speech, language) [Bengio, 2009]
8
deep concepts main idea
9
deep neural networks
∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]
10
convolutional neural networks (cnns)
∙ Convolutional Neural Network consists of two basicoperations∙ convolutional∙ pooling
∙ Convolutional and pooling layersare arranged alternately untilhigh-level features are obtained
∙ Several feature maps in eachconvolutional layer
∙ Weights in the same map areshared
NN
input C1 S2 C3 S4
1
1I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE,CIM,2010
11
convolutional neural networks (cnns)
∙ Convolutional: suppose the size of the layer is d× dand the size of the receptive fields are r × r, γ and xdenote respectively the values of the convolutionallayer and the previous layer:
γij = g(r∑
m=1
r∑n=1
xi+m−1,j+n−1.wm,n + b)
i, j = 1, · · · , (d− r + 1) where g is a nonlinear function.∙ Pooling is following after convolution to reduce thedimensionality of features and to introducetranslational invariance into the CNN network.
12
deep belief networks (dbns)
∙ Probabilistic generative modelscontrasting with the discriminativenature of other NNS
∙ Generative models provide a jointprobability distribution of dataand labels
∙ Unsupervised greedy-layer-wisepre-training followed by finaltuning
image 28 x 28 pixels
visible
hidden
visible
hidden
visible
hidden
Top Level units
Labels Hidden Units
RBM Layer
RBM Layer
RBM Layer
Detection Layer
2
2based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial IntelligenceResearch, IEEE, CIM,2010
13
autoencoders (aes)
∙ The auto-encoder has twocomponents:∙ the encoder f (mapping x to h) and∙ the decoder g (mapping h to r)
∙ An auto-encoder is a neuralnetwork that tries to reconstructits input to its output
encoder f…
…
…
…
…
…
decoder g
input x
code h
reconstruction r
3
3based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),www.iro.umontreal.ca_~bengioy_dbook
14
deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].
∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].
∙ As a result, deep architecture models tend to outperformshallow models such as Support VectorMachines (SVMs) [Larochelle et al., 2007].
15
deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].
∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].
∙ As a result, deep architecture models tend to outperformshallow models such as SVMs [Larochelle et al., 2007].
15
Resctricted Boltzmann Machines
Deep Belief Networks
16
restricted boltzmann machines
restricted boltzmann machines (rbms)
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
biasvisible units
hidden units
decoder
encoder
18
restricted boltzmann machines (rbms)
∙ Unsupervised∙ Find complex regularities intraining data
∙ Bipartite Graph∙ visible, hidden layer
∙ Binary stochastic units∙ On/Off with probability
∙ 1 Iteration∙ Update Hidden Units∙ Reconstruct Visible Units
∙ Maximum Likelihood oftraining data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
biasvisible units
hidden units
decoder
encoder
19
restricted boltzmann machines (rbms)
∙ Training Goal: Best probablereproduction∙ unsupervised data
∙ find latent factors of dataset∙ Adjust weights to getmaximum probability ofinput data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
biasvisible units
hidden units
decoder
encoder
20
restricted boltzmann machines (rbms)
Given an observed state, the energy of the joint configurationof the visible units and hidden units (v,h) is given by:
E(v,h) = −I∑i=1
civi −J∑j=1
bjhj −J∑j=1
I∑i=1
Wjivihj , (1)
where W is the matrix of weights, and b and c are the biasunits w.r.t. hidden and visible layers, respectively.
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
biasvisible units
hidden units
decoder
encoder
21
restricted boltzmann machines (rbms)
The Restricted Boltzmann Machine (RBM) assigns aprobability for each configuration (v,h), using:
p(v,h) = e−E(v,h)
Z , (2)
where Z is a normalization constant called partition function,obtained by summing up the energy of all possible (v,h)configurations [Bengio, 2009, Hinton, 2010,Carreira-Perpiñán and Hinton, 2005]:
Z =∑v,h
e−E(v,h) . (3)
22
restricted boltzmann machines (rbms)
Since there are no connections between any two units withinthe same layer, given a particular random inputconfiguration, v, all the hidden units are independent of eachother and the probability of h given v becomes:
p(h | v) =∏j
p(hj = 1 | v) , (4)
where
p(hj = 1 | v) = σ(bj +I∑i=1
viWji) . (5)
23
restricted boltzmann machines (rbms)
Similarly given a specific hidden state, h, the probability of vgiven h is obtained by (6):
p(v | h) =∏i
p(vi = 1 | h) , (6)
where:
p(vi = 1 | h) = σ(ci +J∑j=1
hjWji) . (7)
24
restricted boltzmann machines (rbms)
Given a random training vector v, the state of a given hiddenunit j is set to 1 with probability:
p(hj = 1|v) = σ(bj +∑i
viWij)
Similarly:p(vi = 1|h) = σ(ci +
∑j
hjWij)
where σ (x) is the sigmoid squashing function 1(1+e−x) .
25
restricted boltzmann machines (rbms)
The marginal probability assigned to a visible vector, v, isgiven by (8):
p(v) =∑h
p(v,h) = 1Z∑h
e−E(v,h) . (8)
Hence, given a specific training vector v its probability can beraised by adjusting the weights and the biases in order tolower the energy of that particular vector while raising theenergy of all the others.
26
restricted boltzmann machines (rbms)
To this end, we can perform stochastic gradient ascentprocedure on the log-likelihood obtained from training thedata vectors using ( 9):
∂ logp(v)∂θ
= −∑h
p(h | v)∂ E(v,h)∂θ︸ ︷︷ ︸
positive phase
+∑v,h
p(v,h)∂E(v,h)∂θ︸ ︷︷ ︸
negative phase
(9)
27
training an rbm
training an rbm
The learning rule for performing stochastic steepest ascent inthe log probability of the training data:
∂ logp(v)∂θ
=⟨vihj
⟩0 −
⟨vihj
⟩∞ (10)
where 〈·〉0 denotes expectations for the data distribution(p0 = p(h | v)) and 〈·〉∞ denotes expectations under themodel distributionp∞(v,h) = p(v,h) [Roux and Bengio, 2008].
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
biasvisible units
hidden units
decoder
encoder
29
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
⟨vihj
⟩0
p(hj = 1|v) = σ(bj +∑I
i=1 viWji)
30
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
⟨vihj
⟩0
v(1)
i · · ·
p(vi = 1|h) = σ(ci +∑J
j=1 hjWji)
31
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
⟨vihj
⟩0
v(1)
i · · ·
h(1)
· · · j
p(hj = 1|v) = σ(bj +∑I
i=1 viWji)
32
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
⟨vihj
⟩0
v(1)
i · · ·
h(1)
· · · j
v(1)
i · · ·
p(vi = 1|h) = σ(ci +∑J
j=1 hjWji)
33
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
⟨vihj
⟩0
v(1)
i · · ·
h(1)
· · · j
v(2)
i · · ·
h(2)
· · · j
v(∞)
i · · ·
h(∞)
· · · j
⟨vihj
⟩∞
34
contrastive divergence algorithm
contrastive divergence (cd–k)
∙ To solve this problem, Hinton proposed the ContrastiveDivergence algorithm.
∙ CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.
∆Wji = η(⟨vihj
⟩0 −
⟨vihj
⟩k) (11)
36
contrastive divergence (cd–k)
∙ v(0) ← x∙ Compute the binary (features) states of the hidden units,h(0), using v(0)
∙ for n← 1 to k∙ Compute the “reconstruction” states for the visible units, v(n),using h(n−1)
∙ Compute the “reconstruction” states for the hidden units, h(n),using v(n)
∙ end for∙ Update the weights and biases, according to:
∆Wji = η(⟨vihj
⟩0 −
⟨vihj
⟩k) (12)
∆bj = η(⟨hj⟩0 −
⟨hj⟩k) (13)
∆ci = η(〈vi〉0 − 〈vi〉k) (14)37
deep belief networks (dbns)
deep belief networks (dbns)
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
39
deep belief networks (dbns)
∙ Start with a training vectoron the visible units
∙ Update all the hidden unitsin parallel
∙ Update the all the visibleunits in parallel to get a“reconstruction”
∙ Update the hidden unitsagain
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
40
pre-training and fine tuning
RBM
data
500 hidden units
RBM
300 hidden units
500 hidden units
RBM
100 hidden units
300 hidden units
RBM
100 hidden units
10 hidden
data
update weights
500 hidden units
300 hidden units
100 hidden units
10 hidden
error < 0.001
BP
DBN Model
RBMs pre-training fine-tuning with BP41
deep belief networks (dbns)
42
practical considerations
weights initialization
44
deep belief networks (dbns) - adaptive learning rate size
ηji =
uη(old)ji if (⟨vihj
⟩0 −
⟨vihj
⟩k)(
⟨vihj
⟩(old)0 −
⟨vihj
⟩(old)k ) > 0
dη(old)ji if (⟨vihj
⟩0 −
⟨vihj
⟩k)(
⟨vihj
⟩(old)0 −
⟨vihj
⟩(old)k ) < 0
44Lopes et al., Towards Adaptive learning with improvedconvergence of DBNs on GPUs, Pattern Recognition, [2014]
45
adaptive step size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE
(reconstruction)
Epoch
α = 0.1
adaptiveγ = 0.1γ = 0.4γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE
(reconstruction)
Epoch
α = 0.4
adaptiveγ = 0.1γ = 0.4γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE
(reconstruction)
Epoch
α = 0.7
adaptiveγ = 0.1γ = 0.4γ = 0.7
Average reconstruction error (RMSE).46
convergence results (α = 0.1)
Training images
Reconstructionafter 50 epochsReconstructionafter 100 epochsReconstructionafter 250 epochsReconstructionafter 500 epochsReconstructionafter 750 epochsReconstruc-tion after
1000 epochs
Adaptive Step Size Fixed (optimized) learning rate η = 0.4
47
deep models characteristics
deep models characteristics
∙ Biological Plausibility
∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.
∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs
49
deep models characteristics
∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.
∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.
∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs
49
deep models characteristics
∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.
∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs
49
deep models characteristics
∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.
∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs
49
deep models characteristics
∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.
∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs
49
Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.
Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).On contrastive divergence learning.In Proceedings of the 10th International Workshop onArtificial Intelligence and Statistics (AISTATS 2005), pages33–40.Hinton, G. E. (2010).A practical guide to training restricted Boltzmannmachines.Technical report, Department of Computer Science,University of Toronto.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).
49
An empirical evaluation of deep architectures onproblems with many factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.
Roux, N. L. and Bengio, Y. (2008).Representational power of restricted Boltzmannmachines and deep belief networks.Neural Computation, 20(6):1631–1649.
Roux, N. L. and Bengio, Y. (2010).Deep belief networks are compact universalapproximators.Neural Computation, 22(8):2192–2207.
50
Questions?
50
deep learning
Algorithms and Applications
Bernardete Ribeiro, [email protected] 24, 2015
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015