dl1 deep learning_algorithms

deep learning

Algorithms and Applications

Bernardete Ribeiro, [email protected]

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

III - Deep Learning Algorithms

1

elements 3: deep neural networks

outline

∙ Learning in Deep Neural Networks∙ Deep Learning: Evolution Timeline∙ Deep Architectures∙ Restricted Boltzmann Machines (RBMs)∙ Deep Belief Networks (DBNs)∙ Deep Models Overall Characteristics

3

learning in deep neural networks

learning in deep neural networks

1. No general learning algorithm (no-free lunch theorem byWolpert 1996)

2. Learning algorithm for specific tasks - perception, control,prediction, planning reasoning, language understanding

3. Limitations of BP - local minima, optimization challengesfor non-convex objective functions

4. Hinton’s deep belief networks (DBNs) as stack of RBMs5. LeCun’s energy based learning for DBNs

5

deep learning: evolution timeline

1. Perceptron [Frank Rosenblatt, 1959]2. Neocognitron [K Fukushima, 1980]3. Convolutional Neural Network (CNN) [LeCun, 1989]4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]5. Deep Belief Networks (DBNs) as stack of RBMs [GeoffreyHinton, 2006]

6

deep architectures

from brain-like computing to deep learning

∙ New empirical and theoretical results have brought deeparchitectures into the focus of the Machine Learning (ML)researchers [Larochelle et al., 2007].

∙ Theoretical results suggest that deep architectures arefundamental to learn the kind of brain-like complicatedfunctions that can represent high-level abstractions (e.g.vision, speech, language) [Bengio, 2009]

8

deep concepts main idea

9

deep neural networks

∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]

10

convolutional neural networks (cnns)

∙ Convolutional Neural Network consists of two basicoperations∙ convolutional∙ pooling

∙ Convolutional and pooling layersare arranged alternately untilhigh-level features are obtained

∙ Several feature maps in eachconvolutional layer

∙ Weights in the same map areshared

NN

input C1 S2 C3 S4

1

1I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE,CIM,2010

11

convolutional neural networks (cnns)

∙ Convolutional: suppose the size of the layer is d× dand the size of the receptive fields are r × r, γ and xdenote respectively the values of the convolutionallayer and the previous layer:

γij = g(r∑

m=1

r∑n=1

xi+m−1,j+n−1.wm,n + b)

i, j = 1, · · · , (d− r + 1) where g is a nonlinear function.∙ Pooling is following after convolution to reduce thedimensionality of features and to introducetranslational invariance into the CNN network.

12

deep belief networks (dbns)

∙ Probabilistic generative modelscontrasting with the discriminativenature of other NNS

∙ Generative models provide a jointprobability distribution of dataand labels

∙ Unsupervised greedy-layer-wisepre-training followed by finaltuning

image 28 x 28 pixels

visible

hidden

visible

hidden

visible

hidden

Top Level units

Labels Hidden Units

RBM Layer

RBM Layer

RBM Layer

Detection Layer

2

2based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial IntelligenceResearch, IEEE, CIM,2010

13

autoencoders (aes)

∙ The auto-encoder has twocomponents:∙ the encoder f (mapping x to h) and∙ the decoder g (mapping h to r)

∙ An auto-encoder is a neuralnetwork that tries to reconstructits input to its output

encoder f…

…

…

…

…

…

decoder g

input x

code h

reconstruction r

3

3based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),www.iro.umontreal.ca_~bengioy_dbook

14

www.iro.umontreal.ca_~bengioy_dbook

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as Support VectorMachines (SVMs) [Larochelle et al., 2007].

15

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as SVMs [Larochelle et al., 2007].

15

Resctricted Boltzmann Machines

Deep Belief Networks

16

restricted boltzmann machines

restricted boltzmann machines (rbms)

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

18


∙ Unsupervised∙ Find complex regularities intraining data

∙ Bipartite Graph∙ visible, hidden layer

∙ Binary stochastic units∙ On/Off with probability

∙ 1 Iteration∙ Update Hidden Units∙ Reconstruct Visible Units

∙ Maximum Likelihood oftraining data

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

19


∙ Training Goal: Best probablereproduction∙ unsupervised data

∙ find latent factors of dataset∙ Adjust weights to getmaximum probability ofinput data

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

20


Given an observed state, the energy of the joint configurationof the visible units and hidden units (v,h) is given by:

E(v,h) = −I∑i=1

civi −J∑j=1

bjhj −J∑j=1

I∑i=1

Wjivihj , (1)

where W is the matrix of weights, and b and c are the biasunits w.r.t. hidden and visible layers, respectively.

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

21


The Restricted Boltzmann Machine (RBM) assigns aprobability for each configuration (v,h), using:

p(v,h) = e−E(v,h)

Z , (2)

where Z is a normalization constant called partition function,obtained by summing up the energy of all possible (v,h)configurations [Bengio, 2009, Hinton, 2010,Carreira-Perpiñán and Hinton, 2005]:

Z =∑v,h

e−E(v,h) . (3)

22


Since there are no connections between any two units withinthe same layer, given a particular random inputconfiguration, v, all the hidden units are independent of eachother and the probability of h given v becomes:

p(h | v) =∏j

p(hj = 1 | v) , (4)

where

p(hj = 1 | v) = σ(bj +I∑i=1

viWji) . (5)

23


Similarly given a specific hidden state, h, the probability of vgiven h is obtained by (6):

p(v | h) =∏i

p(vi = 1 | h) , (6)

where:

p(vi = 1 | h) = σ(ci +J∑j=1

hjWji) . (7)

24


Given a random training vector v, the state of a given hiddenunit j is set to 1 with probability:

p(hj = 1|v) = σ(bj +∑i

viWij)

Similarly:p(vi = 1|h) = σ(ci +

∑j

hjWij)

where σ (x) is the sigmoid squashing function 1(1+e−x) .

25


The marginal probability assigned to a visible vector, v, isgiven by (8):

p(v) =∑h

p(v,h) = 1Z∑h

e−E(v,h) . (8)

Hence, given a specific training vector v its probability can beraised by adjusting the weights and the biases in order tolower the energy of that particular vector while raising theenergy of all the others.

26


To this end, we can perform stochastic gradient ascentprocedure on the log-likelihood obtained from training thedata vectors using ( 9):

∂ logp(v)∂θ

= −∑h

p(h | v)∂ E(v,h)∂θ︸︷︷︸

positive phase

+∑v,h

p(v,h)∂E(v,h)∂θ︸︷︷︸

negative phase

(9)

27

training an rbm

training an rbm

The learning rule for performing stochastic steepest ascent inthe log probability of the training data:

∂ logp(v)∂θ

=⟨vihj

⟩0 −

⟨vihj

⟩∞ (10)

where 〈·〉0 denotes expectations for the data distribution(p0 = p(h | v)) and 〈·〉∞ denotes expectations under themodel distributionp∞(v,h) = p(v,h) [Roux and Bengio, 2008].

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

29

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

30


v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

31


v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

32


v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

v(1)

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

33


v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

v(2)

i · · ·

h(2)

· · · j

v(∞)

i · · ·

h(∞)

· · · j

⟨vihj

⟩∞

34

contrastive divergence algorithm

contrastive divergence (cd–k)

∙ To solve this problem, Hinton proposed the ContrastiveDivergence algorithm.

∙ CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (11)

36

contrastive divergence (cd–k)

∙ v(0) ← x∙ Compute the binary (features) states of the hidden units,h(0), using v(0)

∙ for n← 1 to k∙ Compute the “reconstruction” states for the visible units, v(n),using h(n−1)

∙ Compute the “reconstruction” states for the hidden units, h(n),using v(n)

∙ end for∙ Update the weights and biases, according to:

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (12)

∆bj = η(⟨hj⟩0 −

⟨hj⟩k) (13)

∆ci = η(〈vi〉0 − 〈vi〉k) (14)37


x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

39


∙ Start with a training vectoron the visible units

∙ Update all the hidden unitsin parallel

∙ Update the all the visibleunits in parallel to get a“reconstruction”

∙ Update the hidden unitsagain

x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

40

pre-training and fine tuning

RBM

data

500 hidden units

RBM

300 hidden units

500 hidden units

RBM

100 hidden units

300 hidden units

RBM

100 hidden units

10 hidden

data

update weights

500 hidden units

300 hidden units

100 hidden units

10 hidden

error < 0.001

BP

DBN Model

RBMs pre-training fine-tuning with BP41


42

practical considerations

weights initialization

44

deep belief networks (dbns) - adaptive learning rate size

ηji =

uη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) > 0

dη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) < 0

44Lopes et al., Towards Adaptive learning with improvedconvergence of DBNs on GPUs, Pattern Recognition, [2014]

45

adaptive step size

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.1

adaptiveγ = 0.1γ = 0.4γ = 0.7

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.4

adaptiveγ = 0.1γ = 0.4γ = 0.7

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.7

adaptiveγ = 0.1γ = 0.4γ = 0.7

Average reconstruction error (RMSE).46

convergence results (α = 0.1)

Training images

Reconstructionafter 50 epochsReconstructionafter 100 epochsReconstructionafter 250 epochsReconstructionafter 500 epochsReconstructionafter 750 epochsReconstruc-tion after

1000 epochs

Adaptive Step Size Fixed (optimized) learning rate η = 0.4

47

deep models characteristics


∙ Biological Plausibility

∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49


∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.

∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.



49


∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.



49

Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.

Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).On contrastive divergence learning.In Proceedings of the 10th International Workshop onArtificial Intelligence and Statistics (AISTATS 2005), pages33–40.Hinton, G. E. (2010).A practical guide to training restricted Boltzmannmachines.Technical report, Department of Computer Science,University of Toronto.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).

49

An empirical evaluation of deep architectures onproblems with many factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.

Roux, N. L. and Bengio, Y. (2008).Representational power of restricted Boltzmannmachines and deep belief networks.Neural Computation, 20(6):1631–1649.

Roux, N. L. and Bengio, Y. (2010).Deep belief networks are compact universalapproximators.Neural Computation, 22(8):2192–2207.

50

Questions?

50

deep learning

Algorithms and Applications

Bernardete Ribeiro, [email protected] 24, 2015

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

dl1 deep learning_algorithms

Internet