learning deep architectures for aimath.gmu.edu/seminars/nda/ndaapr14.pdf · deep vs shallow pascanu...
TRANSCRIPT
![Page 1: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/1.jpg)
Learning Deep Architectures for AI
- Yoshua Bengio
Part II - Vijay Chakilam
![Page 2: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/2.jpg)
Limitations of Perceptron
� There is no value for W and b such that the model results in right target for every example
x1
x2
y
W, b
0,1
0,0 1,0
1,1
weight plane output =1 output =0
A graphical view of the XOR problem
![Page 3: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/3.jpg)
Learning representations
0,0 1,0
2,1
x2
x1 u1
u2 h2
h1
y
W, b φV, c
W =1111⎡
⎣⎢
⎤
⎦⎥, b =
0−1⎡
⎣⎢
⎤
⎦⎥, V =
1−2⎡
⎣⎢
⎤
⎦⎥ and c = 0
![Page 4: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/4.jpg)
Backpropagation
∂E∂z j
=dyjdz j
∂E∂yj
= yj (1− yj )∂E∂yj
yjj
yii
z j
∂E∂yi
=dzjdyi
∂E∂z jj
∑ = wij∂E∂z jj
∑
∂E∂wij
=∂z j∂wij
∂E∂z j
= yi∂E∂z j
E = 12
(t jj∈output∑ − yj )
2
∂E∂yj
= −(t j − yj )
![Page 5: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/5.jpg)
Deep vs Shallow � Pascanu et al. (2013) compared deep rectifier
networks with their shallow counterparts. � For a deep model with inputs and hidden layers
of width , the maximal number of response regions per parameter behaves as
� For a shallow model with inputs and hidden units, the maximal number of response regions per parameter behaves as
n0 kn
Ωnn0
⎢
⎣⎢
⎥
⎦⎥
k−1nn0−2
k
⎛
⎝⎜⎜
⎞
⎠⎟⎟
O kn0−1nn0−1( )
n0 nk
![Page 6: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/6.jpg)
Deep vs Shallow � Pascanu et al. (2013) showed that the deep model
can generate exponentially more regions per parameter in terms of the number of hidden layers, and at least order polynomially more regions per parameter in terms of layer width n.
� Montufar et al. (2014) came up with significantly improved lower bound on the maximal linear regions.
![Page 7: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/7.jpg)
Unsupervised Pre-training � Scarcity of labeled examples and availability of
many unlabeled examples.
� Unknown future tasks.
� Once a good high-level representation is learned, other learning tasks could be much easier. � For example, kernel machines can be very powerful if
using an appropriate kernel, i.e. an appropriate feature space.
![Page 8: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/8.jpg)
X
X
Z1 Z2 Z3
Y
W 1 W 2 W 3 W 4
Z1
W 1
Z2 Z1
W 2
Z2 Z3 W 3
DNN with unsupervised pre-training using RBMs
![Page 9: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/9.jpg)
DNN with unsupervised pre-training using
Autoencoders X
Z1
X
W W T
W T
W T
W
W
Z1 Z2 Z
Z2 Z3 Z
X
Z1
2
1
2 2
3 3
Z2 Z3
Y
W 1
1 1
W 2 W 3 W 4
![Page 10: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/10.jpg)
Restricted Boltzmann Machines
![Page 11: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/11.jpg)
Restricted Boltzmann Machine � Energy function:
� Distribution:
E(x,h) = −hTWx − cT x − bTh
= − Wj,khj xkk∑
j∑ − ckxk
k∑ − bjhj
j∑
p(x,h) = exp(−E(x,h)) / Z
X h
W
![Page 12: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/12.jpg)
h1 h2
+2 +1
x1 x2
An example of how weights define a
distribution
39.70
x h - E e-E P(x, h ) P(x) 1 1 1 1 2 7.39 .186 1 1 1 0 2 7.39 .186 1 1 0 1 1 2.72 .069 0.466 1 1 0 0 0 1 .025 1 0 1 1 1 2.72 .069 1 0 1 0 2 7.39 .186 1 0 0 1 0 1 .025 0.305 1 0 0 0 0 1 .025 0 1 1 1 0 1 .025 0 1 1 0 0 1 .025 0 1 0 1 1 2.72 .069 0.144 0 1 0 0 0 1 .025 0 0 1 1 -1 0.37 .009 0 0 1 0 0 1 .025 0.084 0 0 0 1 0 1 .025 0 0 0 0 0 1 .025
-1
-1
![Page 13: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/13.jpg)
Log likelihood � To train an RBM, we’d like to minimize the average
negative log-likelihood
1T
l( f (x(t ) ))t∑ =
1T
− log p(x(t ) )t∑
∂− log p(x(t ) )∂θ
= Eh∂E(x(t ),h)
∂θ| x(t )
⎡
⎣⎢
⎤
⎦⎥−Ex,h
∂E(x,h)∂θ
⎡
⎣⎢⎤
⎦⎥
![Page 14: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/14.jpg)
Contrastive Divergence � Idea:
� Replace the expectation by a point estimate at x˜
� Obtain the point x˜ by Gibbs sampling � Start sampling chain at x(t)
~ p(h|x) ~ p(x|h)
......
x(t ) x1 xk = x~
![Page 15: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/15.jpg)
Conditional probabilities
p(h | x) = p(hj | x)j∏
p(hj =1| x) =1
1+ exp(−(bj +Wj.x))
= sigm(bj +Wj.x)
p(x | h) = p(xk | h)k∏
p(xk =1| h) =1
1+ exp(−(ck + hTW.k ))
= sigm(ck + hTW.k )
![Page 16: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/16.jpg)
Conditional probabilities
![Page 17: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/17.jpg)
Contrastive Divergence – Parameter Updates
� Derivation of for ∂E(x,h)∂θ
θ =Wjk
∂E(x,h)∂Wjk
=∂
∂Wjk
− Wjkhjxkjk∑ − ckxk
k∑ − bjhj
j∑
⎛
⎝⎜⎜
⎞
⎠⎟⎟
= −∂
∂Wjk
Wjkhjxkjk∑
= −hjxk
∇WE(x,h) = −hxT
![Page 18: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/18.jpg)
Contrastive Divergence – Parameter Updates
� Derivation of for Εh∂E(x,h)∂θ
| x⎡
⎣⎢⎤
⎦⎥
Εh∂E(x,h)∂Wjk
| x⎡
⎣⎢⎢
⎤
⎦⎥⎥= Εh −hjxk | x⎡⎣ ⎤⎦= −hjxk p(hj | x)
hj∈{0,1}∑
= −xk p(hj =1| x)
Εh ∇WE(x,h) | x[ ] = −h(x)xT
θ =Wjk
![Page 19: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/19.jpg)
Contrastive Divergence – Parameter Updates
� Given and the learning rule for becomes
W ⇐W −α ∇W − log p(x(t ) )( )
⇐W −α Εh ∇WE(x(t ),h) | x(t )⎡⎣ ⎤⎦− Εx,h ∇WE(x,h)[ ]( )
⇐W −α Εh ∇WE(x(t ),h) | x(t )⎡⎣ ⎤⎦− Εh ∇WE(x
~,h) | x
~⎡⎣⎢
⎤⎦⎥
⎛
⎝⎜
⎞
⎠⎟
x(t ) x~
θ =Wjk
⇐W +α h(x(t ) )x(t )T − h(x~)x~ T⎛
⎝⎜
⎞
⎠⎟
![Page 20: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/20.jpg)
Contrastive Divergence - Algorithm
� For each training example � Generate a negative sample using k steps of Gibbs
sampling, starting at � Update parameters
� Repeat until stopping criterion
W ⇐W +α h(x(t ) )x(t )T − h(x~)x~ T⎛
⎝⎜
⎞
⎠⎟
b⇐ b+α h(x(t ) )− h(x~)⎛
⎝⎜
⎞⎠⎟
c⇐ c+α x(t ) − x~⎛
⎝⎜
⎞⎠⎟
x(t )
x(t )x~
![Page 21: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/21.jpg)
Autoencoders
![Page 22: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/22.jpg)
3
bj
ck x
W h(x ) Encoder
= g(a(x)) = sigm(b + Wx )
Decoderx = o(a(x))
= sigm(c + W* h(x)) for binary inputs
W * = W T
(tied weights)
h(x )
Autoencoder
![Page 23: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/23.jpg)
Autoencoder: Loss Function � Loss function:
• Use as the loss function
• Example: We get the squared error loss
• By choosing a Gaussian distribution with mean and identity covariance
and choosing
� For binary inputs:
l( f (x)) = − log p(x |µ)
p(x |µ) = 1(2π )D/2
exp(− 12
(xk −µk )2 )
k∑µ
µ = c+W *h(x)
l( f (x)) = − (xk log(xk )+ (1− xk )log(1− xk ))k∑
l( f (x)) = 12
(xk − xk )2
k∑
![Page 24: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/24.jpg)
Autoencoder: Gradient � For both binary and real valued inputs, the gradient
has a very simple form
� Parameter gradients are obtained by backpropagating the gradient like in a regular network
∇a(x( t ) )
l( f (x(t ) )) = x(t ) − x(t )
![Page 25: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/25.jpg)
3
bj
ck X^
W
W * = W T
(tied weights)
h(x~ )
x
0 0 0 noise process
p(x~|x)
Denoising Autoencoder
� Representation should be robust to introduction of noise � Random assignment of
subset of inputs to 0, with probability v
� Gaussian additive noise
� Reconstruction computed from the corrupted input
� Loss function compares reconstruction with original input
X~
![Page 26: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/26.jpg)
Dataset � MNIST ("Modified National Institute of Standards
and Technology") is the de facto “hello world” dataset of computer vision.
� Contains gray-scale images of hand-drawn digits, from zero through nine.
� Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
![Page 27: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/27.jpg)
![Page 28: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/28.jpg)
![Page 29: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/29.jpg)
∂y i ∂zi
i i = y (1 - y )
The output units in a softmax group use a non-local non-linearity:
softmax group
z i
yi
yi =ezi
e z j ∑j∈group
Softmax
![Page 30: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/30.jpg)
Theano � Theano is a Python library that allows you to define,
optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
� Tight integration with NumPy
� Transparent use of a GPU. � Perform data-intensive computations much faster than on a
CPU.
� Efficient symbolic differentiation. � Theano does your derivatives for functions with one or
many inputs.
� Speed and stability optimizations.
![Page 31: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/31.jpg)
Theano
� Installation and more information: � http://deeplearning.net/software/theano/
� Latest reference: � Theano: A Python framework for fast computation of
mathematical expressions
![Page 32: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/32.jpg)
Theano.Tensor � Tensor
� A Theano module for symbolic NumPy types and operations
� There are many types of symbolic expressions. Basic Tensor functionality includes creation of tensor variables such as scalar, vector, matrix, etc.
![Page 33: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/33.jpg)
� Create algebraic expressions on symbolic variables:
� Theano Functions
![Page 34: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/34.jpg)
� Create a shared variable so we can do gradient descent
� Define a cost function that has a minimum value
� Compute the gradient
![Page 35: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/35.jpg)
• Get data and initialize network
![Page 36: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/36.jpg)
• Define Theano variables and expressions
![Page 37: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/37.jpg)
• Define training expressions and functions
![Page 38: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/38.jpg)
• Loop through the data in batches
![Page 39: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/39.jpg)
DNN without unsupervised pre-training
![Page 40: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/40.jpg)
DNN with unsupervised pre-training using RBMs
X
X
Z1 Z2 Z3
Y
W 1 W 2 W 3 W 4
Z1
W 1
Z2 Z1
W 2
Z2 Z3 W 3
![Page 41: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/41.jpg)
DNN unsupervised pre-training with RBMs
![Page 42: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/42.jpg)
DNN with unsupervised pre-training using
Autoencoders X
Z1
X
W W T
W T
W T
W
W
Z1 Z2 Z
Z2 Z3 Z
X
Z1
2
1
2 2
3 3
Z2 Z3
Y
W 1
1 1
W 2 W 3 W 4
![Page 43: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/43.jpg)
DNN unsupervised pre-training with AutoEncoders
![Page 44: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/44.jpg)
Unsupervised Pre-training � Layer-wise unsupervised learning:
� The author believes that greedy layer-wise unsupervised pre-training overcomes the challenges of deep learning by introducing a useful prior to the supervised fine-tuning training procedure.
� Gradients of a criterion defined at the output layer could become less useful as they are propagated backwards to lower layers.
� The author states it is reasonable to believe unsupervised learning criterion defined at the level of a single layer could be used to move its parameters in a favorable direction.
![Page 45: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/45.jpg)
Unsupervised Pre-training � The author claims unsupervised pre-training acts as
a regularizer by establishing an initialization point of the fine-tuning procedure inside a region of parameter space in which the parameters are henceforth restricted.
� Another motivation is that greedy layer-wise unsupervised pre-training could be a way to naturally decompose the problem into sub-problems associated with different levels of abstraction and extract salient information about the input distribution and capture in a distributed representation.
![Page 46: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/46.jpg)
References � Yoshua Bengio, Learning deep architecures for AI:
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
� Hugo Larochelle, Neural networks:http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
� Geoff Hinton, Neural networks for machine learning: https://www.coursera.org/learn/neural-networks
� Goodfellow et al., Deep Learning: http://www.deeplearningbook.org/
� Montufar et al., On the number of linear regions of Deep Networks: https://arxiv.org/pdf/1402.1869.pdf
![Page 47: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/47.jpg)
References � Pascanu et al., On the number of response regions of deep
feed forward neural networks with piece-wise linear activations: https://arxiv.org/pdf/1312.6098.pdf
� Perceptron weight update graphics:https://www.cs.utah.edu/~piyush/teaching/8-9-print.pdf
� Proof of Theorem (Block, 1962, and Novikoff, 1962).http://cs229.stanford.edu/notes/cs229-notes6.pdf
� Python and Theano code reference:https://deeplearningcourses.com/
![Page 48: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts](https://reader033.vdocuments.mx/reader033/viewer/2022041717/5e4c1c6b3699c37e3e0271be/html5/thumbnails/48.jpg)
References � Erhan et al., Why does unsupervised pre-training
help deep learning:http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
� Theano installation and more information: http://deeplearning.net/software/theano/
� Theano latest reference: Theano: A Python framework for fast computation of mathematical expressions