presentation springschool2019d.pptx [autosaved]€¦ · whps iuht gldjqrvwlf +hdowk\ 6lfn 6lfn «...
TRANSCRIPT
Basics of deep learning
By
Pierre-Marc Jodoin and Christian Desrosiers
Lets start with a simple example
From Wikimedia Commonsthe free media repository
( temp, freq) diagnostic
(37.5, 72) Healthy
(39.1, 103) Sick
(38.3, 100) Sick
(…) …
(36.7, 88) Healthy
Freq
temp
Sick
Healthy
Patient 1
Patient 2
Patient 3
Patient N
D
x t
D
Lets start with a simple example
Freq
temp
Sick
Healthy
A new patient comes to the hospitalHow can we determine if he is sick or not?
From Wikimedia Commonsthe free media repository
Lets start with a simple example
Freq
temp
Divide the feature space in 2 regions : sick and healthy
Solution
From Wikimedia Commonsthe free media repository
Freq
temp
Sick
Healthy
Divide the feature space in 2 regions : sick and healthy
Solution
From Wikimedia Commonsthe free media repository
Freq
temp
t=Sick
t=Healthy
otherwise
region blue in the is ifealthy
Sick
xHxy
More formally
From Wikimedia Commonsthe free media repository
How to split the featurespace?
Definition … a line!
x
ybmxy
x
y
x
ym
slope bias
b, bias
x
ybmxy
x
y bxx
yy
x
ym
xbyxxy
xbxyyx 0
Definition … a line!
b, bias
xbxyyx 0
1w 2w 0w
Rename variables
x
y
1x
2x
Rename variables
0210 wywxw
022110 wxwxw
1x
2x
02211 wxwxwxyw
0xyw
0xyw
0xyw
Classification function
),( 21 wwN
1x
2x
002211 wxwxw
0.4
0.2
0.1
0
2
1
w
w
w
)3,2(
)6,1(
0702211 wxwxw
)1,4(
0602211 wxwxw
Classification function
02211 wxwxwxyw
1x
2x
02211 wxwxwxyw
0xyw
0xyw
0xyw
Classification function
),( 21 wwN
21210
02211
,,1,, xxwww
wxwxwxyw
DOT
product
'x
w
1x
2x
02211 wxwxwxyw
0xyw
0xyw
0xyw
Classification function
),( 21 wwN
'
,,1,, 21210
02211
xw
xxwww
wxwxwxy
T
w
DOT
product
To simplify notation
xwxy Tw
linear classification = dot product with bias included
Freq
0tSick
Healthy
With the training dataset D
the GOAL is to
find the parameters that would best separate the two classes.
210 ,, www 0xyw
0xyw
Learning
How do we know if a model is good?
Freq
0t
FreqFreq
0t0t
0, DxyL w
0, DxyL w
0, DxyL w
Loss function
Good! Medium BAD!
4. Training : find that minimize
So far…
1. Training dataset: D
2. Classification function (a line in 2D) :
3. Loss function:
02211 wxwxwxyw
210 ,, www
DxyL w ,
DxyL w ,
Perceptron Logistic regressionMulti-layer perceptronConv Nets
Today
22
Perceptron
27
Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, v65, No. 6, pp. 386–408
Perceptron
1
1w
2w
0w
(2D and 2 classes)
)(xyw
1x
2x
w
1x
2x0)( xyw
0)( xyw
0)( xyw
xw
xwxwwxyw
T
12110
)(
xw T
28
sign }1,1{)( xyw
Activation function
xw T
xwsignxyw
T
Perceptron(2D and 2 classes)
29
1x
2x0)( xyw
0)( xyw
0)( xyw
w
1
1w
2w
0w
1x
2x
NeuronDot product + activation function
sgxw T 1,1T xwsignxyw
Perceptron(2D and 2 classes)
30
w
1x
2x0)( xyw
0)( xyw
0)( xyw
1
1w
2w
0w
1x
2x
So far…
1. Training dataset: D
2. Classification function (a line in 2D) :
3. Loss function:
02211 wxwxwxyw
DxyL w ,
4. Training : find that minimize
So far…
1. Training dataset: D
2. Classification function (a line in 2D) :
3. Loss function:
210 ,, www
DxyL w ,
DxyL w ,
sg
1
1w
2w
0w
1x
2x xw T xyW
Linear classifiers have limits
Non-linearly separable training data
Linear classifier = large error rate
Three classical solutions
1. Acquire more observations
2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data
Three classical solutions
1. Acquire more observations
2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data
D
x
( temp, freq, headache) Diagnostic
(37.5, 72, 2) healthy
(39.1, 103, 8) sick
(38.3, 100, 6) sick
(…) …
(36.7, 88, 0) healthy
Patient 1
Patient 2
Patient 3
Patient N
D
x
t
( temp, freq) diagnostic
(37.5, 72) healthy
(39.1, 103) sick
(38.3, 100) sick
(…) …
(36.7, 88) healthy
Patient 1
Patient 2
Patient 3
Patient N
D
t
Acquire more data
10
8
6
4
2
0
02211 wxwxwxyw
(line)
0332211 wxwxwxwxyw
(plane)
Non-linearly separable training data
39
sgxw T 1,1T xwsignxyw
Perceptron(2D and 2 classes)
02211 wxwxwxyw
(line)
1x
2x0)( xyw
0)( xyw
0)( xyw
w
1
1w
2w
0w
1x
2x
40
Perceptron(3D and 2 classes)
sg
1
1w
2w
0w
1x
2x
xw T 1,1T xwsignxyw
1)( xyw
1)( xyw
Example 3D
0332211 wxwxwxwxyw
(plane)
3x 3w
41
Perceptron(K-D and 2 classes)
sg
1
1w
2w
0w
xw T 1,1T xwsignxyw
0332211 wxwxwxwxwxy KKw
(hyperplane)
1Kw
Kw
(…)
1x
2x
1Kx
Kx
Learning a machine
The goal: with a set of training data , estimate so:
In other words, minimize the training loss
Optimization problem
ntxy nnw
NN txtxtxD ,,...,,,, 2211
,,1
N
nnnww txylDxyL
42
w
Freq
0t
FreqFreq
0t0t
Loss function
0, DxyL w
0, DxyL w
0, DxyL w
44
DxyL w ,
1w2w
Perceptron
Question: how to find the best solution?
45
0, DxyL w
Random initialization
DxyL w ,
1w2w
[w1,w2]=np.random.randn(2)
Gradient descent
Question: how to find the best solution?
DxyLηww [k]w
[k]][k ,1
Gradient of the loss function
Learning rate
47
0, DxyL w
Perceptron Criterion (loss)
A wrongly classified sample is when
Consequently is ALWAYS positive for wrongly classified samples
.1et 0
or
1et 0
T
T
nn
nn
txw
txw
Tnntxw
Observation
50
Perceptron Criterion (loss)
15.464, DxyL w
samples classified wrongly ofset theis V ere wh, T
Vx
nnw
n
txwDxyL
51
So far…
1. Training dataset: D
2. Linear classification function:
3. Loss function:
02211 wxwxwxwxy MMw
, T
Vx
nnw
n
txwDxyL
4. Training : find that minimizes
So far…
1. Training dataset: D
2. Linear classification function:
3. Loss function:
w DxyL w ,
sg
1
1w2w
0w
1x
2x
xw T 1,1T xwsignxyw
3x
Kx
3w
Kw
(…)
DxyLw ww
,minarg
0, DxyL w
, T
Vx
nnw
n
txwDxyL
Optimisation
Lkkk ][][]1[ ww
Gradient of the loss function
learning rate
Stochastic gradient descent (SGD)
Initk=0DO k=k+1
FOR n = 1 to N
UNTIL every data is well classified or k== MAX_ITER
nk xLww
][
54
w
Perceptron gradient descent
Vx
nnw
n
xtDxyL
,
NOTE : learning rate :
• Too low => slow convergence• Too large => might not converge (even diverge)• Can decrease at each iteration (e.g. )kcstk /][ 55
, T
Vx
nnw
n
txwDxyL
Stochastic gradient descent (SGD)
Initk=0DO k=k+1
FOR n = 1 to NIF THEN /* wrongly classified */
UNTIL every data is well classified OR k=k_MAX
nnxtww
0T nntxw
w
Similar loss functions
N
nn
Tnw xwt,DxyL
1
0max,
“Hinge Loss” or “SVM” Loss
samples classified wrongly ofset thes V ere wh, T itxwDxyLVx
nnw
n
N
nn
Tnw xwt,DxyL
1
10max,
58
Multiclass Perceptron
1 1,0w
2,0w
0,0w
2221120222
2211110111
2201100000
xwxwwxw)x(y
xwxwwxw)x(y
xwxwwxw)x(y
,,,T
,w
,,,T
,w
,,,T
,w
(2D and 3 classes)
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xy iWi
,maxarg
1x
xwT 0
xwT 1
xwT 2
59
2x
max is )(0, xyw
max is )(2, xyw
max is )(1, xyw
xWxyW
T)(
2
1
2,21,20,2
2,11,10,1
2,01,00,0 1
)(
x
x
www
www
www
xyW
1 1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xy iWi
,maxarg
xwT 0
xwT 1
xwT 2
Multiclass Perceptron(2D and 3 classes)
max is )(0, xyw
max is )(2, xyw
max is )(1, xyw
2x
1x
Multiclass Perceptron(2D and 3 classes)
Example
2
1.1
1
9.446
1.44.24
5.06.32
)(xyW
Class 0
Class 1
(1.1, -2.0)
2.8
6.9
9.6
Class 2
61
Multiclass PerceptronLoss function
xwxwDxyLVx
nTtn
TjW
n
n
,
Sum over all wronglyclassified samples
Score of the wrong class
Score of the true class
,
Vx
nw
n
xDxyL
66
Multiclass Perceptron
Stochastic gradient descent (SGD)
init Wk=0, i=0DO k=k+1
FOR n = 1 to N
IF THEN /* wrongly classified sample */
UNTIL every data is well classified or k > K_MAX.
ntt
njj
xww
xww
nn
Wmaxarg nT xj
itj
67
Perceptron
Advantages:• Very simple• Does NOT assume the data follows a Gaussian distribution.• If data is linearly separable, convergence is guaranteed.
Limitations:• Zero gradient for many solutions => several “perfect solutions”• Data must be linearly separable
Will never convergeSuboptimal solutionMany “optimal”
solutions
Two famous ways of improving the Perceptron
1. New activation function + new Loss
1. New network architectureLogistic regression
Multilayer Perceptron / CNN
73
Logistic regression
74
]1,0[)( xyw
Sigmoid activation function
Logistic regression(2D, 2 classes)
New activation function: sigmoid
te
t
1
11
1w
2w
0w
1x
2x
xw T
75
xw
Tw T
exwσ)x(y
1
1
te
t
1
1
Neuron
1
1w
2w
0w
1x
2x
xwT ]1,0[)( xyw
76
Logistic regression(2D, 2 classes)
New activation function: sigmoid
1
1w
2w
0w
1x
2x
xwxy Tw
)(
77
w
1x
2x5.0)( xyw
0)( xyw
1)( xyw
Neuron
xwT ]1,0[)( xyw
Logistic regression(2D, 2 classes)
New activation function: sigmoid
Example 5.0,6.3,0.2,0.1,4.0 wxn
125.01
194.1)(
94.1
exyw
94.15.04.0*6.32 nT xw
1
6.3
5.0
2
4.0
1
Since 0.125 is lower than 0.5, is behind the plan.nx
xw T
78
Logistic regression(2D, 2 classes)
Like the Perceptron the logistic regression accomodates for K-D vectors
2x
(...)
1
Kw
1w
2w
0w
Kx
xwT
1x
79
Logistic regression(K-D, 2 classes)
xwxy Tw
)(
80
xwxy Tw
)(
What is the loss function?
2x
(...)
1
Kw
1w
2w
0w
Kx
xwT
1x
With a sigmoid, we can simulate a conditional probabilityof c1 GIVEN x
81
xcPxwxy Tw
|)( 1
2x
(...)
1
Kw
1w
2w
0w
Kx
xwT
1x
With a sigmoid, we can simulate a conditional probabilityof c1 GIVEN x
82
xcPxwxy Tw
|)( 1
)(1,|0 xyxwcP w
2x
(...)
1
Kw
1w
2w
0w
Kx
xwT
1x
n
N
nnnw
w xtxywd
DxydL
1
,
Cost function is –ln of the likelihood
We can also show that
2 Class Cross entropy
As opposed to the Perceptronthe gradient does not dependon the wrongly classified samples
nwn
N
nnwnw xytxytDxyL
1ln1ln,1
Logistic Network
Advantages:• More stable than the Perceptron• More effective when the data is non separable
Per
cept
ron
Log
istic
net
87
And for K>2 classes?New activation function : Softmax
exp
normexp
exp
)(0
xyw
c
xw
xw
w Tc
Ti
i e
exy
)(
xWcP
,|0
)(1
xyw
xWcP
,|1
)(2
xyw
xWcP
,|2
1 1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xwT 0
xwT 1
xwT 2
88
2x
1x
Softmax
exp
normexp
exp
)(0
xyw
xWcP
,|0
)(1
xyw
xWcP
,|1
)(2
xyw
xWcP
,|2
1 1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xwT 0
xwT 1
xwT 2
And for K>2 classes?New activation function : Softmax
c
xw
xw
w Tc
Ti
i e
exy
)(
2x
1x
'airplane' t 1000000000
'automobile' t 0100000000
'bird' t 0010000000
'cat' t 0001000000
'deer' t 0000100000
'dog' t 0000010000
'frog' t
0000001000
'horse' t 0000000100
'ship' t 0000000010
'truck' t 0000000001
Class labels : one-hot vectors
And for K>2 classes?
Cifar10
Cross entropy Loss
K>2 classes
91
N
n
K
knWknW xytDxyL
k1 1
ln,
K-Class cross entropy loss
92
N
nknnWn txyxL
1
N
n
K
knWknW xytDxyL
k1 1
ln,
Regularization
31,31,31
0,0,1
0.1,0.1,0.1
2
1
T
T
w
w
x
Different weights may give the same score
121 xwxw TT Maximum a Solution:
Maximum a posteriori
93
,minarg DxyL wW
Loss function
Regularization
Constant
21
Wou WRIn general L1 or L2
WR
Maximum a posterioriRegularization
94
Wow! Loooots of information!
Lets recap…
95
Neural networks2 classes
xcPxyw
|)( 1
Sign activation Sigmoid activation(...)
1
1w
2w
Kw
0w
sgxw T
1,1 xyw
xw T
96
Perceptron Logistic regression
2x
Kx
1x
(...)
1
1w
2w
Kw
0w
2x
Kx
1x
K-Class Neural networks
Softmax activation
exp
normexp
exp
)(0
xyw
xcP
|0
)(1
xyw
xcP
|1
)(2
xyw
xcP
|2
1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xwT 0
xwT 1
xwT 2
xyiW
i
maxarg
1 1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xT 0w
xT 1w
xT 2w
Perceptron
Logistic regression
2x
1x
1
2x
1x
Loss functions
samples classified wrongly ofset theis V ere wh, T
Vx
nnw
n
xwtDxyL
N
nnnw xwtDxyL
1
T,0max,
N
nnnw xwtDxyL
1
T1,0max,
“Hinge Loss” or “SVM” Loss
2 classes
Cross entropy loss
98
nwn
N
nnwnw xytxytDxyL
1ln1ln,1
Loss functions
K classes
Cross entropy loss with a Softmax
99
samples classified wrongly ofset thes V where, ixwxwDxyLVx
nTtn
Tjw
n
n
N
n jn
Ttn
Tjw xwxw,DxyL
n1
10max,
N
n jn
Ttn
Tjw xwxw,DxyL
n1
0max,
N
n
K
knWknw xytDxyL
k1 1
ln,
“Hinge Loss” or “SVM” Loss
Maximum a posteriori
Loss function
Regularization
Constante
,,1
N
nnnWw txylDxyL
21
W or WWR
WR
100
Optimisation
Lkkk ][][]1[ ww
Gradient of the loss function
learning rate
Stochastic gradient descent (SGD)
Initk=0DO k=k+1
FOR n = 1 to N
UNTIL every data is well classified or k== MAX_ITER
nk xLww
][
101
w
Now, lets goDEEPER
Three classical solutions
1. Acquire more data
2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data
Three classical solutions
1. Acquire more data
2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data
2D, 2Classes, Linear logistic regression
105
1x
1
2w
1w
0w
2x
Input layer(3 “neurons”)
Output layer(1 neuron with sigmoid)
Rxwσ)x(y Tw
xwT
Let’s add 3 neurons
1 1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
Input layer(3 “neurons”)
First layer(3 neurons)
xwT 0
xwT 1
xwT 2
RxwT
0
RxwT
1
RxwT
22x
1x
NOTE: The output of the first layer is a vector of 3 real values
1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xwT 0
xwT 1
xwT 2
RxwT
0
RxwT
1
RxwT
2
3
2
1
2,21,20,2
2,11,10,1
2,01,00,0 1
R
x
x
www
www
www
1
2x
1x
[0]W x
NOTE: The output of the first layer is a vector of 3 real values
1,0w
2,0w
0,0w
1,1w
2,1w
0,1w
1,2w
2,2w
0,2w
xwT 0
xwT 1
xwT 2
RxwT
0
RxwT
1
RxwT
2
1
2x
1x
If we want a 2-class Classification via a logistic regression (a cross entropy loss) we must add an output neuron.
Input layer(3 « neurons »)
hidden layer(3 neurons)
Output layer(1 neuron)
2-D, 2-Class, 1 hidden layer
1
]0[2,0w
]0[2,2w
]0[2,1w
]0[0,0w
]0[0,2w
]0[0,1w
]0[1,0w
]0[1,2w
]0[1,1w
]1[0w
]1[1w
]1[2w
xw T ]0[0
xw T ]1[1
xw T ]2[2
xWw T ]0[]1[ RxWwxy TW
]0[]1[
2x
1x
1
Visualsimplification
[0]W
1
]0[2,0w
]0[2,2w
]0[2,1w
]0[0,0w
]0[0,2w
]0[0,1w
]0[1,0w
]0[1,2w
]0[1,1w
]1[0w
]1[1w
]1[2w
xw T ]0[0
xw T ]1[1
xw T ]2[2
xWw T ]0[]1[ RxWwxy TW
]0[]1[
2x
1x
1x
0x
]1[w
xWwxy TW
]0[]1[
2-D, 2-Class, 1 hidden layer
Inputlayer
Hiddenlayer
Outputlayer
1
bias neuron.
This network contains a total of 13 parameters
Input layerHidden layer
3x3 1x4
Hidden layerOutput layer
1
[0]W
1x
0x
]1[w
xWwxy TW
]0[]1[
2-D, 2-Class, 1 hidden layer
112
1
1
Increasing the number of neurons = increasing the capacity of the model
This network has 5x3+1x6=21 parameters
2x
1x
Inputlayer
Hiddenlayer
Outputlayer
2-D, 2-Class, 1 hidden layer
xWwxy TW
]0[]1[
No hidden neuron
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
12 hidden neurons 60 hidden neurons
Nb neurons VS Capacity
Linear classificationUnderfitting(low capacity)
Non linear classificationGood result
(good capacity)
Non linear classificationOver fitting
(too large capacity)
114
1
Increasing the dimensionality of the data = more columns in
This network has 5x(k+1)+1x6 parameters
kD, 2Classes, 1 hidden layer
1x
1
2x
[0]W
(...)
kx
1kx
Inputlayer
Hiddenlayer
Outputlayer
xWwxy TW
]0[]1[
115
1
Adding an hidden layer = Adding a matrix multiplication
This network has 5x(k+1)+6x3 + 1x4 parameters
kD, 2Classes, 2 hidden layers
[0]W
1
[1]W
Inputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
1x
1
2x
(...)
kx
1kx
xWWwxy TW
]0[]1[]2[
Tw ]2[4]2[
63]1[
15]0[
Rw
RW
RW k
kD, 2 Classes, 4 hidden layer network
116
1
kx
1
[0]W
1
[1]W
(...)
1
1
[3]W
[2]W
This network has 5x(k+1)+6x3 + 4x4+7x5+1x8 parameters
8]4[
57]3[
44]2[
63]1[
15]0[
Rw
RW
RW
RW
RW k
xWWWWwxy TW
]0[]1[]2[]3[]4[
]4[w
Inputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
1x
2x
3x
kD, 2 Classes, 4 hidden layer network
1
1x
kx
1
2x
[0]W
1
[1]W
(...)
1
1
[3]W
[2]W
8]4[
57]3[
44]2[
63]1[
15]0[
Rw
RW
RW
RW
RW k
xWWWWwxy TW
]0[]1[]2[]3[]4[
]4[w
Inputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
NOTE : More hidden layers = deeper network = more capacity.
3x
Multilayer Perceptron
1
1x
2x
1
1
(...)
1
1
Kx
xyW
xf
x
3x
Example
1
1x
2x
1
1
Input data
x
Output of the last layer Output of the network
)(xyW
A K-Class neural networkhas K output neurons.
kD, 4 Classes, 4 hidden layer network
1
kx
1
[0]W
1
[1]W
[4]W
(...)
1
1
[3]W
[2]W
,1Wy x
,0Wy x
,3Wy x
,2Wy x
[4] [3] [2] [1] [0]Wy x W W W W W x
[4]0W
[4]1W
[4]2W
[4]3W
48]4[
57]3[
44]2[
63]1[
15]0[
RW
RW
RW
RW
RW kInputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
1x
2x
3x
kD, 4 Classes, 4 hidden layer network
1
kx
1
[0]W
1
[1]W
[4]W
(...)
1
1
[3]W
[2]W
,1Wy x
,0Wy x
,3Wy x
,2Wy x
[4] [3] [2] [1] [0]Wy x W W W W W x
Hinge loss
[4]0W
[4]1W
[4]2W
[4]3W
Inputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
1x
2x
3x
Softmax
1
kx
1
[0]W
1
[1]W
[4]W
(...)
1
1
[3]W
[2]W exp
exp
exp
exp
,1Wy xN
ORM
,0Wy x
,3Wy x
,2Wy x
[4] [3] [2] [1] [0]softmaxWy x W W W W W x
Cross entropy
kD, 4 Classes, 4 hidden layer networkInputlayer
HiddenLayer 1
Outputlayer
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
1x
2x
3x
How to make a prediction?
126
Forward pass
127
1
1x
2x
1
1
x
Forward pass
128
1
1
1
xW]0[
1x
2x
Forward pass
129
1
1
1
xW]0[
1x
2x
Forward pass
130
1
1
1
xWW]0[]1[
1x
2x
Forward pass
131
1
1
1
xWW]0[]1[
1x
2x
Forward pass
132
1
1
1
xWWwT ]0[]1[
1x
2x
Forward pass
133
1
1
1
xWWwxy TW
]0[]1[
1x
2x
Forward pass
134
xytxyttxyl WWW
1ln)1(ln,
1
1
1
txyl W ,
1x
2x
How to optimize the network?
135
0- From
Choose a regularization function
,minarg1
WRtxylWN
nnnW
W
21
or WWWR
How to optimize the network?
1- Choose a loss for example
Hinge lossCross entropy
Do not forget to adjust the output layer withthe loss you have choosen.
cross entropy => Softmax
nnW txyl ,
How to optimize the network?
137
2- Compute the gradient of the loss with respect to each parameter
and launch a gradient descent algorithm to update the parameters.
][
,
1
,
cba
N
nnnW
W
WRtxyl
][
,
1][,
][,
,
cba
N
nnnW
cba
cba W
WRtxyl
WW
How to optimize the network?
138
][
,
1
,
cba
N
nnnW
W
WRtxyl
Backpropagation
1
0x
1x
1
1
xWWwxyW
]0[]1[]2[
139
xytxyttxyl WWW
1ln)1(ln,
1
0x
1x
1
1
txyl W ,
xWWwxy TW
]0[]1[
140
1
0x
1x
1
1
txyl
Fxy
DwF
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A
141
1
0x
1x
1
1
txyl
Fxy
DwF
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A B
142
1
0x
1x
1
1
txyl
Fxy
DwF
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A B
C
143
1
0x
1x
1
1
txyl
Fxy
DwF
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A B
CD
144
1
0x
1x
1
1
txyl
Fxy
DwE
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A B
CD
E
145
1
0x
1x
1
1
txyl
Exy
DwE
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
xWWwxy TW
]0[]1[
A B
CD
E
146
1
0x
1x
1
1
txyl W ,
txyl
Exy
DwE
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
xWWwxy TW
]0[]1[
A B
CD
E
147
][
1
,
l
N
nnnW
W
txyl
Chain rule
Chain rule recap
148
xxv
vvu
uuf
/1)(
2)(
)( 2
2
122
xu
x
v
v
u
u
f
x
f
?
x
f
149
1
0x
1x
1
1
txyl
Exy
DwE
CD
BWC
AB
xWA
W
W
T
,
]1[
]0[
A B
CD
E
]0[]0[
, ,
W
A
A
B
B
C
C
D
D
E
E
xy
xy
txyl
W
txyl W
W
WW
150
1
0x
1x
1
1
txyl W ,
]0[]0[
, ,
W
B
A
B
B
C
C
D
D
E
E
xy
xy
txyl
W
txyl W
W
WW
Back propagation
A B
CD
E xyW
Activation functions
151
Activation functions
Sigmoïde
1
1 xx
e
3 Problems :
• Gradient saturates when input is large
• Not zero centered
• exp() is an expensive operation
Activation functions
Tanh(x)
• Output is zero-centered• Small gradient when input is large
[LeCun et al., 1991]
Activation functions ReLU max 0,x x
ReLU(x)(Rectified Linear Unit)
• Large gradient for x>0• Super fast
• Output non centered at zero• No gradient when x<0?
[Krizhevsky et al., 2012]
Activation functions LReLU max 0.01 ,x x x
Leaky ReLU(x)
• no gradient saturation • Super fast
• 0.01 is an hyperparameter
[Mass et al., 2013][He et al., 2015]
Activation functions PReLU max ,x x x
Parametric ReLU(x)
• no gradient saturation • Super fast
• learn with back prop
[Mass et al., 2013][He et al., 2015]
In practice
• By default, people use ReLU.
• Try Leaky ReLU / PReLU / ELU
• Try tanh but might be sub-optimal
• Do not use sigmoïde except at the output of a 2 class net.
How to classify an image?
https://ml4a.github.io/ml4a/neural_networks/
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
Biashttps://ml4a.github.io/ml4a/neural_networks/
How to classify an image?
Many parameters(7850 in Layer 1)
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
biashttps://ml4a.github.io/ml4a/neural_networks/
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
Pixel 65536https://ml4a.github.io/ml4a/neural_networks/
256x256
Too many parameters(655,370 in Layer 1)
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
Pixel 16Mhttps://ml4a.github.io/ml4a/neural_networks/
256x256x256
Waaay too many parameters(160M in Layer 1)
Full connections are too many
(...)
(...) (...)
1,01 xF
S
S
S
S
1,02 xF
1,03 xF
1,04 xF
150-D input vector with 150 neurons in Layer 1 => 22,500 parameters!!
No full connection
(...) (...) (...)
1,01 xF
S
S
S
S
1,02 xF
1,03 xF
1,04 xF
150-D input vector with 150 neurons in Layer 1 => 450 parameters!!
Share weights
150-D input vector with 150 neurons in Layer 1 => 3 parameters!!
(...)
(...)
(...)
1,01 xF
S
S
S
S
1,02 xF
1,03 xF
1,04 xF
(...)
10w
11w
12w
11w
10w
12w
11w
12w
10w
(...)
20w
21w
22w
22w
21w
20w
21w
22w
20w
1- Learning convolution filters! 2- Small number of parameters = can make it deep!
(...)(...)
10w
11w
12w
(...)
(...)(...)
10w
11w
12w
(...)
3.2
7.1
0.4
8.2
7.5
4.4
8.2
7.5
4.4
(...)(...)
1.02.0
3.0
(...)
3.2
7.1
0.4
8.2
7.5
4.4
8.2
7.5
4.4
(...)(...)
1.02.0
3.0
(...)
3.2
7.1
0.4
8.2
7.5
4.4
25.02.134.23.
8.2
7.5
4.4
(...)(...)
1.02.0
3.0
(...)
3.2
7.1
0.4
8.2
7.5
4.4
25.0
45.084.8.17.
8.2
7.5
4.4
(...)(...)
1.02.0
3.0
(...)
3.2
7.1
0.4
8.2
7.5
4.4
25.0
45.0
50.017.56.4.
8.2
7.5
4.4
(...)(...)
1.02.0
3.0
(...)
3.2
7.1
0.4
8.2
7.5
4.4
25.0
45.0
50.0
8.2
7.5
4.4
50.0
Convolution!
Each neuron of layer 1 is connected to 3x3 pixels, layer 1 has 9 parameters!!
Feature map
]0[WxF
Convolution operation
]0[0Wx ]0[
1Wx
]0[2Wx
]0[3Wx ]0[
4Wx
5-feature mapconvolution layer
K-feature mapconvolution layer
Conv layer 1POOLING
LAYER
Pooling layer
181
Goals
• Reduce the spatial resolution of featuremaps
• Lower memory and computation requirements
• Provide partial invariance to position, scale and rotation
(stride = 1) (stride = 1)
Images: https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2
(…)
Fully connected layers
xyW
2 Class CNN
xytxyttxyl WWW
1ln)1(ln,
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2
(…)
Fully connected layers
K Class CNN
xytxyttxyl WWW
1ln)1(ln,
exp
normexp
exp
)(0
xyw
)(1
xyw
)(2
xyw
SOFTMAX
185
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018
Nice example from the litterature
Learn image-based caracteristics
http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf
Batch processing
5.0,6.3,0.2
0.1,4.0
w
x
1
xw T
188
1x
2x
5.0,6.3,0.2
0.1,4.0
w
x
1
xw T
189
4.0
0.1
125.01
194.1)(
94.1
exyw
94.15.04.0*6.32 xwT 0.2
6.3
5.0
5.0,6.3,0.2
0.1,4.0
w
x
1
xw T
190
4.0
0.1
125.0
1
4.0
1
5.0,6.3,0.2)(
aw xy
0.2
6.3
5.0
5.0,6.3,0.2
0.3,1.2,0.1,4.0
w
xx ba
1
xw T4.0
0.1
125.0
1
4.0
1
5.0,6.3,0.2)(
aw xy
1
xw T1.2
0.3
99.0
0.3
1.2
1
5.0,6.3,0.2)(
bw xy
0.2
6.3
5.0
0.2
6.3
5.0
Mini-batch processing
192
xw T
1
4.0
0.1
99.0,125.0
0.31
1.24.0
11
5.0,6.3,0.2)(
aw xy
1
1.2
0.3
0.2
6.3
5.0
Mini-batch processing
193
1
xw T4.0
0.1
99.0,125.0,2.0,89.0)( aw xy
1
1.2
0.3
1
5.0
2.0
1
1.1
1.1
0.2
6.3
5.0
Mini-batch processing
HorseDogTruck
Mini-batch processing
Mini-batch of4 images 4 predictions
HorseDogTruck
Classical applications of ConvNets
Classification.
197
Classical applications of ConvNets
Classification.
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018
Image segmentation
Classical applications of ConvNets
Image segmentation
Classical applications of ConvNets
Tran, P. V., 2016. A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv:1604.00494.
Image segmentation
Classical applications of ConvNets
Fang Liu, Zhaoye Zhou, +3 authors, Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. in Magnetic resonance in medicine 2018 DOI:10.1002/mrm.26841
Image segmentation
Classical applications of ConvNets
Havaei M., Davy A., Warde-Farley D., Biard A., Courville A., Bengio Y., Pal C., Jodoin P-M, Larochelle H. (2017)Brain Tumor Segmentation with Deep Neural Networks, Medical Image Analysis, Vol 35, 18-31
Localization
Classical applications of ConvNets
Localization
Classical applications of ConvNets
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018
Conclusion
• Linear classification (1 neuron network)
• Logistic regression
• Multilayer perceptron
• Conv Nets
• Many buzz words
– Softmax
– Loss
– Batch
– Gradient descent
– Etc.
204