rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture4b.pdf · rapid...
TRANSCRIPT
1/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
Rapid Introduction to Machine Learning/Deep Learning
Hyeong In Choi
Seoul National University
2/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
Lecture 4bConvolutional Network
October 30, 2015
3/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
Table of contents
1 1. Objectives of Lecture 4b
2 2. Convolution kernel2.1. Convolution
3 3. Convolutional network3.1. 2D convolution3.2. Analysis of LeCun’s example3.3. Another example3.4. Classification3.5. Training convolutional network
4/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
1. Objectives of Lecture 4b
Objective 1
Learn the basic formalism of convolutional network
Objective 2
Go through LeCun’s examples
Objective 3
Learn about the training of convolutional network
5/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
2. Convolution kernel2.1. Convolution
f (x) ∶ functionK(x) ∶ convolution kernel (filter)
(f ∗K)(x) = ∫ f (y)K(x − y)dy = ∫ f (x − y)K(y)dy
Discrete convolution
x(n) ∶ dataK(n) ∶ convolution kernel (filter)
(x ∗K)(n) =∑m
x(m)K(n −m) =∑m
x(m − n)K(m)
6/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
Example (1D Convolution)
(x ∗K)(5) = x(5 − 1)K(1) + x(5 − 0)K(0) + x(5 + 1)K(−1)
= x(4)K(1) + x(5)K(0) + x(6)K(−1)
= x(4) + 2x(5) − x(6)
7/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
(x ∗K)(5) = x(4) + 2x(5) − x(6)
8/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
(x ∗K)(6) = x(5) + 2x(6) − x(7)
9/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
Example (2D Convolution)
x(m,n) ∶ data K(p,q) ∶ convolution kernel
(x ∗K)(m,n) =∑p,q
x(m − p,n − q)K(p,q)
10/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
(x ∗K)(3,4) = 2x(2,3) + 4x(2,4) − 2x(2,5)
+ 3x(3,3) + 6x(3,4) − 3x(3,5)
+ x(4,3) + 2x(4,4) − x(4,5)
11/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
(x ∗K)(3,5) = 2x(2,4) + 4x(2,5) − 2x(2,6)
+ 3x(3,4) + 6x(3,5) − 3x(3,6)
+ x(4,4) + 2x(4,5) − x(4,6)
12/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
Boundary effect
Example
At the boundary
13/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
There is no x(−1), so (x ∗K)(1) is not defined
One may pad 0’s around boundaries
But the “valid” part of x ∗K is shorter than x itself
In the above example, the valid part of x ∗K is an array ofsize 3
14/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
2.1. Convolution
in general if K is a (2p + 1) × (2q + 1) matrix then the validpart of x ∗K is an (M − 2p) × (N − 2q) matrix, where x is anM ×N matrix
15/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.1. 2D convolution
3. Convolutional network3.1. 2D convolution
The same convolution kernel K is applied at every position
16/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.1. 2D convolution
Example
The same convolution kernel K is applied at every position
17/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
3.2. Analysis of LeCun’s example
18/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
3.2.1.
19/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Pooling
Moving 10 × 10 window on 75 × 75 image results in 66 × 66matrix
Pooling is taken as one of the following:
Maximum (Max Pooling)LP sum (P = 1,2,⋯)Average
20/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Subsampling
Example: 5× 5 subsampling (i.e., column stride = 5, row stride = 5)
14 =66 − 1
5+ 1
sampling at (1,1), (1,6),⋯, (1,66), (6,1), (6,6),⋯, (6,66), ⋯,(66,1), (66,6), ⋯, (66,66)
21/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Layer 3
There are 256 features maps in Layer 3. Each of such (256)feature maps are gotten as follows:
Randomly select 16 feature maps out of 64 feature maps inLayer 2
22/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Convolution is done for 16 × 9 × 9 3D pipe in the 16 × 14 × 14volume. For each feature map of 16 features, this defines a2D convolution kerenl; thus 16 kernels for Thus there are256 × 16 = 4096 2D kernels
Augmentation
The step from Convolution to Pooling and Subsampling canbe augmented with rectification and Local ContrastNormalization(LCN)
xi ∶ ith feature mapxijk ∶ (j , k)th pixel value of xi
Rectification (Rabc)xijk → ∣xijk ∣
23/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Subtractive normalization
xijk → vijk = xijk − ∑i ,p,q
ωpqxi ,j+p,k+q,
where ωpq is a Gaussian-like filter such that ∑i ,p,q ωpq = 1
Divisive normalization
vijk → yijk = vijk/max(c , σjk),
where σjk = (∑i ,p,q ωpqv2i ,j+p,k+q)
1/2
24/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
25/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Summary: Model architecture
The number of n2 × n3 image (input feature map) is n1
26/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
xi ∶ ith image (input feature map)
kij ∶ convolution kernel of size `1 × `2 operating on xi toproduce yj , j = 1,⋯,m1 where m1 is the number of outputfeature maps
yj ∶ jth output feature map
yj = {gj tanh (∑
n1i=1 kij ∗ xi)
gjsigm (∑n1i=1 kij ∗ xi)
for j = 1,⋯,m1
[Hence gj is called the gain coefficient]
27/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.2. Analysis of LeCun’s example
Notations
(a)C = ConvolutionS = sigm/ tanhG = gain
⎫⎪⎪⎪⎬⎪⎪⎪⎭
⇒ FCSG
In LeCun’s example above, Layer 1 is denoted by 64F 9×9CSG
[64 = number of kernels, 9 × 9 = convolution kernel size]
(b) Rabs ∶ rectification (= taking the absolute value)
(c) N ∶ local contrast normalization (LCN)
(d) PA ∶ average pooling and subsamplingPM ∶ max pooling and subsampling
28/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.3. Another example
3.3. Another example
29/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.3. Another example
Th above processes are denoted by
64F 9×9CSG → R/N/P5×5
The whole processes are denoted by
64F 9×9CSG → R/N/P5×5
→ 256F 9×9CSG → R/N/P4×4
30/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.4. Classification
3.4. classification
The final layer is fed into the classification layer like softmaxlayer
31/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.4. Classification
These two layers are fully connected
Train the entire network in the supervised manner
Only the filters (kernels) are trained
The error derivative back propagation has to be worked outacross R/N/P layers
32/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.5. Training convolutional network
3.5. Training convolutional network
Weight training (learning)
Convolution weights
Training is done just like the usual neural networkTo enforce convolution, need to maintain equality constraint
Example
Suppose weights ω1 = ω2 = ⋯ = ωN due to convolutionconstraintDuring the training get ω̃1(new), ω̃2(new),⋯, ω̃N(new)To enforce the equality constraint, define
ωi(new) =1
N
j=1
∑N
ω̃j(new),
for i = 1,⋯,N
33/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.5. Training convolutional network
R/N/P
The computations in R/N steps do not involve weights. So noneed to be concerned on these steps in the training
For the pooling step1D Example: pooling by 3, subsampling by 2 (stride 2)
34/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.5. Training convolutional network
Combine the weights affecting the subsampling neurons tocome up with an effective network
35/35
1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network
3.5. Training convolutional network
Derivative of max function
max(x1, x2) =1
2{∣x1 − x2∣ + x1 + x2}
∂
∂x1max(x1, x2) = {
1 if x1 > x2
0 else
Simiarly
∂
∂x1max(x1, x2, x3) = {
1 if x1 > x2, x1 > x3
0 else
If the pooling is average or other Lp norm, the derivatives canbe easily computed
Once the derivatives of pooling layers are computed, the backpropagation algorithm can be applied