rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture4b.pdf · rapid...

1/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/35


Lecture 4bConvolutional Network

October 30, 2015

3/35


Table of contents

1 1. Objectives of Lecture 4b

2 2. Convolution kernel2.1. Convolution

3 3. Convolutional network3.1. 2D convolution3.2. Analysis of LeCun’s example3.3. Another example3.4. Classification3.5. Training convolutional network

4/35


1. Objectives of Lecture 4b

Objective 1

Learn the basic formalism of convolutional network

Objective 2

Go through LeCun’s examples

Objective 3

Learn about the training of convolutional network

5/35


2.1. Convolution

2. Convolution kernel2.1. Convolution

f (x) ∶ functionK(x) ∶ convolution kernel (filter)

(f ∗K)(x) = ∫ f (y)K(x − y)dy = ∫ f (x − y)K(y)dy

Discrete convolution

x(n) ∶ dataK(n) ∶ convolution kernel (filter)

(x ∗K)(n) =∑m

x(m)K(n −m) =∑m

x(m − n)K(m)

6/35


2.1. Convolution

Example (1D Convolution)

(x ∗K)(5) = x(5 − 1)K(1) + x(5 − 0)K(0) + x(5 + 1)K(−1)

= x(4)K(1) + x(5)K(0) + x(6)K(−1)

= x(4) + 2x(5) − x(6)

7/35


2.1. Convolution

(x ∗K)(5) = x(4) + 2x(5) − x(6)

8/35


2.1. Convolution

(x ∗K)(6) = x(5) + 2x(6) − x(7)

9/35


2.1. Convolution

Example (2D Convolution)

x(m,n) ∶ data K(p,q) ∶ convolution kernel

(x ∗K)(m,n) =∑p,q

x(m − p,n − q)K(p,q)

10/35


2.1. Convolution

(x ∗K)(3,4) = 2x(2,3) + 4x(2,4) − 2x(2,5)

+ 3x(3,3) + 6x(3,4) − 3x(3,5)

+ x(4,3) + 2x(4,4) − x(4,5)

11/35


2.1. Convolution

(x ∗K)(3,5) = 2x(2,4) + 4x(2,5) − 2x(2,6)

+ 3x(3,4) + 6x(3,5) − 3x(3,6)

+ x(4,4) + 2x(4,5) − x(4,6)

12/35


2.1. Convolution

Boundary effect

Example

At the boundary

13/35


2.1. Convolution

There is no x(−1), so (x ∗K)(1) is not defined

One may pad 0’s around boundaries

But the “valid” part of x ∗K is shorter than x itself

In the above example, the valid part of x ∗K is an array ofsize 3

14/35


2.1. Convolution

in general if K is a (2p + 1) × (2q + 1) matrix then the validpart of x ∗K is an (M − 2p) × (N − 2q) matrix, where x is anM ×N matrix

15/35


3.1. 2D convolution

3. Convolutional network3.1. 2D convolution

The same convolution kernel K is applied at every position

16/35


3.1. 2D convolution

Example

The same convolution kernel K is applied at every position

17/35


3.2. Analysis of LeCun’s example


18/35



3.2.1.

19/35



Pooling

Moving 10 × 10 window on 75 × 75 image results in 66 × 66matrix

Pooling is taken as one of the following:

Maximum (Max Pooling)LP sum (P = 1,2,⋯)Average

20/35



Subsampling

Example: 5× 5 subsampling (i.e., column stride = 5, row stride = 5)

14 =66 − 1

5+ 1

sampling at (1,1), (1,6),⋯, (1,66), (6,1), (6,6),⋯, (6,66), ⋯,(66,1), (66,6), ⋯, (66,66)

21/35



Layer 3

There are 256 features maps in Layer 3. Each of such (256)feature maps are gotten as follows:

Randomly select 16 feature maps out of 64 feature maps inLayer 2

22/35



Convolution is done for 16 × 9 × 9 3D pipe in the 16 × 14 × 14volume. For each feature map of 16 features, this defines a2D convolution kerenl; thus 16 kernels for Thus there are256 × 16 = 4096 2D kernels

Augmentation

The step from Convolution to Pooling and Subsampling canbe augmented with rectification and Local ContrastNormalization(LCN)

xi ∶ ith feature mapxijk ∶ (j , k)th pixel value of xi

Rectification (Rabc)xijk → ∣xijk ∣

23/35



Subtractive normalization

xijk → vijk = xijk − ∑i ,p,q

ωpqxi ,j+p,k+q,

where ωpq is a Gaussian-like filter such that ∑i ,p,q ωpq = 1

Divisive normalization

vijk → yijk = vijk/max(c , σjk),

where σjk = (∑i ,p,q ωpqv2i ,j+p,k+q)

1/2

24/35



25/35



Summary: Model architecture

The number of n2 × n3 image (input feature map) is n1

26/35



xi ∶ ith image (input feature map)

kij ∶ convolution kernel of size `1 × `2 operating on xi toproduce yj , j = 1,⋯,m1 where m1 is the number of outputfeature maps

yj ∶ jth output feature map

yj = {gj tanh (∑

n1i=1 kij ∗ xi)

gjsigm (∑n1i=1 kij ∗ xi)

for j = 1,⋯,m1

[Hence gj is called the gain coefficient]

27/35



Notations

(a)C = ConvolutionS = sigm/ tanhG = gain

⎫⎪⎪⎪⎬⎪⎪⎪⎭

⇒ FCSG

In LeCun’s example above, Layer 1 is denoted by 64F 9×9CSG

[64 = number of kernels, 9 × 9 = convolution kernel size]

(b) Rabs ∶ rectification (= taking the absolute value)

(c) N ∶ local contrast normalization (LCN)

(d) PA ∶ average pooling and subsamplingPM ∶ max pooling and subsampling

28/35


3.3. Another example


29/35



Th above processes are denoted by

64F 9×9CSG → R/N/P5×5

The whole processes are denoted by

64F 9×9CSG → R/N/P5×5

→ 256F 9×9CSG → R/N/P4×4

30/35


3.4. Classification

3.4. classification

The final layer is fed into the classification layer like softmaxlayer

31/35


3.4. Classification

These two layers are fully connected

Train the entire network in the supervised manner

Only the filters (kernels) are trained

The error derivative back propagation has to be worked outacross R/N/P layers

32/35


3.5. Training convolutional network


Weight training (learning)

Convolution weights

Training is done just like the usual neural networkTo enforce convolution, need to maintain equality constraint

Example

Suppose weights ω1 = ω2 = ⋯ = ωN due to convolutionconstraintDuring the training get ω̃1(new), ω̃2(new),⋯, ω̃N(new)To enforce the equality constraint, define

ωi(new) =1

N

j=1

∑N

ω̃j(new),

for i = 1,⋯,N

33/35



R/N/P

The computations in R/N steps do not involve weights. So noneed to be concerned on these steps in the training

For the pooling step1D Example: pooling by 3, subsampling by 2 (stride 2)

34/35



Combine the weights affecting the subsampling neurons tocome up with an effective network

35/35



Derivative of max function

max(x1, x2) =1

2{∣x1 − x2∣ + x1 + x2}

∂

∂x1max(x1, x2) = {

1 if x1 > x2

0 else

Simiarly

∂

∂x1max(x1, x2, x3) = {

1 if x1 > x2, x1 > x3

0 else

If the pooling is average or other Lp norm, the derivatives canbe easily computed

Once the derivatives of pooling layers are computed, the backpropagation algorithm can be applied

rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture4b.pdf · rapid...

Documents