apprentissage,réseauxdeneuronesetmodèlesgraphiques (rcp209...

Apprentissage, réseaux de neurones et modèles graphiques(RCP209) - Neural Networks and Deep Learning

Convolutionnal Neural Nets (ConvNets)

Nicolas [email protected]

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département InformatiqueConservatoire Nationnal des Arts et Métiers (Cnam)

http://cedric.cnam.fr/vertigo/Cours/ml2/

Motivation Convolution Pooling ConvNets

Outline

1 Limitations of Fully Connected Networks

2 Convolution

3 Pooling

4 Deep Convolutionnal Neural Nets

[email protected] RCP209 / Deep Learning 2/ 64

mailto:[email protected]


Limitations of Fully Connected Networks

Credit: M.A. Ranzato

• Scalability issue with Fully Connected Networks# Parameter explosion even for a single hidden layer !





• Signal data: importance of local structure

1D signals: local temporal structure2D signal data: local spatial structure





• BUT: vectorial representation of inputs: dimensions arbitrary!





• MNIST ex: same performances with initial and permuted images!However, local information obviously useful





• Prior knowlege on data structure⇒ useful

• Example: MLP training for shaperecognition (rectangle, trinagle,diamond, star) from color images





• Invariance & stability• Expectations:

Small deformation ⇒ similarrepresentationsLarge deformation ⇒ dissimilar

• Translation invariance difficult withFully Connected Networks ∼ localscale, rotation, deformations, etc




Convolutionnal Neural Networks

Overcome most of the aforementioned limitations:• Significantly limit number of free parameters• Explicitly focus on local structure of the signal• Able to gain invariance to local deformations• All parameters remain trainable with error back-propagation




Outline


2 Convolution

3 Pooling





Convolution in 1D (Signal)

• Discrete 1D convolution with Finite Impulse Response (FIR) filter h, sized (odd)

• Input signal f (i), i ∈ {1;N}• Output signal f ′(i), i ∈ {1;N}• Convolution: operator T ∶ f → f ′ = T [f ] = f ⋆ h

f ′(i) = (f ⋆ h)(i) =d−12

∑n=− d−1

2

f (i − n)h(n)




Convolution in 2D (Images)

• Discrete 2D convolution with FIR filter h (size d odd),T ∶ f → f ′ = T [f ] = f ⋆ h:

f ′(i , j) = (f ⋆ h)(i , j) =d−12

∑n=− d−1

2

d−12

∑m=− d−1

2

f (i − n,m − j)h(n,m)

• Ex with a 3 × 3 filter:f ′(i, j) = w1f (i − 1, j − 1) +w2f (i − 1, j) +w3f (i − 1, j + 1)

+ w4f (i, j − 1) +w5f (i, j) +w6f (i, j + 1)

+ w7f (i + 1, j − 1) +w8f (i + 1, j) +w9f (i + 1, j + 1)

• Convolution processing:1 Apply central symmetry to the filter:

h(n,m)⇒ h(−n,−m) = g(n,m)2 ∀(i , j), compute weighted sum between image value

around f (i , j) and filter coeffs g(n,m)

h =⎛⎜⎝

w9 w8 w7w6 w5 w4w3 w2 w1

⎞⎟⎠

g =⎛⎜⎝

w1 w2 w3w4 w5 w6w7 w8 w9

⎞⎟⎠




2D Convolution vs Cross-Correlation

• 2D Convolution: f ′(i , j) = (f ⋆ h)(i , j) = ∑n∑m

f (i − n,m − j)h(n,m)

• Cross-Correlation: f ′(i , j) = (f ⊗ h)(i , j) = ∑n∑m

f (i + n,m + j)h(n,m)

Cross-Correlation ∼ Convolution without symmetrizing mask!

h =⎛⎜⎝

−4 0 00 0 00 0 4

⎞⎟⎠⇒ g =

⎛⎜⎝

4 0 00 0 00 0 −4

⎞⎟⎠




2D Convolution / Cross-Correlation: Interpretation

• Cross-Correlation: ∀(i , j): dot productbetween image region and filter hLarge f ′(i , j)⇒ filter and region aligned

• Input: 2d image ⇒ output: 2d map




2D Convolution / Cross-Correlation: Example

• Cross-Correlation: output maps ⇔ location in input image similar to mask




2D Convolution / Cross-Correlation: Real Image Example

Credit: K. Matsui




Strided Convolution

f ′(i , j) = (f ⋆ h)(i , j) =d−12

∑n=− d−1

2

d−12

∑m=− d−1

2

f (i − n,m − j)h(n,m)

• Standard convolution: stride 1 ⇒ compute f ′(i , j) for (i , j) ∈ {1;N} × {1;M}• Strided convolution: compute f ′(i , j) for

i ∈ {1,1 + s,1 + 2s, ...,N} (idem for j)• Ex: s = 2, N = M = 5, d = 3⇒ reduced map size (3 × 3)




Convolution: Example for Gradient Computation

Ix ≈ I ⋆Mx

• Gradient:Ð→G (x , y) = ( ∂I

∂x∂I∂y )T = ( Ix Iy )T

• Convolution approximation: Ix ≈ I ⋆Mx , Iy ≈ I ⋆My

Mx =14⋅

⎡⎢⎢⎢⎢⎣

−1 0 1−2 0 2−1 0 1

⎤⎥⎥⎥⎥⎦

My =14⋅

⎡⎢⎢⎢⎢⎣

−1 −2 −10 0 01 2 1

⎤⎥⎥⎥⎥⎦




Convolution with Multiple Filters: Edge Detection

Ix ∼ filter 1 Iy ∼ filter 2 Ie = Ix2 + Iy2

Ie,t

Ie,t : Ie threshold⇒ edge detector!




Convolution: Linear Filtering

• Convolution can be viewed as multiplication by a matrix• 1D case: univariate Toeplitz matrix:

• 2D case: doubly block circulant matrix




Convolution vs Fully Connected Layers

Convolution: overcome fully connected network limitations1 Local connection⇒ drastic reduction in the number of parametersa) Sparse connectivity: hidden unit only connected to a local patch






Convolution: overcome fully connected network limitations1 Local connection

b) Weight sharing: same feature detected across all image positions


• Convolution: number of parameters independent of input image size ! ≠fully connected layers




Translation-Invariant Feature Detection

• Convolution, weight sharing: same feature detected across all imagepositions

• Very relevant prior for object classification / scene recognition





Convolution: overcome fully connected network limitations2 Convolution: local spatial structure

Analyses shape/appearance in a local neighborhoodPermutation to input images ⇒ very different local info⇒ Different classification performances





Convolution: overcome fully connected network limitations3 Convolution: equivariance property

Equivariance:f equivariant to g ⇔ f [g(x)] = g [f (x)]Convolution equivariant to translation:

T [x(t − τ)] = x(t − τ) ⋆ h(t) = (x ⋆ h)(t − τ) = y(t − τ)





Convolution: overcome fully connected network limitations

3 Convolution: translation equivarianceEnsure that deformation, i.e. translation, encoded in mapsLocal translation invariance: local pooling ⇒ next !

Credit: G. Hinton




Convolution and Non-Linearity

source image I I ⋆Mx ∣I ⋆Mx ∣ (I ⋆Mx)2

• Convolution, linear operation for each feature map

Gradient Ix ≈ I ⋆Mx , Mx = 14 ⋅⎡⎢⎢⎢⎢⎢⎣

−1 0 1−2 0 2−1 0 1

⎤⎥⎥⎥⎥⎥⎦• Convolution + point-wise non-linearity: feature detection

Ex: σ(z) = z2, σ(z) = ∣z ∣⇒ activate for large > 0 & < 0 Ix values




Convolution and Non-Linearity

source image I I ⋆Mx Sigmoid ReLU

• Other non-linearities: only activate for Ix > 0Sigmoid (with bias) σ(z) = (1 + e−a(z−b))−1,a = 8 ⋅ 10−2, b = 50ReLU (see later) σ(z) = max(0, z)




Outline


2 Convolution

3 Pooling





Pooling

• Pooling: statistical aggregation of a set of values, e.g. x = {x1, ..., xN}• Output: a single scalar value - Possible pooling functions:

Max pooling: pool(x) = maxi∈{1;N}

xi

Average pooling: pool(x) = 1N

N∑i=1

xi

max = 8, avg = 4.8• Goal: capture statistics of responses

Invariance wrt position of valuesPermut values ⇒ same features




`p Pooling

• `p pooling: pool(x) = ( 1N

N∑i=1

xpi )

1p

• Smooth transition: average → max (wrt p)




Pooling in Convolution Feature Maps

• Spatial pooling: aggregation over image (map) regions• Pooling Input: map (image), output: map• Local aggregation: ⇒ local pooling receptive field• Key pooling parameters:

Pooling functionLocal pooling sizeStride between two pooling areas




Spatial Max Pooling

Credit: K. Matsui

• Ex: max pooling with 5 × 5pooling area

• binary input: pooling ⇒ presence/ absence of feature in localpooling area

• (partial) Translation invariance ⇒later




Spatial Average Pooling

Credit: K. Matsui

• Ex: average pooling with 5 × 5pooling area

• binary input: pooling ∼ countnumber of present features in localpooling area




Spatial Pooling: Stride

• Step s to which pooling areas centered• s > 1: decreases spatial resolution ⇒ less parameters in Deep Models ∼Downsampling

Credit: M. Antony




Spatial Pooling: from Equivariance to Invariance

• Recap: convolution equivariantto translation:

f [g(x)] = g [f (x)]

f convolution, g translationCredit: G. Hinton




Max Pooling & Translation Invariance

• Under some conditions, max pooling ⇒ translation invariance:

f ′ [g(x)] = f ′(x)

f ′ = f ○ p with f convolution, p pooling





• Translation invariance wrt vectorÐ→T = (tx , ty)t if:

Ð→T ⇏ new largest element at pooling region edgeÐ→T ⇏ remove max from pooling region

• Ex: 5 × 5 conv map, 3 × 3 max pooling centered at15: max = 15,

• Invariance OK: ∀ translation (tx , ty) ∈ ±1 px⇒ max = 15

C =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

11 −5 1 −2 01 3 0 0 58 4 15 −10 48 6 5 3 73 0 −2 9 3

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦





• Translation invariance wrt vectorÐ→T = (tx , ty)t if:

Ð→T ⇏ new largest element at pooling region edgeÐ→T remove max from pooling region

• Ex: 5 × 5 conv map, 3 × 3 max pooling centered at15: max = 15,

• Invariance KO: right translation tx = +1 px⇒ max = 7

C =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

11 −5 1 −2 01 3 0 0 58 15 4 −10 48 6 5 3 73 0 −2 9 3

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦





• Max pooling: partial translation invariance (under some conditions)At least local stability: every value in bottom changed, only half values in topchanged ⇒ Distance after pooling decreases

From [Goodfellow et al., 2016]




Pooling: Conclusion

• Reducing spatial feature map size (stride)• Partial translation invariance and stability

• Convolution on tensors (color images / hierarchies)?⇒ following!




Outline


2 Convolution

3 Pooling





Convolution Layer

• 2D convolution: each filter ⇒ 2D map (image)• Convolution Layer: staking maps from multiple Filters⇒ Tensor: multi-dimensional array




Convolution Layer

• Tensor: stacking several filters outputsDepth ⇔ # filtersEach spatial position: output for the differentfilters

• Ex: 2D convolution with gray-scale imagesInput tensor depth = 1

• Convolution on color images / hierarchies:Convolution on tensors!Input Tensor ⇒ output Tensor




Convolution Layer for Tensors

f ′(i , j) = (f ⋆ h)(i , j) =K

∑k=1

d−12

∑n=− d−1

2

d−12

∑m=− d−1

2

f (i − n,m − j , k)h(n,m, k) + b

• Convolution: linear, bias b ⇒ affine

• Filtering on depth: correlation between feature maps





Ex: input color image





Natural extension for multiple filters





Ex: input color image




Specific Tensor Convolution Filters

• Input tensor size W ×H ×D• Filter size = W ×H ×D = tensor size, nopadding⇒ No possible displacement for filter

Output: single scalar valueUse of K filters ⇒ output: K-dim vector

• Convolution ∼ fully connected onflattened tensor




Convolution Layer: Non-Linearity

• Convolutional Layer:Input Tensor → Output Tensor

1 Convolution: linear / affine filtering2 Follwed by point wise non-linearity

∼ non-linearity on spatial maps




Convolution Layer: Non-Linearity

• Each activation in tensor map ⇔ formal neuron• Ex: sigmoid activation:

σ(z) = (1 + e−az)−1




Convolution Hierarchies

• Convolution Layer: affine filtering + non-linear activation• Convolution Hierarchies: stacking Convolution Layers• Motivation: depth increase modeling capacities

Non-linearity crucial: hierarchical model ≠ flat model!∼ fully connected network




Convolution Hierarchies: Receptive Field

• Cascading two 3× 3 convolutions: same receptive field as 5× 5 convolutionin input image

• Convolution Hierarchies:Feature combinationGradual increase of spatial receptive field ⇒ indirect global connectivity




Convolution Hierarchies: Example

Edge detection with convolution hierarchy (pyramid) ⇒ two layers:• Input: gray-scale image I ⇒ W ×H × 1 tensor

I 2x ∼ filter 1 I 2

y ∼ filter 2

1 1st layer: convolution with two filters

Mx =1

4⋅⎡⎢⎢⎢⎢⎢⎣

−1 0 1−2 0 2−1 0 1

⎤⎥⎥⎥⎥⎥⎦My =

1

4⋅⎡⎢⎢⎢⎢⎢⎣

−1 −2 −10 0 01 2 1

⎤⎥⎥⎥⎥⎥⎦

followed by non-linearity: σ(z) = z2

⇒ Output: W ×H × 2 tensor, H1 ∼ (Ix)2, H2 ∼ (Iy)2




Convolution Hierarchies: Example

Edge detection with convolution hierarchy (pyramid): two-Layers

I 2x I 2

y output

2 2nd layer: convolution with one 1 × 1 filter [1 1]For each pixel: (Ix)2 + (Iy )2 = ∣∣

Ð→G (x , y)∣∣2 = ∣∣

Ð→∇I ∣∣

2

σ(z) = Step(z −T), T threshold on ∣∣Ð→G (x , y)∣∣2

⇒ Output: W ×H × 1 tensor ⇒ edge detector




Pooling in Convolution Layer

Where to pool in a convolution tensor?• Most common choice: pool in each feature map independently⇒ spatial pooling on top of convolution Layer

Input / Output: a tensor of depth DOutput smaller spatial size (pooling stride)




Convolution / Pooling Layer

• Pooling on top of convolution Layer• An elementary block: Convolution Layer + Pooling [Conv-Pool]




Convolution / Pooling Layer

• An elementary block:

Convolution + Non linearity³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µConvolution Layer + Pooling




Convolutional Neural Networks (ConvNets)

• Stack several Convolution /Pooling blocks⇒ Convolutional Neural Network(ConvNet)

• Ex: 7 × 7 convolution2 × 2 pooling area, stride s = 2

• Input image 46 × 46, 1st

[Conv-Pool] Layer:Conv output = 40Pooling output = 20Receptive field size for eachpooled unit?⇒ Pooling ↑ receptive field





• ConvNets: hierarchical tensor modification• At some (depth) point, often flattening input tensor ⇒ vector





• ConvNet prediction: 2-stage process:1 Representation learning with [Conv-Pool] hierarchy:

Conv detects relevant featuresPool gains spatial invariance for classification





• ConvNet prediction: 2-stage process:2 Classification: Tensor flattened ⇒ vector

Flattening: neuron position in initial tensor⇒ breaking translation invarianceFollowed by a hierarchy of fully connected layers




Conclusion

• ConvNet: hierarchical [Conv-Pool] + fully connectedArchitecture for famous historical Nets, e.g. LeNet, or more recent,e.g. AlexNet (2012)

• Deep Learning History?⇒ next course!



References I

[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016).Deep Learning.MIT Press.http://www.deeplearningbook.org.

http://www.deeplearningbook.org

apprentissage,réseauxdeneuronesetmodèlesgraphiques (rcp209...

Documents