backpropagation in convolutional neural networks | deepgrid

12
DeepGrid Organic Deep Learning. Latest Article: Backpropagation In Convolutional Neural Networks 5 September 2016 Home About Archive GitHub Twitter @jefkine © 2016. All rights reserved. Backpropagation In Convolutional Neural Networks Jefkine, 5 September 2016 Introduction Convolutional neural networks (CNNs) are a biologically-inspired variation of the multilayer perceptrons (MLPs). CNNs emulate the basic mechanics of the animal visual cortex. Neurons in CNNs share weights unlike in MLPs where each neuron has a separate weight vector. This sharing of weights ends up reducing the overall number of trainable weights hence introducing sparsity.

Upload: khangminh22

Post on 24-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

DeepGridOrganic Deep Learning.

Latest Article:Backpropagation In Convolutional Neural Networks

5 September 2016

Home

About

Archive

GitHubTwitter @jefkine

© 2016. All rights reserved.

Backpropagation In Convolutional NeuralNetworksJefkine, 5 September 2016

IntroductionConvolutional neural networks (CNNs) are a biologically-inspired variation of themultilayer perceptrons (MLPs). CNNs emulate the basic mechanics of the animalvisual cortex. Neurons in CNNs share weights unlike in MLPs where each neuronhas a separate weight vector. This sharing of weights ends up reducing the overallnumber of trainable weights hence introducing sparsity.

Utilizing the weights sharing strategy, neurons are able to perform convolutions onthe data with the convolution lter being formed by the weights. This is thenfollowed by a pooling operation which is a form of non-linear down-sampling thatprogressively reduces the spatial size of the representation hence reducing theamount of parameters and computation in the network. An illustration can be seenin the diagram above.

After several convolutional and max pooling layers, the image size (fearture mapsize) is reduced and more complex features are extracted. Eventually with a smallenough feature map, the contents are squashed into a one dimension vector and fedinto a fully-connected MLP for processing.

Existing between the convolution and the pooling layer is a ReLU layer in which anon-saturating activation function is applied element-wise, i.e. thresholding at zero.

The last layer of the fully-connected MLP seen as the output, is a loss layer which isused to specify how the network training penalizes the deviation between thepredicted and true labels.

Cross-correlationGiven an input image and a kernel or lter of dimensions , a cross-correlation operation leading to an output image is given by:

f(x) = max(0, x)

I F k × k

C

ConvolutionGiven an input image and a kernel or lter of dimensions , a convolutionoperation leading to an output image is given by:

Convolution is the same as cross-correlation, except that the kernel is “ ipped”(horizontally and vertically).

In the two operations above, the region of support for is given by ranges and . These are the ranges for which the pixels are

de ned. In the case of unde ned pixels, the input image could be zero padded toresult in an output of a size similar to the input image.

Notation1. is the layer where is the rst layer and is the last layer.2. is the weight vector connecting neurons of layer with neurons of layer

.3. is the output vector at layer

1. is the activated output vector for a hidden layer .

2. is the activation function. A ReLU function is usedas the activation function.

3. is the matrix representing all entries of the last output layer given by

4. is the matrix representing all the training patterns for network training5. is the matrix of all the targets.

Foward Propagarion

C

C(x, y)

= I ⊗ F

= I(x + a, y + b)F(a, b)∑a=0

k−1

∑b=0

k−1

(1)

I F k × k

C

C

C(x, y)

= I ∗ F

= I(x − a, y − b)F(a, b)∑a=0

k−1

∑b=0

k−1

(2)

C(x, y)0 ≤ x ≤ k − 1 0 ≤ y ≤ k − 1

l lth l = 1 l = L

wlx,y l

l + 1ol

x,y l

olx,y = ∗wl

x,y ax,y

= ∑x′

∑y ′

wl,x′ y ′ al

x− ,y−x′ y ′

al l

alx,y = f( )ol−1

x,y

f(x) f(x) = max(0, x)

AL L

aLx,y = f( )oL−1

x,y

P

T

Foward PropagarionTo perform a convolution operation, the kernel is ipped and slid across theinput feature map in equal and nite strides. At each location, the product betweeneach element of the kernel and the input element it overlaps is computed and theresults summed up to obtain the output at that current location.

This procedure is repeated using different kernels to form as many output featuremaps as desired. The concept of weight sharing is used as demonstrated in thediagram below:

180∘

Units in convolution layer have receptive elds of width 3 in the input feature mapand are thus only connected to 3 adjacent neurons in the input layer. This is theidea of sparse connectivity in CNNs where there exists local connectivity patternbetween neurons in adjacent layers.

The color codes of the weights joining the input layer to the convolution layer showhow the kernel weights are distributed (shared) amongst neurons in the adjacentlayers. Weights of the same color are constrained to be identical.

For the convolution operation to be performed, the kernel is ipped as shown in thediagram below before:

The convolution equation is given by:

This is illustrated below:

O(x, y)

O(x, y)

= ∗ +wlx,y al

x,y blx,y

= +∑x′

∑y ′

wl,x′ y ′ al

x− ,y−x′ y ′ blx,y

= f( ) +∑x′

∑y ′

wl,x′ y ′ ol−1

x− ,y−x′ y ′ blx,y

(3)

(4)

(5)

ErrorFor a total of predictions, the predicted network outputs and theircorresponding targeted values the the mean squared error is given by:

Learning will be achieved by adjusting the weights such that is as close aspossible or equals to . In the classical back-propagation algorithm, the weightsare changed according to the gradient descent direction of an error surface .

BackpropagationFor back-propagation there are two updates performed, for the weights and thedeltas. Lets begin with the weight update. The gradient component is for eachweight can be obtained by applying the chain rule.

P aLp

tp

E =12

∑p=1

P

( − )tp aLp

2(6)

AL

T

E

∂E

∂wlx,y

= ∑x′

∑y ′

∂E

∂ol,x′ y ′

∂ol,x′ y ′

∂wlx,y

= ∑x′

∑y ′

δ l,x′ y ′

∂ol,x′ y ′

∂wlx,y

= ∑x′

∑y ′

δ l,x′ y ′

∂ol,x′ y ′

∂wlx,y

(7)

In Eqn is equivalent to and so applying the convolution

operation here gives us an equation of the form:

Expanding the summation in Eqn and taking the partial derivatives for all thecomponents results in zero values for all except the components where and

in which implies and in

as follows:

Substituting Eqn in Eqn gives us the following results:

In Eqn above, rotation of the kernel, causes the shift from to

. The diagram below shows gradients

generated during back-propagation:

(7) , ol,x′ y ′ +wl

,x′ y ′ al,x′ y ′ bl

∂ol,x′ y ′

∂wlx,y

= ( + )∂∂wl

x,y∑x′′

∑y ′′

wl,x′′ y ′′ al

− , −x′ x′′ y ′ y ′′ bl

= ( f ( ) + )∂∂wl

x,y∑x′′

∑y ′′

wl,x′′ y ′′ ol−1

− , −x′ x′′ y ′ y ′′ bl (8)

(8)x = x′′

y = y ′′ wl,x′′ y ′′ − ↦ − xx′ x′′ x′ − ↦ − yy ′ y ′′ y ′

f ( )ol−1− , −x′ x′′ y ′ y ′′

∂ol,x′ y ′

∂wlx,y

= ( f ( ) + ⋯ + f ( ) + ⋯ + )∂∂wl

x,ywl

0,0 ol−1−0, −0x′ y ′ wl

x,y ol−1−x, −yx′ y ′ bl

= ( f ( ))∂∂wl

x,ywl

x,y ol−1−x, −yx′ y ′

= f ( )ol−1−x, −yx′ y ′ (9)

(9) (7)

∂E

∂wlx,y

= f ( )∑x′

∑y ′

δ l,x′ y ′ ol−1

−x, −yx′ y ′

= ∗ f ( )δ lx,y ol−1

−x,−y

= ∗ f ( ( ))δ lx,y rot180∘ ol−1

x,y

(10)

(11)

(12)

(12) f ( )ol−1−x,−y

f ( ( ))rot180∘ ol−1x,y ( , , , )δ11 δ12 δ21 δ22

For back-propagation routine, the kernel is ipped yet again before theconvolution operation is done on the gradients to reconstruct the input featuremap.

The convolution operation used to reconstruct the input feature map is shownbelow:

180∘

During the reconstruction process, the deltas are used. Thedeltas are provided by an equation of the form:

Using chain rule and introducing sums give us the following results:

( , , , )δ11 δ12 δ21 δ22

 δ lx,y =

∂E

∂olx,y

(13)

∂E

∂olx,y

= ∑x′

∑y ′

∂E

∂ol+1,x′ y ′

∂ol+1,x′ y ′

∂olx,y

= ∑x′

∑y ′

δ l+1,x′ y ′

∂ol+1,x′ y ′

∂olx,y

(14)

(14) , l+1 +l+1 l+1 l+1

In Eqn is equivalent to and so applying the

convolution operation here gives us an equation of the form:

Expanding the summation in Eqn and taking the partial derivatives for all thecomponents results in zero values for all except the components where

and in which implies

and in as follows:

Substituting Eqn in Eqn gives us the following results:

In Eqn , we now have a cross-correlation which is transformed to a convolutionby “ ipping” the kernel (horizontally and vertically) as follows:

Pooling LayerThe function of the pooling layer is to progressively reduce the spatial size of therepresentation to reduce the amount of parameters and computation in thenetwork, and hence to also control over tting. No learning takes place on thepooling layers [2].

(14) , ol+1,x′ y ′ +wl+1

,x′ y ′ al+1

,x′ y ′ bl+1

∂ol+1,x′ y ′

∂olx,y

= ( + )∂∂ol

x,y∑x′′

∑y ′′

wl+1,x′′ y ′′ al+1

− , −x′ x′′ y ′ y ′′ bl+1

= ( f ( ) + )∂∂ol

x,y∑x′′

∑y ′′

wl+1,x′′ y ′′ ol

− , −x′ x′′ y ′ y ′′ bl+1 (15)

(15)

x = −x′ x′′ y = −y ′ y ′′ f ( )ol− , −x′ x′′ y ′ y ′′ = x +x′′ x′

= y +y ′′ y ′ wl+1,x′′ y ′′

∂ol+1,x′ y ′

∂olx,y

= ( f ( ) + ⋯ + f ( ) + ⋯ + )∂∂ol

x,ywl+1

0,0 ol−0, −0x′ y ′ wl+1

x+ ,y+x′ y ′ olx,y bl+1

= ( f ( ))∂∂ol

x,ywl+1

x+ ,y+x′ y ′ olx,y

= (f ( ))wl+1x+ ,y+x′ y ′

∂∂ol

x,yol

x,y

= ( )wl+1x+ ,y+x′ y ′ f

′ olx,y (16)

(16) (14)

∂E

∂olx,y

= ( )∑x′

∑y ′

δ l+1,x′ y ′ wl+1

x+ ,y+x′ y ′ f′ ol

x,y (17)

(17)

∂E

∂olx,y

= ( ) ( )∑x′

∑y ′

δ l+1,x′ y ′ rot180∘ wl+1

x+ ,y+x′ y ′ f ′ olx,y

= ( )∑x′

∑y ′

δ l+1,x′ y ′ wl+1

x− ,y−x′ y ′ f′ ol

x,y

= ∗ ( ) ( )δ l+1x,y rot180∘ wl+1

x,y f ′ olx,y (18)

Pooling units are obtained using functions like max-pooling, average pooling andeven L2-norm pooling. At the pooling layer, forward propagation results in an

pooling block being reduced to a single value - value of the “winning unit”.Back-propagation of the pooling layer then computes the error which is acquired bythis single value “winning unit”.

To keep track of the “winning unit” its index noted during the forward pass and usedfor gradient routing during back-propagation. Gradient routing is done in thefollowing ways:

Max-pooling - the error is just assigned to where it comes from - the “winningunit” because other units in the previous layer’s pooling blocks did notcontribute to it hence all the other assigned values of zeroAverage pooling - the error is multiplied by and assigned to the whole

pooling block (all units get this same value).

ConclusionConvolutional neural networks employ a weight sharing strategy that leads to asigni cant reduction in the number of parameters that have to be learned. Thepresence of larger receptive eld sizes of neurons in successive convolution layerscoupled with the presence of pooling layers also lead to translation invariance. Aswe have observed the derivations of forward and backward propagations will differdepending on what layer we are propagating through.

References1. Dumoulin, Vincent, and Francesco Visin. “A guide to convolution arithmetic for

deep learning.” stat 1050 (2016): 23. [pdf]2. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard,

W.,Jackel, L.D.: Backpropagation applied to handwritten zip code recognition.Neural computation 1(4), 541–551 (1989)

3. Wikipedia page on Convolutional neural network4. Convolutional Neural Networks (LeNet) deeplearning.net

Related PostsFormulating The ReLU 24 Aug 2016

N × N

1N×N

0 Comments Deep Grid  Login1

 Share⤤ Sort by Best

Start the discussion…

Be the first to comment.

Subscribe✉ Add Disqus to your site Add Disqus Addd Privacyὑ�

 Recommend