training convolutional neural networksraul/rs/training convolutional... · • data-dependent...

Training Convolutional Neural Networks

R. Q. FEITOSA

Overview

1. Activation Functions

2. Data Preprocessing

3. Weight Initialization

4. Batch Normalization

Activation Functions

Activation Layer

𝑤0𝑥0 = 𝑏

𝑤1𝑥1

𝑤𝑛𝑥𝑛

𝑤𝑖𝑥𝑖

+ 𝑏

𝑓 𝑤𝑖𝑥𝑖

+ 𝑏

activation

function

Activation Functions

𝑡𝑎𝑛ℎ 𝑥

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Leaky ReLU

𝑚𝑎𝑥 0.01𝑥, 𝑥

𝑚𝑎𝑥 0, 𝑥

Sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.

Three shortcomings:

1) Saturated neurons “kill” the gradients,

2) Sigmoid outputs are not zero-centered,

3) 𝑒𝑥𝑝 () is a bit compute expensive.

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Sigmoid kills gradient

𝜕𝐿

𝜕𝜎

𝜕𝐿

𝜕𝑥=

𝜕𝜎

𝜕𝑥

𝜕𝐿

𝜕𝜎 upstream

gradient

sigmoid

𝜕𝜎

𝜕𝑥

𝑥 𝜎 𝑥 =1

1 + 𝑒−𝑥

gradient

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

“kills” gradient “kills” gradient

What happens when the input to a neuron (𝑥) is always positive?

The gradients on 𝒘 are all positive or all negative.

Therefore, zero-mean data is highly desirable!

possible

gradient

update

directions

possible

gradient

update

directions

Sigmoid output not zero-centered

𝑓 𝑤𝑖𝑥𝑖

+ 𝑏 𝑤1

hypothetical

optimal

𝑤 vector

zig zag

updates

𝑡𝑎𝑛ℎ 𝑥

• Squashes numbers to range [-1,1]

• Zero-centered.

Shortcoming:

Still “kills” the gradients when saturated

“kills” gradient “kills” gradient

𝑓 𝑥 = 𝑡𝑎𝑛ℎ 𝑥

Rectified Linear Unit

• Does not saturate in the positive region.

• Very computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

Shortcoming:

1) not zero-centered output

𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥

ReLU kills gradient for negative input

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑥=

𝜕𝑧

𝜕𝑥

𝜕𝐿

𝜕𝑧 upstream

gradient

gate 𝜕𝑧

𝜕𝑥

𝑥 𝑧 = 𝑅𝑒𝐿𝑈 𝑥 = 𝑚𝑎𝑥 (0, 𝑥)

gradient

“kills” gradient

data cloud

Dead ReLU

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑦=

𝜕𝑧

𝜕𝑦

𝜕𝐿

𝜕𝑧 upstream

gradient

gradient 𝑊𝑥 + 𝑑 para 𝑊𝑥 + 𝑑 >0

0, otherwise

𝜕𝑧

𝜕𝑥=

𝑊𝑥 + 𝑑

𝑦 = 𝑊𝑥 + 𝑑 𝑧 = 𝑚𝑎𝑥 (0,𝑊𝑥 + 𝑑 )

only data on this side of

the hyperplane will have

a non-zero derivative

Leaky ReLU

• Never saturates.

• Computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

• will not “die”.

Shortcoming:

not zero-centered output

𝑓 𝑥 = 𝑚𝑎𝑥 0.01𝑥, 𝑥

Leaky ReLU

𝑚𝑎𝑥 0.1𝑥, 𝑥

Parametric Rectifier (PReLU):

𝑓 𝑥 = 𝑚𝑎𝑥 𝛼𝑥, 𝑥

Data Preprocessing

Preprocessing Operations

In practice for images

• No normalization (pixel

values in the same range)

• Subtract the mean image

(e.g. AlexNet)

• Subtract per-channel mean

(e.g. VGGNet)

Weight Initialization

Small initial Weights As the data flows forward, the input (𝑥) at each layer tends to concentrate around zero.

Recall that the local gradient 𝑊𝑥 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient back props, it tends to vanish no update.

layer 1 layer 1 layer 1 layer 1 layer 1

histograms of the input data at each convolutional layer

Large initial Weights Assume a tanh or a sigmoid activation function.

The input to the activation functions are in the saturated region, where the gradient is small.dients 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient propagates back, they tend to vanish no update.

Optimal Weight Initialization Still an active research topic:

• Understanding the difficulty of training deep feedforward neural networks by Glorot and

Bengio, 2010

• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

• Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

• Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

• Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

• All you need is a good init, Mishkin and Matas, 2015

Batch Normalization

Making the Activations Gaussian Take a batch of activation at some layer. To make each dimension 𝑘 look like a unit Gaussian, apply

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

Notice that this function has a simple derivative.

Making the Activations Gaussian In practice, we take the empirical mean and variance. For 𝑥 over a mini-batch ℬ = 𝑥1…𝑚 :

features

𝑥 1

𝑥𝑚

𝜇ℬ =1

𝑚 𝑥𝑖

𝑖=1

𝜎ℬ2 =

𝑚 𝑥𝑖 − 𝜇ℬ

𝑖=1

𝑥 𝑖 =𝑥𝑖 − 𝜇ℬ

𝜎ℬ2 + 𝜀

constant added for

numerical stability

𝜇ℬ and 𝜎ℬ2

computed for

dimension

independently

Recovering the Identity Mapping Normalize:

and then allow the network to undo it, if “wished”.

Notice that the network can learn:

to recover the identity mapping

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 𝛾 𝑘 = 𝑉𝑎𝑟 𝑥 𝑘

𝛽 𝑘 = 𝔼 𝑥 𝑘

Obs.: recall that 𝑘 refers to a dimension.

learnable parameters

ConvNet Layer with BN

Batch normalization is inserted prior to the nonlinearity

polling

activation function

batch normalization

convolution

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ 𝐵𝑁𝛾,𝛽 𝑥𝑖 // scale and shift

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

• Improves gradient flow

through the network

• Allows higher learning

• Reduces the dependence

on the initialization

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ BN𝛾,𝛽 𝑥𝑖 // scale and shift

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

At test time BN functions

differently:

The mean/std are not

computed based on the

batch. Instead, fixed

empirical values computed

during training are used.

Summary

• Activation Functions (use ReLU or Leaky ReLU)

• Data Preprocessing (subtract the mean)

• Weight Initialization (use Xavier init)

• Batch Normalization (use)

Training Convolutional Neural Networks

Thank you!

R. Q. FEITOSA

training convolutional neural networksraul/rs/training convolutional... · • data-dependent...

Documents

very deep convolutional neural networks for noise...

convolutional neural network · convolutional neural...

convolutional neural networks - virginia...

convolutional neural fabrics - arxiv

convolutional neural networks analyzed via convolutional ......

neural network part 3: convolutional neural...

convolutional neural networks (part...

anapproachtosemanticandstructuralfeatureslearningfor...

convolutional neural network - unnes

convolutional neural networks - arxiv

convolutional neural networks deepa

visualizing and understanding convolutional neural ... -...

convolutional neural networks arise from ising models and...

constrained convolutional neural networks for...

detecting object affordances with convolutional neural...

imagenet classification with deep convolutional neural...

accelerating convolutional neural network...

doubly convolutional neural networks -...

imagenet classiﬁcation with deep convolutional neural...

convolutional neural networks: motivation, …