training convolutional neural networksraul/rs/training convolutional... · • data-dependent...

28
Training Convolutional Neural Networks R. Q. FEITOSA

Upload: others

Post on 21-May-2020

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Training Convolutional Neural Networks

R. Q. FEITOSA

Page 2: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Overview

1. Activation Functions

2. Data Preprocessing

3. Weight Initialization

4. Batch Normalization

2

Page 3: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Activation Functions

3

Page 4: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Activation Layer

4

𝑓

𝑤0𝑥0 = 𝑏

𝑤1𝑥1

𝑤𝑛𝑥𝑛

𝑤𝑖𝑥𝑖

𝑖

+ 𝑏

𝑓 𝑤𝑖𝑥𝑖

𝑖

+ 𝑏

activation

function

Page 5: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Activation Functions

5

tanh

𝑡𝑎𝑛ℎ 𝑥

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Leaky ReLU

𝑚𝑎𝑥 0.01𝑥, 𝑥

ReLU

𝑚𝑎𝑥 0, 𝑥

Page 6: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.

Three shortcomings:

1) Saturated neurons “kill” the gradients,

2) Sigmoid outputs are not zero-centered,

3) 𝑒𝑥𝑝 () is a bit compute expensive.

6

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Page 7: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Sigmoid kills gradient

7

𝜕𝐿

𝜕𝜎

𝜕𝐿

𝜕𝑥=

𝜕𝜎

𝜕𝑥

𝜕𝐿

𝜕𝜎 upstream

gradient

sigmoid

gate

𝜕𝜎

𝜕𝑥

𝑥 𝜎 𝑥 =1

1 + 𝑒−𝑥

local

gradient

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

“kills” gradient “kills” gradient

Page 8: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

What happens when the input to a neuron (𝑥) is always positive?

The gradients on 𝒘 are all positive or all negative.

Therefore, zero-mean data is highly desirable!

possible

gradient

update

directions

possible

gradient

update

directions

Sigmoid output not zero-centered

8

𝑓 𝑤𝑖𝑥𝑖

𝑖

+ 𝑏 𝑤1

𝑤2

hypothetical

optimal

𝑤 vector

zig zag

updates

Page 9: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Tanh

9

tanh

𝑡𝑎𝑛ℎ 𝑥

• Squashes numbers to range [-1,1]

• Zero-centered.

Shortcoming:

Still “kills” the gradients when saturated

“kills” gradient “kills” gradient

𝑓 𝑥 = 𝑡𝑎𝑛ℎ 𝑥

Page 10: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Rectified Linear Unit

10

• Does not saturate in the positive region.

• Very computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

Shortcoming:

1) not zero-centered output

ReLU

𝑚𝑎𝑥 0, 𝑥

𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥

Page 11: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

ReLU

𝑚𝑎𝑥 0, 𝑥

ReLU kills gradient for negative input

11

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑥=

𝜕𝑧

𝜕𝑥

𝜕𝐿

𝜕𝑧 upstream

gradient

ReLU

gate 𝜕𝑧

𝜕𝑥

𝑥 𝑧 = 𝑅𝑒𝐿𝑈 𝑥 = 𝑚𝑎𝑥 (0, 𝑥)

local

gradient

“kills” gradient

Page 12: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

𝑥1

𝑥2

data cloud

Dead ReLU

12

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑦=

𝜕𝑧

𝜕𝑦

𝜕𝐿

𝜕𝑧 upstream

gradient

ReLU

𝑥

local

gradient 𝑊𝑥 + 𝑑 para 𝑊𝑥 + 𝑑 >0

0, otherwise

𝜕𝑧

𝜕𝑥=

𝑊𝑥 + 𝑑

𝑦 = 𝑊𝑥 + 𝑑 𝑧 = 𝑚𝑎𝑥 (0,𝑊𝑥 + 𝑑 )

only data on this side of

the hyperplane will have

a non-zero derivative

Page 13: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Leaky ReLU

13

• Never saturates.

• Computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

• will not “die”.

Shortcoming:

not zero-centered output

𝑓 𝑥 = 𝑚𝑎𝑥 0.01𝑥, 𝑥

Leaky ReLU

𝑚𝑎𝑥 0.1𝑥, 𝑥

Parametric Rectifier (PReLU):

𝑓 𝑥 = 𝑚𝑎𝑥 𝛼𝑥, 𝑥

Page 14: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Data Preprocessing

14

Page 15: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Preprocessing Operations

15

In practice for images

• No normalization (pixel

values in the same range)

• Subtract the mean image

(e.g. AlexNet)

• Subtract per-channel mean

(e.g. VGGNet)

Page 16: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Weight Initialization

16

Page 17: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Small initial Weights As the data flows forward, the input (𝑥) at each layer tends to concentrate around zero.

Recall that the local gradient 𝑊𝑥 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient back props, it tends to vanish no update.

17

layer 1 layer 1 layer 1 layer 1 layer 1

histograms of the input data at each convolutional layer

Page 18: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Large initial Weights Assume a tanh or a sigmoid activation function.

The input to the activation functions are in the saturated region, where the gradient is small.dients 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient propagates back, they tend to vanish no update.

18

Page 19: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Optimal Weight Initialization Still an active research topic:

• Understanding the difficulty of training deep feedforward neural networks by Glorot and

Bengio, 2010

• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

• Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

• Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

• Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

• All you need is a good init, Mishkin and Matas, 2015

19

Page 20: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Batch Normalization

20

Page 21: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Making the Activations Gaussian Take a batch of activation at some layer. To make each dimension 𝑘 look like a unit Gaussian, apply

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

Notice that this function has a simple derivative.

21

Page 22: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Making the Activations Gaussian In practice, we take the empirical mean and variance. For 𝑥 over a mini-batch ℬ = 𝑥1…𝑚 :

22

sa

mp

les

features

𝑥 1

𝑥2

𝑥𝑚

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚

𝑖=1

𝜎ℬ2 =

1

𝑚 𝑥𝑖 − 𝜇ℬ

2

𝑚

𝑖=1

𝑥 𝑖 =𝑥𝑖 − 𝜇ℬ

𝜎ℬ2 + 𝜀

constant added for

numerical stability

𝜇ℬ and 𝜎ℬ2

computed for

each

dimension

independently

Page 23: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Recovering the Identity Mapping Normalize:

and then allow the network to undo it, if “wished”.

23

Notice that the network can learn:

to recover the identity mapping

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 𝛾 𝑘 = 𝑉𝑎𝑟 𝑥 𝑘

𝛽 𝑘 = 𝔼 𝑥 𝑘

Obs.: recall that 𝑘 refers to a dimension.

learnable parameters

Page 24: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

ConvNet Layer with BN

Batch normalization is inserted prior to the nonlinearity

24

polling

activation function

batch normalization

convolution

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘

Page 25: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

1

𝑚 𝑥𝑖 − 𝜇ℬ

2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ 𝐵𝑁𝛾,𝛽 𝑥𝑖 // scale and shift

25

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

• Improves gradient flow

through the network

• Allows higher learning

rates

• Reduces the dependence

on the initialization

Page 26: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

1

𝑚 𝑥𝑖 − 𝜇ℬ

2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ BN𝛾,𝛽 𝑥𝑖 // scale and shift

26

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

At test time BN functions

differently:

The mean/std are not

computed based on the

batch. Instead, fixed

empirical values computed

during training are used.

Page 27: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Summary

• Activation Functions (use ReLU or Leaky ReLU)

• Data Preprocessing (subtract the mean)

• Weight Initialization (use Xavier init)

• Batch Normalization (use)

27

Page 28: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Training Convolutional Neural Networks

Thank you!

R. Q. FEITOSA