a hands-on experience on visual object recognition module ... · •batch normalization prevents...

A Hands-On Experience on

Visual Object Recognition

Project

Visual RecognitionModule 5

Ramon Baldrich

week 5 : Joost van de Weijer,

Marc Masana, German Ros

Coordination

Contents

Tips and tricks to make it work

• Stoachastic gradient descent with momentum

• Initialization

• Vanishing gradient problem

• Overfitting - Drop out

• Batch normalization

Training

‘antelope’

‘ballet’

‘boat’

txty

0

; 0.01

0.99

f

0

; 1

0

f

0.98

; 0.01

0.01

f

1

arg min ; ,t t

t

l f x yT

1

arg min log ;j j

t j

y f xT

Empirical Risk

Training

arg min , ; arg min , ;t t

t

E x y L x y

E

*

E• The can be computed with the backpropagation algorithm (chainrule).

E

Training

Training


t

E x y L x y

E

*

E• The can be computed with the backpropagation algorithm.

E

, ; , ; , ;t t t

t t

E x y L x y L x y

Training


t

E x y L x y

E

*

E• Stochastic gradient descent:

If the data set is highly redundant the gradient over a part will already represent the gradient over the whole data set.

E

, ;t

L x y

Minibatch gradient descent: Use groups (called minibatches) of images to update the parameters.

• Less computation is used (very efficient on GPUs.)• Balance the classes in each mini batch • One round through all data is called an epoc.

Training

, ;t

L x y E

General recipe for stochastic gradient descent:• Guess an initial learning rate.• If the error keeps getting worse or oscillates wildly, reduce the learning rate.• If the error is falling fairly consistently but slowly, increase the learning.• Towards the end of mini-batch learning it nearly always helps to turn down the learning rate.

Hinton

erro

r

Training

, ;t

L x y E

• Exists a lot of literature on more elaborate gradient descent methods.

Hinton

erro

r

0.9 0.1 , ;t

L x y

•Stochastic gradient descent with momentum:

Training + Regularization

‘antelope’

‘ballet’

‘boat’

txty

weight decay

2

k

ij

k i j

W

1

arg min log ;j j

t j

y f xT

Initialization

The initialization of the networks is very important for (fast) convergence of thenetwork.

INPUT: Networks convergence faster if its inputs are whitened (linearly transformedto zero-mean and unit variance and decorrelated)

It is important to observe that weights cannot be initialized with thesame value . You need to break the symmetry.

BIASES: initialized to zero.

WEIGHTS:• Glorot & Bengio[2010] aimed to keep the variance of the outputs at each layer tobe white and unit variance.

• Improvement proposed by He et al [ArXiv 2015] for ReLu

weight_filler { type: "xavier“)6 6

,in out in out

W Un n n n

Training Deep Networks

• Neural networks with one layer can approximate arbitrarely closeany network with multiple layers.

• However as the number of layers increase the number of nodes toexpress the same function decreases exponentially.

Problems• Vanishing gradient problem• Overfitting (number of parameters is often larger then numberof training examples)

slide credit P.Poupart

slide credit R.Fergus

• You have to play with the best positioning of the dropout (often it is placed in between the fully connected layers).

Dropout

Batch Normalization

• Batch normalization is recent work from Sergey Ioffe and Christian Szegedy [ICML2015]

• We apply whitening to the input, and choose our initial weights in such a way that theactivations in between are close to whitened (‘Xavier’ inialization).

It would be nice to also ensure whitened activations during training for all layers.

• Training of networks is complicated because the distribution of layer inputs changes during training (internal covariate shift)

Making normalization at all layers part of the training prevents the internalcovariate shift.

Batch Normalization

Batch Normalization

Backpropagation with Batch Normalization

Batch Normalization

Results on MNIST [Ioffe & Szegedy2015]

Batch Normalization

Results on ImageNet [Ioffe & Szegedy2015]

Learning ratemultiplied with 5 and 30.

14x faster to reach same results

Conclusion

We discussed a set of tools which you can use to improve training CNNs:

• change learning rate / weight-decay

•SGD with Momentum can speed up convergence.

• Correct initialization is crucial for the succesfull application of DNNs.

• Dropout is an effective method to prevent overfitting.

• Batch Normalization prevents the internal covariate shift and therefore allowshigher learning rates (It seems that Dropout is not necessary when using BatchNormalization)

AssignmentAs an assignment for next week Monday 11/4 (deadline at 9:00):

• submit a short presentation in which you show your results for the

different exercises (the exercises with number 1-4).

• include a copy of your best network (mynet_train.m).

Assignment

As an assignment for next week Monday 11/4 (deadline at 9:00):

• submit a short presentation in which you show your results for the

different exercises (the exercises with number 1-4). • include a copy of your best network (mynet_train.m).

Exam MaterialAll material including the Hands-on slides.

(except Batch-normalization derivation of back propagation, and Generative

Adversarial Networks)

A Hands-On Experience on

Visual Object Recognition

Project

Visual RecognitionModule 5

Ramon Baldrich

week 5 : Joost van de Weijer,

Marc Masana, German Ros

Coordination

a hands-on experience on visual object recognition module ... · •batch normalization prevents...

Documents