a hands-on experience on visual object recognition module ... · •batch normalization prevents...

31
A Hands-On Experience on Visual Object Recognition Project Visual Recognition Module 5 Ramon Baldrich week 5 : Joost van de Weijer, Marc Masana, German Ros Coordination

Upload: others

Post on 29-Oct-2019

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

A Hands-On Experience on

Visual Object Recognition

Project

Visual RecognitionModule 5

Ramon Baldrich

week 5 : Joost van de Weijer,

Marc Masana, German Ros

Coordination

Page 2: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Contents

Tips and tricks to make it work

• Stoachastic gradient descent with momentum

• Initialization

• Vanishing gradient problem

• Overfitting - Drop out

• Batch normalization

Page 3: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

‘antelope’

‘ballet’

‘boat’

txty

0

; 0.01

0.99

f

0

; 1

0

f

0.98

; 0.01

0.01

f

1

arg min ; ,t t

t

l f x yT

1

arg min log ;j j

t j

y f xT

Empirical Risk

Page 4: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

arg min , ; arg min , ;t t

t

E x y L x y

E

*

E• The can be computed with the backpropagation algorithm (chainrule).

E

Page 5: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

Page 6: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

Page 7: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

Page 8: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

Page 9: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems
Page 10: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems
Page 11: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

arg min , ; arg min , ;t t

t

E x y L x y

E

*

E• The can be computed with the backpropagation algorithm.

E

, ; , ; , ;t t t

t t

E x y L x y L x y

Page 12: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

arg min , ; arg min , ;t t

t

E x y L x y

E

*

E• Stochastic gradient descent:

If the data set is highly redundant the gradient over a part will already represent the gradient over the whole data set.

E

, ;t

L x y

Minibatch gradient descent: Use groups (called minibatches) of images to update the parameters.

• Less computation is used (very efficient on GPUs.)• Balance the classes in each mini batch • One round through all data is called an epoc.

Page 13: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

, ;t

L x y E

General recipe for stochastic gradient descent:• Guess an initial learning rate.• If the error keeps getting worse or oscillates wildly, reduce the learning rate.• If the error is falling fairly consistently but slowly, increase the learning.• Towards the end of mini-batch learning it nearly always helps to turn down the learning rate.

Hinton

erro

r

Page 14: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training

, ;t

L x y E

• Exists a lot of literature on more elaborate gradient descent methods.

Hinton

erro

r

0.9 0.1 , ;t

L x y

•Stochastic gradient descent with momentum:

Page 15: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training + Regularization

‘antelope’

‘ballet’

‘boat’

txty

weight decay

2

k

ij

k i j

W

1

arg min log ;j j

t j

y f xT

Page 16: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Initialization

The initialization of the networks is very important for (fast) convergence of thenetwork.

INPUT: Networks convergence faster if its inputs are whitened (linearly transformedto zero-mean and unit variance and decorrelated)

It is important to observe that weights cannot be initialized with thesame value . You need to break the symmetry.

BIASES: initialized to zero.

WEIGHTS:• Glorot & Bengio[2010] aimed to keep the variance of the outputs at each layer tobe white and unit variance.

• Improvement proposed by He et al [ArXiv 2015] for ReLu

weight_filler { type: "xavier“)6 6

,in out in out

W Un n n n

Page 17: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Training Deep Networks

• Neural networks with one layer can approximate arbitrarely closeany network with multiple layers.

• However as the number of layers increase the number of nodes toexpress the same function decreases exponentially.

Problems• Vanishing gradient problem• Overfitting (number of parameters is often larger then numberof training examples)

Page 18: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit P.Poupart

Page 19: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit P.Poupart

Page 20: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit P.Poupart

Page 21: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit P.Poupart

Page 22: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit P.Poupart

Page 23: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

slide credit R.Fergus

• You have to play with the best positioning of the dropout (often it is placed in between the fully connected layers).

Dropout

Page 24: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Batch Normalization

• Batch normalization is recent work from Sergey Ioffe and Christian Szegedy [ICML2015]

• We apply whitening to the input, and choose our initial weights in such a way that theactivations in between are close to whitened (‘Xavier’ inialization).

It would be nice to also ensure whitened activations during training for all layers.

• Training of networks is complicated because the distribution of layer inputs changes during training (internal covariate shift)

Making normalization at all layers part of the training prevents the internalcovariate shift.

Page 25: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Batch Normalization

Page 26: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Batch Normalization

Backpropagation with Batch Normalization

Page 27: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Batch Normalization

Results on MNIST [Ioffe & Szegedy2015]

Page 28: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Batch Normalization

Results on ImageNet [Ioffe & Szegedy2015]

Learning ratemultiplied with 5 and 30.

14x faster to reach same results

Page 29: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Conclusion

We discussed a set of tools which you can use to improve training CNNs:

• change learning rate / weight-decay

•SGD with Momentum can speed up convergence.

• Correct initialization is crucial for the succesfull application of DNNs.

• Dropout is an effective method to prevent overfitting.

• Batch Normalization prevents the internal covariate shift and therefore allowshigher learning rates (It seems that Dropout is not necessary when using BatchNormalization)

AssignmentAs an assignment for next week Monday 11/4 (deadline at 9:00):

• submit a short presentation in which you show your results for the

different exercises (the exercises with number 1-4).

• include a copy of your best network (mynet_train.m).

Page 30: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

Assignment

As an assignment for next week Monday 11/4 (deadline at 9:00):

• submit a short presentation in which you show your results for the

different exercises (the exercises with number 1-4). • include a copy of your best network (mynet_train.m).

Exam MaterialAll material including the Hands-on slides.

(except Batch-normalization derivation of back propagation, and Generative

Adversarial Networks)

Page 31: A Hands-On Experience on Visual Object Recognition Module ... · •Batch Normalization prevents the internal covariate shift and therefore allows higher learning rates (It seems

A Hands-On Experience on

Visual Object Recognition

Project

Visual RecognitionModule 5

Ramon Baldrich

week 5 : Joost van de Weijer,

Marc Masana, German Ros

Coordination