data-dependent initializations of convolutional neural ... · –pre-initialize zero-mean gaussian...

Torsten Koller

Data-dependent Initializations of Convolutional Neural Networks

Philipp Krähenbühl, Carl Doersch, Jeff

Donahue, Trevor Darrell

Presented by: Torsten Koller

Torsten Koller

• Introduction

– Why is training CNNs hard?

– Vanishing gradients

– Common strategies to improve training

• Data-Dependent Initialization of CNNs

– Within-Layer normalization

– Between-Layer normalization

– Unsupervised pre-initialization

• Experiments

• Summary

2

Outline

Torsten Koller

• Large number of parameters

– High computational demand

• Non-convex

– (Poor) local optima

• Hierarchical Structure

– Vanishing / Exploding Gradients

• Requires large amounts of

Data

Introduction: Why is training CNNs hard?

3

http://www.cs.toronto.edu/~ranzato/publications/ranzato_cvpr13.pdf

Torsten Koller

Introduction: Vanishing (Exploding) Gradients

4

• Influence of lower-level weights can decline (grow)

exponentially.

Initialization

matters

……

Layer 1 Layer k Layer N Loss

Torsten Koller

Introduction: Change Rate of Weights

5

• Recall: Stochastic gradient descent

• Change rate determined by:– Global learning rate

– Architecture of the CNN

– Initialization (Current weights)

– Data distribution

Torsten Koller

Introduction: Common Approaches

6

• Improvements based on

– Global learning rate :

– Architecture of the CNN:

– Initialization:

– Data distribution:

Learning rate schedules [1]

Activation functions [2], depth of architecture,

etc.

(Un-)supervised pre-training [3] , (proper) random

initialization [5-7], structured initialization [4], etc.

batch normalization [8], etc.

Torsten Koller

• Goal: Speeding up CNN training via proper random

initialization

• Desirable properties :

– Applicability to arbitrary feed-forward networks

– Unsupervised

– Data-efficient

Data-dependent Initializations of CNNs

7

Torsten Koller

Data-dependent Initializations of CNNs

8

• Change rate is a random variable

• Expected change rate per weight:

Idea: Initialize & normalize

s.t.

“ Enforce same learning rate

for each weight “

Torsten Koller

Model assumptions

9

• Idea: simplify & approximate

… …

loss

loss…

Hard-to-control

Dependency!

Laye

r k

Torsten Koller

Model assumptions

10

• Idea: simplify & approximate

… …

loss

…

loss

“Average out”

Dependency!

simplification

approximation

Laye

r k

Torsten Koller

Model assumptions

11

• Making the procedure unsupervised:

– Assumption:

… loss

loss…

Laye

r k

…

Torsten Koller

Model assumptions

12

• Change in objective:

• Advantage: (approximate) Independence

.

= Column-wise const. Layer-wise const.

Constant change rate

per weight ( )

Constant change rate

per column( )

Torsten Koller

Method

13

• Obtain two-step normalization:

(pre-) initialization

Within-Layer Normalization

Between-LayerNormalization

Enforce constant

learning rate for

each layer k

Rescale per-layer

learning-rate to

obtain global

consistency

Torsten Koller

Method: Within-Layer Normalization

14

• Estimating and normalizing

Aff

ine

Laye

r 1

Sample from activations k

Torsten Koller


14


Aff

ine

Laye

r 1

Estimate

for all channels i

Torsten Koller


14


Aff

ine

Laye

r 1

Rescale

Torsten Koller


14


Aff

ine

Laye

r 1

Aff

ine

Laye

r 2

Sample from activations k

Torsten Koller


14


Aff

ine

Laye

r 1

Aff

ine

Laye

r 2

Estimate

for all channels i

Torsten Koller


14


Aff

ine

Laye

r 1

Aff

ine

Laye

r 2

Rescale

Torsten Koller

Method: How about non-affine Layers?

20

• Q: How do we rescale non-linear layers?

• Assumptions:

– Same operation on each channel

– Operations on different channels are independent

– Not parametrized

No

n-a

ffin

e La

yer

Torsten Koller

Method

21

• Obtain two-step normalization:




Enforce constant

learning rate for

each layer k

Rescale per-layer

learning-rate to obtain

global consistency

Torsten Koller

Method: Between-Layer Normalization

22

• Idea:

1. Rescale change rate of layer k

2. Undo rescaling in layer k+1

Same network output, different gradients (“homework”)

• Problem: Changes learning rate of other layers

• Solution: Iterative method

Torsten Koller

• Until now:

– Pre-initialize zero-mean gaussian weights

– Within-Layer normalization

– Between-Layer normalization

• Variants of Pre-Initialization:

1. PCA on convolutional layers

- output feature maps are white & decorrelated

2. Spherical k-means[11] on convolutional layers

- outputs correspond to centroids of spherical k-means

Method: Pre-Initialization

23

Torsten Koller

• Architecture: CaffeNet[10]

Experiments

24

Max-pooling

ReLuConv / FC

Input

Barbenko et al.,

Neural Codes for Image Retrieval

Torsten Koller

• Dataset: PASCAL VOC 2007

– 5011 training images

– 4952 test images

– 4 Top-Classes (Person, Animal, Vehicle, Indoor)

– 20 Subclasses

Experiments

25

http://host.robots.ox.ac.uk/pascal/VOC/voc2007/


Torsten Koller

• Tasks

1. Image classification: presence or absence of classes

• SGD for 80.000 Iterations, momentum 0.9, batchsize 10

• Learning rate: 0.001 ( times 0.5 every 10.000 iterations)

1. Object Detection: classification and localization

(bounding box)

• Fast R-CNN[9] for 150.000 iterations

• Learning rate: (0.01/0.002/0.001)

dropped by (times 0.1 every

50.000

Experiments

26



Torsten Koller

• Experiment 1: Do we get a constant change rate?

– Evaluation on CaffeNet

– 160 Images for Data-dependent initialization

– 100 “test”-images to approximate change-rate statistics

after initialization

Experiments

27

Kraehenbuehl et al., Data-dependent Initialization of CNNs

Torsten Koller

• Experiment 2: Classification

– Scaling vs. no scaling

– Within-layer vs. between-layer vs. both

– Gaussian vs k-means pre-initialization

– Four types of optimization methods

Experiments

28


Torsten Koller

• Experiment 3: Classification + Detection

– Five different types of pre-training

1. Egomotion: motion between two images

2. Motion: relative motion of objects

3. Unsupervised: relative arrangement of image patches

4. K-means

5. 1000 class labels: Pre-trained model on imagenet

Experiments

29


Torsten Koller

• Experiment 4: ImageNet Nearest Neighbors

– Find nearest neighbors in layer fc-7

Experiments

30


Torsten Koller

Experiments

31


Torsten Koller

Summary

32




Enforce constant

learning rate for

each layer k

Rescale per-layer

learning-rate to obtain

global consistency

Torsten Koller

• Presented approach:

– Widely applicable to various feed-forward architectures

– Unsupervised

– Data-efficient and fast

– Can be used on top of other pre-training methods

• Limitations

– Relies on several assumptions

– Not generally applicable to non-feed-forward networks

Summary

33

Torsten Koller 34

Thank you!

Q&A

Torsten Koller

[1] Andrew Senior, Georg Heigold, Marc'Aurelio Ranzato and Ke Yang. An empirical

study of learning rates in deep neural networks for speech recognition. IEEE on

Acoustics, Speech and Signal Processing, 2013

[2] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Neural

Networks: Tricks of the trade. Springer, 1998

[3] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol and

Pascal Vincent. Why Does Unsupervised Pre-training Help Deep Learning?

JMLR, 2010

[4] Jason Yosinski, Jeff Clune, Yoshua Bengio and Hod Lipson. How transferable

are features in deep neural networks? Advances in Neural Information Processing

Systems 27, pages 3320-3328, 2014

[5] Philipp Krähenbühl, Carl Doersch, Jeff Donahue and Trevor Darrell. Data-

dependent Initializations of Convolutional Neural Networks. ICLR, 2016

[6] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep

feedforward neural networks. AISTATS, 2010

[7] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into

rectifiers: Surpassing human-level performance on ImageNet classification.

ICCV, 2015

References

35

Torsten Koller

[8] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep

network training by reducing internal covariate shift. In ICML, 2015.

[9] Ross Girshick. Fast R-CNN. ICCV, 2015

[10] http://caffe.berkeleyvision.org/model_zoo.html

[11] Coates, Adam and Ng, Andrew Y. Learning feature representations with k-

means. In Neural Networks: Tricks of the Trade, pp. 561–580. Springer, 2012.

References

36

http://caffe.berkeleyvision.org/model_zoo.html

data-dependent initializations of convolutional neural ... · –pre-initialize zero-mean gaussian...

Documents