data-dependent initializations of convolutional neural ... · –pre-initialize zero-mean gaussian...
TRANSCRIPT
Torsten Koller
Data-dependent Initializations of Convolutional Neural Networks
Philipp Krähenbühl, Carl Doersch, Jeff
Donahue, Trevor Darrell
Presented by: Torsten Koller
Torsten Koller
• Introduction
– Why is training CNNs hard?
– Vanishing gradients
– Common strategies to improve training
• Data-Dependent Initialization of CNNs
– Within-Layer normalization
– Between-Layer normalization
– Unsupervised pre-initialization
• Experiments
• Summary
2
Outline
Torsten Koller
• Large number of parameters
– High computational demand
• Non-convex
– (Poor) local optima
• Hierarchical Structure
– Vanishing / Exploding Gradients
• Requires large amounts of
Data
Introduction: Why is training CNNs hard?
3
http://www.cs.toronto.edu/~ranzato/publications/ranzato_cvpr13.pdf
Torsten Koller
Introduction: Vanishing (Exploding) Gradients
4
• Influence of lower-level weights can decline (grow)
exponentially.
Initialization
matters
……
Layer 1 Layer k Layer N Loss
Torsten Koller
Introduction: Change Rate of Weights
5
• Recall: Stochastic gradient descent
• Change rate determined by:– Global learning rate
– Architecture of the CNN
– Initialization (Current weights)
– Data distribution
Torsten Koller
Introduction: Common Approaches
6
• Improvements based on
– Global learning rate :
– Architecture of the CNN:
– Initialization:
– Data distribution:
Learning rate schedules [1]
Activation functions [2], depth of architecture,
etc.
(Un-)supervised pre-training [3] , (proper) random
initialization [5-7], structured initialization [4], etc.
batch normalization [8], etc.
Torsten Koller
• Goal: Speeding up CNN training via proper random
initialization
• Desirable properties :
– Applicability to arbitrary feed-forward networks
– Unsupervised
– Data-efficient
Data-dependent Initializations of CNNs
7
Torsten Koller
Data-dependent Initializations of CNNs
8
• Change rate is a random variable
• Expected change rate per weight:
Idea: Initialize & normalize
s.t.
“ Enforce same learning rate
for each weight “
Torsten Koller
Model assumptions
9
• Idea: simplify & approximate
… …
loss
loss…
Hard-to-control
Dependency!
Laye
r k
Torsten Koller
Model assumptions
10
• Idea: simplify & approximate
… …
loss
…
loss
“Average out”
Dependency!
simplification
approximation
Laye
r k
Torsten Koller
Model assumptions
11
• Making the procedure unsupervised:
– Assumption:
… loss
loss…
Laye
r k
…
Torsten Koller
Model assumptions
12
• Change in objective:
• Advantage: (approximate) Independence
.
= Column-wise const. Layer-wise const.
Constant change rate
per weight ( )
Constant change rate
per column( )
Torsten Koller
Method
13
• Obtain two-step normalization:
(pre-) initialization
Within-Layer Normalization
Between-LayerNormalization
Enforce constant
learning rate for
each layer k
Rescale per-layer
learning-rate to
obtain global
consistency
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Sample from activations k
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Estimate
for all channels i
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Rescale
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Aff
ine
Laye
r 2
Sample from activations k
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Aff
ine
Laye
r 2
Estimate
for all channels i
Torsten Koller
Method: Within-Layer Normalization
14
• Estimating and normalizing
Aff
ine
Laye
r 1
Aff
ine
Laye
r 2
Rescale
Torsten Koller
Method: How about non-affine Layers?
20
• Q: How do we rescale non-linear layers?
• Assumptions:
– Same operation on each channel
– Operations on different channels are independent
– Not parametrized
No
n-a
ffin
e La
yer
Torsten Koller
Method
21
• Obtain two-step normalization:
(pre-) initialization
Within-Layer Normalization
Between-LayerNormalization
Enforce constant
learning rate for
each layer k
Rescale per-layer
learning-rate to obtain
global consistency
Torsten Koller
Method: Between-Layer Normalization
22
• Idea:
1. Rescale change rate of layer k
2. Undo rescaling in layer k+1
Same network output, different gradients (“homework”)
• Problem: Changes learning rate of other layers
• Solution: Iterative method
Torsten Koller
• Until now:
– Pre-initialize zero-mean gaussian weights
– Within-Layer normalization
– Between-Layer normalization
• Variants of Pre-Initialization:
1. PCA on convolutional layers
- output feature maps are white & decorrelated
2. Spherical k-means[11] on convolutional layers
- outputs correspond to centroids of spherical k-means
Method: Pre-Initialization
23
Torsten Koller
• Architecture: CaffeNet[10]
Experiments
24
Max-pooling
ReLuConv / FC
Input
Barbenko et al.,
Neural Codes for Image Retrieval
Torsten Koller
• Dataset: PASCAL VOC 2007
– 5011 training images
– 4952 test images
– 4 Top-Classes (Person, Animal, Vehicle, Indoor)
– 20 Subclasses
Experiments
25
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/
Torsten Koller
• Tasks
1. Image classification: presence or absence of classes
• SGD for 80.000 Iterations, momentum 0.9, batchsize 10
• Learning rate: 0.001 ( times 0.5 every 10.000 iterations)
1. Object Detection: classification and localization
(bounding box)
• Fast R-CNN[9] for 150.000 iterations
• Learning rate: (0.01/0.002/0.001)
dropped by (times 0.1 every
50.000
Experiments
26
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/
Torsten Koller
• Experiment 1: Do we get a constant change rate?
– Evaluation on CaffeNet
– 160 Images for Data-dependent initialization
– 100 “test”-images to approximate change-rate statistics
after initialization
Experiments
27
Kraehenbuehl et al., Data-dependent Initialization of CNNs
Torsten Koller
• Experiment 2: Classification
– Scaling vs. no scaling
– Within-layer vs. between-layer vs. both
– Gaussian vs k-means pre-initialization
– Four types of optimization methods
Experiments
28
Kraehenbuehl et al., Data-dependent Initialization of CNNs
Torsten Koller
• Experiment 3: Classification + Detection
– Five different types of pre-training
1. Egomotion: motion between two images
2. Motion: relative motion of objects
3. Unsupervised: relative arrangement of image patches
4. K-means
5. 1000 class labels: Pre-trained model on imagenet
Experiments
29
Kraehenbuehl et al., Data-dependent Initialization of CNNs
Torsten Koller
• Experiment 4: ImageNet Nearest Neighbors
– Find nearest neighbors in layer fc-7
Experiments
30
Kraehenbuehl et al., Data-dependent Initialization of CNNs
Torsten Koller
Experiments
31
Kraehenbuehl et al., Data-dependent Initialization of CNNs
Torsten Koller
Summary
32
(pre-) initialization
Within-Layer Normalization
Between-LayerNormalization
Enforce constant
learning rate for
each layer k
Rescale per-layer
learning-rate to obtain
global consistency
Torsten Koller
• Presented approach:
– Widely applicable to various feed-forward architectures
– Unsupervised
– Data-efficient and fast
– Can be used on top of other pre-training methods
• Limitations
– Relies on several assumptions
– Not generally applicable to non-feed-forward networks
Summary
33
Torsten Koller 34
Thank you!
Q&A
Torsten Koller
[1] Andrew Senior, Georg Heigold, Marc'Aurelio Ranzato and Ke Yang. An empirical
study of learning rates in deep neural networks for speech recognition. IEEE on
Acoustics, Speech and Signal Processing, 2013
[2] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Neural
Networks: Tricks of the trade. Springer, 1998
[3] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol and
Pascal Vincent. Why Does Unsupervised Pre-training Help Deep Learning?
JMLR, 2010
[4] Jason Yosinski, Jeff Clune, Yoshua Bengio and Hod Lipson. How transferable
are features in deep neural networks? Advances in Neural Information Processing
Systems 27, pages 3320-3328, 2014
[5] Philipp Krähenbühl, Carl Doersch, Jeff Donahue and Trevor Darrell. Data-
dependent Initializations of Convolutional Neural Networks. ICLR, 2016
[6] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. AISTATS, 2010
[7] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into
rectifiers: Surpassing human-level performance on ImageNet classification.
ICCV, 2015
References
35
Torsten Koller
[8] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015.
[9] Ross Girshick. Fast R-CNN. ICCV, 2015
[10] http://caffe.berkeleyvision.org/model_zoo.html
[11] Coates, Adam and Ng, Andrew Y. Learning feature representations with k-
means. In Neural Networks: Tricks of the Trade, pp. 561–580. Springer, 2012.
References
36