convolutional neural networks (part ii)lvelho.impa.br/ip17/proj/slides/1110goingdeeperconv.pdf ·...

Convolutional Neural Networks 1

Convolutional Neural Networks (Part II)

08, 10 & 17 Nov, 2016

J. Ezequiel Soto S.Image Processing 2016

Prof. Luiz Velho


Summary & References08/11 ImageNet Classification with Deep Convolutional Neural Networks

2012, Krizhevsky et. al. [source]10/11 Going Deeper with Convolutions

2015, Szegedy et. al. [source]17/11 Painting Style Transfer for Head Portraits using Convolutional Neural Networks

2016, Selim & Elgharib [source]

+ An Analysis of Deep Neural Network Models for Practical Applications

2016, Canziani & Culurciello [source]+ Provable bounds for learning some deep representations

2013, Arora et.al. [source]

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf

http://dl.acm.org/citation.cfm?id=2925968

https://arxiv.org/pdf/1605.07678v2.pdf

http://jmlr.org/proceedings/papers/v32/arora14.pdf


Going Deeper with Convolutions

Szegedy et.al. 2015


Outline● Introduction● Related Work● Motivation● Architecture Detail● GoogLeNet● Training● ILSVRC 2014● Conclusions


Introduction● GoogLeNet submission to ILSVRC →2014

● Accuracy + low cost in ops (1.5B @inference) → real world applicability

● Efficient CNN architecture: Inception

● Depth: network layers + Inception module

● Results!!! New State of the Art→


Related Work● Standard CNN layer:

convolution + normalization + max pooling ● Good results in MNIST, CIFAR and ImageNet (with dropout vs. overfitting)

● Concerns that max-pooling loses spatial information● Neuro-science model of primate vision: stack filters

→ inspiration of the inception module● Network in Network model (NiN)● 1 x 1 convolutions:

– Increase depth– Dimension reduction (reduce computational cost)

● Regions with Convolutional Neural Networks: R-CNN


Motivation● Improve CNNs by growing them deeper and wider…

– Too much parameters Overfitting →– Computational cost: two layers chained

2x filters o→ 2 computation– Zero entries? Sparsity control*→– The lack of structure, large number of filters and great

batches Efficient use of dense computation→

* Theoretical results: 2013, Arora et.al. “Provable Bounds for Learning Some Deep Representations”, 54 p.


Motivation“This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices.”

● Inception idea…– case study trying to approximate Arora’s sparse structure with

dense, available components (convolutions)– highly speculative / immediate good results

CAUTION: “although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction”


“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”

Video 1Video 2

https://www.youtube.com/watch?v=c43pqQE176g

https://www.youtube.com/watch?v=0WX0h5fu0zs


Architecture Detail“finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components”

● Translation invariance convolutional→● Local construction that repeats● Theory points at analyzing correlations of last layer and cluster by it.

● Lower layers: correlation spatial localization→● Avoid “aligned” correlations… using different sized filers


Architecture Detail● Higher levels higher abstraction:→● Spatial correlation decreases

→ increase use of bigger filters (3×3, 5×5)● Stacking large filters blows up the number of outputs! reduce dimension→

● Avoid too much compression of the information and maintain sparsity → 1×1 convolutions before the larger ones!


Inception module

Video

https://www.youtube.com/watch?v=VxhSouuSZDY


Architecture Detail● Lower levels: classic convolutions● Higher levels: inceptions modules

* Author thinks this isn’t necessary, but compensates some inefficiency of structure design...

● Intuition scale invariance of visual information before abstraction→

● Increased computation efficiency achieved by the reductions, allowing to grow depth and breath

● Efficiency: 3 – 10x faster than similar networks without inception modules, but the design has to be careful.


GoogLeNet● Specific design with Inception models used in the ILSVRC 2014 competition

● Same design for 6/7 of the ensemble models

● 22 layers deep

● Detail:– All convolutions include ReLU– Input: 224×224 in RGB with zero mean– #3×3 reduce = 1×1 filters before 3×3 convolutions– #5×5 reduce = 1×1 filters before 5×5 convolutions– pool proj = 1×1 filters after max-pooling


GoogLeNet● 22 layers (27 with max-pooling)● 100 independent building blocks● Pooling before classifying: NiN

+ Linear layer: convenience / change labels● Avg-pooling over FC gives +0.6% top-1 acc● Dropout remained essential

● Propagate gradient in effective manner discriminate →correctly in middle layers

● Inclusion of intermediate classifiers: convolutional networks on top of the inception modules (4a) and (4d) 0.3*Loss→

● Auxiliary classifiers are ignored at inference / marginal effect


GoogLeNet● Auxiliary network:

– Avg-pooling: 5×5 filter, stride 3(4a) 4×4×512→(4d) 4×4×528→

– 1×1 convolution with 128 filters + ReLU– FC layer with 1024 units + ReLU– Dropout layer (70%)– Linear + softmax for 1000 classes

(removed @inference)


https://arxiv.org/pdf/1409.4842v1.pdf#page=7

https://arxiv.org/pdf/1409.4842v1.pdf#page=7


Training Methodology● DistBelief: modest model & data parallelism (…Google)

CPU only → one week in a few GPUs (memory!)

● Stochastic Gradient Descent:– 0.9 momentum– Fixed learning rate:

-4% every 8 epochs– Polyak-Ruppert average of the iterations of

SGD● Many different methods for sampling and training over the images…– Different size crops– Patches 8% - 100% of the image– Aspect ratio [¾, 4/3]– Photometric distortions


ILSVRC 2014: Classification● No external data for training● 7 versions of GoogLeNet model (1 wider)

– Same initialization (same weights: oversight)– Same learning policies– Different sampling

→ Ensemble prediction● Testing (more aggressive than AlexNet)

– 4 scales (256, 288, 320, 352)– Left, center and right (top, center, bottom) squares– Each square: full + 4 corners + center (224×224)– Mirrored image

→ 4×3×6×2 = 144 crops per imageNot necessary / decreasing marginal benefit

● Softmax: avg over all crops & all models (1008 tests) avgn=1008

144x


Source: 2016, Canziani & Culurciello


ILSVRC 2014: Detection● Produce bbox around objects in 200 classes

– Correct if the bbox overlaps 50% w/ groundtruth– Extraneous detection (false +) are penalized

● Submission:– R-CNN + Inception model as region classifier– Selective search (2x pixel) + Multibox– Classify region: ensemble of 6 GoogLeNet models– No bounding box regression (R-CNN)– Report mean avg precision (mAP)


Source: 2016, Canziani & Culurciello


Conclusions“...approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.”

● Large gain / Small increase in computation● Detection is very competitive not using context nor bbox regression

● Moving to sparser architectures: feasible & useful● Importance of the analysis!!! (2013, Arora et.al.)

DeepDream (side result) examples are creepy… but show the reverse function of the network!

Input image force it to get close to animal categories→


Will continue, again...

convolutional neural networks (part ii)lvelho.impa.br/ip17/proj/slides/1110goingdeeperconv.pdf ·...

Documents