convolutional neural networks and supervised learningconvolutional neural networks and supervised...
TRANSCRIPT
Convolutional Architectures Training Achitecture search Bibliography
Convolutional Neural Networks and SupervisedLearning
Eilif Solberg
August 30, 2018
Convolutional Architectures Training Achitecture search Bibliography
Outline
Convolutional ArchitecturesConvolutional neural networks
TrainingLossOptimizationRegularizationHyperparameter search
Achitecture searchNAS1NAS2
Bibliography
Convolutional Architectures Training Achitecture search Bibliography
Convolutional Architectures
Convolutional Architectures Training Achitecture search Bibliography
Template matching
Figure: Illustration fromhttp://pixuate.com/technology/template-matching/
1. Try to match template at each location by �sliding overwindow�
2. Threshold for detection
For 2D-objects, kind of possible but di�cult
Convolutional Architectures Training Achitecture search Bibliography
Convolution
Which �lter has produces the activation map on the right?
Convolutional Architectures Training Achitecture search Bibliography
Convolutional layer
�> Glori�ed template matching
� Many templates (aka output �lters)
� We learn the templates, the weights are the templates
� Intermediate detection results only means to an end
� treat them as features, which we again match new templates to
� Starting from the second layer we have �nonlinear �lters�
Convolutional Architectures Training Achitecture search Bibliography
Hyperparameters of convolutional layer
1. Kernel height and width -template sizes
2. Stride - skips between templatematches
3. Dilation rate� �Wholes� in template where
we don't care� Larger �eld-of-view without
more weights. . .
4. Number of output �lters -number of templates
5. Padding - expand image,typically with zeros
Figure: Image fromhttp://neuralnetworksanddeeplearning.com/
Convolutional Architectures Training Achitecture search Bibliography
Detector / activation function
� Non-saturating activation functions as ReLU, leaky ReLUdominating
Figure: Sigmoidfunction
Figure: Tanh functionFigure: ReLU function
Convolutional Architectures Training Achitecture search Bibliography
Basic CNN architecture for image classi�cation
Image �> [Conv �> ReLU]xN �> Fully Connected �> Softmax
� Increase �lter depth when using stride
Improve with:
� Batch normalization
� Skip connections ala ResNet or DenseNet
� No fully connected, average pool predictions instead
Convolutional Architectures Training Achitecture search Bibliography
Training
Convolutional Architectures Training Achitecture search Bibliography
How do we �t model?
How do we �nd parameters θ for our network?
Convolutional Architectures Training Achitecture search Bibliography
Supervised learning
� Training data comes as (X ,Y ) pairs, where Y is the target
� Want to learn f (x) ∼ p(y |x), conditional distribution of Ygiven X
� De�ne di�erentiable surrogate loss function, e.g. for a singlesample
l(f (X ),Y ) = (f (X )− Y )2regression (1)
l(f (X ),Y ) = −∑c
Yc log(f (X )c)classi�cation (2)
Convolutional Architectures Training Achitecture search Bibliography
Gradient
� The direction for which the function increases the most
Figure: Gradient of the function f (x2, y2) = x/ex2+y
2
[By Vivekj78 [CC BY-SA3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons]
Convolutional Architectures Training Achitecture search Bibliography
Backpropagation
� E�cient bookkeeping scheme when applying chain rule fordi�erentiation
� Biologically implausible?
Convolutional Architectures Training Achitecture search Bibliography
(Stochastic) gradient descentTaking steps in the opposite direction of the gradient
Figure: [By Vivekj78 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], fromWikimedia Commons]
� Full gradient too expensive / not necessary
N∑i=1
∇θl(f (Xi ),Yi ) ≈n∑i=1
∇θl(f (XP(i)),YP(i)) (3)
for a random permutation P .
Many di�erent extensions to standard SGD� SGD with momentum, RMSprop, ADAM.
Convolutional Architectures Training Achitecture search Bibliography
Network, loss, optimization
� Weight penalty added to loss term, usually squared L2normalization uniformly for all parameters
l(θ) + λ‖θ‖22 (4)
� Dropout
� Batch normalization� Intersection of optimization and generalization� Your best friend and your worst enemy
Convolutional Architectures Training Achitecture search Bibliography
More on batch normalization
For a tensor [batch_size Ö height Ö width Ö depth], normalize�template matching scores� for each template d by
µd ←1
N ∗ H ∗W
N∑i=1
H∑h=1
W∑w=1
xi ,h,w ,d (5)
σ2d← 1
N ∗ H ∗W
N∑i=1
H∑h=1
W∑w=1
(xi ,h,w ,d − µd )2 (6)
x̂i ,h,w ,d ←xi ,h,w ,d − µd√
(σ2d+ ε)
(7)
yi ,h,w ,d ← γx̂i ,h,w ,d + β (8)
where N, H and W represents batch size, height and width.
� �Template/Feature more present than usual or not�
� During inference we use stored values for µd and σd .
Convolutional Architectures Training Achitecture search Bibliography
Data augmentation
Idea: apply random transformation to X that does not alter Y .
� Normally you would like result X ′ to be plausible, i.e. couldhave been a sample from the distribution of interest
� Which transformation you may use is application-dependent.
Image data
� Horizontal mirroring (issuefor objects not left/rightsymmetric)
� Random crop
� Scale
� Aspect ratio
� Lightning etc.
Text data
� Synonym insertion
� Back-translation: translateand translate back with e.g.Google Translate!!!
Convolutional Architectures Training Achitecture search Bibliography
Hyperparameters to search
� Learning rate (and learning rate schedule)
� Regularization params: L2, (dropout)
Convolutional Architectures Training Achitecture search Bibliography
Search strategies
� random search rather than grid search
� logscale when appropriate
� careful with best values on border
� may re�ne search
Convolutional Architectures Training Achitecture search Bibliography
Achitecture search
Convolutional Architectures Training Achitecture search Bibliography
Architecture search
1. De�ne the search space.
2. Decide upon the optimization algorithm� random search, reinforcment learning, genetic algorithms
Convolutional Architectures Training Achitecture search Bibliography
Neural architecture search
Figure: An overview of Neural Architecture Search. Figure and captionfrom [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS1 - search space
Fixed structure:
� Architecture is a series of layers of the form
conv2D(FH, FW, N) −→ batch-norm −→ ReLU
Degrees of freedom:
� Parameters of conv layer� �lter height, �lter width and number of output �lters
� Input layers to each conv layer
Convolutional Architectures Training Achitecture search Bibliography
NAS1 - discovered architecture
Figure: FH is �lter height, FW is �lter width and N is number of �lters.If one layer has many input layers then all input layers are concatenatedin the depth dimension. Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - search space
Fixed structure:
Figure: Architecure for CIFAR-10 and ImageNet. Figure from [?].
Degrees of freedom:
� Some freedom in normal cell and reduction cell, shall see soon
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - discovered convolutional cells
Normal Cell Reduction Cell
hi
hi-1
...
hi+1
concat
avg!3x3
sep!5x5
sep!7x7
sep!5x5
max!3x3
sep!7x7
add add
add add add
sep!3x3
iden!tity
avg!3x3
max!3x3
hi
hi-1
...
hi+1
concat
sep!3x3
avg!3x3
avg!3x3
sep!5x5
sep!3x3
iden!tity
iden!tity
sep!3x3
sep!5x5
avg!3x3
add add add addadd
Figure: NASNet-A identi�ed with CIFAR-10. Figure and caption from[?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - Performance(computational_cost)
10000 20000 300000
75
70
65
80
85
# Mult-Add operations (millions)
accu
racy
(pre
cisi
on @
1)
40000
PolyNet
Inception-v1
VGG-16
MobileNet
Inception-v3
Inception-v2
ResNeXt-101
ResNet-152Inception-v4
Inception-ResNet-v2
Xception
NASNet-A (6 @ 4032)
ShuffleNet
DPN-131NASNet-A (7 @ 1920)
NASNet-A (5 @ 1538)
NASNet-A (4 @ 1056)
SENet
Figure: Performance on ILSVRC12 as a function of number of�oating-point multiply-add operations needed to process an image.Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - Performance(#parameters)
75
70
65
80
85
# parameters (millions)
accu
racy
(pre
cisi
on @
1)
60 80 100 120 1400 4020
NASNet-A (5 @ 1538)
NASNet-A (4 @ 1056)VGG-16
PolyNet
MobileNetInception-v1
ResNeXt-101
Inception-v2
Inception-v4
Inception-ResNet-v2
ResNet-152
Xception
Inception-v3
ShuffleNet
DPN-131
NASNet-A (6 @ 4032)
NASNet-A (7 @ 1920) SENet
Figure: Performance on ILSVRC12 as a function of number ofparameters. Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
Bibliography
Convolutional Architectures Training Achitecture search Bibliography
Bibliography I