all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...
TRANSCRIPT
All You Need is a Good init
Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil
University of Kentucky
April 1st, 2019
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 1 / 13
Overview
1 Problem
2 History
3 Method
4 Performance
5 Conclusion
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 2 / 13
Problem
Thin deep nets provide accuracy and are inference-time efficient
Has no general, reliable, efficient procedure fro end-to-end training
Hard to train by back propagation if there are more than five layers
Batch Normalization can facilitate training on deeper networks, butadds 30% computational overhead for each iteration
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 3 / 13
History
Since 2012, the popular initialzation method has been with Gaussiannoise with zero mean, bias equal to one, and standard deviation of0.01.
This cannot train very deep networks from scratch
If layers are not properly initialized, then they can scale inputs by k,giving a final scaling of kL, where L is the number of layers.
I If k > 1, then output layers give extremely large valuesI If k < 1, then there are issues with diminishing signals and gradients
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 4 / 13
History
Glorot initialization estimates the standard deviation on the basis ofthe number of input and outputs, but assumes no non-linearitybetween layers
He et al. adjusted Glorot’s method to take into account ReLUnon-linearity
Saxe et al. showed that orthonormal matrix initialization works betterthan Gaussian noise for both linear and non-linear networks
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 5 / 13
Method - LSUV
Mishkin and Matas extended Saxe’s method to an iterative process
1 Fill weights with Gaussian noise with unit variance
2 Decompose into orthonormal basis with either QR or SVD methods,and replace the weights with one of the components
3 Estimate the output variations of the convolution layers and scale tomake variance equal to one.
This is essentially orthonormal initialization combined with batchnormalization using only the first minibatch
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 6 / 13
Method - LSUV
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 7 / 13
Performance
When trained on the MNIST and CIFAR datasets, FitNets with LSUVinitialization outperformed FitNets with other common types ofinitializations
When trained on CIFAR-10, orthonormal initialization outperformedscaled Gaussian noise for all tested activation functions except tanhLSUV outperformed orthonormal initializations to a smaller butconsistent extent
None of the tested methods converged for sigmoid based networks
When using a Residual network, LSUV was the only tested methodwhich converged for each non-linearity type, except for sigmoid
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 8 / 13
Performance
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 9 / 13
Performance
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 10 / 13
Performance
In comparing to batch normalization with Xavier initialization, theXavier initialized nets trained in fewer iterations, but tookapproximately equal amounts of real time as LSUV initialized nets
LSUV initialization requires a SVD-decomposition of the weightmatrices, which makes the initialization step take longer, but it isnegligible when compared to training time
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 11 / 13
Performance
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 12 / 13
Conclusion
LSUV initialization is just as good as more complicated schemes
Allows learning of very deep nets quickly and accurately with SGD
Works well with multiple activation functions
Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 13 / 13