all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...

All You Need is a Good init

Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil

University of Kentucky

April 1st, 2019

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 1 / 13

Overview

1 Problem

2 History

3 Method

4 Performance

5 Conclusion


Problem

Thin deep nets provide accuracy and are inference-time efficient

Has no general, reliable, efficient procedure fro end-to-end training

Hard to train by back propagation if there are more than five layers

Batch Normalization can facilitate training on deeper networks, butadds 30% computational overhead for each iteration


History

Since 2012, the popular initialzation method has been with Gaussiannoise with zero mean, bias equal to one, and standard deviation of0.01.

This cannot train very deep networks from scratch

If layers are not properly initialized, then they can scale inputs by k,giving a final scaling of kL, where L is the number of layers.

I If k > 1, then output layers give extremely large valuesI If k < 1, then there are issues with diminishing signals and gradients


History

Glorot initialization estimates the standard deviation on the basis ofthe number of input and outputs, but assumes no non-linearitybetween layers

He et al. adjusted Glorot’s method to take into account ReLUnon-linearity

Saxe et al. showed that orthonormal matrix initialization works betterthan Gaussian noise for both linear and non-linear networks


Method - LSUV

Mishkin and Matas extended Saxe’s method to an iterative process

1 Fill weights with Gaussian noise with unit variance

2 Decompose into orthonormal basis with either QR or SVD methods,and replace the weights with one of the components

3 Estimate the output variations of the convolution layers and scale tomake variance equal to one.

This is essentially orthonormal initialization combined with batchnormalization using only the first minibatch


Method - LSUV


Performance

When trained on the MNIST and CIFAR datasets, FitNets with LSUVinitialization outperformed FitNets with other common types ofinitializations

When trained on CIFAR-10, orthonormal initialization outperformedscaled Gaussian noise for all tested activation functions except tanhLSUV outperformed orthonormal initializations to a smaller butconsistent extent

None of the tested methods converged for sigmoid based networks

When using a Residual network, LSUV was the only tested methodwhich converged for each non-linearity type, except for sigmoid


Performance


Performance

In comparing to batch normalization with Xavier initialization, theXavier initialized nets trained in fewer iterations, but tookapproximately equal amounts of real time as LSUV initialized nets

LSUV initialization requires a SVD-decomposition of the weightmatrices, which makes the initialization step take longer, but it isnegligible when compared to training time


Performance


Conclusion

LSUV initialization is just as good as more complicated schemes

Allows learning of very deep nets quickly and accurately with SGD

Works well with multiple activation functions


all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...

Documents