all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...

All You Need is a Good init

Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil

University of Kentucky

April 1st, 2019

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 1 / 13

Overview

1 Problem

2 History

3 Method

4 Performance

5 Conclusion

Problem

Thin deep nets provide accuracy and are inference-time efficient

Has no general, reliable, efficient procedure fro end-to-end training

Hard to train by back propagation if there are more than five layers

Batch Normalization can facilitate training on deeper networks, butadds 30% computational overhead for each iteration

History

Since 2012, the popular initialzation method has been with Gaussiannoise with zero mean, bias equal to one, and standard deviation of0.01.

This cannot train very deep networks from scratch

If layers are not properly initialized, then they can scale inputs by k,giving a final scaling of kL, where L is the number of layers.

I If k > 1, then output layers give extremely large valuesI If k < 1, then there are issues with diminishing signals and gradients

History

Glorot initialization estimates the standard deviation on the basis ofthe number of input and outputs, but assumes no non-linearitybetween layers

He et al. adjusted Glorot’s method to take into account ReLUnon-linearity

Saxe et al. showed that orthonormal matrix initialization works betterthan Gaussian noise for both linear and non-linear networks

Method - LSUV

Mishkin and Matas extended Saxe’s method to an iterative process

1 Fill weights with Gaussian noise with unit variance

2 Decompose into orthonormal basis with either QR or SVD methods,and replace the weights with one of the components

3 Estimate the output variations of the convolution layers and scale tomake variance equal to one.

This is essentially orthonormal initialization combined with batchnormalization using only the first minibatch

Method - LSUV

Performance

When trained on the MNIST and CIFAR datasets, FitNets with LSUVinitialization outperformed FitNets with other common types ofinitializations

When trained on CIFAR-10, orthonormal initialization outperformedscaled Gaussian noise for all tested activation functions except tanhLSUV outperformed orthonormal initializations to a smaller butconsistent extent

None of the tested methods converged for sigmoid based networks

When using a Residual network, LSUV was the only tested methodwhich converged for each non-linearity type, except for sigmoid

Performance

In comparing to batch normalization with Xavier initialization, theXavier initialized nets trained in fewer iterations, but tookapproximately equal amounts of real time as LSUV initialized nets

LSUV initialization requires a SVD-decomposition of the weightmatrices, which makes the initialization step take longer, but it isnegligible when compared to training time

Performance

Conclusion

LSUV initialization is just as good as more complicated schemes

Allows learning of very deep nets quickly and accurately with SGD

Works well with multiple activation functions

all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...

Documents

tp init codesys

early delta init

compilable specifications by dmytro mindra

mobile ux - dmytro svarytsevych

praktyka rozslidyvannya - dmytro gnap

dmytro ostapenko d.ostapenko@latrobe.au

pivorak clojure by dmytro bignyak

formobject for building complex forms. dmytro pilugin

burn downchart symptomatics dmytro bibikov

dmytro navrotskiy, sofiya

cryptocurrencies for everyone (dmytro pershyn technology...

protocols init details

dmytro yablonovskyy_gfk

kernel init

behavioral statistical...

dmytro malakhov dobovo

init cms presentation

dmytro firtash complaint

reading init param

asset recovery. case study - dmytro marchukov