all you need is a good init - university of kentuckyqye/ma721/presentations/pospisil.pdf · dmytro...

13
All You Need is a Good init Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil University of Kentucky April 1st, 2019 Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil All You Need is a Good init April 1st, 2019 1 / 13

Upload: others

Post on 03-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

All You Need is a Good init

Dmytro Mishkin, Jiri Matas, Presented by Cole Pospisil

University of Kentucky

April 1st, 2019

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 1 / 13

Page 2: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Overview

1 Problem

2 History

3 Method

4 Performance

5 Conclusion

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 2 / 13

Page 3: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Problem

Thin deep nets provide accuracy and are inference-time efficient

Has no general, reliable, efficient procedure fro end-to-end training

Hard to train by back propagation if there are more than five layers

Batch Normalization can facilitate training on deeper networks, butadds 30% computational overhead for each iteration

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 3 / 13

Page 4: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

History

Since 2012, the popular initialzation method has been with Gaussiannoise with zero mean, bias equal to one, and standard deviation of0.01.

This cannot train very deep networks from scratch

If layers are not properly initialized, then they can scale inputs by k,giving a final scaling of kL, where L is the number of layers.

I If k > 1, then output layers give extremely large valuesI If k < 1, then there are issues with diminishing signals and gradients

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 4 / 13

Page 5: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

History

Glorot initialization estimates the standard deviation on the basis ofthe number of input and outputs, but assumes no non-linearitybetween layers

He et al. adjusted Glorot’s method to take into account ReLUnon-linearity

Saxe et al. showed that orthonormal matrix initialization works betterthan Gaussian noise for both linear and non-linear networks

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 5 / 13

Page 6: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Method - LSUV

Mishkin and Matas extended Saxe’s method to an iterative process

1 Fill weights with Gaussian noise with unit variance

2 Decompose into orthonormal basis with either QR or SVD methods,and replace the weights with one of the components

3 Estimate the output variations of the convolution layers and scale tomake variance equal to one.

This is essentially orthonormal initialization combined with batchnormalization using only the first minibatch

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 6 / 13

Page 7: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Method - LSUV

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 7 / 13

Page 8: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Performance

When trained on the MNIST and CIFAR datasets, FitNets with LSUVinitialization outperformed FitNets with other common types ofinitializations

When trained on CIFAR-10, orthonormal initialization outperformedscaled Gaussian noise for all tested activation functions except tanhLSUV outperformed orthonormal initializations to a smaller butconsistent extent

None of the tested methods converged for sigmoid based networks

When using a Residual network, LSUV was the only tested methodwhich converged for each non-linearity type, except for sigmoid

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 8 / 13

Page 9: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Performance

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 9 / 13

Page 10: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Performance

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 10 / 13

Page 11: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Performance

In comparing to batch normalization with Xavier initialization, theXavier initialized nets trained in fewer iterations, but tookapproximately equal amounts of real time as LSUV initialized nets

LSUV initialization requires a SVD-decomposition of the weightmatrices, which makes the initialization step take longer, but it isnegligible when compared to training time

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 11 / 13

Page 12: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Performance

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 12 / 13

Page 13: All You Need is a Good init - University of Kentuckyqye/MA721/presentations/Pospisil.pdf · Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April

Conclusion

LSUV initialization is just as good as more complicated schemes

Allows learning of very deep nets quickly and accurately with SGD

Works well with multiple activation functions

Dmytro Mishkin, Jiri Matas, Presented by Cole PospisilAll You Need is a Good init April 1st, 2019 13 / 13