cs 4670/5670 sean bell · weight initialization [he et al, “delving deep into rectifiers:...

29
Strawberry Goblet Throne CS 4670/5670 Sean Bell Lecture 39: Training Neural Networks (Cont’d)

Upload: others

Post on 23-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Strawberry Goblet Throne

CS 4670/5670 Sean Bell

Lecture 39: Training Neural Networks (Cont’d)

Page 2: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

(Side Note for PA5) AlexNet: 1 vs 2 parts

Caffe represents caffe like the above image, but computes as if it were the bottom image using 2 “groups”

Page 3: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

x h LFunction Function s...

(1) Forward Propagation:

L∂L∂s

...

(2) Backward Propagation:

∂L∂h

FunctionFunction∂L∂x

(Recall) Each iteration of training

(3) Weight update: θ ←θ − λ ∂L∂θ

Page 4: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

(Recall) Babysitting the training process

Slide from Karpathy 2016

Typical loss

Page 5: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

(Recall) Babysitting the training process

Figure: Andrej Karpathy

Page 6: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Weight Initialization

[He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015]

For deep nets, initialization is subtle and important:

Initialize weights to be smaller if there are more input connections:

For neural nets with ReLU, this will ensure all activations have the same variance

Page 7: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Initialization matters

[He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015]

22 layer model 30 layer model

Training can take much longer if not carefully initialized:

Page 8: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Proper initialization is an active area of research

Slide: Andrej Karpathy

Page 9: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

(Recall) Regularizationreduces overfitting

L = Ldata + Lreg Lreg = λ 12W 2

2

[Andrej Karpathy http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html]

Page 10: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Example Regularizers

Lreg = λ W 1 = λ Wijij∑L1 regularization

(L1 regularization encourages sparse weights: weights are encouraged to reduce to exactly zero)

“Elastic net” Lreg = λ1 W 1 + λ2 W 22

(combine L1 and L2 regularization)

Lreg = λ 12W 2

2L2 regularization

(L2 regularization encourages small weights)

Max normClamp weights to some max norm W 2

2 ≤ c

Page 11: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

“Weight decay”

Lreg = λ 12W 2

2

[Andrej Karpathy http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html]

Regularization is also called “weight decay” becausethe weights “decay” each iteration:

∂L∂W

= λW

W ←W −αλW − ∂Ldata∂W

Gradient descent step:

αλWeight decay: (weights always decay by this amount)

Note: biases are sometimes excluded from regularization

Page 12: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

[Srivasta et al, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, JMLR 2014]

Simple but powerful technique to reduce overfitting:

Page 13: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

[Srivasta et al, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, JMLR 2014]

Simple but powerful technique to reduce overfitting:

Page 14: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

[Srivasta et al, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, JMLR 2014]

Simple but powerful technique to reduce overfitting:

Note: Dropout can be interpreted as an approximation to taking the geometric mean of an ensemble of exponentially many models

Page 15: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

[Srivasta et al, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, JMLR 2014]

How much dropout? Around p = 0.5

Page 16: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

[Krizhevsky et al, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012]

Case study: [Krizhevsky 2012]Dropout here

But not here — why?

“Without dropout, our network exhibits substantial overfitting.”

Page 17: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

(note, here X is a single input)

Figure: Andrej Karpathy

Page 18: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Dropout

Figure: Andrej Karpathy

Test time: scale the activations

We want to keep the same expected value

Expected value of a neuron h with dropout:E[h]= ph + (1− p)0 = ph

Page 19: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

Page 20: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

Page 21: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

Page 22: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Place after a FC or Convolutional layer, and

before nonlinearity

Slide: Andrej Karpathy

Page 23: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Transfer Learning (“fine-tuning”)

Slide: Andrej Karpathy

Page 24: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

This is not just a special trick; this is “the” method used by most papers

Transfer Learning (“fine-tuning”)

Page 25: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Transfer Learning (“fine-tuning”)

Slide: Andrej Karpathy

Page 26: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Summary• Preprocess the data (subtract mean, sub-crops)

• Initialize weights carefully

• Use Dropout and/or Batch Normalization

• Use SGD + Momentum

• Fine-tune from ImageNet

• Babysit the network as it trains

Page 27: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

Page 28: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy

Page 29: CS 4670/5670 Sean Bell · Weight Initialization [He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015] For deep

Slide: Andrej Karpathy