lecture 37: convnets (cont’d) and training...convnets (e.g. [krizhevsky 2012]) overfit the data....

58
Lecture 37: ConvNets (Cont’d) and Training CS 4670/5670 Sean Bell [http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by]

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Lecture 37: ConvNets (Cont’d) and TrainingCS 4670/5670 Sean Bell

[http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by]

Page 2: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Unrelated) Dog vs Food

[Karen Zack, @teenybiscuit]

Page 3: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

[Karen Zack, @teenybiscuit]

(Unrelated) Dog vs Food

Page 4: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

[Karen Zack, @teenybiscuit]

(Unrelated) Dog vs Food

Page 5: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap) BackpropFrom Geoff Hinton’s seminar at Stanford yesterday

Page 6: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap) Backprop

Gradient:∂L∂θ

= ∂L∂θ1

∂L∂θ2

…⎡

⎣⎢⎢

⎦⎥⎥

All of the weights and biases in the network, stacked together

Parameters: θ = θ1 θ2 !⎡⎣

⎤⎦

Intuition: “How fast would the error change if I change myself by a little bit”

Page 7: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

x h(1) LFunction Function s...

θ (1) θ (n)Forward Propagation: compute the activations and loss

L∂L∂s

...

Backward Propagation: compute the gradient (“error signal”)

∂L∂h(1)

Function

∂L∂θ (n)

Function

∂L∂θ (1)

∂L∂x

(Recap) Backprop

Page 8: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)A ConvNet is a sequence of convolutional layers, interspersed with

activation functions (and possibly other layer types)

Page 9: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 10: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 11: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 12: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 13: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 14: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 15: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(Recap)

Page 16: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Web demo 1: Convolution

http://cs231n.github.io/convolutional-networks/

[Karpathy 2016]

Page 17: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Web demo 2: ConvNet in a Browser

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

[Karpathy 2014]

Page 18: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Weights

During convolution, the weights “slide” along the input to generate each output

Output

Page 19: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 20: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 21: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 22: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 23: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Output

Page 24: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

During convolution, the weights “slide” along the input to generate each output

Recall that at each position, we are doing a 3D sum:

hr = xrijkWijkijk∑ + b

(channel, row, column)

Page 25: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 26: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 27: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

Output

But we can also convolve with a stride, e.g. stride = 2

Page 28: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: Stride

Input

- Notice that with certain strides, we may not be able to cover all of the input

Output

- The output is also half the size of the input

But we can also convolve with a stride, e.g. stride = 2

Page 29: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 30: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 31: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: Padding

Input

We can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Page 32: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2

Output

Input

Page 33: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: How big is the output?

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

width win

stride s

kernel k

pp

wout =win + 2p − k

s⎢⎣⎢

⎥⎦⎥+1

In general, the output has size:

Page 34: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Convolution: How big is the output?

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

Example: k=3, s=1, p=1

wout =win + 2p − k

s⎢⎣⎢

⎥⎦⎥+1

= win + 2 − 31

⎢⎣⎢

⎥⎦⎥+1

= win

width win p

stride s

kernel k

VGGNet [Simonyan 2014] uses filters of this shapep

Page 35: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Pooling

Figure: Andrej Karpathy

For most ConvNets, convolution is often followed by pooling:- Creates a smaller representation while retaining the most important information- The “max” operation is the most common- Why might “avg” be a poor choice?

Page 36: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Pooling

Page 37: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Max Pooling

Figure: Andrej Karpathy

What’s the backprop rule for max pooling?- In the forward pass, store the index that took the max- The backprop gradient is the input gradient at that index

Page 38: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 39: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 40: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

Page 41: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example ConvNet

Figure: Andrej Karpathy

10x3x3 conv filters, stride 1, pad 12x2 pool filters, stride 2

Page 42: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example: AlexNet [Krizhevsky 2012]

FC

Figure: [Karnowski 2015] (with corrections)

3 227x227

max full

“max”: max pooling “norm”: local response normalization “full”: fully connected

1000

classscores

Page 43: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Example: AlexNet [Krizhevsky 2012]

Figure: [Karnowski 2015]

zoom in

Page 44: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Questions?

Page 45: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

… why so many layers?

[Szegedy et al, 2014]

How do you actually train these things?

Page 46: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

How do you actually train these things?

Gather labeled data

Find a ConvNet architecture

Minimize the loss

Roughly speaking:

Page 47: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Training a convolutional neural network

• Split and preprocess your data

• Choose your network architecture

• Initialize the weights

• Find a learning rate and regularization strength

• Minimize the loss and monitor progress

• Fiddle with knobs

Page 48: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Mini-batch Gradient DescentLoop:

1. Sample a batch of training data (~100 images)

2. Forwards pass: compute loss (avg. over batch)

3. Backwards pass: compute gradient

4. Update all parameters

Note: usually called “stochastic gradient descent” even though SGD has a batch size of 1

Page 49: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Regularization

L = Ldata + Lreg Lreg = λ 12W 2

2

[Andrej Karpathy http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html]

Regularization reduces overfitting:

Page 50: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Overfitting

[Image: https://en.wikipedia.org/wiki/File:Overfitted_Data.png]

Overfitting: modeling noise in the training set instead of the “true” underlying relationship

Underfitting: insufficiently modeling the relationship in the training set

General rule: models that are “bigger” or have more capacity are more likely to overfit

Page 51: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(0) Dataset splitSplit your data into “train”, “validation”, and “test”:

Dataset

Train

Validation

Test

Page 52: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train

Validation

Test

Train: gradient descent and fine-tuning of parameters

Validation: determining hyper-parameters (learning rate, regularization strength, etc) and picking an architecture

Test: estimate real-world performance (e.g. accuracy = fraction correctly classified)

(0) Dataset split

Page 53: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train

Validation

Test

(0) Dataset split

To avoid false discovery, once we have used a test set once, we should not use it again (but nobody follows this rule, since it’s expensive to collect datasets)

Be careful with false discovery:

Instead, try and avoid looking at the test score until the end

Page 54: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

Train Test

(0) Dataset splitCross-validation: cycle which data is used as validation

Val

TestTrain TrainVal

TestTrain TrainVal

Average scores across validation splits

TrainVal Test

Page 55: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Figure: Andrej Karpathy

Preprocess the data so that learning is better conditioned:

Page 56: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Slide: Andrej Karpathy

In practice, you may also see PCA and Whitening of the data:

Page 57: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessing

Figure: Alex Krizhevsky

For ConvNets, typically only the mean is subtracted.

A per-channel mean also works (one value per R,G,B).

Page 58: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally

(1) Data preprocessingAugment the data — extract random crops from the input, with slightly jittered offsets. Without this, typical ConvNets (e.g. [Krizhevsky 2012]) overfit the data.

E.g. 224x224 patches extracted from 256x256 images

Randomly reflect horizontally

Perform the augmentation live during training

Figure: Alex Krizhevsky