convolutional neural networks iiiyjlee/teaching/ecs269-fall2019/cnn_basi… · convolutional neural...
TRANSCRIPT
ConvolutionalneuralnetworksIII
October2nd,2019
YongJaeLeeUCDavis
ManyslidesfromRobFergus,SvetlanaLazebnik,Jia-BinHuang,DerekHoiem,AdrianaKovashka,AndrejKarpathy
Announcements• Sign-upforpaperpresentations• FirstpaperreviewdueThurs11:59PM
2
Gradientdescent• We’llupdateweightsiteratively• Moveindirectionoppositetogradient:
LLearning rate
Time
Figure from Andrej Karpathy
original W negative gradient direction
W_1
W_2
loss function landscape
Gradientdescentinmulti-layernets• We’llupdateweights• Moveindirectionoppositetogradient:
• Howtoupdatetheweightsatalllayers?• Answer:backpropagationoflossfromhigher
layerstolowerlayers
Backpropagation:Graphicexample
• Firstcalculateerrorofoutputunitsandusethistochangethetoplayerofweights.
output
hidden
input
Update weights into j
Adapted from Ray Mooney
k j i
w(2)
w(1)
Backpropagation:Graphicexample
• Nextcalculateerrorforhiddenunitsbasedonerrorsontheoutputunitsitfeedsinto.
output
hidden
input
k j i
Adapted from Ray Mooney
Backpropagation:Graphicexample
• Finallyupdatebottomlayerofweightsbasedonerrorscalculatedforhiddenunits.
output
hidden
input
Update weights into i
k j i
Adapted from Ray Mooney
Backpropagation• Easierifweusecomputationalgraphs,
especiallywhenwehavecomplicatedfunctionstypicalindeepneuralnetworks
Figure from Karpathy
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Lecture 4 - 10
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 11
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 12
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 13
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 14
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 15
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 16
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 17
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 18
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Chain rule: Want:
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Upstream gradient Local gradient
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Want:
Lecture 4 - 20
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g.x=-2,y=5,z=-4
Chain rule: Want:
Lecture 4 - 21
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Backpropagation: a simple example
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
f
activations
Lecture 4 - 22
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
Lecture 4 - 23
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
“local gradient”
f
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
Lecture 4 - 24
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
“local gradient”
f
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
Lecture 4 - 25
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
“local gradient”
f
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
“local gradient”
f
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
Lecture 4 - 27
13Jan2016Fei-FeiLi&AndrejKarpathy&JustinJohnson
Andrej Karpathy
“local gradient”
f
gradients
Backpropagation: another example
Andrej Karpathy
ConvolutionalNeuralNetworks(CNN)• Neuralnetworkwithspecialized
connectivitystructure• Stackmultiplestagesoffeatureextractors• Higherstagescomputemoreglobal,more
invariant,moreabstractfeatures• Classificationlayerattheend
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Adapted from Rob Fergus
• Feed-forwardfeatureextraction:1. Convolveinputwithlearnedfilters2. Applynon-linearity3. Spatialpooling(downsample)
• Supervisedtrainingofconvolutionalfiltersbyback-propagatingclassificationerror
Adapted from Lana Lazebnik
ConvolutionalNeuralNetworks(CNN)
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Output (class probs)
…
32
3
32x32x3 image
width
height
32 depth
Convolutions:Moredetail
Andrej Karpathy
32
32
3
5x5x3 filter
32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Convolutions:Moredetail
AndrejKarpathy
32
32
3
ConvolutionLayer32x32x3 image 5x5x3 filter
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Convolutions: More detail
AndrejKarpathy
32
32
3
ConvolutionLayeractivation map
32x32x3 image 5x5x3 filter
1
28
28
convolve (slide) over all spatial locations
Convolutions: More detail
AndrejKarpathy
32
32
3
Convolution Layer
32x32x3 image 5x5x3 filter
activation maps
1
28
28
convolve (slide) over all spatial locations
considerasecond,greenfilter
Convolutions: More detail
AndrejKarpathy
32
3 6
28
activation maps 32
28
Convolution Layer
Forexample,ifwehad65x5filters,we’llget6separateactivationmaps:
We stack these up to get a “new image” of size 28x28x6!
Convolutions: More detail
AndrejKarpathy
Preview:ConvNetisasequenceofConvolutionLayers,interspersedwithactivationfunctions
32
32
3
28
28
6
CONV, ReLU e.g. 6 5x5x3 filters
Convolutions: More detail
AndrejKarpathy
Preview:ConvNetisasequenceofConvolutionalLayers,interspersedwithactivationfunctions
32
32
3
CONV, ReLU e.g. 6 5x5x3 filters 28
28
6
CONV, ReLU e.g. 10 5x5x6 filters
CONV, ReLU
….
10
24
24
Convolutions: More detail
AndrejKarpathy
preview:
Convolutions: More detail
AndrejKarpathy
Figurefromhttp://www.mdpi.com/2072-4292/7/11/14680/htm
ACommonArchitecture:AlexNet
CaseStudy:VGGNet
Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2
best model 11.2% top 5 error in ILSVRC 2013 -> 7.3% top 5 error
[Simonyan and Zisserman, 2014]
AndrejKarpathy
[Szegedy et al., 2014]
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
Case Study: GoogLeNet
AndrejKarpathy
Slide from Kaiming He’s presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
CaseStudy:ResNet
AndrejKarpathy
(slide from Kaiming He’s presentation)
CaseStudy:ResNet
AndrejKarpathy
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
(slide from Kaiming He’s presentation)
2-3 weeks of training on 8 GPU machine at runtime: faster than a VGGNet! (even though it has 8x more layers)
CaseStudy:ResNet
AndrejKarpathy
Practicalmatters
Commentsontrainingalgorithm• Notguaranteedtoconvergetozerotrainingerror,may
convergetolocaloptimaoroscillateindefinitely.• However,inpractice,doesconvergetolowerrorformany
largenetworksonrealdata.• Thousandsofepochs(epoch=networkseesalltrainingdata
once)mayberequired,hoursordaystotrain.• Toavoidlocal-minimaproblems,runseveraltrialsstarting
withdifferentrandomweights(randomrestarts),andtakeresultsoftrialwithlowesttrainingseterror.
• Maybehardtosetlearningrateandtoselectnumberofhiddenunitsandlayers.
• Neuralnetworkshadfallenoutoffashionin90s,early2000s;backwithanewnameandsignificantlyimprovedperformance(deepnetworkstrainedwithdropoutandlotsofdata).
Ray Mooney, Carlos Guestrin, Dhruv Batra
Over-trainingprevention• Runningtoomanyepochscanresultinover-fitting.
• Keepahold-outvalidationsetandtestaccuracyonitaftereveryepoch.Stoptrainingwhenadditionalepochsactuallyincreasevalidationerror.
0 # training epochs
erro
r
on training data
on test data
Adapted from Ray Mooney
Training:Bestpractices• Usemini-batch• Useregularization• Usecross-validationforyourparameters• UseRELUorleakyRELU,don’tusesigmoid• Center(subtractmeanfrom)yourdata• Learningrate:toohigh?toolow?• UseBatchNorm
DataAugmentation(Jittering)• Createvirtualtrainingsamples
– Horizontalflip– Randomcrop– Colorcasting– Geometricdistortion
Jia-bin Huang, Image: https://github.com/aleju/imgaug
Regularization:Dropout
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
• Randomly turn off some neurons • Allows individual neurons to independently be responsible for performance
Adapted from Jia-bin Huang
TransferLearning
“You need a lot of a data if you want to train/use CNNs”
Andrej Karpathy
TransferLearningwithCNNs
• Themoreweightsyouneedtolearn,themoredatayouneed
• That’swhywithadeepernetwork,youneedmoredatafortrainingthanforashallowernetwork
• Onepossiblesolution:
Set these to the already learned weights from another network
Learn these on your own task
1. Train on ImageNet
2. Small dataset:
Freeze these
Train this
3. Medium dataset: finetuning
more data = retrain more of the network (or all of it)
Freeze these
Lecture 11 - 29
Train this
TransferLearningwithCNNs
Adapted from Andrej Karpathy
Source: classification on ImageNet Target: some other task/data
moregeneric
more specific
Lecture 11 - 34
very similar dataset
very different dataset
very little data Use linear classifier on top layer
You’re in trouble… Try linear classifier from different stages
quite a lot of data
Finetune a few layers
Finetune a larger number of layers
Transfer Learning with CNNs
Andrej Karpathy
Summary• Weusedeepneuralnetworksbecauseoftheir
strongperformanceinpractice• Convolutionalneuralnetworks(CNN)
• Convolution,nonlinearity,maxpooling• Trainingdeepneuralnets
• Weneedanobjectivefunctionthatmeasuresandguidesustowardsgoodperformance
• Weneedawaytominimizethelossfunction:stochasticgradientdescent
• Weneedbackpropagationtopropagateerrorthroughalllayersandchangetheirweights
• Practicesforpreventingoverfitting• Dropout;BatchNorm;dataaugmentation;transfer
learning
Questions?
SeeyouFriday!
56