introduction to deep learning for biomedical...
TRANSCRIPT
Introduction to Deep Learning for Biomedical
Engineering
After a presentation made by:Evan Shelhamer, Jeff Donahue, Jon Long
caffe.berkeleyvision.orggithub.com/BVLC/caffe 1
Prof. Bart ter Haar Romeny
What isDeep Learning?
2
3
A typical Deep Convolutional Neural Network
4
5
ImageNet – Fei Fei Li
ImageNet Large ScaleVisual Recognition Competition(ILSVRC)
AlexNET
6
7
Litjens, Geert, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen AWM van der Laak, Bram van Ginneken, and Clara I. Sánchez. "A survey on deep learning in medical image analysis." arXiv preprint arXiv:1702.05747 (Feb 2017).
8
9
Power of heatmaps – Train on image level, visualize on pixel level.
10Samaneh Abbasi, Bart Romeny et al. TU/e:Recurrent Convolutional Neural Networks, MICCAI 2017, Quebec City
11
Samaneh Abbasi et al. TU/e:Recurrent ConvolutionalNeural Networks,MICCAI 2017, Quebec City
12
13
14
For Diabetic Retinopathy the best detection performance is by Quellec et al.: Az = 0.954 in Kaggle’s dataset and Az = 0.949 in e-Ophtha.
15
Why Deep Learning?
Applications
The Challengeof Recognition
Learning & Optimization
Network Tour Transfer Learning
Deep Learning for VisionDive into
Deep LearningWhat is DL?Why Now?
Caffe First Sip
Why Deep Learning? End-to-End Learning for Many Tasks
vision speech text control
16
Some examples
Demo: Google translate on smartphone (speech + images)
Demo: https://www.imageidentify.com/
How does this work?
Биомедицинская инженерияToday you can read this Russian text with your smartphone
Kaggle: Diabetic Retinopathy ChallengeBlog
Google Photos
18
Other examples:
Robot vision and recognition:Harvest robot for peppers.
Wageningen University, the Netherlands
Vision for self-driving cars
19
Aalsmeer, Netherlands, largest flower auction in the world
20
Quick facts and figures about the Dutch Horticulture industry
The Dutch horticulture sector is a global trendsetter and the undisputed international market leader in flowers, plants, bulbs and propagation material.
Did you know?• Holland has a 44% share of the worldwide trade in floricultural products, making it the dominant global supplier of flowers and flower products. Some 77% of all flower bulbs traded worldwide come from the Netherlands, the majority of which are tulips. 40% of the trade in 2015 was cut flowers and flower buds.• The sector is the number 1 exporter to the world for live trees, plants, bulbs, roots and cut flowers.• The sector is the number 3 exporter in nutritional horticulture products.• Of the approximately 1,800 new plant varieties that enter the European market each year, 65% originate in the Netherlands. In addition, Dutch breeders account for more than 35% of all applications for community plant variety rights.• The Dutch are one of the world’s largest exporter of seeds: the exports of seeds amounted to € 3.1 billion in 2014.• In 2014 the Netherlands was the world’s second largest exporter (in value) of fresh vegetables. The Netherlands exported vegetables with a market value of € 7 billion.
21
From Wikipedia:
Deep learning is a class of machine learning algorithms that
• use a deep cascade of many layers of nonlinear processing unitsfor feature extraction and transformation.
• Each successive layer uses the output from the previous layer as input. • The algorithms may be supervised or unsupervised.• Applications include pattern analysis (unsupervised) and classification (supervised).
• are based on the (unsupervised) learning of multiple levels of features or representations of the data.
• Higher level features are derived from lower level features to form a hierarchical representation.
Deep Learning
So we have to learn:
1. Overview in depth → Introduction, Caffe example2. What are filters? → Convolution and convolution networks3. What is learned? → Invariant geometric features4. How can kernels be learned? → Principal Component Analysis5. How does the visual system this? → Front-end vision, visual cortex6. How can we use this? → Software developments in Deep Learning7. Questions → and answers
Deep Learning is a very hot area of Machine Learning Research, with many remarkable recent successes, such as 97.5% accuracy on face recognition, nearly perfect German traffic sign recognition, or even Dogs vs Cats image recognition with 98.9% accuracy.
Many winning entries in recent Kaggle Data Science competitions have used Deep Learning.
The term "deep learning" refers to the method of training multi-layered neural networks, and became popular after papers by Geoffrey Hinton and his co-workers which showed a fast way to train such networks.
http://www.kdnuggets.com/2014/05/learn-deep-learning-courses-tutorials-overviews.html
Yann LeCun, a student of Geoff Hinton, also developed a very effective algorithm for deep learning, called ConvNet, which was successfully used in late 80-s and early 90-s for automatic reading of amounts on bank checks.
In May 2014, Baidu, the Chinese search giant, has hired Andrew Ng, a leading Machine Learning and Deep Learning expert (and co-founder of Coursera) to head their new AI Lab in Silicon Valley, setting up an AI & Deep Learning race with Google (which hired Geoffrey Hinton) and Facebook (which hired Yann LeCun to head Facebook AI Lab).
27
Human vision and convolutional neural networks:
A cascade of increasing complexity
• Hierarchical network• Use of context
28
Wikipedia: Gestalt psychology or gestaltism (German: Gestalt "shape, form") is a philosophy of mind of the Berlin School of experimental psychology. Gestalt psychology is an attempt to understand the laws behind the ability to acquire and maintain meaningful perceptions in an apparently chaotic world. The central principle of gestalt psychology is that the mind forms a global whole with self-organizing tendencies. The assumed physiological mechanisms on which Gestalt theory rests are poorly defined and support for their existence is lacking. It is known as ‘perceptual grouping’.
AlexNET - pdf
Vision: the highest bandwidth input channel
29
Machines are useful mainly to the extent that they interact with the physical worldVisual information is the richest source of information about the real world
Vision is the highest-bandwidth mode for machines to obtain real-world info
Embedded vision enables our things to be:- More responsive- More personal and secure- Safer, more autonomous- Easier to use
subaru.com
30
http://www.kdnuggets.com/2017/02/top-arxiv-papers-january-convnets-wide-adversarial.html
Top papers on arXiv (https://arxiv.org/) :
31
Performance evaluation: http://www.robots.ox.ac.uk/~vgg/research/deep_eval/
VOC:
VisualObjectClasses
Why Now?1.Data
ImageNet et al.: millions of labeled (crowdsourced) images1.Compute
GPUs: terabytes/s memory bandwidth, teraflops compute1.Technique
new optimization know-how,new variants on old architectures,new tools for rapid experimentation
32
Why Now? DataFor example:
>14 million labeled images>1 million with bounding boxes
>300,000 images with labeled and segmented objects
33
URL
Why Now? GPUs
Parallel processorsfor parallel models:
Inherent Parallelismsame op, different data
Bandwidthlots of data in and out
Tuned PrimitivescuDNN and cuBLASfor deep nets for matrices 34
Nvidia News URL
GPU – Graphical Processing Unit
35
Thousands of parallell coresFully programmable in e.g. CUDAVery affordableShared large memory (e.g. 12 GB)In large server banksCan be rented by Amazon, Baidu, Alibaba etc.
Titan Xp GPU
36
Why Now? TechniqueNon-convex and high-dimensional learning is okaywith the right design choices
e.g. non-saturating non-linearities
Learning by Stochastic Gradient Descent (SGD) with momentum and other variants — more later!
instead of
37
38
Examples from NVIDIA:https://developer.nvidia.com/deep-learning
39
DeepBreak
What is Deep Learning?
Compositional ModelsLearned End-to-End
Hierarchy of Representations- vision: pixel, motif, part, object- text: character, word, clause, sentence- speech: audio, band, phone, word concrete
abstract
layer1
input
layer2
loss
θ1
θ2
truth
output
θ3
40
Back-propagation jointly learnsall of the model parameters tooptimize the output for the task—more on this later!
What is Deep Learning?
Compositional ModelsLearned End-to-End
41
layer1
input
layer2
loss
θ1
θ2
truth
output
θ3
Shallow Learning
[slide credit K. Cho]
Separation of hand engineering and machine learning
42
= a conclusion reached on the basis of evidence and reasoning
Hand-Engineered Features
43Features from years of vision expertise by the whole community are nowsurpassed by learned representations and these transfer across tasks
[figure credit R. Fergus]
Deep Learning
44[slide credit K. Cho]
45
End-to-End Learning Representations
The visual world is too vast and variedto fully describe by hand
Learn the representation from datalocal appearance parts and texture objects and semantics
[figure credit H. Lee]
Hierarchical growth of complexity
46
47
End-to-End Learning Tasks
The visual world is too vast and variedto fully describe by hand
Learn the task from data
Types of Learning
Vast space of models!
[figure credit Marc’aurelio Ranzato, CVPR 2014 tutorial]
Deep Network
Recurrent Network
Convolutional Network
48
Example: TensorFlow (URL)
49
The Neural Networks ZOO : http://www.asimovinstitute.org/neural-network-zoo/
50
Neural Network Graphs : http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/
51
Neural Network Graphs : http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/
History
Is deep learning 4, 20, or 50 years old? What’s changed?
2000s Sparse, Probabilistic, and Layer-wise models (Hinton, Bengio, Ng)2012 DL popularized in vision by contest victory (Krizhevsky et al. 2012)
Rosenblatt’s Perceptron52
Radial Basis Function
Convolutional Networks: 1989
LeNet: a layered model composed of convolution and subsampling layers followed by a holistic representationand ultimately a classifier for handwritten digits [LeNet]
53
Note: channel dimension goes upas spatial dimension goes down... still a common pattern today
AlexNet: a layered model composed of convolution, subsampling, and further operations followed by a holistic representation and all-in-all a landmark classifier onILSVRC12 [AlexNet]
+ data+ gpu+ non-saturating non-linearity+ regularization 54
Convolutional Networks: 2012
55
FC 1000
FC 4096 / ReLU
FC 4096 / ReLU
Max Pool 3x3s2
Conv 3x3s1, 256 / ReLU
Conv 3x3s1, 384 / ReLU
Conv 3x3s1, 384 / ReLU
Max Pool 3x3s2
Local Response Norm
Conv 5x5s1, 256 / ReLU
Max Pool 3x3s2
Local Response NormConv 11x11s4, 96 /
ReLU
FC-ReLU:stack at end of the net to learn outputmajority of the learned parameters
Conv-Pool: 1+ conv are followed by pooling to subsamplespatial size shrinks; receptive field grows
Conv-ReLU:all conv are followed by non-linearityin this case ReLU
Convnet Design Patterns
Convnet Computation: 2012 & 2014AlexNet inference for a single image (3x227x227 input):
- 725M FLOPS
- 60M parameters (60,965,224 to be exact)
- 408 mb GPU memory in Caffe<12 gb for batch size of 1,500
- <1ms / image on Titan X with cuDNN v4for batch size >= 256
56
Compare GoogleNet (ILSVRC14 winner):- 2x FLOPs- 0.1x the parameters- 14% more accurate
Architecture matters!But the computational primitives are the same.
FC 1000
FC 4096 / ReLU
FC 4096 / ReLU
Max Pool 3x3s2
Conv 3x3s1, 256 / ReLU
Conv 3x3s1, 384 / ReLU
Conv 3x3s1, 384 / ReLU
Max Pool 3x3s2
Local Response Norm
Conv 5x5s1, 256 / ReLU
Max Pool 3x3s2
Local Response Norm
Conv 11x11s4, 96 / ReLU
4M
16M
37M
442K
1.3M
884K
307K
35K
4M
16M
37M
74M
112M
149M
223M
105M
params FLOPsAlexNet
Convolutional Nets: 2014
GoogLeNet ILSVRC14 Winner: ~6.6% Top-5 error- composition of multi-scale dimension-reduced
“Inception” modules- no FC layers and only 5 million parameters
+ depth+ auxiliary classifiers+ dimensionality reduction
57[Szegedy15]
1x1 Convolution
58
- reduce channel dimension to control 1. parameter count 2. computation- stack with non-linearity for deeper net- found in many of the latest nets
each filter has size64x1x1 and does a64-dim dot product
1x1 convwith 32 filters
[figure credit A. Karpathy]
Convolutional Nets: 2014
VGG16 ILSVRC14 Runner-up: ~7.3% Top-5 error- 13 layers of 3x3 convolution interleaved with
max pooling + 3 fully-connected layers - simple architecture, good for transfer learning- 155 million params and more expensive to compute
+ depth+ fine-tuning deeper and deeper+ stacking small filters
59
FC 1000
FC 4096 / ReLU
FC 4096 / ReLUMax Pool 2x2s2
Conv 3x3s1, 256 / ReLU
Conv 3x3s1, 256 / ReLU
Conv 3x3s1, 256 / ReLU
Max Pool 2x2s2
Conv 3x3s1, 128 / ReLU
Max Pool 2x2s2
Conv 3x3s1, 64 / ReLU
Conv 3x3s1, 64 / ReLU
Conv 3x3s1, 128 / ReLU Max Pool 2x2s2
Conv 3x3s1, 512 / ReLU
Conv 3x3s1, 512 / ReLU
Conv 3x3s1, 512 / ReLU
Max Pool 2x2s2
Conv 3x3s1, 512 / ReLU
Conv 3x3s1, 512 / ReLU
Conv 3x3s1, 512 / ReLU
stack 23x3 conv
for a 5x5 receptive field
[figure creditA. Karpathy]
[Simonyan15]
ILSVRC15 and COCO15 Winner: MSRA ResNet- classification- detection- segmentation
Convolutional Nets: 2015
Learn residual mapping w.r.t. identity
- very deep 100+ layer nets
- skip connections across layers
- batch normalization
60
Kaiming He, et al.Deep Residual Learning for Image RecognitionarXiv 1512.03385. Dec. 2015.
[He15]
Convolutional Nets: 2015
MSRA ResNet
(~5x the layers shown here)
ILSVRC15 Winner 3.5% Top-5 error andCOCO15 Winner with >10% lead for detection and segmentation
- MSRA Residual Net (ResNet): 101 and 152 layer networks- skip and sum layers to form residuals- batch normalization (optimization trick) 61[He15]
Mathematica demo MNIST
62
MNIST Visualizations
Why Now?1.Data
ImageNet et al.: millions of labeled (crowdsourced) images1.Compute
GPUs: terabytes/s memory bandwidth, teraflops compute1.Technique
new optimization know-how,new variants on old architectures,new tools for rapid experimentation
63
Why Now? DataFor example:
>14 million labeled images>1 million with bounding boxes
>300,000 images with labeled and segmented objects
64
URL
Why Now? GPUs
Parallel processorsfor parallel models:
Inherent Parallelismsame op, different data
Bandwidthlots of data in and out
Tuned PrimitivescuDNN and cuBLASfor deep nets for matrices 65
Nvidia News URL
GPU – Graphical Processing Unit
66
Thousands of parallell coresFully programmable in e.g. CUDAVery affordableShared large memory (e.g. 12 GB)In large server banksCan be rented by Amazon, Baidu, Alibaba etc.
Titan Xp GPU
67
Why Now? TechniqueNon-convex and high-dimensional learning is okaywith the right design choices
e.g. non-saturating non-linearities
Learning by Stochastic Gradient Descent (SGD) with momentum and other variants — more later!
instead of
68
framework
Why Now? Deep Learning Frameworks
networkinternal
representation
tools:visualization, profiling, debugging, etc.
layer library:fast implementations of common functions and gradients
backend:dispatch compute for learning and inference
frontend:a language for any network, any task
69
Deep Learning Frameworks
all open sourcewe like to brew our networks with Caffe
CaffeBerkeley / BVLCC++ / CUDA, Python, MATLAB
TorchFacebook + NYULua (C++)
TheanoU. MontrealPython
TensorFlowGooglePython (C++)
70
- This isn’t a problem (except for neuroscientists)
- Be wary of neural realism hype or “it just works because it’s like the brain”
- network, not neural networkunit, not neuron
Not So “Neural”
71
These models are not how the brain worksWe don’t know how the brain works!
Visual Recognition TasksClassification- what kind of image?- which kind(s) of objects?
Challenges- appearance varies by
lighting, pose, context, ...- clutter- fine-grained categorization
(horse or exact species) 72
❏ dog❏ car❏ horse❏ bike❏ cat❏ bottle❏ person
73
Image Classification: ILSVRC 2010-2015
[graph credit K. He]74
top-5error
❏ dog❏ car❏ horse❏ bike❏ cat❏ bottle❏ person
ImageNet Large Scale Visual Recognition Competition
Website
AlexNET - pdf
Visual Recognition Tasks
75
car person horse
Detection- what objects are there?- where are the objects?
Challenges- localization- multiple instances- small objects
Detection: PASCAL VOC
[graph credit R. Girshick]76
dete
ctio
n ac
cura
cy
R-CNN:regions +convnets
state-of-the-art, in Caffe
Visual object classes
Semantic Segmentation- what kind of thing
is each pixel part of?- what kind of stuff
is each pixel?
Challenges- tension between
recognition and localization
- amount of computation
Visual Recognition Tasks
77
horse
car
78
Some examples:
• NVIDIA news:https://news.developer.nvidia.com/google-releases-tensorflow-1-0/http://nvidianews.nvidia.com/news?q=neural+nets&year=&month=&c=&from=&to= http://nvidianews.nvidia.com/news?q=deep+learning&year=&month=&c=&from=&to=
• Free book:http://neuralnetworksanddeeplearning.com/
• Other books:MIT: https://pdfs.semanticscholar.org/751f/aab15cbb955b07537fc38901bc96d4e70f57.pdf
• New companies:http://aidence.com/
• Papers:Classical paper: http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.htmlImageNet: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks(cited 11342 times) CAD: https://www.nature.com/articles/srep24454
• Google TensorFlow:https://www.tensorflow.org/get_started/
• Kaggle Diabetic Retinopahy Challenge: https://www.kaggle.com/c/diabetic-retinopathy-detection(see also our BMIE project: www.retinacheck.org/zh/index.html).
• Google Diabetic Retinopathy paper:https://research.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html?m=1
Some Basics of Deep Learning
79
80
Why Deep Learning?
Applications
The Challengeof Recognition
Learning & Optimization
Network Tour Transfer Learning
Deep Learning for Vision
Embedded Vision Alliance Tutorial – © Shelhamer, Donahue, Long
Dive intoDeep Learning
What is DL?Why Now?
Caffe First Sip
First Dive Into Deep Learning
81
Deep Learning is
Stacking LayersandLearning End-to-End
Deep networks are layered models made bystacking different types of transformation
A layer is a transformation
82
Stacking Layers
x’ = layer(x)
x2 = layer1(x1)x3 = layer2(x2)...
How do layers stack?
Networks run layer-by-layer, composingthe input-output transformation of each layer
83
Layered Networks
layer1
layer2
output
input
layer1
layer2
output
input
During learning, the error is passed backlayer-by-layer to tune the transformations
layer1
layer2
output
input What kind of layers should we stack?
x1out
= layer1(input)= layer2(x1)
output+ error
Non-linearity
84
The simplest layers
Matrix Multiplication
(for example)
85
Matrix Multiplication
Multiply input x by weights W and add bias bLearns linear transformations
K x O dimensionalK inputsO outputs
O outputs
86
Matrix Multiplication == Fully Connected Layer
Output is a function of every input, or the input and output are“fully connected”
Abbreviated as FC
[figure credit BDTI]
- Suppose our data points (x) are 2D and each comes with a label y, where y = -1 or y = 1
- Learn a weight vector w = [w1; w2]
- Predict the class of a given xby sign(wTx) = sign(w1x1 + w2x2)
87
Linear Classification
?
To classify we need to separate the data into red vs. blue
y = -1
y = 1
x1
x2
- Suppose our data points (x) are 2D and each comes with a label y, where y = -1 or y = 1
- Learn a weight vector w = [w1; w2]
- Predict the class of a given xby sign(wTx) = sign(w1x1 + w2x2)
88
Linear Classification
To classify we need to separate the data into red vs. blue
y = -1
y = 1
x1
x2
89
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
?x1
x2
NO90
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
x1
x2
NO91
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
x1
x2
NO92
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
x1
x2
YES93
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
Non-linearity!
x1
x2
YES94
Linearity is Not Enough
To classify we need to separate the data into red vs. blue
y = -1
y = 1
Non-linearity!
x1
x2
95
The Limits of Linearity
Linear steps collapse and stay linear
Linear layers alone do not meaningfully stack
96
The Shallowest Deep Net
Deep nets are made by stacking learned linear layersand simple pointwise non-linear layers
Due to the Rectified Linear Unit (ReLU) non-linearity max(0, x), x3 cannot be computed as a linear function of x1
Linear Non-linear, Deep
add ReLU
Non-linearity is needed to deepen the representationMany non-linearities or activations to choose from
97
Non-linearityReLU
Sigmoid
Yet more non-linearities
98
ReLU
Sigmoid
TanH
Leaky ReLU
When in doubt, ReLU
Worth trying Leaky ReLU, ELU
Avoid Sigmoid
ELU
99
Define Your First Net
Let’s go non-linear ona classification problem
Try It OutDeep Learning in your browser demos
100
Designing for Sight
Convolutional Networks or convnets are nets for vision
- functional fit for the visual worldby compositionality and feature sharing
- learned end-to-end to handle visual detailfor more accuracy and less engineering
Convnets are the dominant architectures for visual tasks
101
Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail
Across Space: the same what, no matter whererecognize the same input in different places
102
Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail
Across Space: the same what, no matter whererecognize the same input in different places
Can rely on spatial coherence This is not a cat
103
Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail
Across Space: the same what, no matter whererecognize the same input in different places
Can rely on spatial coherence This is not a cat
All of these are cats
104
Vision Layers
Convolution/Filteringlinear layer for vision
Poolingspatial summarization max pool 2x2
with stride 2
Learned Filter
[figure credit A. Karpathy, cs231n course]
So use the same weights between nodes with the same spatial relationship
Convolution: A Linear Layer for VisionImages have translation invariant semantics: these are all equally squirrels
105
This is convolution (or correlation—used interchangeably in vision)Convolution means fewer parameters for more efficient learning
106
A Filter
input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)
A filter is a spatially local and cross-channel templateConvnet filters are learned
[figure adapted from A. Karpathy]
107
A Filter
input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)
A filter is a spatially local and cross-channel templateConvnet filters are learned
filter is 3x5x5 weights- spatially local: kernel size is 5x5- cross-channel: connected across all input channels
[figure adapted from A. Karpathy]
108
A Filter
input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)
A filter is a spatially local and cross-channel templateConvnet filters are learned
filter is 3x5x5 weights- spatially local: kernel size is 5x5- cross-channel: connected across all input channels
total parameters:3*52 = 75 filter weights + 1 bias
[figure adapted from A. Karpathy]
One filter evaluation is a dot product between the input window and weights + bias
109
Convolution
32
inputfilterbiasoutput
3x32x323x5x5
11
[figure adapted from A. Karpathy]
110
Convolution
32
inputfilterbiasoutput
3x32x323x5x5
11
feature map
1x28x28
[figure adapted from A. Karpathy]
Convolving the filter with the input gives a feature map.
111
Convolution
32
inputfilterbiasoutput
3x32x323x5x5
11
feature map
Convolving the filter with the input gives a feature map.
1x28x28
Filter parameters:FC parameters:
3*52 = 753*322 = 3,072 [figure adapted from A. Karpathy]
112
Convolution Layer (conv)
32
inputfiltersbiasoutput
3x32x326x3x5x5
66x28x28
feature maps
Convolution layers have multiple filters for more modeling capacity
Convolution Layer
[figure adapted from A. Karpathy]
113
Convolution Layer (conv)
32
inputfiltersbiasoutput
3x32x326x3x5x5
66x28x28
feature maps
Convolution layers have multiple filters for more modeling capacity
Convolution LayerLearned Filters from AlexNet conv1
conv1 has 96 filters foredge, color, and frequency
richer than 3D RGB [figure adapted from A. Karpathy]
114
Pooling (pool)
2x2 pooling, stride 2Max pooling
Average pooling
Spatial summary by computingoperation over window with stride
- overlapping or non-overlapping
- separate across channels
- Current fashion:3x3 max poolingwith stride 2
[figure credit BDTI]
[figure credit A. Karpathy]
Pooling
115
- reduce resolution
- increase receptive field sizefor later layers
- save computation
- add invariance to translation/noise within pooling window
64x224x22464x112x112
Fully Connected Layers (FC)
116
Learn a global feature from the full feature mapsOften found at the end of convnetsNote: this could likewise be done by a large convolution kernel
feature maps2x2x2
unroll
input1 x 8
weights8 x 3
outputsor units
1 x 3
bias1 x 3
117
Normalization Layers (Deprecated)Local response normalization was popular for a time but is now deprecated;more recent networks do not include these layers
[figure credit BDTI]
118
- layers compute differentiable transformations
- types of layers: conv, ReLU, pool, FC
- parameters (conv, FC) or not (pool, ReLU)
- arguments like kernel size, stride, etc. (conv pool)
Layer Review
119
Convnet Architecture
Input Image Scores
Conv 3x3s1, 10 / ReLU Type: Conv Kernel Size: 3x3 Stride: 1 Channels:10 Activation: ReLU
FC 10
Conv 3x3s1, 10 / ReLU
Max Pool 3x3s1
Conv 3x3s1, 10 / ReLU
Conv 3x3s1, 10 / ReLU
Conv 3x3s1, 10 / ReLU
Max Pool 3x3s1
Conv 3x3s1, 10 / ReLU
Max Pool 3x3s1
Conv 3x3s1, 10 / ReLU
Stack convolution, non-linearity, and pooling until global FC layer classifier
[figure credit A. Karpathy]
Data augmentation: making muchmore data
120
transform the training data, without changing its truth
… and anything else you can come up ith! ( d bi ti f th b
horizontal flipscat still a cat
random crops/scalesviews of catcat cat darker cat
relighting
[figure adapted from A. Karpathy]
much
121
See a Net Learn to See
Let’s watch a convnet as it learnshow to recognize objects in images
MNIST demo: Try It Out
Cifar 10 demo: Try It Out
Internalfunctionality
122
Supervised Learning
Given labeled data:(x1, y1), (x2, y2), …, (xN, yN)
Goal: find a function f such that yn = f(xn)for all n, “as well as possible”
labeldata
123
What does “as well as possible” mean?Pick a loss function ℓ(y, ŷ): how wrong is it to predict ŷ when the true label is y?Minimize the total loss over all data:
E.g. ℓ(y, ŷ) = ‖y - ŷ‖2 “Euclidean Loss” or everyday linear regression
Supervised Loss
124
Parametric Learning
How do we find the label-prediction function f?Parametric answer: pick it from a family determined by a set of parameters θ:
E.g. f(x; θ) = θ x “linear prediction”For us: f is a network, θ is a set of weights
f(x) = f(x; θ)matrix vector
125
Parametric Supervised Learning
Altogether: our goal is to find θ in order to loss true label
parameters(weights)
model(network)
predicted label
sum over data 126
Underfitting and Overfitting
underfitting:not enough parameters to model the data
overfitting:enough parameters to memorize the training set without generalizing
fewer parameters more parameters
127
[figure credit A. Karpathy]
RegularizationHow can we prevent overfitting without reducing the number of parameters?
Add a regularization penalty to our loss: “complicated” solutions are worse128
[figure credit A. Karpathy]
Regularization: Weight Decay and Dropout
Weight Decay: minimize L(θ) + λ‖θ‖2 to pull weights toward zeroλ (scalar) is an optimization setting… pick it empiricallyaka “L2 regularization”
Dropout: during training, randomly set a fraction p of activations to zerop is an optimization setting (often 0.5)forces model to be robust to noise
129
Gradient Descent: Intuition
Want to minimize “loss” function L(x; θ)
θ axis
L(x; θ)
Move in the direction of the gradient
old θnew θ
θ (vector): parameter to updatex (vector): input data (fixed on this slide)
130
The gradient tells you, for each element of the network parameters,how the loss changes in response to a change in that parameter.
Stochastic Gradient Descent (SGD)
Want to minimize “loss” function L(x; θ)1. Pick input datum x
2. Compute parameter gradient
3. Multiply by learning rate
4. Update parameters θ
131
(The alternative is to average the gradient over all available data,“batch gradient descent”:
That’s too slow for big data!)
Why “Stochastic”?
The gradient depends on the choice of input datum xChoose x randomly (or just cycle through all data in a fixed order)
132
SGD with Weight Decay and Momentum
133
SGD with Weight Decay and Momentum
weight decay(regularization)
134
SGD with Weight Decay and Momentum
There are many other variants:Adam, RMSprop, AdaDelta, AdaGrad, Nesterov, ...
weight decay(regularization)
momentum(p is a number less than 1)
135
136
ReLU
Sigmoid
Layer GradientsMatrix Multiply Gradients
137
Back-propagation: The Chain Rule
layer1
θ
loss (ℓ)
A net is a composition of layer functionsThe gradient of a net is the product of layer gradients
Back-propagation in a Bigger Net
layer1
x
layer2
loss
θ1
θ2
input
output
y truth
ŷ
θ3
138
Backward passForward pass