classification and semantic segmentationyboykov/courses/cs898/lectures/lec5... · semantic...
Post on 07-Jul-2020
12 Views
Preview:
TRANSCRIPT
Classification
and
Semantic Segmentation
Most slides are from Fei-Fei Li, Justin Johnson, Andrej Karpathy, Serena Yeung, Jia-Bin Huang, Bharath Hariharan, Jeremy Jordan
Supervised Machine Learning
▪ Training data 𝑥1, … , 𝑥𝑁 with true labels (targets) 𝑦1, … , 𝑦𝑁▪ Chose hypothesis class ℎ 𝑥,𝑊▪ Define loss function for 𝑥 when the true label is 𝑦
▪ i.e. 𝐿 ℎ 𝑥,𝑊 , 𝑦 = 𝑦 − ℎ 𝑥,𝑊 2
▪ Training stage▪ minimize total loss on training set using gradient descent
min𝑊
𝑖=1
𝑁
𝐿 ℎ 𝑥𝑖 ,𝑊 , 𝑦𝑖
▪ Test stage▪ compute accuracy on test data, unseen during training
Single Layer Neural Network on Images
▪ 2 classes (cat vs dog)▪ ℎ 𝑥,𝑊 = σ 𝑊𝑥
▪ range in (0,1)
Single Layer Neural Network on Images
▪ 2 classes (cat vs dog)▪ ℎ 𝑥,𝑊 = σ 𝑊𝑥
▪ range in (0,1) 𝑊p(dog)sigmoid
▪ Also called Linear Classifier▪ Works well only for linearly separable classes
▪ not expressive enough
Single Layer NN for multiple classes
▪ Several classes (dog, cat, horse)
p(horse)softmax p(dog)
p(cat)
▪ One-hot encoding for labels 𝑦 horse = 100
, dog = 010
, cat = 001
−
𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝑦𝑡𝑟𝑢𝑒log(𝑦𝑝𝑟𝑒𝑑)▪ Cross-entropy loss
Multilayer Neural Network on images
Linear +
ReLU
Linear +
ReLULinear + sigmoid p(dog)
1024
32
▪ cat vs dog
256x256 2048
▪ Layers are called fully-connected▪ Expressive enough, but huge number of parameters
▪ expensive, requires lots of data to train well
Reducing Number of Parameters
65,536
Idea 1: local connectivity
Pixels only related to nearby pixels
Idea 2: Translation invariance
Pixels only related to nearby pixelsWeights should not depend on the location of the neighborhood
Linear function + translation invariance = convolution
▪ Local connectivity determines kernel size
5.4 0.1 3.6
1.8 2.3 4.5
1.1 3.4 7.2
Convolution over multiple channels
*
*
*
*+
+
= =
CNN: Convolutional layer
w
h
c
w
h
c’
Convolution
c
c’
CNN: Convolution Subsampling Convolution
▪ Subsampling can be implemented by applying convolution in strides▪ every 2 (or 3, or 4,…) pixels ▪ number of features is usually increased after subsampling, to maintain
expressiveness
subsampling
▪ Convolution in earlier steps detects more local patterns less resilient to distortion▪ Convolution in later steps detects more global patterns more resilient to
distortion▪ Subsampling allows capture of larger, more invariant patterns
Invariance to distortions: Pooling
▪ Each window reduced to one value▪ with max or average
…
4 7 6 9 3 11
8 3 21 4 0 0
1 2 1 3 5 6
7 9 4 3 1 8
5 2 1 5 5 0
0 1 6 4 5 6
Invariance to distortions: Max Pooling
8 21 11
9 4 8
5 6 6
4 7 6 9 3 11
8 3 21 4 0 0
1 2 1 3 5 6
7 9 4 3 1 8
5 2 1 5 5 0
0 1 6 4 5 6
Invariance to distortions: Average Pooling
5.5 10 3.5
4.75 2.75 5
2 4 4
▪ Each pooling layer takes a collection of feature maps as input and produces a collection of feature maps as output
▪ Output feature maps are usually smaller in height and width▪ Parameters: None
CNN: Pooling Layer
Convolutional networks
Horse
convolutional and pooling layers
fully connected layers
First Successful Classification CNN
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86.11 (1998): 2278-2324.
AlexNet - 2012
▪ Won ImageNet competition by a large margin
▪ First simple widely used net▪ Smaller filters and Deeper Network
VGGNet 2014
ResNet 2015
▪ Many more layers▪ Special skip connections for better training
Semantic Segmentation
person
grass
trees
motorbike
road
Semantic Segmentation
Semantic Segmentation: One-hot encoding
Semantic Segmentation: Cross-Entropy Loss Function
−
𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝑦𝑡𝑟𝑢𝑒log(𝑦𝑝𝑟𝑒𝑑)
▪ Pixelwise loss
▪ Added over all pixels
Semantic Segmentation with CNNs
h
w
3
Semantic Segmentation with CNNs
h/4
w/4
d
Semantic Segmentation with CNNs
d
h/4
w/4
Semantic Segmentation with CNNs
h/4
w/4
d𝑑 good features for classifying top left ‘pixel’
Semantic Segmentation with CNNs
𝑑
convolve with 𝑐 filters of size 1x1
𝑐
h/4
w/4
▪ Finally pass 𝑐 features of each pixel feature through softmax
Semantic Segmentation with CNNs
▪ Pass image through convolution and subsampling layers
▪ Final convolution with #classes outputs▪ Get scores for subsampled image▪ Upsample back to original size
person
bicycle
The Resolution Issue
▪ Problem: Need fine details!▪ Shallower network/earlier layers?
▪ not very semantic
Horse
Visualizations from : M. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In ECCV 2014
The Resolution Issue
▪ Problem: Need fine details!▪ Remove subsampling?
▪ Need many features per pixel▪ expensive without subsampling
▪ Need large field of view for final features▪ very deep network, expensive without subsampling
Solution 1: Image pyramids
Learning Hierarchical Features for Scene LabelingClement Farabet, Camille Couprie, Laurent Najman, Yann LeCun. In TPAMI, 2013
Hig
her
res
olu
tio
nLe
ss c
on
text
▪ Does not scale well to deep architectures
Solution 2: CNN+Conditional Random Fields▪ Combine with CRF as post-processing
▪ “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFS”, Chen et.al. ICLR’2015
Solution 2: CNN+Conditional Random Fields
CNN
input class probabilities Full CRF final output
RNN
◼ Combine with CRF in end-to-end trainable system ◼ mean field inference implemented as RNN◼ Zheng et.al., “Conditional Random Fields as Recurrent Neural Networks”ICCV’2015
Solution 3: Learn to Upsample
◼ Encoder/decoder structure
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015Badrinarayanan et al, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, TPAMI 2017
Methods for Upsampling
Methods for Upsampling
Decoding using only Upsampling
From long et al.: struggles to produce fine-grained segmentationsSemantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where... Combining fine layers and coarse layers lets the model make local predictions that respect global structure.
Solution 4: Skip connections
upsample
Skip connections
Fully convolutional networks for semantic segmentation. Evan Shelhamer, Jon Long, Trevor Darrell. In CVPR 2015
without skip connections
with skip connections
Solution 5: Dilation
▪ Need subsampling to allow convolutional layers to capture large regions with small filters▪ can we do this without subsampling?
Solution 5: Dilation
▪ Need subsampling to allow convolutional layers to capture large regions with small filters▪ can we do this without subsampling?
Fully convolutional networks for semantic segmentation. Evan Shelhamer, Jon Long, Trevor Darrell. In CVPR 2015Multi-Scale Context Aggregation by Dilated Convolutions. Yu et.al.ICRL’2016
Solution 5: Dilation
▪ Need subsampling to allow convolutional layers to capture large regions with small filters▪ can we do this without subsampling?
Solution 5: Dilation
▪ Instead of subsampling by factor of 2: dilate by factor of 2▪ allows for exponential increase in field of view without decrease
of spatial dimensions▪ Not panacea: without subsampling, feature maps are much larger
▪ memory issues
Putting it all together
55
60
65
70
Basic +Skip +Dilation +CRF
mean IoU on PASCAL VOC
Best Non-CNN approach: ~46.4%
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan Yuille. In ICLR, 2015.
More Architectures: U-net
▪ expanding the decoder with symmetry
“U-Net: Convolutional Networks for Biomedical Image Segmentation”, Ronneberger et. al., ICMI’2015
PSPNet▪ Pyramid Pooling mode
▪ new module to capture global scene context▪ 82.6 mean IoU on PASCAL VOC
“Pyramid Scene Parsing Network”, Zhao et.al., CVPR 2017
ICNet for Real-Time Semantic Segmentation
“ICNet for Real-Time Semantic Segmentation on High-Resolution Images ”, Zhao et.al., ECCV 2018
▪ Apply heavier CNN to small resolution
top related