lecture 23 deep learning: segmentation...hi-res input image: 3 x 800 x 600 with region proposal...
TRANSCRIPT
![Page 1: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/1.jpg)
COS 429: Computer Vision
Lecture 23Deep Learning: Segmentation
COS429 : 12.12.16 : Andras Ferencz
Thanks: most of these slides shamelessly adapted fromStanford CS231n: Convolutional Neural Networks for Visual Recognition
Fei-Fei Li, Andrej Karpathy, Justin Johnson http://cs231n.stanford.edu/
![Page 2: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/2.jpg)
2 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
![Page 3: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/3.jpg)
3 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
3
Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016
![Page 4: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/4.jpg)
4 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
4
ClassificationClassification + Localization
Computer Vision Tasks
CAT CAT CAT, DOG, DUCK
Object DetectionInstance
Segmentation
CAT, DOG, DUCK
Single object Multiple objects
![Page 5: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/5.jpg)
5 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
5
Simple Recipe for Classification + Localization
Step 2: Attach new fully-connected “regression head” to the network
Image
Convolution and Pooling
Final conv feature map
Fully-connected layers
Class scores
Fully-connected layers
Box coordinates
“Classification head”
“Regression head”
![Page 6: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/6.jpg)
6 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
6
Sliding Window: Overfeat
Network input: 3 x 221 x 221
0.5 0.75
Classification scores: P(cat)
Larger image:3 x 257 x 257
![Page 7: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/7.jpg)
7 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
7
Sliding Window: Overfeat
Network input: 3 x 221 x 221
0.5 0.75
0.6 0.8
Classification scores: P(cat)
Larger image:3 x 257 x 257
![Page 8: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/8.jpg)
8 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
8
Sliding Window: Overfeat
Network input: 3 x 221 x 221
0.5 0.75
0.6 0.8
Classification scores: P(cat)
Larger image:3 x 257 x 257
![Page 9: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/9.jpg)
9 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
9
Sliding Window: Overfeat
Network input: 3 x 221 x 221 Classification score:
P(cat)Larger image:3 x 257 x 257
Greedily merge boxes and scores (details in paper)
0.8
![Page 10: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/10.jpg)
10 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
10
Sliding Window: Overfeat
In practice use many sliding window locations and multiple scales
Window positions + score maps Box regression outputs Final Predictions
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
![Page 11: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/11.jpg)
11 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
11
Efficient Sliding Window: Overfeat
Image: 3 x 221 x 221
Convolution + pooling
Feature map: 1024 x 5 x 5
4096 x 1 x 1 1024 x 1 x 1
5 x 5 conv
5 x 5 conv
1 x 1 conv
4096 x 1 x 1 1024 x 1 x 1
Box coordinates:(4 x 1000) x 1 x 1
Class scores:1000 x 1 x 1
1 x 1 conv
1 x 1 conv 1 x 1 conv
Efficient sliding window by converting fully-connected layers into convolutions
![Page 12: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/12.jpg)
12 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
12
Efficient Sliding Window: Overfeat
Training time: Small image, 1 x 1 classifier output
Test time: Larger image, 2 x 2 classifier output, only extra compute at yellow regions
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
![Page 13: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/13.jpg)
13 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
13
ClassificationClassification + Localization
Computer Vision Tasks
Instance Segmentation
Object Detection
![Page 14: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/14.jpg)
14 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
14
Region Proposals
● Find “blobby” image regions that are likely to contain objects● “Class-agnostic” object detector● Look for “blob-like” regions
![Page 15: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/15.jpg)
15 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
15
Region Proposals: Selective Search
Bottom-up segmentation, merging regions at multiple scales
Convert regions to boxes
Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
![Page 16: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/16.jpg)
16 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
16
R-CNN
Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014
Slide credit: Ross Girschick
![Page 17: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/17.jpg)
17 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
R-CNN Problems: Slow at test-time due to independent forward passes of the CNN
Solution: Share computation of convolutional layers between proposals for an image
R-CNN Problems: - Post-hoc training: CNN not updated in response to final classifiers and regressors- Complex training pipeline
Solution:Just train the whole system end-to-end all at once!
Fast R-CNN
![Page 18: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/18.jpg)
18 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
18
Fast R-CNN: Region of Interest Pooling
Hi-res input image:
3 x 800 x 600with region proposal
Convolution and Pooling
Hi-res conv features:C x H x W
with region proposal
Fully-connected layers
Can back propagate similar to max pooling
RoI conv features:C x h x w
for region proposal
Fully-connected layers expect low-res conv features:
C x h x w
![Page 19: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/19.jpg)
19 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
19
Faster R-CNN: TrainingIn the paper: Ugly pipeline- Use alternating optimization to train RPN,
then Fast R-CNN with RPN proposals, etc.- More complex than it has to be
Since publication: Joint training!One network, four losses- RPN classification (anchor good / bad)- RPN regression (anchor -> proposal)- Fast R-CNN classification (over classes)- Fast R-CNN regression (proposal -> box)
Slide credit: Ross Girschick
![Page 20: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/20.jpg)
20 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
20
Faster R-CNN: Results
R-CNN Fast R-CNN Faster R-CNN
Test time per image(with proposals)
50 seconds 2 seconds 0.2 seconds
(Speedup) 1x 25x 250x
mAP (VOC 2007) 66.0 66.9 66.9
![Page 21: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/21.jpg)
21 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
21
Object Detection State-of-the-art:ResNet 101 + Faster R-CNN + some extras
He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015
![Page 22: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/22.jpg)
22 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
22
ImageNet Detection 2013 - 2015
![Page 23: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/23.jpg)
23 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
23
YOLO: You Only Look OnceDetection as Regression
Divide image into S x S grid
Within each grid cell predict:B Boxes: 4 coordinates + confidenceClass scores: C numbers
Regression from image to 7 x 7 x (5 * B + C) tensor
Direct prediction using a CNN
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
![Page 24: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/24.jpg)
24 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
24
YOLO: You Only Look OnceDetection as Regression
Faster than Faster R-CNN, but not as good
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
![Page 25: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/25.jpg)
25 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
2525
ClassificationClassification + Localization
Computer Vision Tasks
CAT CAT CAT, DOG, DUCK
Object Detection Segmentation
CAT, DOG, DUCK
Multiple objectsSingle object
![Page 26: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/26.jpg)
26 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Classification
Today
Object DetectionClassification + Localization
Segmentation
Today
![Page 27: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/27.jpg)
27 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
27
Label every pixel!
Don’t differentiate instances (cows)
Classic computer vision problem
Figure credit: Shotton et al, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context”, IJCV 2007
![Page 28: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/28.jpg)
28 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
28
Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Detect instances, give category, label pixels
“simultaneous detection and segmentation” (SDS)
Lots of recent work (MS-COCO)
![Page 29: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/29.jpg)
29 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
29
Extract patch
![Page 30: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/30.jpg)
30 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
30
CNN
Extract patch
Run througha CNN
![Page 31: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/31.jpg)
31 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
31
CNN COW
Extract patch
Run througha CNN
Classify center pixel
![Page 32: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/32.jpg)
32 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
32
CNN COW
Extract patch
Run througha CNN
Classify center pixel
Repeat for every pixel
![Page 33: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/33.jpg)
33 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation
33
CNN
Run “fully convolutional” network to get all pixels at once
Smaller output due to pooling
![Page 34: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/34.jpg)
34 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
34
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
![Page 35: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/35.jpg)
35 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
35
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Resize image to multiple scales
![Page 36: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/36.jpg)
36 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
36
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Resize image to multiple scales
Run one CNN per scale
![Page 37: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/37.jpg)
37 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
37
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Resize image to multiple scales
Run one CNN per scale
Upscale outputsand concatenate
![Page 38: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/38.jpg)
38 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
38
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Resize image to multiple scales
Run one CNN per scale
Upscale outputsand concatenate
External “bottom-up” segmentation
![Page 39: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/39.jpg)
39 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Multi-Scale
39
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Resize image to multiple scales
Run one CNN per scale
Upscale outputsand concatenate
External “bottom-up” segmentation
Combine everything for final outputs
![Page 40: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/40.jpg)
40 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Refinement
40
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Apply CNN once to get labels
![Page 41: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/41.jpg)
41 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Refinement
41
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Apply CNN once to get labels
Apply AGAIN to refine labels
![Page 42: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/42.jpg)
42 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Refinement
42
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Apply CNN once to get labels
Apply AGAIN to refine labels
And again!
Same CNN weights:recurrent convolutional network
More iterations improve results
![Page 43: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/43.jpg)
43 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
43
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Semantic Segmentation: Upsampling
Learnable upsampling!
![Page 44: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/44.jpg)
44 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
44
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Semantic Segmentation: Upsampling
![Page 45: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/45.jpg)
45 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
45
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Semantic Segmentation: Upsampling
“skip connections”
![Page 46: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/46.jpg)
46 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
46
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Semantic Segmentation: Upsampling
Skip connections = Better results
“skip connections”
![Page 47: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/47.jpg)
47 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
47
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
![Page 48: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/48.jpg)
48 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
48
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
![Page 49: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/49.jpg)
49 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
49
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
![Page 50: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/50.jpg)
50 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
50
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
![Page 51: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/51.jpg)
51 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
51
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
![Page 52: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/52.jpg)
52 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
52
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
![Page 53: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/53.jpg)
53 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
53
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
![Page 54: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/54.jpg)
54 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
54
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
![Page 55: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/55.jpg)
55 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Learnable Upsampling: “Deconvolution”
55
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps
Same as backward pass for normal convolution!
“Deconvolution” is a bad name, already defined as “inverse of convolution”
Better names: convolution transpose,backward strided convolution,1/2 strided convolution, upconvolution
![Page 56: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/56.jpg)
56 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Semantic Segmentation: Upsampling
56
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Normal VGG “Upside down” VGG
6 days of training on Titan X…
![Page 57: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/57.jpg)
57 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
57
Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Detect instances, give category, label pixels
“simultaneous detection and segmentation” (SDS)
Lots of recent work (MS-COCO)
![Page 58: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/58.jpg)
58 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
58
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
Similar to R-CNN, but with segments
![Page 59: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/59.jpg)
59 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
59
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
External Segment proposals
Similar to R-CNN, but with segments
![Page 60: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/60.jpg)
60 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
60
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
External Segment proposals
Similar to R-CNN
![Page 61: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/61.jpg)
61 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
61
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
External Segment proposals
Mask out background with mean image
Similar to R-CNN, but with segments
![Page 62: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/62.jpg)
62 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
62
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
External Segment proposals
Mask out background with mean image
Similar to R-CNN, but with segments
![Page 63: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/63.jpg)
63 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation
63
Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014
External Segment proposals
Mask out background with mean image
Similar to R-CNN, but with segments
![Page 64: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/64.jpg)
64 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation: Cascades
64
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Similar to Faster R-CNN
Won COCO 2015 challenge (with ResNet)
![Page 65: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/65.jpg)
65 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation: Cascades
65
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Similar to Faster R-CNN
Region proposal network (RPN)
Won COCO 2015 challenge (with ResNet)
![Page 66: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/66.jpg)
66 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation: Cascades
66
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Similar to Faster R-CNN
Won COCO 2015 challenge (with ResNet)
Region proposal network (RPN)
Reshape boxes to fixed size,figure / ground logistic regression
Mask out background, predict object class
Learn entire model end-to-end!
![Page 67: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/67.jpg)
67 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Instance Segmentation: Cascades
67
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015 Predictions Ground truth
![Page 68: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/68.jpg)
68 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Segmentation Overview
Semantic segmentationClassify all pixelsFully convolutional models, downsample then upsampleLearnable upsampling: fractionally strided convolutionSkip connections can help
Instance SegmentationDetect instance, generate maskSimilar pipelines to object detection
68
![Page 69: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/69.jpg)
69 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Quick overview ofOther Topics
69
![Page 70: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/70.jpg)
70 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
70
Recurrent Neural Networks (RNN)
Vanilla Neural Networks
![Page 71: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/71.jpg)
71 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
71
e.g. Image Captioningimage -> sequence of words
Recurrent Neural Networks (RNN)
![Page 72: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/72.jpg)
72 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
72
e.g. Sentiment Classificationsequence of words -> sentiment
Recurrent Neural Networks (RNN)
![Page 73: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/73.jpg)
73 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
73
e.g. Machine Translationseq of words -> seq of words
Recurrent Neural Networks (RNN)
![Page 74: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/74.jpg)
74 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
74
e.g. Video classification on frame level
Recurrent Neural Networks (RNN)
![Page 75: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/75.jpg)
75 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
75
x
RNN
y
![Page 76: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/76.jpg)
76 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
train more
train more
train more
Character RNN during training
![Page 77: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/77.jpg)
77 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
77
![Page 78: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/78.jpg)
78 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
78
Generated C code
![Page 79: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/79.jpg)
79 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
79
quote detection cell
Searching for interpretable cells
![Page 80: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/80.jpg)
80 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Multiple Object Recognition with
Visual Attention, Ba et al.
Sequential Processing of fixed inputs
![Page 81: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/81.jpg)
81 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
DRAW: A Recurrent
Neural Network For
Image Generation,
Gregor et al.
Sequential Processing of fixed outputs
![Page 82: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/82.jpg)
82 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
82
Explain Images with Multimodal Recurrent Neural Networks, Mao et al.Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-FeiShow and Tell: A Neural Image Caption Generator, Vinyals et al.Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Image Captioning
![Page 83: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/83.jpg)
83 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
83
Convolutional Neural Network
Recurrent Neural Network
![Page 84: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/84.jpg)
84 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
84
CNN
Image: H x W x 3
Features: L x D
h0
a1
z1
Weighted combination of features
y1
h1
First word
Distribution over L locations
Soft Attention for Captioning
a2 d1
h2
z2 y2Weighted
features: D
Distribution over vocab
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
![Page 85: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/85.jpg)
85 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Soft Attention for Captioning
85
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft attention
Hard attention
![Page 86: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/86.jpg)
86 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Soft Attention for Captioning
86
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
![Page 87: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/87.jpg)
87 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Spatial Transformer Networks
Input image:H x W x 3
Box Coordinates:(xc, yc, w, h)
Cropped and rescaled image:
X x Y x 3
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Can we make this function differentiable?
![Page 88: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/88.jpg)
88 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Spatial Transformer Networks
Input image:H x W x 3
Box Coordinates:(xc, yc, w, h)
Cropped and rescaled image:
X x Y x 3
Can we make this function differentiable?
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input
Repeat for all pixels in output to get a sampling grid
Then use bilinear interpolation to compute output
Network attends to input by predicting �
![Page 89: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/89.jpg)
89 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Spatial Transformer Networks
Input: Full image
A smallLocalization network predicts transform �
Grid generator uses to �compute sampling grid
Sampler uses bilinear interpolation to produce output
Output: Region of interest from input
![Page 90: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected](https://reader035.vdocuments.mx/reader035/viewer/2022070920/5fb8f24a2a15712b727e24ff/html5/thumbnails/90.jpg)
90 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:
Spatial Transformer Networks
90
Differentiable “attention / transformation” module
Insert spatial transformers into a classification network and it learns to attend and transform the input