modern convolutional object detectorspeople.ee.duke.edu/~lcarin/kevin9.29.2017.pdf ·...
TRANSCRIPT
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Modern Convolutional Object DetectorsFaster R-CNN, R-FCN, SSD
29 September 2017
Presented by: Kevin Liang
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Papers Presented
Faster R-CNN: Towards Real-Time Object Detection withRegion Proposal Networks (NIPS 2015)Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
R-FCN: Object Detection via Region-based FullyConvolutional Networks (NIPS 2016)Jifeng Dai, Yi Li, Kaiming He, Jian Sun
SSD: Single Shot MultiBox Detector (ECCV 2016)Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, Alexander C Berg
Speed/accuracy trade-offs for modern convolutional objectdetectors (CVPR 2017)Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama,
Kevin Murphy
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Object Detection Models
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Section 1
Background
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Classification (top) vs Detection (bottom)
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Two-stage Detection Paradigm
Many algorithms break up detection into two components:
Box ProposalObject Classification
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Regions with CNN features (R-CNN): 2013
Pros:
Demonstrated ImageNet classification + AlexNet learnedlearned features also applicable for detection
Significant jump in mAP performance on PASCAL VOC overstate-of-the-art
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Regions with CNN features (R-CNN) (continued)
Cons:
Proposals generated by external algorithm (Selective Search)
SVM and bounding box regressors trained separately; savingfeatures to train these is memory intensive
Patches extracted from raw image; overlapping proposals leadto repeated CNN computation for classification
Detection takes ∼47 s/image (GPU)
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Fast R-CNN
Single pass of entire image through CNN; patches extractedfrom convolutional feature maps
Final classification and bounding box regression trained jointlywith CNN
Test time detection is 213x as fast as R-CNN
Proposals still generated by external algorithm
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Two-stage Detection Paradigm - Revisited
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Section 2
Faster R-CNN
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Two-stage Detection Paradigm - Revisited
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Faster R-CNN: Model Architecture
Box proposal andclassification shareconvolutionalcomputation
Patches extractedfrom convolutionalfeature map
Detection ∼5 fps
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Faster R-CNN: Region Proposal Network
Sliding window passed over final convolutional feature maps
Box proposals are parameterized relative to fixed sizereference boxes (termed ”anchors”)
Multiple scales and and aspect ratios (eg 1x1, 2x1, 1x2, 2x2,etc.)
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Faster R-CNN: Training
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Faster R-CNN: Loss
RPN Loss:
L(pi, ti) =1
Ncls
∑i
Lcls(pi, p∗i ) + λ
1
Nreg
∑i
p∗iLreg(ti, t∗i ) (1)
Bounding box parameterization:
tx = (x− xa)/wa, ty = (y − ya)/hatw = log(w/wa), th = log(h/ha)
t∗x = (x∗ − xa)/wa, t∗y = (y∗ − ya)/hat∗w = log(w∗/wa), t∗h = log(h∗/ha)
Localization loss:
Lreg(t, t∗) =
∑j∈x,y,w,h
smoothL1(tj − t∗j ),
smoothL1(x) =
{0.5x2 if |x| < 1
|x| − 0.5 otherwise
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Faster R-CNN: ResNet
ResNet weights:
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Section 3
R-FCN
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Region-based Fully Convolutional Networks: Inspiration
Fast and Faster R-CNN save time by sharing computation ofrepeated convolutional features for object classification andregion proposals, respectively
However, Faster R-CNN still contains several unshared fullyconnected layers that must be computed for each of hundredsof proposals
Dilemma of opposing needs for translation invariance(classification) and translation variance (localization)⇒ Leads to unnatural insertion of region proposals betweenconvolutional layers
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Region-based Fully Convolutional Networks: ModelArchitecture
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Region-based Fully Convolutional Networks: Visualization
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Region-based Fully Convolutional Networks: Benefits andContributions
2.5-20x faster than Faster R-CNN
Comparable accuracy
Easy Online Hard Example Mining (OHEM)
Use of atrous convolutions
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Section 4
SSD
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Two-stage Detection Paradigm - Revisited Again
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Single Shot Multibox Detector: Model Architecture
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Single Shot Multibox Detector: Default boxes
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Section 5
Speed/Accuracy Comparison
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
CNNs and Meta Architectures Summary
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Accuracy vs Speed
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Accuracy/Speed vs Box proposals
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Conclusions
Found a speed-accuracy performance ”frontier”
For two-stage set-ups, more box proposals leads to betterperformance, but at the cost of speed and significantlydiminishing returns.
General correlation of convolutional feature extractorperformance on ImageNet classification and performance aspart of an object detector.
Higher resolution images lead to higher quality localization,but at the cost of speed and memory. A similar trade-offexists with respect to convolutional feature extractor outputstride (atrous convolutions).
Won 2016 MS COCO object detection challenge byensembling these implementations.
Background Faster R-CNN R-FCN SSD Speed/Accuracy Comparison
Open-Sourced TensorFlow Implementations