detection: from r-cnn to fast r-cnn reporter: liliang zhang 第 1 页 | 共 25 页

Detection: From R-CNN to Fast R-CNN

Reporter: Liliang Zhang

第 1页 | 共 25 页

Object Detection: Intuition

Detection ≈ Localization + Classification

第 2页 | 共 25 页

Outline

• R-CNN• SPP-Net• Fast R-CNN

第 3页 | 共 25 页

Outline


第 4页 | 共 25 页

R-CNN: Pipeline Overview

Step1. Input an imageStep2. Use selective search to obtain ~2k proposalsStep3. Warp each proposal and apply CNN to extract its featuresStep4. Adopt class-specified SVM to score each proposalStep5. Rank the proposals and use NMS to get the bboxes. Step6. Use class-specified regressors to refine the bboxes’ positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR14

第 5页 | 共 25 页

R-CNN: Performance in PASCAL VOC07

• AlexNet(T-Net): 58.5 mAP

• VGG-Net(O-Net): 66.0 mAP

第 6页 | 共 25 页

R-CNN: Limitation

• TOO SLOWWWW !!! (13s/image on a GPU or 53s/image on a CPU, and VGG-Net 7x slower)

• Proposals need to be warped to a fixed size.

第 7页 | 共 25 页

Outline


第 8页 | 共 25 页

SPP-Net: Motivation

• Cropping may loss some information about the object

• Warpping may change the object’s appearance

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, TPAMI15

第 9页 | 共 25 页

SPP-Net: Spatial Pyramid Pooling (SPP) Layer

• FC layer need a fixed-length input while conv layer can be adapted to arbitrary input size.

• Thus we need a bridge between the conv and FC layer.• Here comes the SPP layer.

第 10页 | 共 25 页

SPP-Net: Training for Detection(1)

Conv5 feature

map

Conv5 feature

map

Conv5 feature

map

Image Pyramid FeatMap Pyramids

conv

Step1. Generate a image pyramid and exact the conv FeatMap of the whole image

第 11页 | 共 25 页

SPP-Net: Training for Detection(2)

• Step 2, For each proposal, walking the image pyramid and find a project version that has a number of pixels closest to 224x224. (For scaling invariance in training.)

• Step 3, find the corresponding FeatMap in Conv5 and use SPP layer to pool it to a fix size.

• Step 4, While getting all the proposals’ feature, fine-tune the FC layer only.

• Step 5, Train the class-specified SVM

第 12页 | 共 25 页

SPP-Net: Testing for Detection

• Allmost the same as R-CNN, except Step3.

第 13页 | 共 25 页

SPP-Net: Performance

• Speed: 64x faster than R-CNN using one scale, and 24x faster using five-scale paramid.

• mAP: +1.2 mAP vs R-CNN

第 14页 | 共 25 页

SPP-Net: Limitation

2. Training is expensive in space and time.

1. Training is a multi-stage pipeline.

FC layersConv layers SVM regressor

store

第 15页 | 共 25 页

Outline


第 16页 | 共 25 页

Fast R-CNN: Motivation

Ross Girshick, Fast R-CNN, Arxiv tech report

JOINT TRAINING!!

第 17页 | 共 25 页

Fast R-CNN: Joint Training Framework

Joint the feature extractor, classifier, regressor together in a unified framework

第 18页 | 共 25 页

Fast R-CNN: RoI pooling layer

≈ one scale SPP layer

第 19页 | 共 25 页

Fast R-CNN: Regression Loss

A smooth L1 loss which is less sensitive to outliers than L2 loss

第 20页 | 共 25 页

Fast R-CNN: Scale Invariance

image pyramids （ multi scale ）brute force （ single scale ）

Conv5 feature

mapconv

• In practice, single scale is good enough. (The main reason why it can faster x10 than SPP-Net)

第 21页 | 共 25 页

Fast R-CNN: Other tricks

• SVD on FC layers: 30% speed up at testing time with a little performance drop.

• Which layers to fine-tune? Fix the shallow conv layers can reduce the training time with a little performance drop.

• Data augment: use VOC12 as the additional trainset can boost mAP by ~3%

第 22页 | 共 25 页

Fast R-CNN: Performance

• Without data augment, the mAP just +0.9 on VOC077

• But training and testing time has been greatly speed up. (training 9x, testing 213x vs R-CNN)

• Without data augment, the mAP +2.3 on VOC127

第 23页 | 共 25 页

Fast-RCNN: Discussion about #proposal

Are more proposals always better ？

NO!

第 24页 | 共 25 页

Thanks

第 25页 | 共 25 页

detection: from r-cnn to fast r-cnn reporter: liliang zhang 第 1 页 | 共 25 页

Documents