pedestrian detection with r-cnn - machine learningcs229.stanford.edu/proj2015/172_poster.pdf ·...

1
Pedestrian Detection with R-CNN Matthew Chen Department of Computer Science Stanford University Pedestrian Image Examples Figure 1: Original image on top left. Positive selective search bounding boxes on top right. Figure 2: Warped Proposal Images Figure 3: Negative Examples for training Figure 4: Example Detection output from Alexnet Architec- ture Dataset Statistic Test Train Num Images 774 4952 Avg Pedestrians 7 7 Avg Proposals 2278 2953 Pos Proposals 230 111 Neg Proposals 2048 2842 Table 1: Data statistics split up by training and test sets The dataset is composed of several sequences of videos produced by a camera on a moving plat- form. Each frame has hand labelled annotations denoting bounding boxes of pedestrians. Overall there are seven sequences of videos with a com- bined 4,534 frames. The data was split into a training and test set where we have the frames from five sequences (3,105 frames) in the training set and two sequences (1,429) in the test set as shown in Table 1. The annotations are not com- plete in that they do not strictly label all pedes- trians in a given image. It is usually the case that only pedestrians which take up a certain sub- jective threshold of the screen are labelled. This leads to what could be some false negatives in the training set. 0 10 20 30 test train Dataset RMSE Proposal Method edgebox ss Figure 5: Comparison between selective search and Edgebox Proposal Method Methods Raw Image Proposal BBs Pedestrian Detec- tor Non- Maximal Supres- sion img BBs scores Figure 6: Pedestrian detection pipeline The complete pipeline from video frame to bound- ing box output is shown in Figure 6. We start with a given video sequence and split it up by frames. Then we run a algorithm to generate pro- posal bounding boxes, in this case we use selective search, which we cache for use across the process. We pass these bounding boxes along with the orig- inal image to the detector which is our convolu- tional neural network. The CNN produces soft- max scores for each bounding box which are used in the final non-maximal suppression step. 0 20 40 60 alexnet alexnet_pretrained cifarnet logitnet Architecture Runtime per Image (sec) Figure 7: Runtime per image on test set Results 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 Recall Precision Architecture alexnet alexnet_pretrained cifarnet logitnet Figure 8: Precision Recall curve for proposal box detections 0.7 0.8 0.9 1.0 0 25 50 75 100 False Positives Miss rate Architecture alexnet alexnet_pretrained cifarnet logitnet Figure 9: Miss rate to false positives curve Discussion We find that Alexnet with pretrained weights from Imagenet performs the best in moderate prediction score thresholds while a slightly larger variant of Alexnet, trained from scratch, performs better at higher acceptance thresholds. There is plenty of future work to be done to improve these results. The main limiting factor was GPU sup- port which could have significantly sped up com- putation as shown in Figure 7. Additional tun- ing of hyperparameters such as overlap thresholds may also help.

Upload: others

Post on 28-Jan-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pedestrian Detection with R-CNN - Machine learningcs229.stanford.edu/proj2015/172_poster.pdf · 2017-09-23 · Pedestrian Detection with R-CNN Matthew Chen Department of Computer

Pedestrian Detection with R-CNNMatthew Chen

Department of Computer ScienceStanford University

Pedestrian Image Examples

Figure 1: Original image on top left. Positive selective searchbounding boxes on top right.

Figure 2: Warped Proposal Images

Figure 3: Negative Examples for training

Figure 4: Example Detection output from Alexnet Architec-ture

Dataset

Statistic Test TrainNum Images 774 4952Avg Pedestrians 7 7Avg Proposals 2278 2953Pos Proposals 230 111Neg Proposals 2048 2842

Table 1: Data statistics split up by training and test sets

The dataset is composed of several sequences ofvideos produced by a camera on a moving plat-form. Each frame has hand labelled annotationsdenoting bounding boxes of pedestrians. Overallthere are seven sequences of videos with a com-bined 4,534 frames. The data was split into atraining and test set where we have the framesfrom five sequences (3,105 frames) in the trainingset and two sequences (1,429) in the test set asshown in Table 1. The annotations are not com-plete in that they do not strictly label all pedes-trians in a given image. It is usually the casethat only pedestrians which take up a certain sub-jective threshold of the screen are labelled. Thisleads to what could be some false negatives in thetraining set.

0

10

20

30

test train

Dataset

RM

SE Proposal Method

edgeboxss

Figure 5: Comparison between selective search and EdgeboxProposal Method

Methods

RawImage

ProposalBBs

PedestrianDetec-tor

Non-MaximalSupres-sion

img

BBs

scores

Figure 6: Pedestrian detection pipeline

The complete pipeline from video frame to bound-ing box output is shown in Figure 6. We startwith a given video sequence and split it up byframes. Then we run a algorithm to generate pro-posal bounding boxes, in this case we use selectivesearch, which we cache for use across the process.We pass these bounding boxes along with the orig-inal image to the detector which is our convolu-tional neural network. The CNN produces soft-max scores for each bounding box which are usedin the final non-maximal suppression step.

0

20

40

60

alexnet alexnet_pretrained cifarnet logitnet

Architecture

Run

time

per

Imag

e (s

ec)

Figure 7: Runtime per image on test set

Results

0.2

0.4

0.6

0.00 0.25 0.50 0.75 1.00

Recall

Pre

cisi

on

Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 8: Precision Recall curve for proposal box detections

0.7

0.8

0.9

1.0

0 25 50 75 100

False Positives

Mis

s ra

te Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 9: Miss rate to false positives curve

Discussion

We find that Alexnet with pretrained weightsfrom Imagenet performs the best in moderateprediction score thresholds while a slightly largervariant of Alexnet, trained from scratch, performsbetter at higher acceptance thresholds. There isplenty of future work to be done to improve theseresults. The main limiting factor was GPU sup-port which could have significantly sped up com-putation as shown in Figure 7. Additional tun-ing of hyperparameters such as overlap thresholdsmay also help.