pedestrian detection with r-cnn - machine learningcs229.stanford.edu/proj2015/172_poster.pdf ·...

Pedestrian Detection with R-CNNMatthew Chen

Department of Computer ScienceStanford University

Pedestrian Image Examples

Figure 1: Original image on top left. Positive selective searchbounding boxes on top right.

Figure 2: Warped Proposal Images

Figure 3: Negative Examples for training

Figure 4: Example Detection output from Alexnet Architec-ture

Dataset

Statistic Test TrainNum Images 774 4952Avg Pedestrians 7 7Avg Proposals 2278 2953Pos Proposals 230 111Neg Proposals 2048 2842

Table 1: Data statistics split up by training and test sets

The dataset is composed of several sequences ofvideos produced by a camera on a moving plat-form. Each frame has hand labelled annotationsdenoting bounding boxes of pedestrians. Overallthere are seven sequences of videos with a com-bined 4,534 frames. The data was split into atraining and test set where we have the framesfrom five sequences (3,105 frames) in the trainingset and two sequences (1,429) in the test set asshown in Table 1. The annotations are not com-plete in that they do not strictly label all pedes-trians in a given image. It is usually the casethat only pedestrians which take up a certain sub-jective threshold of the screen are labelled. Thisleads to what could be some false negatives in thetraining set.

0

10

20

30

test train

Dataset

RM

SE Proposal Method

edgeboxss

Figure 5: Comparison between selective search and EdgeboxProposal Method

Methods

RawImage

ProposalBBs

PedestrianDetec-tor

Non-MaximalSupres-sion

img

BBs

scores

Figure 6: Pedestrian detection pipeline

The complete pipeline from video frame to bound-ing box output is shown in Figure 6. We startwith a given video sequence and split it up byframes. Then we run a algorithm to generate pro-posal bounding boxes, in this case we use selectivesearch, which we cache for use across the process.We pass these bounding boxes along with the orig-inal image to the detector which is our convolu-tional neural network. The CNN produces soft-max scores for each bounding box which are usedin the final non-maximal suppression step.

0

20

40

60

alexnet alexnet_pretrained cifarnet logitnet

Architecture

Run

time

per

Imag

e (s

ec)

Figure 7: Runtime per image on test set

Results

0.2

0.4

0.6

0.00 0.25 0.50 0.75 1.00

Recall

Pre

cisi

on

Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 8: Precision Recall curve for proposal box detections

0.7

0.8

0.9

1.0

0 25 50 75 100

False Positives

Mis

s ra

te Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 9: Miss rate to false positives curve

Discussion

We find that Alexnet with pretrained weightsfrom Imagenet performs the best in moderateprediction score thresholds while a slightly largervariant of Alexnet, trained from scratch, performsbetter at higher acceptance thresholds. There isplenty of future work to be done to improve theseresults. The main limiting factor was GPU sup-port which could have significantly sped up com-putation as shown in Figure 7. Additional tun-ing of hyperparameters such as overlap thresholdsmay also help.

pedestrian detection with r-cnn - machine learningcs229.stanford.edu/proj2015/172_poster.pdf ·...

Documents