pedestrian detection with r-cnn - machine learningcs229.stanford.edu/proj2015/172_poster.pdf ·...
TRANSCRIPT
Pedestrian Detection with R-CNNMatthew Chen
Department of Computer ScienceStanford University
Pedestrian Image Examples
Figure 1: Original image on top left. Positive selective searchbounding boxes on top right.
Figure 2: Warped Proposal Images
Figure 3: Negative Examples for training
Figure 4: Example Detection output from Alexnet Architec-ture
Dataset
Statistic Test TrainNum Images 774 4952Avg Pedestrians 7 7Avg Proposals 2278 2953Pos Proposals 230 111Neg Proposals 2048 2842
Table 1: Data statistics split up by training and test sets
The dataset is composed of several sequences ofvideos produced by a camera on a moving plat-form. Each frame has hand labelled annotationsdenoting bounding boxes of pedestrians. Overallthere are seven sequences of videos with a com-bined 4,534 frames. The data was split into atraining and test set where we have the framesfrom five sequences (3,105 frames) in the trainingset and two sequences (1,429) in the test set asshown in Table 1. The annotations are not com-plete in that they do not strictly label all pedes-trians in a given image. It is usually the casethat only pedestrians which take up a certain sub-jective threshold of the screen are labelled. Thisleads to what could be some false negatives in thetraining set.
0
10
20
30
test train
Dataset
RM
SE Proposal Method
edgeboxss
Figure 5: Comparison between selective search and EdgeboxProposal Method
Methods
RawImage
ProposalBBs
PedestrianDetec-tor
Non-MaximalSupres-sion
img
BBs
scores
Figure 6: Pedestrian detection pipeline
The complete pipeline from video frame to bound-ing box output is shown in Figure 6. We startwith a given video sequence and split it up byframes. Then we run a algorithm to generate pro-posal bounding boxes, in this case we use selectivesearch, which we cache for use across the process.We pass these bounding boxes along with the orig-inal image to the detector which is our convolu-tional neural network. The CNN produces soft-max scores for each bounding box which are usedin the final non-maximal suppression step.
0
20
40
60
alexnet alexnet_pretrained cifarnet logitnet
Architecture
Run
time
per
Imag
e (s
ec)
Figure 7: Runtime per image on test set
Results
0.2
0.4
0.6
0.00 0.25 0.50 0.75 1.00
Recall
Pre
cisi
on
Architecturealexnetalexnet_pretrainedcifarnetlogitnet
Figure 8: Precision Recall curve for proposal box detections
0.7
0.8
0.9
1.0
0 25 50 75 100
False Positives
Mis
s ra
te Architecturealexnetalexnet_pretrainedcifarnetlogitnet
Figure 9: Miss rate to false positives curve
Discussion
We find that Alexnet with pretrained weightsfrom Imagenet performs the best in moderateprediction score thresholds while a slightly largervariant of Alexnet, trained from scratch, performsbetter at higher acceptance thresholds. There isplenty of future work to be done to improve theseresults. The main limiting factor was GPU sup-port which could have significantly sped up com-putation as shown in Figure 7. Additional tun-ing of hyperparameters such as overlap thresholdsmay also help.