pedestrian detection with rcnn - machine learningcs229.stanford.edu/proj2015/172_report.pdf ·...

Pedestrian Detection with RCNN

Matthew ChenDepartment of Computer Science

Stanford [email protected]

Abstract

In this paper we evaluate the effectiveness of us-ing a Region-based Convolutional Neural Net-work approach to the problem of pedestrian de-tection. Our dataset is composed of manuallyannotated video sequences from the ETH vi-sion lab. Using selective search as our proposalmethod, we evaluate the performance of severalneural network architectures as well as a baselinelogistic regression unit. We find that the best re-sult was split between using the AlexNet archi-tecture with weights pre-trained on ImageNet aswell as a variant of this network trained fromscratch.

1 Introduction

Pedestrian tracking has numerous applicationsfrom autonomous vehicles to surveillance. Tra-ditionally many detection systems were based offof hand tuned features before being fed into alearning algorithm. Here we take advantage ofrecent work in Convolutional Neural Networksto pose the problem as a classification and local-ization task.

In particular we will explore the use of Regionbased Convolutional Neural Networks. The pro-cess starts with separating the video into frameswhich will be processed individually. For each

frame we generate class independent proposalboxes, in our case with a method called selec-tive search. Then we train a deep neural net-work classifier to classify the proposals as eitherpedestrian or background.

2 Related Work

This paper many follows an approach knownas Regions Convolutional Neural Networks in-troduced in [6]. This method tackles the prob-lem classifying and localizing objects in an im-age by running detection on a series of pro-posal boxes. These proposal boxes are generallyprecomputed offline using low level class inde-pendent segmentation methods such as selectivesearch [12], though recent work has incorporatedthis process into the neural network pipeline [11].Given the proposals, a deep convolutional neuralnetwork is trained to generate features which arefed into class specific SVM classifiers. This ap-proach as proved successful in localizing a largeclass of items for the PASCAL VOC challenge.

For the architecture of our CNN we test abaseline logistic method and compare it to re-sults from implementations of well known CNNs.These CNNs include Cifarnet which was devel-oped in [9] for the cifar-10 dataset and Alexnetwhich won the 2012 Imagenet challenge [10]. Ad-ditonally we look at the effect of using pretrained

1

Figure 1: Original image on top left. Positiveselective search bounding boxes on top right.Warped background and pedestrian image onbottom left and right respectively

weights for Alexnet and fine tuning the last layer.

The eth pedestrian tracking dataset was es-tablished through through a sequence of papers[4] [3] [5]. These papers use additional informa-tion collected including stereo vision and odom-etry data as additional sources of informationfor their models. We are only using monocularcamera data for each of the collected video se-quences.

3 Dataset and Features

The dataset is composed of several sequences ofvideos produced by a camera on a moving plat-form. Each frame has hand labelled annotationsdenoting bounding boxes of pedestrians. Overallthere are seven sequences of videos with a com-bined 4,534 frames. The data was split into atraining and test set where we have the framesfrom five sequences (3,105 frames) in the train-

ing set and two sequences (1,429 frames) in thetest set as shown in Table 1. The annotationsare not complete in that they do not strictly la-bel all pedestrians in a given image. It is usuallythe case that only pedestrians which take up acertain subjective threshold of the screen are la-belled. This leads to what could be some falsenegatives in the training set.

Statistic Test Train

Num Images 774 4952Avg Pedestrians 7 7Avg Proposals 2278 2953Pos Proposals 230 111Neg Proposals 2048 2842

Table 1: Data statistics split up by training andtest sets

Only a subset of the data was used. For thetwo sequences in the test set, the annotationswere sparse in that they were recorded only onevery fourth frame. Thus we included only theframes which annotations existed. Additionally,for training, we set the number of proposals usedproportional to the number of positive proposals.Specifically we set the ratio at 1:2 positive pedes-trian images to negative background images.

4 Methods

The complete pipeline from video frame tobounding box output is shown in Figure 2. Westart with a given video sequence and split it upby frames. Then we run a algorithm to generateproposal bounding boxes, in this case we use se-lective search [12], which we cache for use acrossthe process. We pass these bounding boxes alongwith the original image to the detector whichis our convolutional neural network. The CNN

2

RawImage

ProposalBBs

PedestrianDetec-

tor

Non-MaximalSupres-

sion

img

BBs

scores

Figure 2: Pedestrian detection pipeline

produces softmax scores for each bounding boxwhich are used in the final non-maximal suppres-sion step.

4.1 Proposal Boxes

We tested two different proposal box methodwhich were popular in the literature [8]. Theedge box method [14] actually performed betterin terms of the average intersection over union(IOU) across all images in the set as shown inFigure 3. However we opted to use selectivesearch as it proposed fewer boxes and hence in-creased the runtime of our algorithm. We startedby precomputing proposal boxes for all frames inthe dataset. Then we created an image processorto preprocess and warp these proposals, whichvaried in size and aspect ratio, into a fixed size toinput into our neural network. The preprocess-ing involved mean subtraction and whitening foreach frame. Using these sub-sampled images wetrained a convolutional neural network on top ofthis data to classify a proposal as either back-ground or pedestrian.

0.0

0.2

0.4

0.6

test train

Dataset

IOU

Proposal Methodedgeboxss

Figure 3: Comparison of Selective Search andEdgeBox proposal algorithms

Given the proposals and ground truth bound-ing boxes we could then generate a training setwhere the proposals were labeled based on theiroverlap with the ground truth images. We used athreshold of 0.3 so that images which had at leastthis overlap with a given ground truth boundingbox were considered positive examples and therest negative.

4.2 Detector

We start by baselining our results relative to alogistic regression network that we train on ourimages. Additonally we experiment with variousneural network architectures and measure theirperformance on our task. We start with usingthe Cifarnet architecture which takes image at a32x32x3 scale [9].

For the alexnet pretrained architecture wemaintained the exact architecture specified bythe initial paper with the exception of the lastlayer which was replaced by a softmax functionwith two outputs initialized with random weights[10].The softmax function generates what can beinterpreted as the probability for each class andis defined as follows.

3

0.2

0.4

0.6

0.00 0.25 0.50 0.75 1.00

Recall

Pre

cisi

on

Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 4: Precision Recall curves for variousmethods on test set

p(y = i | x; θ) =eθ

Ti x∑k

j=1 eθTj x

Next implement a simplified variant of thealexnet architecture in which we remove groupconvolution and local response normalizationlayers. We modify the middle convolutional lay-ers maintain full spacial depth form the previouslayers . Doing so simplifies the implementationas grouping was used primarily due to memoryconstraints from training on two separate GPUsin the original implementation. These networkswere built using the tensorflow library [1] andtrained running on multiple CPUs in parallel(the exact number of CPUs and specificationsvaried as this was run across multiple servers).

5 Results

After training our net for 4000 iterations, inwhich each iteration was composed a random

0.7

0.8

0.9

1.0

0 25 50 75 100

False Positives

Mis

s ra

te

Architecturealexnetalexnet_pretrainedcifarnetlogitnet

Figure 5: Miss rate to false positives by method

proportional sub-sample of 50 images across theentire dataset, we tested each net on the com-plete test set. Figure 4 shows a comparison ofthe precision and recall curves of each net whenadjusting threshold for classifying the image aspedestrian as opposed to background. The preci-sion and recall is calculated on the test set usingthe same pedestrian overlap threshold to denotepositive examples for proposal boxes.

For AlexNet with pre-trained weights, all theweights besides the fully connected layers andthe last convolutional layer were frozen at ini-tialization for fine tuning. All other networkswere trained from scratch with weights initial-ized from a truncated normal distribution withstandard deviation proportional to input size.

We find that Alexnet with pretrained weightsfrom Imagenet performs the best in moderateprediction score thresholds while a slightly largervariant of Alexnet, trained from scratch, per-forms better at higher acceptance thresholds.The Cifarnet performance is not that far behind,which is interesting as it as order of magnitude

4

Figure 6: Example final output of our algorithm

fewer parameters and takes input images of size32x32x3 compared to 227x227x3 for alexnet.

For the non-maximal supression step we aimedfor a minimal threshold for our bounding boxesand used 0.05. We then did a final evaluation ofour algorithm by looking at the miss rate to falsepositive curve shown in Figure 5. We can seethat our best performance still has a 70 percentmiss rate. An example of the final output for agiven frame is shown in Figure 6.

6 Discussion

We find that AlexNet with pre-trained weightsfrom Imagenet and fine tuning of the last layersperforms the best on a majority of the thresh-old levels. Overall our approach still has a missrate that is too high for many real world appli-cations. This rate can be due to several factors.First is the proposal box method. As we haveshown the best bounding boxes only had a meanIOU ratio of around 0.6 which would serve asan upper bound on the accuracy of our over-all system. Next it is likely the case that ourneural networks could have benefited from addi-

tional training time to reach convergence as thenumber of iterations we used was relatively smallcompared to the amount of time it took to trainon imagenet and other large vision benchmarks.Additional hyperparameter tuning of parameterssuch as regularization on fully connected layerswould also likely improve the results.

The use of Region-based convolutional neu-ral networks for the task of pedestrian detectiondoes show promise. Our example output imageshows the technique produces reasonable bound-ing boxes. However additional work needs to bedone to improve the overall performance of thesystem.

7 Future Work

The main constraint of the work presented inthis paper was lack of GPU support for run-ning these models due to resource contraints.Most CNN training is currently implemented us-ing one or multiple GPUs which should have aorder of magnitude speed up. Training on aGPU would allow for more iterations throughthe dataset to increase the change of convergenceas well as runtime comparisons to current RCNNresults in other domains. Additionally the addedcomputational power would enable us to use alarger pedestrian dataset such as [2]. Trainingon a large dataset such as this one would allowthe nets to generalize better. This is especiallytrue for the larger network architectures whichrequire larger training sets corresponding to thelarger number of parameters to tune.

5

References

[1] Martın Abadi et al. “TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015”. In: Software available fromtensorflow. org ().

[2] Piotr Dollar et al. “Pedestrian detection:An evaluation of the state of the art”.In: Pattern Analysis and Machine Intelli-gence, IEEE Transactions on 34.4 (2012),pp. 743–761.

[3] A. Ess et al. “A Mobile Vision System forRobust Multi-Person Tracking”. In: IEEEConference on Computer Vision and Pat-tern Recognition (CVPR’08). IEEE Press,2008.

[4] Andreas Ess, Bastian Leibe, and Luc VanGool. “Depth and appearance for mo-bile scene analysis”. In: Computer Vision,2007. ICCV 2007. IEEE 11th Interna-tional Conference on. IEEE. 2007, pp. 1–8.

[5] Andreas Ess et al. “Moving obstacle de-tection in highly dynamic scenes”. In:Robotics and Automation, 2009. ICRA’09.IEEE International Conference on. IEEE.2009, pp. 56–63.

[6] Ross Girshick. “Fast R-CNN”. In: Inter-national Conference on Computer Vision(ICCV). 2015.

[7] Ross Girshick et al. “Rich feature hierar-chies for accurate object detection and se-mantic segmentation”. In: Computer Vi-sion and Pattern Recognition (CVPR),2014 IEEE Conference on. IEEE. 2014,pp. 580–587.

[8] Jan Hosang, Rodrigo Benenson, andBernt Schiele. “How good are detec-tion proposals, really?” In: arXiv preprintarXiv:1406.6962 (2014).

[9] Alex Krizhevsky and Geoffrey Hinton.Learning multiple layers of features fromtiny images. 2009.

[10] Alex Krizhevsky, Ilya Sutskever, and Ge-offrey E Hinton. “Imagenet classificationwith deep convolutional neural networks”.In: Advances in neural information pro-cessing systems. 2012, pp. 1097–1105.

[11] Shaoqing Ren et al. “Faster R-CNN: To-wards Real-Time Object Detection withRegion Proposal Networks”. In: Advancesin Neural Information Processing Systems(NIPS). 2015.

[12] Koen EA Van de Sande et al. “Segmenta-tion as selective search for object recogni-tion”. In: Computer Vision (ICCV), 2011IEEE International Conference on. IEEE.2011, pp. 1879–1886.

[13] Stefan Van Der Walt, S Chris Colbert, andGael Varoquaux. “The NumPy array: astructure for efficient numerical computa-tion”. In: Computing in Science & Engi-neering 13.2 (2011), pp. 22–30.

[14] C Lawrence Zitnick and Piotr Dollar.“Edge boxes: Locating object proposalsfrom edges”. In: Computer Vision–ECCV2014. Springer, 2014, pp. 391–405.

6

pedestrian detection with rcnn - machine learningcs229.stanford.edu/proj2015/172_report.pdf ·...

Documents