learning scene-specific pedestrian detectors without real...

2
Learning Scene-Specific Pedestrian Detectors without Real Data Hironori Hattori Sony Corporation [email protected] Vishnu Naresh Boddeti, Kris Kitani, Takeo Kanade The Robotics Institute, Carnegie Mellon University [email protected], [email protected], [email protected] Abstract We consider the problem of designing a scene-specific pedestrian detector in a scenario where we have no real pedestrian data (ie., neither labeled nor unlabeled). Our approach infers the potential appearance of pedestrians using geometric scene information and a customizable database of virtual simulations of pedestrian appearance and motion. We propose an efficient discriminative learning method that generates a spatially-varying pedestrian ap- pearance model that takes into the account the perspective geometry of the scene. Our method is able to learn a unique pedestrian classifier customized for every possible location in the scene. Our experiments show that our method using purely synthetic data is able to greatly outperform models trained on real scene-specific data when data is limited. 1. Introduction Consider the scenario in which a new surveillance sys- tem is installed in a novel location and an image-based pedestrian detector must be trained without access to real scene-specific pedestrian data. A similar situation may arise when a new imaging system (i.e., a custom camera with unique lens distortion) has been designed and must be able to detect pedestrians without the expensive process of col- lecting data with the new imaging device. One can use a generic pedestrian detection algorithm trained over copious amounts of real data to work robustly across many scenes. However, generic models are not always best-suited for de- tection in specific scenes. In many surveillance scenarios, it is more important to have a customized pedestrian de- tection model that is optimized for a single scene. Creat- ing a situation-specific pedestrian detector can potentially provide better performance. Optimizing for a single scene however often requires a labor intensive process of collect- ing labeled data – drawing bounding boxes of pedestrians taken with a particular camera in a specific scene. The pro- cess also takes time, as recorded video data must be manu- ally mined for various instances of clothing, size, pose and location to build a robust pedestrian appearance model. Our goal is to develop a method for training a ‘data-free’ scene- specific pedestrian detector which outperforms generic de- tection algorithms (i.e., HOG-SVM[2], DPM[3]). The above scenarios are ill posed zero-instance learning problems, where an image-based pedestrian detector must be created without having access to real data. Fortunately, in these scenarios we have access to two important pieces of information: (1) the camera’s calibration parameters, and (2) scene geometry (in the form of static location of objects, ground plane and walls). In this work, we show that with this information, it is possible to generate geometrically ac- curate simulations of pedestrian appearance as training data to act as a proxy for the real data. This allows us to learn a highly accurate scene-specific pedestrian detector. More- over, we show that by using this ‘data-free’ technique (i.e.,, does not require real pedestrian data), we are still able to train a scene-specific pedestrian detector that outperforms state-of-the-art baselines. 2. Approach They key idea of our approach is to maximize the ge- ometric information about the scene to compensate for the lack of real training data. A geometrically consistent method for synthetic data generation has the following ad- vantages. (1) An image-based pedestrian detector can be trained on a wider range of pedestrian appearance. Instead of waiting and collecting real data, it is now possible to gen- erate large amounts of simulated data over a wide range of pedestrian appearance (e.g., clothing, height, weight, gen- der) on demand. (2) Pedestrian data can be generated for any location in the scene. Taken to the extreme, a syn- thetic data-generation framework can be used to learn a cus- tomized pedestrian appearance model at every possible lo- cation (pixel) in the scene. (3) The location of static objects in the scene can be incorporated into data synthesis to pre- emptively train for occlusion. In our proposed approach, we simultaneously learn hun- dreds of pedestrian detectors for a single scene using mil- lions of synthetic pedestrian images. Since our approach is purely dependent on synthetic data, the algorithm requires no real-world data. The main contributions of our work are 1

Upload: others

Post on 21-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Scene-Specific Pedestrian Detectors without Real Datasunw.csail.mit.edu/2015/papers/72_Hattori_SUNw.pdf · pedestrian detector. They key idea of our approach is to maximize

Learning Scene-Specific Pedestrian Detectors without Real Data

Hironori HattoriSony Corporation

[email protected]

Vishnu Naresh Boddeti, Kris Kitani, Takeo KanadeThe Robotics Institute, Carnegie Mellon University

[email protected], [email protected], [email protected]

Abstract

We consider the problem of designing a scene-specificpedestrian detector in a scenario where we have no realpedestrian data (ie., neither labeled nor unlabeled). Ourapproach infers the potential appearance of pedestriansusing geometric scene information and a customizabledatabase of virtual simulations of pedestrian appearanceand motion. We propose an efficient discriminative learningmethod that generates a spatially-varying pedestrian ap-pearance model that takes into the account the perspectivegeometry of the scene. Our method is able to learn a uniquepedestrian classifier customized for every possible locationin the scene. Our experiments show that our method usingpurely synthetic data is able to greatly outperform modelstrained on real scene-specific data when data is limited.

1. IntroductionConsider the scenario in which a new surveillance sys-

tem is installed in a novel location and an image-basedpedestrian detector must be trained without access to realscene-specific pedestrian data. A similar situation may arisewhen a new imaging system (i.e., a custom camera withunique lens distortion) has been designed and must be ableto detect pedestrians without the expensive process of col-lecting data with the new imaging device. One can use ageneric pedestrian detection algorithm trained over copiousamounts of real data to work robustly across many scenes.However, generic models are not always best-suited for de-tection in specific scenes. In many surveillance scenarios,it is more important to have a customized pedestrian de-tection model that is optimized for a single scene. Creat-ing a situation-specific pedestrian detector can potentiallyprovide better performance. Optimizing for a single scenehowever often requires a labor intensive process of collect-ing labeled data – drawing bounding boxes of pedestrianstaken with a particular camera in a specific scene. The pro-cess also takes time, as recorded video data must be manu-ally mined for various instances of clothing, size, pose andlocation to build a robust pedestrian appearance model. Our

goal is to develop a method for training a ‘data-free’ scene-specific pedestrian detector which outperforms generic de-tection algorithms (i.e., HOG-SVM[2], DPM[3]).

The above scenarios are ill posed zero-instance learningproblems, where an image-based pedestrian detector mustbe created without having access to real data. Fortunately,in these scenarios we have access to two important piecesof information: (1) the camera’s calibration parameters, and(2) scene geometry (in the form of static location of objects,ground plane and walls). In this work, we show that withthis information, it is possible to generate geometrically ac-curate simulations of pedestrian appearance as training datato act as a proxy for the real data. This allows us to learna highly accurate scene-specific pedestrian detector. More-over, we show that by using this ‘data-free’ technique (i.e.,,does not require real pedestrian data), we are still able totrain a scene-specific pedestrian detector that outperformsstate-of-the-art baselines.

2. ApproachThey key idea of our approach is to maximize the ge-

ometric information about the scene to compensate forthe lack of real training data. A geometrically consistentmethod for synthetic data generation has the following ad-vantages. (1) An image-based pedestrian detector can betrained on a wider range of pedestrian appearance. Insteadof waiting and collecting real data, it is now possible to gen-erate large amounts of simulated data over a wide range ofpedestrian appearance (e.g., clothing, height, weight, gen-der) on demand. (2) Pedestrian data can be generated forany location in the scene. Taken to the extreme, a syn-thetic data-generation framework can be used to learn a cus-tomized pedestrian appearance model at every possible lo-cation (pixel) in the scene. (3) The location of static objectsin the scene can be incorporated into data synthesis to pre-emptively train for occlusion.

In our proposed approach, we simultaneously learn hun-dreds of pedestrian detectors for a single scene using mil-lions of synthetic pedestrian images. Since our approach ispurely dependent on synthetic data, the algorithm requiresno real-world data. The main contributions of our work are

1

Page 2: Learning Scene-Specific Pedestrian Detectors without Real Datasunw.csail.mit.edu/2015/papers/72_Hattori_SUNw.pdf · pedestrian detector. They key idea of our approach is to maximize

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#168

CVPR#168

CVPR 2015 Submission #168. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

JointClassifierEnsembleLearning

Scene Geometry

Pedestrian Renderings

Scene Simulation

Location Specific Patches Pedestrian Detectors

Detection Result

Figure 2. Overview. For every grid location, geometrically correct renderings of pedestrian are synthetically generated using known sceneinformation such as camera calibration parameters, obstacles (red), walls (blue) and walkable areas (green). All pedestrian detectors aretrained jointly to learn a smoothly varying appearance model. Multiple scene-specific detectors are run in parallel at every grid location.

single scene. Optimizing for a single scene however of-ten requires a labor intensive process of collecting labeleddata – drawing bounding boxes of pedestrians taken witha particular camera in a specific scene. The process alsotakes time, as recorded video data must be manually minedfor various instances of clothing, size, pose and locationto build a robust pedestrian appearance model. Creatinga situation-specific pedestrian detector enables better per-formance but it is often costly to train. In this work, weprovide an alternative technique for learning high-qualityscene-specific pedestrian detector without the need for realpedestrian data. Instead we leverage the geometric informa-tion of the scene (i.e. static location of objects, ground planeand walls) and the parameters of the camera to generate ge-ometrically accurate simulations of pedestrian appearancethrough the use of computer generated pedestrians. In thisway, we are able to synthesize data as a proxy to the realdata, allowing us to learn a highly accurate scene-specificpedestrian detector.

They key idea of our approach is to maximize the ge-ometric information about the scene to compensate forthe lack of real training data. A geometrically consistentmethod for synthetic data generation has the following ad-vantages. (1) An image-based pedestrian detector can betrained on a wider range of pedestrian appearance. Insteadof waiting and collecting real data, it is now possible to gen-erate large amounts of simulated data over a wide range ofpedestrian appearance (e.g. clothing, height, weight, gen-der) on demand. (2) Pedestrian data can be generated forany location in the scene. Taken to the extreme, a syn-thetic data-generation framework can be used learn a cus-tomized pedestrian appearance model for every possible lo-cation (pixel) in the scene. (3) The location of static objectsin the scene can be incorporated into data synthesis to pre-emptively train for occlusion.

In our proposed approach, we simultaneously learn hun-dreds of pedestrian detectors for a single scene using mil-lions of synthetic pedestrian images. Since our approachis purely dependent on synthetic data, the algorithm re-quires no real-world data. To learn the set of scene-specificlocation-specific pedestrian detectors, we propose an effi-cient and scalable appearance learning framework. Our al-gorithmic framework makes use of highly-efficient corre-lation filters as our basic detection unit and globally opti-mizes each model by taking into the account the appearanceof a pedestrian over a small spatial region. We compareour approach to several generically trained baseline mod-els and show that our proposed approach generates a betterperforming scene-specific pedestrian detector. More impor-tantly, our experimental results over multiple data sets showthat our ‘data-free’ approach actually outperforms modelsthat are trained on real scene-specific pedestrian data whendata is limited.Contributions. The contribution of our work is as follows:(1) the first work to learn a scene-specific location-specificgeometry-aware pedestrian detection model using purelysynthetic data and (2) an efficient and scalable algorithm forlearning a large number of scene-specific location-specificpedestrian detectors.

2. Related Work3D Models for Detection. The idea of using 3D syntheticdata for 2D object detection is not new. Brooks [8] usedcomputerized 3D primitives to describe 2D images. Dhomeet al. used computer generated models have been usedto recognize articulated objects from a single image [10].3D computer graphics models have been used for mod-eling human shape [16, 7], body-part gradient templates[1], full-body gradient templates [21] and hand appearance[26, 2, 27]. In addition to modeling people, 3D simulation

2

Figure 1. Overview: For every grid location, geometrically correct renderings of pedestrian are synthetically generated using knownscene information such as camera calibration parameters, obstacles (red), walls (blue) and walkable areas (green). We learn a field oflocation-specific pedestrian detectors that are jointly trained and are run in parallel at every grid location at inference.

as follows: (1) the first work to learn a scene-and-location-specific geometry-aware pedestrian detection model usingpurely synthetic data and (2) an efficient and scalable algo-rithm for discriminatively learning a large number of scene-and-location-specific pedestrian detectors. Our algorithmicframework makes use of highly-efficient correlation filters[1] as our basic detection unit and globally optimizes eachmodel by taking into the account the appearance of a pedes-trian over a small spatial region. At inference, we note thatour approach requires the same amount of operations as thesliding window approach.

3. Experiments EvaluationExperimental evaluation showed that our model outper-

forms several baseline approaches–classical pedestrian de-tection models, hybrid synthetic-real models and a baselinewhich leverages the scene geometry and camera calibrationparameters at the inference stage–both in terms of imageplane localization as well as localization in 3D for scene-specific pedestrian detection. More importantly our resultsalso yield a surprising result, that our method using purelysynthetic data is able to outperform models trained on realscene-specific data when data is limited. Figure 3 shows the3D trajectories of our proposed approach and Fig. 2 showsthe precision-recall curves for 2D and 3D localization.

4. ConclusionsSynthesis-based training techniques are well suited for

the current paradigm of data-hungry object detectors. Wepresented a purely synthetic approach to training scene-and-location-specific pedestrian detectors. We showed that byleveraging the parameters of the camera and known geo-metric layout of the scene, we are able to learn customizedpedestrian models for every part of the scene. Our exper-iments demonstrate that such a model outperforms severalbaseline approaches both in terms of both 2D and 3D lo-

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Overlap Ratio = 0.70

Recall

Pre

cis

ion

G (AP:0.3852)

G+ (AP:0.3872)

DPM (AP:0.3524)

DPM+ (AP:0.528)

SS (AP:0.3004)

SSV (AP:0.2915)

SSV+ (AP:0.3449)

Ours (AP:0.6997)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Overlap Ratio = 0.70

Recall

Pre

cis

ion

G (AP:0.2524)

G+ (AP:0.2703)

DPM (AP:0.3352)

DPM+ (AP:0.4841)

SS (AP:0.2261)

SSV (AP:0.1505)

SSV+ (AP:0.1512)

Ours (AP:0.5922)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Overlap Ratio = 0.70

Recall

Pre

cis

ion

G (AP:0.6814)

G+ (AP:0.6846)

DPM (AP:0.5304)

DPM+ (AP:0.5393)

SS (AP:0.7467)

SSV (AP:0.5864)

SSV+ (AP:0.6175)

Ours (AP:0.8211)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance = 50cm

Recall

Pre

cis

ion

G (AP:0.3797)

G+ (AP:0.3823)

DPM (AP:0.2482)

DPM+ (AP:0.402)

SS (AP:0.4003)

SSV (AP:0.3493)

SSV+ (AP:0.3876)

Ours (AP:0.7397)

(a) Town Center

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance = 50cm

Recall

Pre

cis

ion

G (AP:0.5537)

G+ (AP:0.5469)

DPM (AP:0.3847)

DPM+ (AP:0.6544)

SS (AP:0.5759)

SSV (AP:0.5389)

SSV+ (AP:0.5217)

Ours (AP:0.8493)

(b) PETS 2006

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance = 50cm

Recall

Pre

cis

ion

G (AP:0.7367)

G+ (AP:0.7433)

DPM (AP:0.5727)

DPM+ (AP:0.6064)

SS (AP:0.9172)

SSV (AP:0.8273)

SSV+ (AP:0.834)

Ours (AP:0.9176)

(c) CMUSRD

Figure 2. PR curves for 2D (top) and 3D (bottom) localization.

Figure 3. 3D localization trajectories of our method (red) andground truth (black).

calization. Although, we have focused primarily on the useof scene geometry for synthesis, it is only the first step inmaximizing prior scene knowledge for synthesis.

References[1] V. N. Boddeti, T. Kanade, and B. V. K. Vijaya Kumar.

Correlation filters for object alignment. In CVPR, 2013.2

[2] N. Dalal and B. Triggs. Histograms of oriented gradi-ents for human detection. In CVPR, 2005. 1

[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, andD. Ramanan. Object detection with discriminativelytrained part based models. PAMI, 2010. 1