shape-based human detection and segmentation via hierarchical part-template matching

15
Shape-Based Human Detection and Segmentation via Hierarchical Part-Template Matching Zhe Lin, Member, IEEE, and Larry S. Davis, Fellow, IEEE Abstract—We propose a shape-based, hierarchical part-template matching approach to simultaneous human detection and segmentation combining local part-based and global shape-template-based schemes. The approach relies on the key idea of matching a part-template tree to images hierarchically to detect humans and estimate their poses. For learning a generic human detector, a pose-adaptive feature computation scheme is developed based on a tree matching approach. Instead of traditional concatenation-style image location-based feature encoding, we extract features adaptively in the context of human poses and train a kernel-SVM classifier to separate human/nonhuman patterns. Specifically, the features are collected in the local context of poses by tracing around the estimated shape boundaries. We also introduce an approach to multiple occluded human detection and segmentation based on an iterative occlusion compensation scheme. The output of our learned generic human detector can be used as an initial set of human hypotheses for the iterative optimization. We evaluate our approaches on three public pedestrian data sets (INRIA, MIT-CBCL, and USC-B) and two crowded sequences from Caviar Benchmark and Munich Airport data sets. Index Terms—Generic human detector, part-template tree, hierarchical part-template matching, pose-adaptive descriptor, occlusion analysis. Ç 1 INTRODUCTION H UMAN detection is a fundamental problem in video surveillance. It can provide an initialization for human segmentation. More importantly, robust human tracking and identification are highly dependent on reliable detection and segmentation in each frame since better segmentation can be used to estimate more accurate and discriminative appearance models. Although the problem of human detection has been well studied in vision, it still remains challenging due to highly articulated body postures, viewpoint changes, varying illumination condi- tions, occlusion, and background clutter. Combinations of these factors result in large variability of human shapes and appearances in images. Our goal is to develop a robust and efficient approach to detect and segment (possibly partially occluded) humans under varying poses and camera viewpoints. 1.1 Previous Work Many approaches have been introduced for human detec- tion over the last decade and significant progress has been made in terms of robustness and efficiency. Most commonly, human detection is formulated as a binary sliding window classification problem, i.e., a fixed- size window is scanned over an image pyramid and bounding boxes are localized around humans based on some nonmaximum suppression procedure. In terms of visual cues, shape has been the most dominant cue for detecting humans in still images due to large appearance variability. Motion has been widely used as a complimentary cue for detecting humans in videos. In the object category detection literature, shape modeling or shape feature extraction schemes can be roughly classified into two categories. The first category models human shapes globally or densely over image locations, e.g., an over- complete set of Haar wavelet features in [1], rectangular features in [2], histograms of oriented gradients (HOGs) in [3], locally deformable Markov models in [4], or covariance descriptors in [5]. Global, dense feature-based approaches such as [3], [5] are designed to tolerate some degree of occlusions and shape articulations with a large number of samples and have been demonstrated to achieve excellent performance with well-aligned, more or less fully visible training data. In [6], [7], [8], human shapes are more directly modeled as a set of global shape templates organized in a hierarchical tree. Due to the nature of global modeling, these approaches require a large number of positive training data. The second category models an object shape using sparse local features or as a collection of visual parts. Local feature- based approaches learn body part and/or full-body detec- tors based on sparse interest points and descriptors [9], [10], [11], from predefined pools of local curve segments [12], [13], a contour segment network [14], k-adjacent segments [15], or edgelets [16]. Part-based approaches model an object shape as a rigid or deformable configuration of visual parts [9], [17], [16], [18]. Part-based representations have been shown 604 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010 . Z. Lin is with the Advanced Technology Labs, Adobe Systems Incorporated, 345 Park Avenue, San Jose, CA 95110. E-mail: [email protected]. . L.S. Davis is with the Department of Computer Science, University of Maryland, 3301 A.V. Williams Building, College Park, MD 20742. E-mail: [email protected]. Manuscript received 16 Oct. 2008; revised 2 July 2009; accepted 25 Nov. 2009; published online 22 Dec. 2009. Recommended for acceptance by A. Srivastava, J.N. Damon, I.L. Dryden, and I.H. Jermyn. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMISI-2008-10-0709. Digital Object Identifier no. 10.1109/TPAMI.2009.204. 0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

Upload: ls

Post on 06-Nov-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Shape-Based Human Detectionand Segmentation via

Hierarchical Part-Template MatchingZhe Lin, Member, IEEE, and Larry S. Davis, Fellow, IEEE

Abstract—We propose a shape-based, hierarchical part-template matching approach to simultaneous human detection and

segmentation combining local part-based and global shape-template-based schemes. The approach relies on the key idea of matching

a part-template tree to images hierarchically to detect humans and estimate their poses. For learning a generic human detector, a

pose-adaptive feature computation scheme is developed based on a tree matching approach. Instead of traditional concatenation-style

image location-based feature encoding, we extract features adaptively in the context of human poses and train a kernel-SVM classifier

to separate human/nonhuman patterns. Specifically, the features are collected in the local context of poses by tracing around the

estimated shape boundaries. We also introduce an approach to multiple occluded human detection and segmentation based on an

iterative occlusion compensation scheme. The output of our learned generic human detector can be used as an initial set of human

hypotheses for the iterative optimization. We evaluate our approaches on three public pedestrian data sets (INRIA, MIT-CBCL, and

USC-B) and two crowded sequences from Caviar Benchmark and Munich Airport data sets.

Index Terms—Generic human detector, part-template tree, hierarchical part-template matching, pose-adaptive descriptor, occlusion

analysis.

Ç

1 INTRODUCTION

HUMAN detection is a fundamental problem in videosurveillance. It can provide an initialization for

human segmentation. More importantly, robust humantracking and identification are highly dependent on reliabledetection and segmentation in each frame since bettersegmentation can be used to estimate more accurate anddiscriminative appearance models. Although the problemof human detection has been well studied in vision, it stillremains challenging due to highly articulated bodypostures, viewpoint changes, varying illumination condi-tions, occlusion, and background clutter. Combinations ofthese factors result in large variability of human shapesand appearances in images. Our goal is to develop a robustand efficient approach to detect and segment (possiblypartially occluded) humans under varying poses andcamera viewpoints.

1.1 Previous Work

Many approaches have been introduced for human detec-

tion over the last decade and significant progress has been

made in terms of robustness and efficiency.

Most commonly, human detection is formulated as abinary sliding window classification problem, i.e., a fixed-size window is scanned over an image pyramid andbounding boxes are localized around humans based onsome nonmaximum suppression procedure. In terms ofvisual cues, shape has been the most dominant cue fordetecting humans in still images due to large appearancevariability. Motion has been widely used as a complimentarycue for detecting humans in videos.

In the object category detection literature, shape modelingor shape feature extraction schemes can be roughly classifiedinto two categories. The first category models human shapesglobally or densely over image locations, e.g., an over-complete set of Haar wavelet features in [1], rectangularfeatures in [2], histograms of oriented gradients (HOGs) in[3], locally deformable Markov models in [4], or covariancedescriptors in [5]. Global, dense feature-based approachessuch as [3], [5] are designed to tolerate some degree ofocclusions and shape articulations with a large number ofsamples and have been demonstrated to achieve excellentperformance with well-aligned, more or less fully visibletraining data. In [6], [7], [8], human shapes are more directlymodeled as a set of global shape templates organized in ahierarchical tree. Due to the nature of global modeling, theseapproaches require a large number of positive training data.

The second category models an object shape using sparselocal features or as a collection of visual parts. Local feature-based approaches learn body part and/or full-body detec-tors based on sparse interest points and descriptors [9], [10],[11], from predefined pools of local curve segments [12], [13],a contour segment network [14], k-adjacent segments [15], oredgelets [16]. Part-based approaches model an object shapeas a rigid or deformable configuration of visual parts [9],[17], [16], [18]. Part-based representations have been shown

604 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

. Z. Lin is with the Advanced Technology Labs, Adobe Systems Incorporated,345 Park Avenue, San Jose, CA 95110. E-mail: [email protected].

. L.S. Davis is with the Department of Computer Science, University ofMaryland, 3301 A.V. Williams Building, College Park, MD 20742.E-mail: [email protected].

Manuscript received 16 Oct. 2008; revised 2 July 2009; accepted 25 Nov.2009; published online 22 Dec. 2009.Recommended for acceptance by A. Srivastava, J.N. Damon, I.L. Dryden, andI.H. Jermyn.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMISI-2008-10-0709.Digital Object Identifier no. 10.1109/TPAMI.2009.204.

0162-8828/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

to be very effective for handling partial occlusions. In [17],several part detectors are trained separately for each bodypart and combined with a second-level classifier. Mikolajc-zyk et al. [9] use local features for part detection andassemble the part detections probabilistically. Wu andNevatia [16] introduce edgelet features for human detection.They extend this approach to a general object detection andsegmentation approach by designing local shape-basedclassifiers [19]. Shet et al. [18] propose a logical reasoning-based method for efficiently assembling part detections. Oneproblem with these part-based detection approaches is that,in very cluttered images, too many detection hypothesesmay be generated and a robust assembly method (e.g.,boosting) is thus needed to combine these detections.

Deformable part or patch-based approaches such as [20],[21], [22], [23] established an elegant, probabilistic frame-work for modeling object shape articulation. They have beenvery successful for many vision applications such as hand-written digit recognition, face detection, and object categor-ization, etc. These approaches commonly represent an objectas a collection of parts and enforce certain constraints onspatial configuration of parts to handle object shapearticulation. For example, Amit and Trouve [23] proposeda nice, part-based object recognition framework (calledpatchwork of parts), which is capable of performing objectdetection and handling occlusion precisely. They applied itsuccessfully to handwritten digit recognition and facedetection. Recently, Felzenszwalb et al. [24] combined adeformable part-based model with support vector machines(SVMs) for detecting highly articulated objects in challengingreal-world images. This approach achieved very promisingresults on many object categories in the PASCAL challenges.

Global approaches require training a single, strongclassifier; hence, they are much simpler than part-basedapproaches, where each part needs to be trained separatelyand an additional assembly classifier is must also be trained.On the other hand, part (or local feature)-based approachescan more readily handle partial occlusions and are moreflexible in dealing with shape articulations compared toglobal approaches. The shape modeling schemes can also becombined with appearance cues for simultaneous detectionand segmentation [7], [25], [26]. Shape cues are alsocombined with motion cues for human detection in [27],[28], simultaneous detection, and segmentation in [29].

From a learning perspective, there are generative anddiscriminative approaches for learning object detectors.Some generative approaches construct a tree-based datastructure for efficient shape matching [6], [7], [8]. Typically,a score is computed for each detection window based ontemplate matching and nearest neighbor search. Mostcommonly used discriminative approaches include cas-caded adaboost classifiers [2], [5], [27], [30] and supportvector machines [3], [28], [31]. Dalal and Triggs [3]introduced HOG features and provided an extensiveexperimental evaluation using linear and gaussian-kernelSVMs as the test classifiers. Later, Zhu et al. [30] improvedits computational efficiency significantly by utilizing aboosted cascade of rejectors. Recently, Tuzel et al. [5]reported significantly better detection performance than [3]on the INRIA data set. They use covariant matrices as imagedescriptors and classify patterns on Riemannian manifolds.Similarly, Maji et al. [31] also demonstrate promising results

using multilevel HOG descriptors and faster (histogramintersection) kernel SVM classification. In [32], twofoldadaboost classifiers are adopted for simultaneous partselection and pedestrian classification. Wu and Nevatia[33] combine different features in a single classificationframework. Pang et al. [34] introduced a multiple instancelearning-based scheme for better aligning positive examplesin training images.

For surveillance scenarios, motion blob informationprovides very reliable cues for human detection. Numerousapproaches have been introduced for detecting and trackinghumans using motion blob information. These blob-basedapproaches are computationally more efficient than purelyshape-based generic approaches, but have a commonproblem that the results crucially depend on backgroundsubtraction or motion segmentation. These approaches aremostly developed for detecting and tracking humans underocclusion. Some earlier methods [35], [36] model the humantracking problem by a multiblob observation likelihoodgiven a human configuration. Zhao and Nevatia [37]introduce an MCMC-based optimization approach to hu-man segmentation from foreground blobs. They detectheads by analyzing edges surrounding binary foregroundblobs, formulate the segmentation problem in a Bayesianframework, and optimize by modeling jump and diffusiondynamics in MCMC to traverse the complex solution space.Following this work, Smith et al. [38] propose a similartransdimensional MCMC model to track multiple humansusing particle filters. Later, an EM-based approach wasproposed by Rittscher et al. [39] for foreground blobsegmentation. Zhao et al. [40] use a part-based humanbody model to fit binary blobs and track humans.

Our motivations are summarized as follows: First,hierarchical template matching [6], [8] is a convenientway to efficiently integrate detection and segmentation ofshapes, but it is computationally expensive due to thenecessity of collecting and matching with a large numberof global shape templates. Next, previous discriminativeapproaches such as [3], [5], [33] mostly train a binaryclassifier on a large number of positive and negativesamples where humans are roughly center-aligned. Theseapproaches represent appearances by concatenating infor-mation along 2D image coordinates for capturing spatiallyrecurring local shape events in training data. However,due to highly articulated human poses and varyingviewing angles, a very large number of (well-aligned)training samples are required; moreover, the inclusion ofinformation from whole images inevitably makes themsensitive to biases in training data (in the worst case,significant negative effects can occur from arbitrary imageregions); consequently, the generalization capability of thetrained classifier can be compromised. Finally, althoughrecent deformable part-based models such as [24] andmultiple instance-based learning schemes such as [41] arevery effective for localizing objects, they are limited inestimating more precise shape and pose segmentationsfrom the detection process due to simple rectangular part-shape assumptions.

1.2 Overview of Our Approach

In this paper, we tackle the difficult problem of simulta-neously detecting and segmenting multiple (possiblypartially occluded) humans. For achieving the goal, we

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 605

propose a hierarchical part-template matching approach[42] and combine it with discriminative learning forbuilding a generic human detector. Given an input stillimage, the detector returns human bounding boxes as wellas precise shape segmentation masks. Our approach takesadvantages of both local part-based and global template-based human detectors by decomposing global shapemodels and constructing a part-template tree to modelhuman shapes flexibly and efficiently. Shape observations(edges or local gradient orientations) are matched to thepart-template tree efficiently to determine a reliable set ofhuman detection hypotheses. Shapes and poses are esti-mated automatically through synthesis of part detections.Our approach is similar to [24], [41], [43] in the spirit ofallowing part deformations, but our model detects andsegments humans simultaneously and explicitly handlespartial occlusions.

Using the hierarchical part-template matching scheme,we extract features adaptively in the local context of poses,i.e., we propose a pose-adaptive feature extraction method[44] for better discriminating humans from nonhumans.The intuition is that pose-adaptive features produce muchbetter spatial repeatability of local shape events. Specifi-cally, we segment human poses on both positive andnegative samples1 and extract features adaptively in localneighborhoods of pose contours, i.e., in the pose context.The set of all possible pose instances is mapped to acanonical pose such that points on an arbitrary pose contourhave one-to-one correspondences to points in the canonicalpose. This ensures that our extracted feature descriptorscorrespond well to each other, and are also invariant tovarying poses.

For extending the tree matching algorithm to multipleoccluded humans, a set of detection hypotheses is estimatedby our generic human detector and is iteratively refined/optimized through matching score reevaluation and fineocclusion analysis. For meeting the requirement of real-timesurveillance systems, we also combined the approach withbackground subtraction to improve efficiency, where regioninformation provided by foreground blobs is combined

with shape information from the original image forimproving robustness. Results show that our approachachieves state-of-the-art accuracy on several detectionbenchmarks, and the segmentation results are very good.Our approach is robust to noisy background subtractions ormotion segmentations in detecting multiple humans fromvideos. Our extended approach outputs a set of humanbounding boxes, occlusion ordering, as well as preciseshape segmentations given a static video sequence.

Our main contributions are summarized as follows:

. A part-template tree model and its automaticlearning algorithm are introduced for simultaneoushuman detection and pose segmentation. The ap-proach combines popular local part-based objectdetectors with global shape-template-based schemes.

. A fast hierarchical part-template matching algorithmis proposed to estimate human shapes and poses bymatching local image cues such as gradient magni-tudes and/or orientations. Human shapes and posesare represented by part-based parametric models.

. Estimated optimal poses are used to impose spatialpriors (for possible humans) for encoding pose-adaptive features in the local pose contexts. One-to-one correspondence is established between setsof contour points of an arbitrary pose and acanonical pose.

. The tree matching algorithm is also extended tohandle multiple occluded human detection andsegmentation in challenging surveillance scenarios.We estimate human configuration by performingoptimization in a greedy fashion based on aniterative process of shape matching score reevalua-tion and fine occlusion analysis.

For better understanding of our approaches in subse-quent sections, we summarized notations in Table 1.

The remainder of the paper is organized as follows:Section 2 introduces the details of our hierarchical part-template matching approach. Section 3 describes the pose-adaptive feature extraction method for learning generichuman detectors. Section 4 addresses our extended ap-proach to multiple human detection and segmentation.Section 5 briefly introduces the incorporation of motion and

606 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

TABLE 1A Table of Notations

1. For negative samples, pose estimation is forced to proceed eventhough no person is in them.

geometry cues to the framework. Section 6 presentsexperiments and evaluations. Finally, Section 7 concludesthe paper and discusses possible future extensions.

2 HIERARCHICAL PART-TEMPLATE MATCHING

We introduce a hierarchical shape matching approach toefficiently search for rough human poses (shapes) andcompute a matching score for each candidate detectionwindow. For achieving this, we take advantage of local part-based and global shape-template-based approaches bycombining them in a unified top-down and bottom-up searchscheme. Specifically, we extend the hierarchical templatematching method in [6], [8] by decomposing the global shapemodels into parts and constructing a new part template-based tree that captures appearance correlations betweenpart models from the training database of human shapes.

2.1 Generating the Part-Template Tree Model

Due to the high degree of freedom of human pose space,learning precise part-based human pose/shape modelsdirectly from real images would require a very largeground truth silhouette data set and obtaining alignedparts for all training examples is very challenging. Wealleviate this problem by a two-phase pose learning scheme,which works well even without any ground truth silhou-ettes.2 We first generate a flexible set of global shape modelsby part synthesis using a simple pose generator and thenconstruct a part-template hierarchy using a body partdecomposer. Parallelogram-shaped parts are spatially com-bined for modeling rough human part shapes. We obtainedpart-template models by decomposing the global shapemodels spatially and construct a part-template tree tocapture human pose variation. Next, we refine thesynthesized tree model by learning from a small set of realpose images.

2.1.1 Synthesizing Global Shape Models

For capturing articulation of human poses, we represent thehuman body with six part regions—head, torso, pair ofupper legs, and pair of lower legs. Global shape models aresynthesized by combining these part regions spatially, i.e., aunion of these part region instances. The shape of each partregion is modeled by a horizontal parallelogram character-ized by its center location (2 dofs), size (2 dofs), andorientation parameters (1 dof). Thus, the number of degreesof freedom for the whole body is ideally 5� 6 ¼ 30. Wegenerate these global models mainly in order to obtain anautomatically pose-aligned compact set of part templatemodels (characterized by both edge and region information)by spatial decomposition.

For simplifying the initial tree construction, global shapesare synthesized using only six degrees of freedom (headposition, torso width, orientations of upper/lower legs)given the torso position as the reference, and the remainingparameters (or dofs) are regarded as hidden variablesestimated in the online detection/test phase. Heads andtorsos are simplified to vertical rectangles (fixed orienta-tions) and the rectangular shapes are further modified to

have rounded shapes at corners to make them more realistic.The selected six parameters are evenly quantized intof3; 2; 3; 3; 3; 3g values in their ranges,3 respectively. Denserquantization and larger ranges would improve accuracy butwe found that the above quantization and ranges aresufficient for handling standing humans. Finally, instancesof part regions are independently combined to form 3� 2�3� 3� 3� 3 ¼ 486 global shape models. Examples ofgenerated global shape models are shown in Fig. 1a.

2.1.2 Generating Parts by Decomposition

In order to obtaining parts from the global shape models,we first binarize the global shape models in Fig. 1a to obtainsilhouettes (Fig. 1b) and extract boundaries of the silhou-ettes (Fig. 1c). Next, silhouettes and boundaries aredecomposed into three parts (head-torso (ht), upper legs(ul), and lower legs (ll)) as shown in Figs. 1b and 1c. Thiswill generate a set of part regions and a set of part shapes(curves). By decomposing the global models into local partmodels, we allow deformation of each part and spatialrelations between parts to handle a wider range of humanshape articulations. The parameters of the three partsht; ul; ll are denoted by �ht, �ul, and �ll, where eachparameter �j ¼ ð�indj ; �locj Þ consists of the index �indj of thecorresponding part (e.g., �indul ¼ 3 means the third part-template) and the location parameter �locj encodes the part’slocation relative to a detection window. The number ofglobal shape models is determined by the degree offreedom of each part regions (before synthesis), and thenumber of part-templates is equal to the number of nodes inthe tree, so it is much smaller than the global models.

2.1.3 Constructing an Initial Tree Model Using Parts

Given the set of indexed parts, a part-template tree isconstructed by placing the decomposed part regions and

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 607

Fig. 1. (a) Generation of global shape models by part synthesis,(b) decomposition of global silhouette and boundary models into region,and (c) shape part-templates.

2. In this case, we can impose a uniform prior on the branchingprobability distribution for the tree.

3. The range of each parameter is determined empirically, e.g., the angleof a single leg part is in a range ð�300; 300Þ.

boundary fragments into a tree as illustrated in Fig. 2. Thetree edges (or links) are determined automatically from therelation between part indices in the global shape models. Thetree has four layers denoted byL0,L1,L2, andL3, whereL0 isthe (empty) root node, L1 consists of side view head-torsotemplates L1;i, i ¼ 1; 2; 3 and front/back view head-torsotemplates L1;i, i ¼ 4; 5; 6, and similarly, L2 and L3 consist ofupper and lower leg poses for side and front/back views.Hence, each part-template in the tree represents an instanceof a part, and can also be viewed as a parametric model,where part location and sizes are the model parameters.

The number of part-templates in the tree is directlydetermined by the number of generated global shapemodels. A much larger set of part-templates can be obtainedby finer discretization and wider ranges of parameterswhen generating the global shape models but the matchingperformance only slightly improves with a linear increase incomputational requirements w.r.t. the number of templatesin the tree. As shown in the figure, the initially generatedtree consists of 186 part-templates, i.e., 6 ht models, 18 ulmodels, 162 ll models, and organized hierarchically basedon the layout of human body parts in a top-to-bottommanner. Due to the tree structure, a fast hierarchical shape(or pose) matching scheme can be applied using the model.For example, using hierarchical part-template matching(which is explained later), we only need to match 24 part-templates to account for the complexity of matching486 global shape models using the method in [6], so it isextremely fast. This initial part-template tree model isvisualized in Fig. 2. The initial tree structure and part-templates are constructed by synthetic silhouettes asdescribed above, but it is refined by learning from realimages, which is described in the following section.

In Fig. 2, any tree path (from the root node to a leaf node)uniquely determines a global shape model, and a uniquetree path is associated with any of the global shape models.

2.2 Learning the Part-Template Tree

The set of all global shape models is formed by enumeratingall possible part’s (or parallelogram’s) parameters, so theconstructed part-template tree does not contain any priorstatistics from real human silhouettes. So, we learn the

priors on the appearance probabilities of part-templates bymatching them with real human silhouettes. Since we placethe part template models in a tree configuration, the prior isestimated as conditional probability distributions (branch-ing probabilities at each internal tree node).

In order to more efficiently and reliably estimate humanshapes and poses by matching the tree to an input image,we learn the probabilities associated with the edges of thepart-template tree model using real images. Specifically, thelearning is performed by matching the tree to a set ofhuman silhouette images. The goal is to explicitly estimatebranching probability distributions, i.e., conditional dis-tributions of tree edges (arrows in Fig. 2 connecting nodesbetween two consecutive layers) in each of the tree layersfor handling the observed range of shape articulation. Forexample, given the empty node at Layer L0, we estimate theprobability distribution of six ht models at Layer L1 basedon their occurrence frequencies in the set of trainingsilhouettes.

The silhouette training set consists of 404 (plus mirroredversions) binary silhouette images (white foreground andblack background). Those binary silhouette images wereobtained by manually segmenting a subset of positiveimage patches of the INRIA person database. Each of thetraining silhouette images is passed through the tree fromthe root node to a leaf node for identifying the optimal path(corresponding to the largest matching score). The optimalpath is estimated using a greedy search algorithm byselecting a locally optimum branch at each node. This canbe replaced by a more sophisticated dynamic programmingalgorithm, but we found that the greedy search works verywell and is much faster. The matching score for each node iscomputed as the degree of coverage between the part-template at the current node and the observed silhouette(i.e., the proportion of pixels inside the hypothesis con-sistent with the training silhouette). This process is repeatedfor all the training silhouettes and an optimal path isestimated for every silhouette. Then, based on the set ofpaths, a branching probability distribution (conditionaldistribution of branching edges) is estimated for each nodebased on occurrence frequencies.

Given the learned branching probabilities, any tree pathfrom the root to leaf can be associated with three probabilityvalues (for three parts ht, ul, and ll, respectively). Each treenode now carries a binary image of the part-template, itsboundary sample point coordinates, and a branchingprobability distribution obtained from the tree learning step.Since there is a one-to-one correspondence between a pathand a global shape model, each global shape model is nowrepresented by a soft coverage map. We accumulate the softcoverage maps for all paths to compute the average globalshape for the learned tree. Fig. 3 validates our tree learningapproach by showing that the learned average global shapeis very similar to the average of all training silhouettes.

2.3 Hierarchical Part-Template Matching

Given a test image, we use a scanning window approach forestimating the optimal pose for each candidate detectionwindow. For each window, we match the learned tree toimage observations (edges or edge orientations) to estimatethe optimal tree path and associated location parameters for

608 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

Fig. 2. An illustration of the part-template tree model. The different part-templates are obtained by decomposing the global shape models in atop-town manner. Each part in the tree has a default location/size and ischaracterized by both shape and region information.

the nodes of the path. Similarly to the model used for treelearning, the overall matching score for a particular detectionwindow is simply modeled as a summation of matchingscores of all nodes along that path. But, differently from thelearning, the matching in the test phase is performed onedges or gradient orientations instead of silhouettes, and alsothe score of each node is computed as the product of the part-template matching score and the prior probability of thatnode learned in the training phase. We qualitatively verifiedthat incorporating the branching prior probability givesbetter robustness to pose estimation. Another difference isthat, in the test phase, we allow the part-template locations tobe shifted locally and an optimal location is estimated foreach part-template, as opposed to fixing part-templatelocations. This is similar to the approaches of Amit andTrouve [23] and Felzenszwalb et al. [24], where each part canbe adjusted to its locally optimum location.

We match individual part-templates and compute part-template matching scores using a method similar toChamfer matching [6]. The matching score of a samplepoint on the part-template contour is measured by edgeorientation matching. The goal is to estimate the optimalhuman pose (corresponding to a tree path and locationestimates of each part-templates on the path), which is mostconsistent with the image observations.

The optimization problem can be solved by dynamicprogramming to achieve globally optimal solutions. But,this algorithm is computationally too expensive to denselyscan all hypothetical windows. For efficiency, we use a fastK-fold greedy search algorithm. We keep scores for allnodes (k ¼ 1; 2 . . .K) in Layer L1 instead of estimating thebest k and a greedy procedure is individually performed foreach of those K nodes (or threads).

Pose model parameters estimated by the hierarchicalpart-template matching algorithm are directly used for posesegmentation by part synthesis (region connection). Fig. 4shows the process of global pose (shape) segmentation bythe part-template synthesis.

3 POSE-ADAPTIVE DESCRIPTORS

For applying our part-template tree model and hierarchicalpart-template matching algorithm to discriminative humandetection, we introduce a pose-adaptive feature computationmethod for detecting humans from images using standardmachine learning techniques such as SVM and Boosting.

3.1 Overview of the Approach

In our training and testing data sets, positive samples allconsist of 128� 64 image patches. Negative samples are

randomly selected from raw (person-free) images; positivesamples are cropped (from annotated images) such thatpersons are roughly aligned in location and scale. For eachtraining or testing sample, we first compute a set ofhistograms of (gradient magnitude-weighted) edge orienta-tions for nonoverlapping 8� 8 rectangular regions (or cells)evenly distributed over images. Motivated by the success ofHOG descriptors [3] for object detection, we employ coarse-spatial and fine-orientation quantization to encode thehistograms, and normalization is performed on groups oflocally connected cells, i.e., blocks. Then, given the orienta-tion histograms and a candidate detection window,hierarchical part-template matching is performed to esti-mate the optimal poses. Given the pose and shapeestimates, block features closest to each pose contour pointare collected; finally, the histograms of the collected blocksare concatenated in the order of pose correspondence toform our feature descriptor. As in [3], each block (consistingof four histograms) is normalized before collecting featuresto reduce sensitivity to illumination changes. The one-to-one point correspondence from an arbitrary pose model tothe canonical one reduces sensitivity of extracted descrip-tors to pose variations. Fig. 5 shows an illustration of ourfeature extraction process. Details of dense feature extrac-tion and the tree-based pose inference are described in thefollowing section.

3.2 Low-Level Feature Representation

For pedestrian detection, histograms of oriented gradients(HOGs) [3] exhibited superior performance in separating

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 609

Fig. 3. A comparison of (a) the average of all training silhouettes and(b) the average of the soft probability maps for all tree paths.

Fig. 4. An illustration of shape/pose segmentation. (a) Best part-template estimates (three images on the left side designate a path fromL0 to L3) are combined to produce the final global shape and posesegmentations (two images on the right side). (b) Example pose (shape)segmentation on positive and negative examples.

Fig. 5. Overview of our feature extraction method. (a) A training ortesting image. (b) Part-template detections. (c) Pose/shape segmenta-tion. (d) Cells overlaid onto pose contours. (e) Orientation histogramsand cells overlapping with the pose boundary. (f) Block centers relevantto the descriptor.

image patches into human/nonhuman. These descriptorsignore spatial information locally and, hence, are very robustto small alignment errors. We use a very similar representa-tion as our low-level feature description, i.e., (gradientmagnitude-weighted) edge orientation histograms.

Given an input image, we calculate gradient magni-tudes jGj and edge orientations O using simple differenceoperators ð�1; 0; 1Þ and ð�1; 0; 1Þt in the horizontal x andvertical y directions, respectively. We quantize the imageregion into local 8� 8 nonoverlapping cells, each repre-sented by a histogram of (unsigned) edge orientations (eachsurrounding pixel contributes a gradient magnitude-weighted vote to the histogram bins). Edge orientationsare quantized into Nb ¼ 9 orientation bins ½k �

Nb; ðkþ 1Þ �Nb

Þ,where k ¼ 0; 1 . . .Nb � 1. For reducing aliasing and discon-tinuity effects, we also use trilinear interpolation as in [3] tovote for the gradient magnitudes in both spatial andorientation dimensions. Additionally, each set of neighbor-ing 2� 2 cells forms a block. This results in overlappingblocks, where each cell is contained in multiple blocks. Forreducing illumination sensitivity, we normalize the groupof histograms in each block using L2 normalization with asmall regularization constant � to avoid dividing-by-zero.Fig. 6 shows visualizations of extracted low-level featuresfor two example image patches.

The above computation results in our low-level featurerepresentation consisting of a set of raw (cell) histograms(gradient magnitude-weighted) and a set of normalizedblock descriptors indexed by image locations. We use bothnormalized and unnormalized histograms for our descrip-tor computation algorithm. Unnormalized (raw) histogramscarry gradient magnitude information, so they are used forpose matching (matching histograms against edge orienta-tions of the pose contours in the tree), while normalizedhistograms are used for final descriptors based on theestimated poses.

Note that, in our implementation, all positive andnegative training examples are of size 128� 64, so, duringthe training phase, there is no scale variation. All features(edge gradients and orientations) are computed at the samescale at training. During testing, a pyramid of the inputimage is constructed and a window of the same size 128�64 is scanned to handle scale variation. The (downsam-pling) scale factor used to construct the image pyramid isset to 1.1.

3.3 Pose Inference on the Low-Level Features

Given a low-level feature representation for a detectionwindow and the learned part-template tree, an optimaltree path is estimated based on the hierarchical part-template matching scheme introduced in the previoussection. The part-template score is measured by an average

of gradient magnitudes of corresponding orientation binsin each block of our low-level feature representation. Here,the score is measured by orientation consistency instead ofdistances between edges in traditional chamfer matching[6]. The score of each template point is measured usinglocation-based lookup tables for speed. Magnitudes fromneighboring histogram bins are weighted to reduceorientation biases and regularize the matching scores ofeach template point.

More formally, let OðtÞ denote the edge orientation atcontour point t; its corresponding orientation bin indexBðtÞ is computed as: BðtÞ ¼ ½OðtÞ=ð�=9Þ� (where ½x� denotesthe maximum integer less than or equal to x). For eachcontour point t, we first identify the closest 8� 8 cell in thedetection window. Let H ¼ fhig (where i denotes thehistogram bin index) be the (unnormalized) orientationhistogram at the contour location t, the matching score fðtÞat t is computed as

fðtÞ ¼X�

b¼��wðbÞhBðtÞþb; ð1Þ

where � is a neighborhood range and wðbÞ is a symmetricweight distribution.4 Given a part-template T (a collectionof boundary sample points on the part-template), thematching score fT of the part-template is computed as theaverage point score along the template:

fðT Þ ¼ 1

jT jX

t2TfðtÞ; ð2Þ

where jT j denotes the number of sample points on part-template T . The orientation histograms are stored as alookup table, so the computation is very similar to that ofchamfer distance.

3.4 Representation Using Pose-AdaptiveDescriptors

In our implementation, the global shape models (consistingof three part-template types) are represented as a set ofboundary points with corresponding edge orientations. Therange of the number of those model points is from 118 to172. In order to obtain a unified (constant dimensional)description of images with those different dimensional posemodels, and to establish a one-to-one correspondencebetween contour points of different poses (Fig. 7), we map

610 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

Fig. 6. Examples of two training samples and visualization ofcorresponding (unnormalized and L2-normalized) edge orientationhistograms.

Fig. 7. An illustration of pose alignment by one-to-one contour pointcorrespondence. Only a subset of key contour points is shown here.

4. For simplicity, we use � ¼ 1, and wð1Þ ¼ wð�1Þ ¼ 0:25, wð0Þ ¼ 0:5 inour experiments.

the boundary points of any pose model to those of acanonical pose model. The canonical pose model isassumed to be occlusion-free so that all contour points arevisible. For human upper bodies (heads and torso), theboundaries are uniformly sampled into eight left side andeight right side locations, and the point correspondence isestablished between poses based on vertical y coordinatesand side (left or right) information. For lower bodies (legs),boundaries are uniformly sampled into locations verticallywith four locations at each y value (inner leg sample pointsare sampled at 5 pixels apart from outer sample points inthe horizontal direction). Fig. 5f shows an example of howthe sampled locations are distributed).

Associated with each of those sample locations is a36-dimensional feature vector (L2-normalized histogram ofedge orientations of its closet 2� 2 block in the image).Hence, this mapping procedure generates a ð8� 2 þ7� 4Þ � 36 ¼ 1;584-dimensional feature descriptor. Fig. 5illustrates the feature extraction method. Note that only asubset of blocks is relevant to the descriptor, and a blockmight be duplicated several times based on the frequency ofcontour points lying inside the block.

4 DETECTING AND SEGMENTING MULTIPLE

OCCLUDED HUMANS

Pose-adaptive descriptors discussed in the previous sectionare mainly developed for the purpose of detecting fullyvisible humans from images. However, real-world imagescan be crowded and it is common that humans occlude eachother significantly. This is more obvious in visual surveil-lance scenarios, where videos are usually captured incrowded public places, e.g., shopping malls, airports, etc.In these complex cases, our generic detector based on ourpose-adaptive features can be used to provide an initial setof human hypotheses (by reducing thresholds to ensure lowmiss rates) and then more detailed occlusion analysis andoptimization can be performed.

4.1 Initial Hypotheses

The generic human detector trained using our pose-adaptive features and a discriminative classifier such asSVM can provide a reliable set of initial human hypothesesfor detecting humans from still images. However, forcrowded videos, a set of simpler part detectors can bemore accurate than a single full-body detector due tosevere occlusion between humans. Any portion of a humanbody can be occluded and sometimes only a very smallpart is visible. Hence, here, we introduce an alternativemethod for generating initial human hypotheses forsurveillance scenarios.

Hierarchical part-template matching provides estimatesof the pose model parameters for every detection window.We define a score function Fw (a function of weightvector w) capable of evaluating image responses for anypart or part combinations. The function is modeled as aweighted sum of individual part responses (matchingscores) fj; j 2 fht; ul; llg (computed based on (2)):

Fw ¼X

j

wjfj; ð3Þ

where the weight distribution (or weight vector) w ¼fwj; j ¼ ht; ul; llg (

Pj wj ¼ 1) characterizes different parts

or part combinations. For example, fwht ¼ wul ¼ wll ¼ 1=3gcorresponds to a full-body detector and fwht ¼ 0; wul ¼ wll ¼1=2g corresponds to a leg detector. By applying a detectionthreshold � to the score function Fw, we form seven part orpart-combination detectors (as shown in Fig. 8), and if thehead-torso is decomposed further into head-shoulder andtorso, the number of detectors can be as high as 15. Supposethat we use M part detectors, Dm, m ¼ 1; 2 . . .M, corre-sponding toM weight vectors wm,m ¼ 1; 2 . . .M. Individualpart matching score fi is computed based on (2).

In practice, we build a pyramid from the input imageand use our sliding-window-based generic human detec-tor to reduce the search space into a small subset andboost it by searching additional hypotheses using theabove part/part-combination detectors. We threshold eachof the response maps (of full-body matching scores) usinga constant global detection threshold � (which is adjus-table for trading off precision and recall of detections),merge nearby weak responses to strong responses, andadaptively select modes. This step can also be performedby local maximum selection after smoothing the like-lihood image. The union of the maxima forms the set ofhuman hypotheses:

u ¼ fu1; :::uNg ¼ fðx1; ��ðx1ÞÞ; . . . ðxN; ��ðxNÞÞg: ð4Þ

We compute full-body matching scores (assuming noocclusion) for all these hypotheses and denote them byLðuiÞ, i ¼ 1; 2:::N . More specifically, LðuiÞ is computed asthe average matching score of all uis’ part-template contourpoints. Note that u is an unordered set (no occlusionordering information).

4.2 Objective Function

Using the set of initial detection hypotheses u as theimage observation, we model multiple occluded humandetection as the problem of maximizing an objectivefunction � (similar to joint likelihood maximization in[[16], 35], [36], [37]):

c� ¼ arg maxc

�ðujcÞ; ð5Þ

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 611

Fig. 8. An illustration of part or part combination detectors. Therectangular regions associated with different detectors are drawn withdifferent colors. We also experimented by adding additional elementarypart detectors such as head detectors and head-shoulder detectors(their detectors are built by decomposing the head-torso modelsvertically) since they are useful when humans are significantly occluded.Inclusion of these models improves results slightly but our experimentalresults are based on the simple model consisting of elementary partdetectors for head-torso, upper legs, and lower legs.

where c ¼ fc1; c2; . . . cng denotes an ordered human config-uration,5 n denotes the number of humans in the config-uration, and ci ¼ ðxi; ��i Þ denotes an individual hypothesiswhich consists of image location xi and correspondinghuman model parameter ��i (optimal part-template indicesand locations obtained from hierarchical part-templatematching). We constrain the search space to the initial setof hypotheses, i.e., c � u. The goal is to choose the optimalsubset of initial hypotheses u and its occlusion ordering sothat � is maximized.

Given an ordered human configuration c, we can generateits occlusion map Iocc by overlaying regions of global shapeestimates (see Fig. 9d for an example of the occlusion map).Assuming independence between each observation ui giventhe configuration c, and treating the template matchingscores as pseudolikelihoods, (5) can be decomposed (similarto probability decomposition) as follows:

�ðujcÞ ¼ �ðu1; . . . ; uN jcÞ ¼YN

i¼1

�ðuijcÞ: ð6Þ

Due to possible occlusions, we cannot directly use the full-body matching score LðuiÞ to model �ðuijcÞ. Instead, weneed to globally reevaluate the matching score of eachhypothesis ui based on the occlusion map Iocc, that is, ifui 2 c, we compute its matching score only based onunoccluded (or visible) parts. Similarly to the occlusionhandling scheme in [23], we perform occlusion compensa-tion precisely at the pixel level via an occlusion mapgenerated from precise human segmentations. This occlu-sion compensation-based score reevaluation scheme iseffective in rejecting most false alarms while retaining truedetections. Specifically, if ui 2 c, there exists j such thatui ¼ cj, and consequently, the occlusion-compensatedmatching score LðuijIoccÞ is defined as the average matchingscore of ui’s shape template points lying inside cj’s visibleregion; otherwise, if ui 62 c, LðuijIoccÞ is set to the constant

detection threshold � . Based on the above modeling, theobjective function can be rewritten as

�ðujcÞ ¼Y

ui2c

LðuijIoccÞY

ui 62c

�: ð7Þ

Now, given any ordered set of human hypotheses (config-uration), the objective function can be exactly evaluated.6

4.3 Optimization

Maximizing the objective function in (7) is an NP-hardcombinatorial optimization problem. For simplifying theproblem, we sort the hypotheses in decreasing order ofvertical (or y) coordinate as in [16]. This is valid for manysurveillance videos with ground plane assumption since thecamera is typically looking obliquely down toward thescene. For notational simplicity, we assume u1; u2; . . . ; uN issuch an ordered list. Starting from an empty set, theoptimization is performed based on the iterative addition ofa hypothesis to maximize the objective function in a greedyfashion. Our approach currently only adds hypotheses;although we experimented with using both addition andremoval of hypotheses, it did not improve the results on ourtest data set. This is mainly because humans are roughlystanding on a plane in these data sets. Adding hypothesisremoval could be a useful extension of the algorithm formore complex occlusion cases, for example, when theground plane assumption is not valid.

An example of the detection and segmentation process isshown in Fig. 9. Note that initial false detections are rejected inthe final detection based on likelihood reevaluation, and theocclusion map is accumulated to form the final segmentation.

Algorithm 1. Optimization algorithm

Given an ordered list of initial human hypothesesu1; u2; . . . ; uN ,

initialize the configuration as c as the empty set, the

occlusion map Iocc as empty (white image), and the

objective function as �ðujcÞ ¼ 0.

for i ¼ 1 : N

if �ðujui [ cÞ > �ðujcÞ,(1) insert ui to c, i.e., ui [ c 7! c;

(2) update the occlusion map Iocc using the current c;endfor

return the configuration c and occlusion map Iocc.

5 COMBINING WITH CALIBRATION AND

BACKGROUND SUBTRACTION

We can also combine the shape-based detector with back-ground subtraction and calibration in a unified system.

5.1 Scene-to-Camera Calibration

If we assume that humans are moving on a ground plane,ground plane homography information can be estimatedoffline and used to efficiently control the search for humansinstead of searching over all scales at all positions. A similaridea has been explored by Hoiem et al. [45] combiningcalibration and segmentation. To obtain a mapping betweenhead points and foot points in the image, i.e., to estimate

612 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

Fig. 9. An example of detection process without background subtraction.(a) Initial set of human detection hypotheses, (b) human shapesegmentations, (c) detection result, and (d) segmentation result (finalocclusion map).

5. An ordered set of hypotheses with occlusion ordering, i.e., for anyi < j, cj does not occlude ci.

6. Note that L in the above equations denotes scores (also can beregarded as probabilities) but not log-probability or log-likelihood.

expected vertical axes of humans, we simplify the calibrationprocess by estimating the homography between the headplane and the foot plane in the image [39]. We assume thathumans are standing upright on an approximate groundplane viewed by a distant camera relative to the scene scale,and that the camera is located higher than a typical person’sheight. We define the homography mapping as f ¼ Ph

f :F 7! H, where F ;H 2 IP2. Under the above assumptions,the mapping f is one-to-one correspondence so that, givenan offline estimated 3� 3 matrix Ph

f , we can estimate theexpected location of the corresponding head point ph ¼ fðpfÞgiven an arbitrary foot point pf in the image. Thehomography matrix is estimated by the least-squaresmethod using four or more pairs of foot and head pointspreannotated in some frames. An example of the homo-graphy mapping is shown in Fig. 10.

5.2 Combining with Background Subtraction

Given the calibration information and the binary fore-ground image from background subtraction, we estimatewhat we refer to as the binary foot candidate regions Rfoot

as follows: we first find all foot candidate pixels x withforeground coverage density �x larger than a threshold .Given the estimated human vertical axis v!x at the footcandidate pixel x, �x is defined as the proportion offoreground pixels in an adaptive rectangular windowW ðx; ðw0; h0ÞÞ determined by the foot candidate pixel x.The foot candidate regions Rfoot are defined as:Rfoot ¼ fxj�x � g. The window coverage is efficientlycalculated using integral images [2]. We detect edges inthe augmented foreground regions Rafg, which are gener-ated from the foot candidate regions Rfoot by taking theunion of the rectangular regions determined by each footcandidate pixel pf 2 Rfoot, adaptively based on the esti-mated human vertical axes. Fig. 11 shows an example.

6 EXPERIMENTAL RESULTS

We first present results using our generic human detectoron two public pedestrian data sets and then discuss resultsof our multiple occluded human detector on three crowdedimage and video data sets.

6.1 Detection and Segmentation UsingPose-Adaptive Descriptors

We evaluate our generic human detector mainly using theINRIA person data set7 [3] and the MIT-CBCL pedestrian

data set8 [1], [17]. The MIT-CBCL data set contains924 front/back view positive images (no negative images),and the INRIA data set contains 2,416 positive trainingsamples and 1,218 negative training images plus 1,132 posi-tive testing samples and 453 negative testing images.Comparing to the MIT data set, the INRIA data set is muchmore challenging due to significant pose articulations,occlusion, clutter, viewpoint, and illumination changes.

6.1.1 Detection Performance

We evaluate our detection performance and compare itwith other approaches using Detection-Error-Trade-Off(DET) curves, plots of miss rates versus false positives perwindow (FPPW).

Training. We first extract pose-adaptive descriptors forthe set of 2,416 positive and 12,180 negative samples andbatch-train a discriminative classifier for the initial trainingalgorithm. We use the publicly available LIBSVM tool [46]for binary classification (RBF Kernel) with parameterstuned to C ¼ 8;000, gamma ¼ 0:04 (as the default classifier).These parameter values are estimated via twofold crossvalidation by evenly partitioning the original training datainto validation train and validation test sets.

For improving performance, we perform one round ofbootstrapping procedure for retraining the initial detector.

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 613

Fig. 10. Simplified scene-to-camera calibration. (a) Interpretation of thefoot-to-head plane homography mapping. (b) An example of thehomography mapping. Fifty sample foot points are chosen randomlyand corresponding head points and human vertical axes are estimatedand superimposed in the image.

Fig. 11. An example of the detection process with backgroundsubtraction. (a) Adaptive rectangular window. (b) Foot candidate regionsRfoot (lighter regions). (c) Object-level (foot-candidate) likelihood map bythe hierarchical part-template matching (where red color representshigher probabilities and blue color represents lower probabilities).(d) The set of human hypotheses overlaid on the Canny edge map inthe augmented foreground region (green boxes represent higherlikelihoods and red boxes represent lower likelihoods). (e) Final humandetection result. (f) Final human segmentation result.

7. http://lear.inrialpes.fr/data. 8. http://cbcl.mit.edu/software-datasets/PedestrianData.html.

We densely scan 1,218 (plus mirror versions) person-freephotos by 8-pixel strides in horizontal/vertical directionsand 1.2 scale (downsampling) factors (until the resizedimage does not contain any detection window) to boot-strap false positive windows. This process generates

41,667 “hard” samples out of examined windows. Thesesamples are normalized to 128� 64 and added to theoriginal 12,180 negative training samples and the wholetraining process is performed again.

Testing. For evaluation on the MIT data set, we chose itsfirst 724 image patches as positive training samples and12,180 training images from the INRIA data set as negativetraining samples. The test set contains 200 positive samplesfrom the MIT data set and 1,200 negative samples from theINRIA data set. As a result, we achieve 1.0 percent truepositive rate, and a 0.00 percent false positive rate evenwithout retraining. Direct comparisons on the MIT data setare difficult since there are no negative samples and noseparation of training and testing samples in this data set.Indirect comparisons show that our result on this data set issimilar to the performance achieved previously in [3].

For the INRIA data set, we evaluated our detectionperformance on 1,132 positive image patches and 453 ne-gative images. Negative test images are scanned exhaus-tively in the same way as in retraining. The detailedcomparison of our detector with the current state-of-the-artdetectors on the INRIA data set is plotted using the DET

614 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

Fig. 13. Detection results. (a) Example detections on the INRIA testimages, nearby windows are merged based on distances. (b) and(c) Examples of false negatives (FNs) and false positives (FPs)generated by our detector.

Fig. 12. Detection performance evaluation on the INRIA data set.(a) The proposed approach (testing on single scale) is compared toKernel HOG-SVM [3], Linear HOG-SVM [3], Cascaded HOG [30], andClassification on Riemannian Manifold [5]. The results of [3] are copiedfrom the original paper, and the results of [5], [30] are obtained byrunning their original detectors on the same test data. (b) Performancecomparison w.r.t. the number of negative windows scanned.(c) Distribution of confidence values for positive and negative testwindows.

Fig. 14. Qualitative comparisons of our pose-insensitive (adaptive)descriptor (PID) with the HOG descriptor.

Fig. 15. Example results of pose segmentation.

curves, as shown in Fig. 12. The comparison shows that ourapproach is comparable to the state-of-the-art humandetectors. The dimensionality of our features is less thanhalf of that used in HOG-SVM [3], but we achieve betterperformance. Another advantage of our approach is that itis capable of not only detecting but also segmenting humanshapes and poses. In this regard, our approach can befurther improved because our current pose model is verysimple and can be extended to cover a much wider range ofarticulations. Fig. 13 shows examples of detections onwhole images and examples of false negatives and falsepositives from our experiments. Note that FNs are mostlydue to the unusual poses or illumination conditions, orsignificant occlusions; FPs mostly appeared in highlytextured samples (such as trees) and structures resemblinghuman shapes. Fig. 14 shows qualitative comparisons ofour pose-adaptive descriptors with HOG descriptors [3] ondetecting humans in natural images. Our detector success-fully detected very difficult poses while the HOG-baseddetector missed them.9

6.1.2 Segmentation Performance

Fig. 15 shows some qualitative results of our pose/shapesegmentation algorithm on the INRIA data set. Our posemodel and probabilistic hierarchical part-template matchingalgorithm give very accurate segmentations for most imagesin the MIT-CBCL data set and on over 80 percent of3,548 training/testing images in the INRIA data set.Significantly, poor pose estimation and segmentation areobserved in about 10 percent of the images in the INRIA dataset, and most of those poor segmentations were due to thevery difficult poses and significant misalignment of humans.

Our detection and segmentation system is implementedin C++ and the current running time (on a machine with

2.2 GHz CPU and 3 GB memory) is as follows: Both posesegmentation and feature extraction for 800 windows takeless than 0.2 second, classifying 800 windows with the RBF-Kernel SVM classifier takes less than 10 seconds, initialclassifier training takes about 10 minutes, and retrainingtakes about two hours. The computational overhead is onlydue to the kernel SVM classifier, which can be replacedwith a much faster boosted cascade of classifiers [2] (whichwe have implemented recently and which runs at threeframes/second on a 320� 240 image scanning 800 win-dows); this is comparable to [5] (reported as less than1 second scanning 3,000 windows).

6.2 Detection and Segmentation of MultipleOccluded Humans

In order to quantitatively evaluate the performance of ourdetector, we use the overlap measure defined in [10]. Theoverlap measure is calculated as the smaller value of thearea ratios of the overlap region and the ground-truth-annotated region/detection region. If the overlap measureof a detection is larger than a certain threshold ¼ 0:5, weregard the detection as correct.

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 615

Fig. 16. Detection and segmentation results (without background subtraction) for USC pedestrian data set-B.

9. The source codes of the generic human detector training and testingalgorithms are publicly available at http://terpconnect.umd.edu/~zhelin/PidUMD.html.

Fig. 17. Performance evaluation on three data sets. (a) Evaluation ofdetection performance on USC pedestrian data set-B (54 images with271 humans). Results of [16] and [18] are copied for the comparisonpurpose. (b) Evaluation of detection performance on two test sequencesfrom Munich Airport data set and Caviar data set.

6.2.1 Results without Background Subtraction

We compared our human detector with Wu and Nevatia[16] and Shet et al. [18] on USC pedestrian data set-B [16],which contains 54 gray-scale images with 271 humans. Inthese images, humans are heavily occluded by each otherand partially out of the frame in some images. Note thatno background subtraction is provided for these images.Fig. 16 shows some example results of our detector andFig. 17a shows the comparison result as ROC curves. Ourdetector obtained better detection performance than theothers when allowing more than 10 false alarms out oftotal of 271 humans, while detection rate decreasedsignificantly when the number of false alarms was reducedto 6 out of 271. Proper handling of the edge sharingproblem would reduce the number of false alarms furtherwhile maintaining the detection rates. The running time of[16] for processing a 384� 288 image is reported as aboutone frame/second on a Pentium 2.8 GHz machine, whileour current running time for a same-sized image is twoframes/second on a Pentium 2 GHz machine.

6.2.2 Results with Background Subtraction

We also evaluated our detector on two challenging

surveillance video sequences using background subtrac-

tion. The first test sequence (1,590 frames) is selected from

the Caviar Benchmark data set10 and the second one (4,836

frames) is selected from the Munich Airport data set

collected by Siemens Corporate Research.11 The foreground

regions detected from background subtraction are very

noisy and inaccurate in many frames. From example results

in Fig. 18, we can see that our proposed approach achieves

good performance in accurately detecting humans and

segmenting the boundaries even under severe occlusion

and very inaccurate background subtraction. Also, from the

results, we can see that the shape estimates automatically

obtained from our approach are quite accurate. Some

misaligned shape estimates are generated mainly due tolow contrast and/or background clutter.

We evaluated the detection performance quantitativelyon 200 selected frames from each video sequence. Fig. 17bshows the ROC curves for the two sequences. Most falsealarms are generated by cluttered background areasincorrectly detected as foreground by background subtrac-tion. Misdetections (true negatives) are mostly due to thelack of edge segments in the augmented foreground regionor complete occlusion between humans. Our system isimplemented in C++ and currently runs at about twoframes/second (without background subtraction) and fiveframes/second (with background subtraction) for 384� 288video frames on a Pentium-M 2 GHz Machine.

6.2.3 Integration with a Region-Based Tracker

We also integrated our multiple occluded human detectorwith a regional affine-invariant tracker [47] and evaluatedthe integrated multiperson tracker on the CLEAR07 dataset.12 The detector was used to provide a reliable set ofhuman hypotheses for visual tracking. Details of theintegration method are described in [48]. Quantitativeevaluation of person tracking on a large amount ofchallenging video data showed that our integrated detec-tion and tracking system achieves the best result in thecompetition of 2D person tracking.

7 CONCLUSIONS

A hierarchical part-template matching approach is em-ployed to match human shapes with images to detect andsegment humans simultaneously. Local part-based andglobal shape-template-based approaches are combined todetect and segment humans from images. Based on the shapematching approach, we first introduced a pose-adaptiveimage descriptor for learning a discriminative classifier forthe challenging problems of detecting and segmentinghumans in images. The descriptor is computed adaptivelybased on human poses instead of concatenating featuresalong 2D image locations as in previous approaches.

616 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010

Fig. 18. Detection and segmentation results (with background subtraction) for (a) the Caviar data set and (b) the Munich Airport data set.

10. http://homepage.inf.ed.ac.uk/rbf/CAVIAR/.11. The selected data can be downloaded from ftp://ftp.umiacs.umd.

edu/pub/zhelin/iccv07/dataset. 12. http://www.clear-evaluation.org.

Specifically, we estimate the poses using a fast hierarchicalmatching algorithm based on a learned part-template tree.Given the pose estimate, the descriptor is formed byconcatenating local features along the pose boundaries usinga one-to-one point correspondence between detected andcanonical poses. The pose-adaptive descriptors are used totrain discriminative classifiers to learn generic humandetectors. For applying the tree matching approach tomultiple occluded human detection in crowded surveillancescenarios, we also introduced an approach to iterativelyoptimize the human configuration and occlusion ordering.

The results demonstrate that the proposed part-templatetree model captures the articulations of the human bodyand detects humans robustly and efficiently. Although ourapproach can handle the majority of standing human poses,many of our misdetections are still due to the poseestimation failures. This suggests that the detection perfor-mance could be further improved by extending the part-template tree model to handle more difficult poses and tocope with alignment errors in positive training images. Weare also investigating the addition of color and texturestatistics to the local contextual descriptor to improve thedetection and segmentation performance.

ACKNOWLEDGMENTS

The support of the US Defense Advanced Research ProjectsAgency (DARPA) under project VIRAT (subcontractor toKitware, Inc.) is gratefully acknowledged. The authorswould like to thank Siemens Corporate Research forproviding the Munich Airport Surveillance Video data setfor experiments.

REFERENCES

[1] C. Papageorgiou, T. Evgeniou, and T. Poggio, “A TrainablePedestrian Detection System,” Proc. Symp. Intelligent Vehicles,pp. 241-246, 1998.

[2] P. Viola and M. Jones, “Rapid Object Detection Using a BoostedCascade of Simple Features,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 1, pp. 511-518, 2001.

[3] N. Dalal and B. Triggs, “Histograms of Oriented Gradients forHuman Detection,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, pp. 886-893, 2005.

[4] Y. Wu, T. Yu, and G. Hua, “A Statistical Field Model forPedestrian Detection,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 1, pp. 1023-1030, 2005.

[5] O. Tuzel, F. Porikli, and P. Meer, “Human Detection viaClassification on Riemannian Manifold,” Proc. IEEE Conf. Compu-ter Vision and Pattern Recognition, pp. 1-8, 2007.

[6] D.M. Gavrila and V. Philomin, “Real-Time Object Detection forSMART Vehicles,” Proc. IEEE Int’l Conf. Computer Vision, vol. 1,pp. 87-93, 1999.

[7] L. Zhao and L.S. Davis, “Closely Coupled Object Detection andSegmentation,” Proc. IEEE Int’l Conf. Computer Vision, pp. 454-461,2005.

[8] D.M. Gavrila, “A Bayesian, Exemplar-Based Approach to Hier-archical Shape Matching,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 29, no. 8, pp. 1408-1421, Aug. 2007.

[9] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human DetectionBased on a Probabilistic Assembly of Robust Part Detectors,” Proc.European Conf. Computer Vision, vol. 1, pp. 69-82, 2004.

[10] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian Detection inCrowded Scenes,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, pp. 878-885, 2005.

[11] E. Seemann, B. Leibe, and B. Schiele, “Multi-Aspect Detection ofArticulated Objects,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 1582-1588, 2006.

[12] J. Shotton, A. Blake, and R. Cipolla, “Contour-Based Learning forObject Detection,” Proc. IEEE Int’l Conf. Computer Vision, vol. 1,pp. 503-510, 2005.

[13] A. Opelt, A. Pinz, and A. Zisserman, “A Boundary-Fragment-Model for Object Detection,” Proc. European Conf. Computer Vision,vol. 2, pp. 575-588, 2006.

[14] V. Ferrari, T. Tuytelaars, and L.V. Gool, “Object Detection byContour Segment Networks,” Proc. European Conf. ComputerVision, vol. 3, pp. 14-28, 2006.

[15] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, “Groups of AdjacentContour Segments for Object Detection,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 30, no. 1, pp. 36-51, Jan. 2008.

[16] B. Wu and R. Nevatia, “Detection of Multiple, Partially OccludedHumans in a Single Image by Bayesian Combination of EdgeletPart Detectors,” Proc. IEEE Int’l Conf. Computer Vision, pp. 90-97,2005.

[17] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-BasedObject Detection in Images by Components,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 23, no. 4, pp. 349-361, Apr.2001.

[18] V.D. Shet, J. Neumann, V. Ramesh, and L.S. Davis, “Bilattice-Based Logical Reasoning for Human Detection,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, pp. 1-8, 2007.

[19] B. Wu and R. Nevatia, “Simultaneous Object Detection andSegmentation by Boosting Local Shape Feature Based Classifier,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8,2007.

[20] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognitionby Unsupervised Scale Invariant Learning,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 2, pp. 264-271, 2003.

[21] H. Schneiderman and T. Kanade, “Object Detection UsingStatistics of Parts,” Int’l J. Computer Vision, vol. 56, no. 3,pp. 151-177, 2004.

[22] P.F. Felzenszwalb and D.P. Huttenlocher, “Pictorial Structures forObject Recognition,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 55-79, 2005.

[23] Y. Amit and A. Trouve, “POP: Patchwork of Parts Models forObject Recognition,” Int’l J. Computer Vision, vol. 75, no. 2, pp. 267-282, 2007.

[24] P.F. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discrimi-natively Trained, Multiscale, Deformable Part Model,” Proc. IEEEConf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.

[25] M.P. Kumar, P.H.S. Torr, and A. Zisserman, “Obj Cut,” Proc. IEEEConf. Computer Vision and Pattern Recognition, vol. 1, pp. 18-25,2005.

[26] J. Winn and J. Shotton, “The Layout Consistent Random Field forRecognizing and Segmenting Partially Occluded Objects,” Proc.IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 37-44, 2006.

[27] P. Viola, M. Jones, and D. Snow, “Detecting Pedestrians UsingPatterns of Motion and Appearance,” Proc. IEEE Int’l Conf.Computer Vision, pp. 734-741, 2003.

[28] N. Dalal, B. Triggs, and C. Schmid, “Human Detection UsingOriented Histograms of Flow and Appearance,” Proc. EuropeanConf. Computer Vision, vol. 2, pp. 428-441, 2006.

[29] V. Sharma and J.W. Davis, “Integrating Appearance and MotionCues for Simultaneous Detection and Segmentation of Pedes-trians,” Proc. IEEE Int’l Conf. Computer Vision, pp. 1-8, 2007.

[30] Q. Zhu, S. Avidan, M.-C. Yeh, and K.-T. Cheng, “Fast HumanDetection Using a Cascade of Histograms of Oriented Gradients,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2,pp. 1491-1498, 2006.

[31] S. Maji, A.C. Berg, and J. Malik, “Classification Using IntersectionKernel Support Vector Machines Is Efficient,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, pp. 1-8, 2008.

[32] P. Sabzmeydani and G. Mori, “Detecting Pedestrians by LearningShapelet Features,” Proc. IEEE Conf. Computer Vision and PatternRecognition, pp. 1-8, 2007.

[33] B. Wu and R. Nevatia, “Optimizing Discrimination-EfficiencyTradeoff in Integrating Heterogeneous Local Features for ObjectDetection,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pp. 1-8, 2008.

[34] J. Pang, Q. Huang, and S. Jiang, “Multiple Instance Boost UsingGraph Embedding Based Decision Stump for Pedestrian Detec-tion,” Proc. European Conf. Computer Vision, vol. 4, pp. 541-552,2008.

LIN AND DAVIS: SHAPE-BASED HUMAN DETECTION AND SEGMENTATION VIA HIERARCHICAL PART-TEMPLATE MATCHING 617

[35] H. Tao, H. Sawhney, and R. Kumar, “A Sampling Algorithm forDetecting and Tracking Multiple Objects,” Proc. IEEE Int’l Conf.Computer Vision Workshop Vision Algorithms, pp. 53-68, 1999.

[36] M. Isard and J. MacCormick, “BraMBLe: A Bayesian Multiple-Blob Tracker,” Proc. IEEE Int’l Conf. Computer Vision, vol. 1, pp. 34-41, 2001.

[37] T. Zhao and R. Nevatia, “Tracking Multiple Humans in CrowdedEnvironment,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 406-413, 2004.

[38] K. Smith, D.G. Perez, and J.M. Odobez, “Using Particles to TrackVarying Numbers of Interacting People,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 962-969, 2005.

[39] J. Rittscher, P.H. Tu, and N. Krahnstoever, “SimultaneousEstimation of Segmentation and Shape,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, vol. 2, pp. 486-493, 2005.

[40] Q. Zhao, J. Kang, H. Tao, and W. Hua, “Part Based HumanTracking in a Multiple Cues Fusion Framework,” Proc. Int’l Conf.Pattern Recognition, vol. 1, pp. 450-455, 2006.

[41] P. Dollar, B. Babenko, S. Belongie, P. Perona, and Z. Tu, “MultipleComponent Learning for Object Detection,” Proc. European Conf.Computer Vision, vol. 2, pp. 211-224, 2008.

[42] Z. Lin, L.S. Davis, D. Doermann, and D. DeMenthon, “Hierarch-ical Part-Template Matching for Human Detection and Segmenta-tion,” Proc. IEEE Int’l Conf. Computer Vision, pp. 1-8, 2007.

[43] D. Tran and D.A. Forsyth, “Configuration Estimates ImprovePedestrian Finding,” Proc. Conf. Advances in Neural InformationProcessing Systems, 2007.

[44] Z. Lin and L.S. Davis, “A Pose Invariant Descriptor for HumanDetection and Segmentation,” Proc. European Conf. ComputerVision, vol. 4, pp. 423-436, 2008.

[45] D. Hoiem, A. Efros, and M. Hebert, “Putting Objects inPerspective,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 2137-2144, 2006.

[46] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support VectorMachines, http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.

[47] S. Tran and L.S. Davis, “Robust Object Tracking with RegionalAffine Invariant Features,” Proc. IEEE Int’l Conf. Computer Vision,pp. 1-8, 2007.

[48] S. Tran, Z. Lin, D. Harwood, and L.S. Davis, “UMD_VDT, anIntegration of Detection and Tracking Methods for MultipleHuman Tracking,” Proc. CLEAR Workshop, pp. 179-190, 2007.

Zhe Lin received the BEng degree in automaticcontrol from the University of Science andTechnology of China in 2002, the MS degree inelectrical engineering from the Korea AdvancedInstitute of Science and Technology in 2004, andthe PhD degree in electrical and computerengineering from the University of Maryland,College Park, in 2009. He has been a researchintern at Microsoft Live Labs Research. He iscurrently working as a research scientist at the

Advanced Technology Labs, Adobe Systems Incorporated, San Jose,California. His research interests include object detection and recogni-tion, content-based image and video retrieval, and human motiontracking and activity analysis. He is a member of the IEEE and the IEEEComputer Society.

Larry S. Davis received the BA degree fromColgate University in 1970 and the MS andPhD degrees in computer science from theUniversity of Maryland in 1974 and 1976,respectively. He is currently a professor in theInstitute for Advanced Computer Studies andthe Computer Science Department, as well asthe chair of that department, at the University ofMaryland. From 1977 to 1981, he was anassistant professor in the Department of Com-

puter Science at the University of Texas at Austin. He returned to theUniversity of Maryland as an associate professor in 1981. From 1985 to1994, he was the director of the University of Maryland Institute forAdvanced Computer Studies. He is known for his research in computervision and high-performance computing. He has published more than100 papers in journals and has supervised more than 20 PhD students.He is an associate editor of the International Journal of Computer Visionand an area editor for Computer Models for Image Processing: ImageUnderstanding. He has served as the program or general chair for mostof the field’s major conferences and workshops, including the FifthInternational Conference on Computer Vision, the 2004 ComputerVision and Pattern Recognition Conference, the 11th InternationalConference on Computer Vision. He is a fellow of the IEEE and amember of the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

618 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 4, APRIL 2010