object detection grammars

Pedro FelzenszwalbBrown University

Joint work with Ross Girshick and David McAllester

Object Detection Grammars

Object category detectionObject category detection in computer vision

Goal: detect all pedestrians, cars, monkeys, etc in imageDetect people, cars, trees, lamp posts, etc.

The challenge

[Pascal VOC images]

Objects in each category vary greatly in appearance

Deformable part models (DPM)

• Model an object by a collection of parts arranged in a deformable configuration

• Use mixture models to handle more significant variation

Deformable models

• Can take us a long way...

• But not all the way

Structure variation• Object in rich categories have variable structure

• These are NOT deformations

• Mixture of deformable models?

- too many combined choices

Grammar/Compositional models• Some parts should be optional

- A person could have a hat or not

• There should be subtypes (mixtures) at the part level

- A person could wear a skirt or pants

- A mouth can be smiling or frowning

• Parts are recursively objects

- People have faces and faces have eyes

• [Geman, Potter, Chi 2002], [Jin, Geman, 2006], [Zhu, Mumford, 2006], [Wang, Athitsos, Sclaroff, Betke, 2008]

- person -> face, trunk, arms, lower-part

- face -> eyes, nose, mouth

- face -> hat, eyes, nose, mouth

- hat -> baseball-cap

- hat -> sombrero

- lower-part -> shoe, shoe, legs

- lower-part -> bare-foot, bare-foot, legs

- legs -> pants

- legs -> skirt

shoe

lower-part

person

eyes

face

legsshoenose

mouth

pants

trunk arms

Previous Work• Compositional Engine

Jin, Geman, CVPR 2006

• Hidden State Shape ModelsWang, Athitsos, Sclaroff, Betke, PAMI 2008

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, DRAFT FOR PEER REVIEW 2

Fig. 1. Three object classes that exhibit variable shape structure: branches with leaves, hair combs, and hand contours. Suchclasses can be naturally modeled with a Hidden State Shape Model (HSSM).

both cases, the object belongs to the class “branch,” even if typical instances of this classare branches without missing leaves.

• Some object parts can appear in alternative ways. Examples are the parts of articulatedobjects, such as the hands in Fig. 1, where each finger appears either totally extended,partially bent, or completely hidden.

Object classes of variable shape structure are frequently encountered in both man-made andnatural objects. Blood vessels in the retina, airway ducts in the lung, and dendrites in nerve tissueare examples of biological objects with variable structure. Detecting such objects is importantfor tasks like diagnosing lung cancer or diseases of the retina. Roadways and waterways in aerialimages are also examples of objects that have variable structure.To model object classes of variable shape structure, we have introduced Hidden State Shape

Models (HSSMs) [1], a generalization of Hidden Markov Models (HMMs) [2]. In HSSMs,alterative appearances of each object part are modeled by different model states. Using HSSMs,an object in an image can be localized and the shape structure of the object can be simultaneouslyidentified by stepwise registration of these model states with the parts of the object. Thecomputational complexity of this registration process is polynomial in terms of the total numberof the model states and the total number of the observed features in the image, even in thepresence of a significant amount of clutter.The method proposed in this paper builds on top of the original HSSM method [1]. While the

previous method [1] assumed that the scale of the object was known, the proposed method canhandle more general realistic scenarios, where the object’s scale in the image is not known apriori. In particular, the main contributions we make in this paper are summarized as follows: Aunified probabilistic framework is formulated for object localization and structure identification

May 10, 2007 DRAFT

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCEThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

comes from studies of the diminishing numbers of new partsthat are needed to represent objects in a sequential learningtask (Krempp et al. [24]), as well as from the successesof multiple-object recognition systems built on a commonsubstrate of lower-level parts [2, 37].

Context. It is often observed that segmentation can be am-biguous, if not impossible, in the absence of the contextualinformation provided through recognition. Similarly, reli-able edge and boundary detection is notoriously difficultwhen attempted in a purely bottom-up framework, with-out more global contextual constraints that help to disam-biguate, for example, texture, shadow, and occlusion bound-aries (cf. [7, 38]). By little more than their nature, hierarchi-cal models (as in [15, 16, 21, 29, 33]), embody multi-levelcontextual constraints.

Efficient Representation. Barlow [3] proposed suspiciouscoincidences as a possible principle for discovering mean-ingful groupings, such as the grouping of features into parts,parts into objects, or objects into scenes. A new label,“tree”, “telephone”, “desk”, makes for a more efficient rep-resentation by virtue of “explaining” an otherwise “suspi-cious coincidence” in the arrangement of features and parts.Much earlier, Laplace [25] made a similar observation, ar-guing that a likelihood principle was sufficient to providea gradient towards meaningful grouping. These notions ofgrouping are closely related to the notion of efficient repre-sentation, in that the introduction of a label for an otherwiseunlikely grouping of parts amounts to an enhanced encod-ing and a shorter description length (as discussed for exam-ple by Bienenstock et al. [5]). By this connection, hierar-chical description is a close cousin of Rissanen’s MinimumDescription Length principle [30].

Biology. Fodor and Pylyshyn [12] have questioned the bi-ological relevance of the (nonparametric-type) learning al-gorithms employed in most neural network models. Theyargue that these models lack a fundamental feature of hu-man cognition – they are not compositional. The princi-ple of compositionality holds that humans perceive and or-ganize information as a syntactically constrained hierarchyof reusable parts. The prototypical formulation was intro-duced by Chomsky [8] as a system of formal grammars. In-deed, language itself is the prototypical compositional sys-tem, with evident hierarchy, syntax, and reusability. In thevisual world, physical objects and scenes decompose natu-rally into a hierarchy of meaningful and generic parts, andit is perhaps no coincidence that there is an apparent hierar-chical structure in the ventral visual pathways of the morehighly evolved visual systems [32, 39, 41].

In §2, following the formulation proposed in [17], we de-velop a prior probability model on hierarchically organizedimage interpretations (“composition machine”). We beginwith a Markov structure, in the spirit of a Bayesian net-

work, and later perturb this distribution in order to achievegreater selectivity. An application to licence-plate readingis explored in §3, and some conclusions and speculation areoffered in §4.

2. Model: Composition Machine

Composition systems are generative, probabilistic, im-age models that embody a hierarchy of part/whole relation-ships. Generative probabilistic models include Bayesiannetworks [13, 23, 34, 36], linear and nonlinear filtering [11],Markov random fields [9, 31, 46], and probabilistic context-free grammars [19]. Compositional systems are distinctfrom these models in that they are non-Markovian. On theone hand this makes computation substantially more diffi-cult, but on the other hand, non-Markovian models are moreselective and thereby, in principle, capable of smaller typeII error probabilities (probabilities of false alarms).

Figure 1. Vertical slice through a “composition machine”. Eachrow extends to a two-dimensional sheet of “bricks”. See text fordetails.

Markov Backbone. Figure 1 depicts the ‘Markov back-bone’, which is a generative, hierarchical model equippedwith a Markov structure on a directed acyclic graph. Start-ing at the bottom, the image pixels are represented by aone-dimensional string of nodes, corresponding to a one-dimensional slice through the two-dimensional pixel ar-ray. Hidden (model) variables are associated with two-dimensional sheets of nodes that sit “above” the image ar-ray; these variables are called bricks (as in Lego bricks)to emphasize their re-usability across legitimate configura-tions. The layer of bricks that sit immediately above theimage array are called terminal bricks, and as we shall see,are associated with local image filters.

Bricks represent semantic variables, like edges, strokes,junctions, shapes, and various parts and objects. Assign-ments will vary from application to application; Figure 2indicates the assignments for the application to license-plate

(an ID is misread if any of its characters are misread).

Search Strategies. Bottom-up seeding, as described in §2is slow, even if candidates are heavily pruned during thebottom-up (indexing) pass. Although it is indeed a coarse-to-fine exploration of I, the overwhelming majority of thecalculations of likelihood are unnecessary in that they couldbe eliminated, before the fact, if the goal were to find in-stances of a particular object (e.g. find and read licenseplates).

These observations suggest a more efficient ctf strat-egy: traverse the bricks associated with the objects of in-terest (top-level bricks in the license-plate system). Foreach brick, perform a depth-first search for an instantia-tion. Lower-level bricks might be visited multiple times.Hence, for each brick, a list of instantiations is maintainedand re-used every time that brick appears in the compu-tation. Computation passes immediately to the terminalbricks, and the search remains coarse-to-fine in the sensediscussed in §2. Yet many of the terminal bricks, indeedthe vast majority, are never visited. Furthermore, the algo-rithm admits easily to multi-threading or implementation ona multi-processor system.

A simplified version of depth-first search was imple-mented. Top-level (license-plate) bricks are instantiatedby a pair of bricks: a license-plate number (chosen fromthe penultimate layer) and a license-plate boundary (cho-sen from the third layer from the top). For each top-layerbrick, the possible children among the license-plate bound-aries were first explored. Although there were some false-positive boundaries (one is seen in Figure 6), only a smallfraction of the image needed to be further explored for thecorresponding license number. The result was a many-foldimprovement in computation speed with no loss in per-formance. It is likely that a fully implemented depth-firstsearch would further improve computational efficiency.

Observations. How important is the non-Markovian per-turbation? It is straight-forward to run the composition ma-chine with and without the perturbation term. What is more,the states of intermediate bricks signal detections of inter-mediate structures (such as characters, strings, and bound-aries), and can therefore be assessed, in and of themselves,by their recognition performance.

We consistently find a substantial drop in performance,at all levels of recognition, from characters up to licenseplates, when running the Markov backbone in place of thefull compositional (non-Markovian) system. For example,although we have not run the full data set under the Markovbackbone, a random sample points to a substantial drop indetection performance, from the current 98+% of correctlyread plates to something closer to 90%, as well as the ap-pearance of some false detections at the license-plate level.

A different kind of experiment bears on the justificationof hierarchical structure, per se. As formulated in §2, an in-

terpretation amounts to an annotation of a scene in terms ofa multitude of parts and objects. (See Figure 5 for the top 25parts and objects participating in a particular interpretation.)Consider now a highly simplified version of the license-plate composition machine, consisting of only the bottomtwo layers. The system can be used to detect characters inimages. An alternative use of the character models embod-ied in the compositional structure would be to test at eachlocation for the presence of a particular character, againstthe alternative that neither the character nor any part ofthe character is present. In other words, an all-or-none testinstead of a test for character against the compound alterna-tive of background or part(s). (We restrict ourselves to thecharacter layer because the all-or-none test is nearly com-putationally prohibitive.) We find that recognition perfor-mance, as measured for example by the ROC curve, sufferssubstantially when we force an all-or-none decision. Wewill have more to say about this observation shortly.

Figure 4. Extracted plate region of sample images

4. Concluding Remarks

The machine was “built by hand,” but possibly some ofit could be inferred directly from data. For example, it isnot difficult to imagine parameter estimation schemes thatwould employ labeled or unlabeled data to statistically ad-just the brick-based probabilities (εβ

0 , εβ1 , . . . , εβ

nβ – see §2),or the relational distributions (pc

β and p0β) that govern the

attribute likelihood ratios. On the other hand, learning thearchitecture itself, including the selection of bricks, childrensets, and attribute functions, is an enormously challengingproblem. We have little to say on this matter except to spec-ulate that such a system would probably have to be inferredbottom-up, one layer at a time, perhaps based upon the prin-ciple of “suspicious coincidences” articulated by Barlow inhis theory of unsupervised learning [3].

We believe that there is an important connection betweenreusability and the persistent gap between human and ma-chine performance in vision. As every practitioner knows,the computer vision problem would be far easier if “back-ground” could be reasonably modeled as some kind of sim-

Object detection grammars

• Generalization of deformable part models (tree-structured pictorial structures)

• Object defined by a stochastic grammar

- Each derivation has a different set of parts

- Productions capture spatial relationships between parts and sub-parts

- Terminals model local image data

(A tractable compositional framework)

Relationship to pictorial structures / DPM

• Pictorial structure

- parts (local appearance)

- springs (spatial relationships)

- parts and springs forms a graph --- structure is fixed

• Object detection grammar

- Grammar generates tree of symbols --- structure is variable

- Location of symbol is related to location of parent

- Appearance model associated with each terminal


• Set of terminal symbols T

- (templates)

• Set of nonterminal symbols N

- (objects/parts)

• Set of placements Ω within an image

• Placed symbol X(ω)

- X ∈ T ⋃ N

- ω ∈ Ω

eye((100,80),10)

face((90,10),50)

ω might be (x,y) position and scale

Production rules• Productions define expansions of nonterminals into bags of symbols

• We expand a nonterminal into a bag of terminals by repeatedly applying productions

- There are choices along the way

- Expansion score = sum of scores of productions used along the way

- X(ω) ~~s~~> { A1(ω1), ... , An(ωn) } (sequence of expansions)

- Leads to a derivation tree

placed nonterminal

Bag of placed symbols

score

X(ω) --s--> { Y1(ω1), ... , Yn(ωn) }

Appearance for terminals• Each terminal has an appearance model

- Defined by a scoring function f(A,ω,I)

- Score for placing terminal A at position ω within image I

FA I f(A,ω,I)

f(A,ω,I) might be the response of a HOG filter FA at position ω within I

Appearance for nonterminals• Extend the appearance model from terminals to nonterminals

• Best expansion of X(ω) into a bag of placed terminals

- Takes into account

1) expansion score

2) appearance model of placed terminals at their placements

• Detect objects (any symbol) by finding high scoring placements

f(X,ω,I) = max ( s + ∑ f(Ai,ωi,I) )X(ω) ~~s~~> { A1(ω1), ... , An(ωn) }

i

Isolated deformation grammars• Productions defined by two kinds of schemas

• Structure schema

- One production for each placement ω

• Deformation schema

- One production for each ω and displacement δ

• Leads to efficient algorithm for computing scores f(X,ω,I)

X(ω) --s--> { Y1(ω+δ1), ... , Yn(ω+δn) }

X(ω) --s(δ)--> { Y(ω+δ) }

Face grammar

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7









),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m





8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.


8! : EYE(!) 0! {EYE.FILTER(!)},










7

1) Face defined by global template and parts









),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m





8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.


8! : EYE(!) 0! {EYE.FILTER(!)},










7

2) Parts can move relative to their idea location









),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m





8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.


8! : EYE(!) 0! {EYE.FILTER(!)},










7

3) Parts defined by templates

Learning

• z is an expansion of X(ω) into a bag of terminals

• w is a vector of model parameters

- Score of each structure schema

- Deformation parameters of each deformation schema

- Appearance model for each terminal (HOG template)

• w can be trained using a Latent SVM

f(X,ω,I) = max ( s + ∑ f(Ai,ωi,I) )X(ω) ~~s~~> { A1(ω1), ... , An(ωn) }

f(X,ω,I) = max w . ϕ(I, z)z

Building a person grammar

• Consider mixture of DPMs learned from PASCAL VOC data

• Components differ in how much of the person is visible

• Components have independent parameters

- Inefficient use of training data

• We can build a grammar that

- Allows more flexibility in modeling visibility

- Shares parts among different interpretations

Person detection grammar [NIPS 2011]

• Instantiation includes a variable number of parts

- 1,...,k and occluder if k < 6

• Parts can translate relative to each other

• Parts have subtypes

• Parts have deformable sub-parts (not shown)

• Beats all other methods on PASCAL 2010 (49.5 AP)

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder

Example detections and derived filtersSubtype 1 Subtype 2

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Occluder

Figure 1: Shallow grammar model. This figure illustrates a shallow version of our grammar model(Section 2.1). This model has six person parts and an occlusion model (“occluder”), each of whichcomes in one of two subtypes. A detection places one subtype of each visible part at a location andscale in the image. If the derivation does not place all parts it must place the occluder. Parts areallowed to move relative to each other but are constrained by deformation penalties.

We consider models with productions specified by two kinds of schemas (a schema is a template forgenerating productions). A structure schema specifies one production for each placement ! 2 ⌦,

X(!) s�! { Y1(! � �1), . . . , Yn

(! � �n

) }. (3)

Here the �i

specify constant displacements within the feature map pyramid. Structure schemas canbe used to define decompositions of objects into other objects.

Let � be the set of possible displacements within a single scale of a feature map pyramid. Adeformation schema specifies one production for each placement ! 2 ⌦ and displacement � 2 �,

X(!)↵·�(�)�! { Y (! � �) }. (4)

Here �(�) is a feature vector and ↵ is a vector of deformation parameters. Deformation schemascan be used to define deformable models. We define �(�) = (dx, dy, dx2, dy2) so that deformationscores are quadratic functions of the displacements.

The parameters of our models are defined by a weight vector w with entries for the score of eachstructure schema, the deformation parameters of each deformation schema and the filter coefficientsassociated with each terminal. Then score(T ) = w ·�(T ) where �(T ) is the sum of (sparse) featurevectors associated with each placed terminal and production in T .

2.1 A grammar model for detecting people

Each component in the person model learned by the voc-release4 system [12] is tuned to detectpeople under a prototypical visibility pattern. Based on this observation we designed, by hand, thestructure of a grammar that models visibility by using structural variability and optional parts. Forclarity, we begin by describing a shallow model (Figure 1) that places all filters at the same resolutionin the feature map pyramid. After explaining this model, we describe a deeper model that includesdeformable subparts at higher resolutions.

Fine-grained occlusion Our grammar model has a start symbol Q that can be expanded using oneof six possible structure schemas. These choices model different degrees of visibility ranging fromheavy occlusion (only the head and shoulders are visible) to no occlusion at all.

Beyond modeling fine-grained occlusion patterns when compared to the mixture models from [12]or [7], our grammar model is also richer in the following ways. In Section 5 we show that each ofthese aspects improves detection performance.

Occlusion model If a person is occluded, then there must be some cause for the occlusion — eitherthe edge of the image or an occluding object such as a desk or dinner table. We model the cause ofocclusion through an occlusion object that has a non-trivial appearance model.

3

Building the model• Type in manually defined grammar

• Learn parameters from bounding box annotations

- Production scores

- Deformation models

- Templates (appearance model) for terminals

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215



Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }


p





p

},where N

p


.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}



p,t,r,u


.





p,t,L,u








4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215



Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }


p





p

},where N

p


.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}



p,t,r,u


.





p,t,L,u








4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215



Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }


p





p

},where N

p


.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}



p,t,r,u


.





p,t,L,u








4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215



Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }


p





p

},where N

p


.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}



p,t,r,u


.





p,t,L,u








4

Detections with person grammarQualitative results 1

(a) Full visibility (b) Occlusion boundaries

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections of fully visible people. (b) Examples wherethe occlusion part detects an occlusion boundary.

Qualitative results 2

(a) Early termination (b) Mistakes

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections where there is no occlusion, but a partialperson is appropriate. (b) Mistakes, where the model did not detectocclusion properly.

Qualitative results 1

(a) Full visibility (b) Occlusion boundaries

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections of fully visible people. (b) Examples wherethe occlusion part detects an occlusion boundary.

full visibility occlusion mistakes

Evolution of modelsA Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

DPM [FMR 2008]AP=0.27 2 DPM [FGMR 2010]

AP=0.36

6 DPM (voc-release4) AP=0.43

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

HOG [DT 05]AP=0.16

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder

Example detections and derived filtersSubtype 1 Subtype 2

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Occluder

Figure 1: Shallow grammar model. This figure illustrates a shallow version of our grammar model(Section 2.1). This model has six person parts and an occlusion model (“occluder”), each of whichcomes in one of two subtypes. A detection places one subtype of each visible part at a location andscale in the image. If the derivation does not place all parts it must place the occluder. Parts areallowed to move relative to each other but are constrained by deformation penalties.

We consider models with productions specified by two kinds of schemas (a schema is a template forgenerating productions). A structure schema specifies one production for each placement ! 2 ⌦,

X(!) s�! { Y1(! � �1), . . . , Yn

(! � �n

) }. (3)

Here the �i

specify constant displacements within the feature map pyramid. Structure schemas canbe used to define decompositions of objects into other objects.

Let � be the set of possible displacements within a single scale of a feature map pyramid. Adeformation schema specifies one production for each placement ! 2 ⌦ and displacement � 2 �,

X(!)↵·�(�)�! { Y (! � �) }. (4)

Here �(�) is a feature vector and ↵ is a vector of deformation parameters. Deformation schemascan be used to define deformable models. We define �(�) = (dx, dy, dx2, dy2) so that deformationscores are quadratic functions of the displacements.

The parameters of our models are defined by a weight vector w with entries for the score of eachstructure schema, the deformation parameters of each deformation schema and the filter coefficientsassociated with each terminal. Then score(T ) = w ·�(T ) where �(T ) is the sum of (sparse) featurevectors associated with each placed terminal and production in T .

2.1 A grammar model for detecting people

Each component in the person model learned by the voc-release4 system [12] is tuned to detectpeople under a prototypical visibility pattern. Based on this observation we designed, by hand, thestructure of a grammar that models visibility by using structural variability and optional parts. Forclarity, we begin by describing a shallow model (Figure 1) that places all filters at the same resolutionin the feature map pyramid. After explaining this model, we describe a deeper model that includesdeformable subparts at higher resolutions.

Fine-grained occlusion Our grammar model has a start symbol Q that can be expanded using oneof six possible structure schemas. These choices model different degrees of visibility ranging fromheavy occlusion (only the head and shoulders are visible) to no occlusion at all.

Beyond modeling fine-grained occlusion patterns when compared to the mixture models from [12]or [7], our grammar model is also richer in the following ways. In Section 5 we show that each ofthese aspects improves detection performance.

Occlusion model If a person is occluded, then there must be some cause for the occlusion — eitherthe edge of the image or an occluding object such as a desk or dinner table. We model the cause ofocclusion through an occlusion object that has a non-trivial appearance model.

3

Grammar [GFM 2011] AP=0.47

Summary• The big challenge is modeling appearance variation

• Object detection grammars can express many types of models

- Models with variable structure

- Models with repeated/shared parts

- etc. -- think of it as a programming language

• General implementation

- Isolated deformation grammar + HOG filters + LSVM training

• As in NLP, learning grammar structure is an open problem

Object detection with grammar models Ross Girshick Pedro Felzenszwalb David McAllester University of Chicago Brown University TTI-Chicago

Problem overview


Objects from rich categories have diverse structural variation

There are too many combinations!

helmet,occluded left side

ski cap, no face, truncated

pirate hat, dresses,long hair

truncation, holding glass

truncation

Our approach: Compositional

models defined by grammars

Case study: person detection

Learning from weakly labeled data

Experimental results

[image source: PASCAL VOC 2009]

Example grammar for people

person → face, trunk, arms, lower-part

face → eyes, nose, mouth

face → hat, eyes, nose, mouth

hat → baseball-cap

hat → sombrero

lower-part → shoe, shoe, legs

lower-part → bare-foot, bare-foot, legs

legs → pants

legs → skirt

Object detection grammar formalism

Set of terminal symbols T Set of production rules of the form

Set of nonterminal symbols N

Set of placements ⌦

Placed symbols Placed terminals are scored by

Nonterminals are scored by computing

X(!)

X 2 T [N ,! 2 ⌦

face((90,10),50)

eye((110,80),10)

Placements might be (x, y) position and scale

X(!)s�! { Y1(!1), . . . , Yn(!n) }

placed nonterminal

score bag of placed symbols

f(A,!, I)

FA I score map

f(X,!, I) = max

X(!)s {A1(!1),...,An(!n)}

s+

X

i

f(Ai,!i, I)

!

Dalal & TriggsCVPR 2005

AP 0.12

Felzenszwalb, McAllester& Ramanan CVPR 2008

AP 0.27 Felzenszwalb, Girshick, McAllester& Ramanan PAMI 2010

AP 0.36

Felzenszwalb, Girshick & McAllester voc-release4

AP 0.42

More mixture components?

Subtype 1 Subtype 2

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Occluder

Example detections and derived filters

Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1, 2 & occluder

Fine-grained occlusion ✸ Part sharing ✸ Nontrivial model of the stuff that causes occlusion ✸ Part subtypes Full model: Deformable subparts at multiple resolutions ✸ AP 0.47 "PASCAL VOC 2007#

person

shoe

lower-part

person

eyes

face

legsshoenose

mouth

pants

trunkarms

Goal: train a function f using different label and output spaces

E(w) =1

2||w||2 + C

nX

i=1

L

0(w, xi, yi)

f(x) = argmax

s2S(x)w · �(x, s)

L

0(w, x, y) = max

s2S(x)

[w · �(x, s) + L

margin

(y, s)]

� max

s2S(x)

[w · �(x, s)� L

output

(y, s)]

label y 2 Y output s 2 S

Function to learn

Objective to minimize

Difference of two loss augmented

predictions

Weak-label structural SVM (WL-SSVM)generalizes structural SVM and latent structural SVMWe connect labels to outputs by a loss function

Full visibility Occlusion boundary Early termination Mistakes

PASCAL VOC 2010 'person' category results

Grammar +bbox +context UoC-TTI +bbox +context PoseletsAP 47.5 47.6 49.5 44.4 45.2 47.5 48.5

GrammarLSVM

GrammarWL-SSVM

MixtureLSVM

MixtureWL-SSVM

AP 45.3 46.7 42.6 43.2

Comparison of training objectives on PASCAL VOC 2007

Full model -subtypes -occluder -subpartsAP 46.7 45.5 45.5 40.6

Comparison of model structure choices on PASCAL VOC 2007

object detection grammars

Documents