parsing occluded people by flexible compositionsayuille/pubs15/xianjiechenoccludecvpr2015.pdf ·...

Parsing Occluded People by Flexible Compositions

Xianjie ChenUniversity of California, Los Angeles

Los Angeles, CA [email protected]

Alan YuilleUniversity of California, Los Angeles

Los Angeles, CA [email protected]

Abstract

This paper presents an approach to parsing humanswhen there is significant occlusion. We model humans us-ing a graphical model which has a tree structure building onrecent work [?, ?] and exploit the prior that, even in pres-ence of occlusion, the visible nodes form a connected sub-tree of the graphical model. We call each connected subtreea flexible composition of object parts. This involves a novelmethod for learning occlusion cues. During inference weneed to search over a mixture of different flexible models.By exploiting part sharing, we show that this inference canbe done extremely efficiently requiring only twice as manycomputations as searching for the entire object (i.e., notmodeling occlusion). We evaluate our model on the stan-dard benchmarked “We Are Family” Stickmen dataset andobtain significant performance improvements over the bestalternative algorithms.

1. IntroductionParsing humans into parts is an important visual task

with many applications such as activity recognition [?, ?].A common approach is to formulate this task in terms ofgraphical models where the graph nodes and edges rep-resent human parts and their spatial relationships respec-tively. This approach is becoming successful on bench-marked datasets [?, ?]. But in many real world situationsmany human parts are occluded. Standard methods are par-tially robust to occlusion by, for example, using a latentvariable to indicate whether a part is present and payinga penalty if the part is not detected, but are not designedto deal with significant occlusion. One of these models [?]will be used in this paper as a base model, and we will com-pare to it.

In this paper, we observe that part occlusions often oc-cur in regular patterns. The visible parts of a human tend toconsist of a subset of connected parts even when there is sig-nificant occlusion (see Figures 1 and 2(a)). In the terminol-ogy of graphical models, the visible (non-occluded) nodes

Full Graph Flexible Compositions

Figure 1: An illustration of the flexible compositions. Each con-nected subtree of the full graph (include the full graph itself) is aflexible composition. The flexible compositions that do not havecertain parts is suitable for the people with those parts occluded.

form a connected subtree of the full graphical model (fol-lowing current models, for simplicity, we assume that thegraphical model is treelike). This prior is not always valid(i.e., the visible parts of humans may form two or more con-nected subsets), but our analysis (see Section 6.4) suggestsit’s often true. In any case, we will restrict ourselves to itin this paper, since verifying that some isolated pieces ofbody parts belong to the same person is still very difficultfor vision methods, especially in challenging scenes wheremultiple people occlude one another (see Figure 2).

To formulate our approach we build on the basemodel [?], which is the state of the art on several bench-marked datasets [?, ?, ?], but is not designed for dealingwith significant occlusion. We explicitly model occlusionsusing the prior above. This means that we have a mixture ofmodels where the number of components equals the numberof connected subtrees of the graph, which we call flexiblecompositions, see Figure 1. The number of flexible compo-sitions can be large (for a simple chain like model consistingof N parts, there are N(N + 1)/2 possible compositions).Our approach exploits the fact there is often local evidencefor the presence of occlusions, see Figure 2(b). We proposea novel approach which learns occlusion cues, which canbreak the links/edges, between adjacent parts in the graph-ical model. It is well known, of course, that there are lo-

1

cal cues such as T-junctions which can indicate local occlu-sions. But although these occlusion cues have been used bysome models (e.g., [?, ?]), they are not standard in graphicalmodels of objects.

We show that efficient inference can be done for ourmodel by exploiting the sharing of computation betweendifferent flexible models. Indeed, the complexity is onlydoubled compared to recent models where occlusion is notexplicitly modeled. This rapid inference also enables us totrain the model efficiently from labeled data.

We illustrate our algorithm on the standard benchmarked“We Are Family” Stickmen (WAF) dataset [?] for parsinghumans when significant occlusion is present. We showstrong performance with significant improvement over thebest existing method [?] and also outperform our basemodel [?]. We perform diagnostic experiments to verifyour prior that the visible parts of a human tend to consist ofa subset of connected parts even when there is significantocclusion, and quantify the effect of different aspects of ourmodel.

2. Related workGraphical models of objects have a long history [?, ?].

Our work is most closely related to the recent work of Yangand Ramanan [?], Chen and Yuille [?], which we use asour base model and will compare to. Other relevant workincludes [?, ?, ?, ?].

Occlusion modeling also has a long history. Psychophys-ical studies (e.g., Kanizsa [?]) show that T-junctions are auseful cue for occlusion. But with a few exceptions, e.g.,[?], there has been little attempt to model the spatial pat-terns of occlusions. Instead it is more common to designmodels so that they are robust in the presence of occlu-sion, so that the model is not penalized very much if anobject part is missing. Girshick et. al. [?] and Supervised-DPM [?] model the occluded part (background) using extratemplates. And they rely on a root part (i.e., the holistic ob-ject) that never takes the status of “occluded”. When thereis significant occlusion, modeling the root part itself is dif-ficult. Ghiasi et. al. [?] advocates modeling the occlusionarea (background) using more templates (mixture of tem-plates), and localizes every body parts. It may be plausibleto “guess” the occluded keypoints of face (e.g., [?, ?]), butseems impossible for body parts of people, due to highlyflexible human poses. Eichner and Ferrari [?] handles oc-clusion by modeling interactions between people, which as-sumes the occlusion is due to other people.

Our approach models object occlusion effectively usesa mixture of models to deal with different occlusion pat-terns. There is considerable work which models objects us-ing mixtures to deal with different configurations, see Pose-lets [?] which uses many mixtures to deal with different ob-ject configurations, and deformable part models (DPMs) [?]

where mixtures are used to deal with different viewpoints.To ensure efficient inference, we exploit the fact that

parts are shared between different flexible compositions.This sharing of parts has been used in other work, e.g.,[?]. Other work that exploits part sharing includes com-positional models [?] and AND-OR graphs [?, ?].

3. The Model

We represent human pose by a graphical model G =(V, E) where the nodes V corresponds to the parts (or joints)and the edges E indicate which parts are directly related.For simplicity, we impose that the graph structure forms aK−node tree, where K = |V|. The pixel location of partpart i is denoted by li = (x, y), for i ∈ {1, . . . ,K}.

To model the spatial relationship between neighboringparts (i, j) ∈ E , we follow the base model [?] to discretizethe pairwise spatial relationships into a set indexed by tij ,which corresponds to a mixture of different spatial relation-ships.

In order to handle people with different degrees of oc-clusion, we specify a binary occlusion decoupling variableγij ∈ {0, 1} for each edge (i, j) ∈ E , which enables thesubtree Tj = (V(Tj), E(Tj)) rooted at part j to be decou-pled from the graph at part i (the subtree does not containpart i, i.e., i /∈ V(Tj)). This results in a set of flexible com-positions of the graph, indexed by set CG . These composi-tions share the nodes and edges with the full graph G andeach of themselves forms tree graph (see Figure 1). Thecompositions that do not have certain parts is suitable forthe people with those parts occluded.

In this paper, we exploit the prior that body parts tend tobe connected even in the presence of occlusion, and do notconsider the cases when people are separated into isolatedpieces, which is very difficult. Handling these cases typi-cally requires non-tree models, e.g.,[?], and thus does nothave exact and efficient inference algorithms. Moreover,verifying whether some isolated pieces of people belong tothe same person is still very difficult for vision methods, es-pecially in challenging scenes where multiple people usu-ally occlude one another (see Figure 2(a)).

For each flexible composition Gc = (Vc, Ec), c ∈ CG ,we will define a score function F (l, t,Gc|I,G) as a sum ofappearance terms, pairwise relational terms, occlusion de-coupling terms and decoupling bias terms. Here I denotesthe image, l = {li|i ∈ V} is the set of locations of the parts,and t = {tij , tji|(i, j) ∈ E} is the set of spatial relation-ships.Appearance Terms: The appearance terms make use ofthe local image measurement within patch I(li) to provideevidence for part i to lie at location li. They are of form:

A(li|I) = wiφ(i|I(li);θ), (1)

(a) (b)

Figure 2: Motivation. (a): In real world scenes, people are usually significantly occluded (or truncated). Requiring the model to localize afixed set of body parts while ignoring the fact that different people have different degrees of occlusion (or truncation) is problematic. (b):The absence of body parts evidence can help to predict occlusion, e.g., the right wrist of the lady in brown can be inferred as occludedbecause of the absence of suitable wrist near the elbow. However, absence of evidence is not evidence of absence. It can fail in somechallenging scenes, for example, even though the left arm of the lady in brown is completely occluded, there is still strong image evidenceof suitable elbow and wrist at the plausible locations due to the confusion caused by nearby people (e.g., the lady in green). In bothsituations, the local image measurements near the occlusion boundary (i.e., around the right elbow and left shoulder), e.g., in a imagepatch, can reliably provide evidence of occlusion.

where φ(.|.;θ) is the (scalar-valued) appearance term withθ as its parameters (specified in Section 3.1), and wi is ascalar weight parameter.Image Dependent Pairwise Relational (IDPR) Terms:We follow the base model [?] to use image dependent pair-wise relational (IDPR) terms, which gives stronger spatialconstraints between neighboring parts (i, j) ∈ E . Strongerspatial constraints reduce the confusion from the nearbypeople and clustered background, which helps to better in-fer occlusion.

More formally, the relative positions between parts i andj are discretized into several types tij ∈ {1, . . . , Tij} (i.e., amixture of different relationships) with corresponding meanrelative positions r

tijij plus small deformations which are

modeled by the standard quadratic deformation term. Theyare given by:

R(li, lj , tij , tji|I) = 〈wtijij ,ψ(lj − li − r

tijij )〉

+ wijϕs(tij , γij = 0|I(li);θ)

+ 〈wtjiji ,ψ(li − lj − r

tjiji )〉

+ wjiϕs(tji, γji = 0|I(lj);θ)

, (2)

where ψ(∆l = [∆x,∆y]) = [∆x ∆x2 ∆y ∆y2]ᵀ are thestandard quadratic deformation features, ϕs(., γij = 0|.;θ)is the Image Dependent Pairwise Relational (IDPR) termwith θ as its parameters (specified in Section 3.1). IDPRterms are only defined when both part i and j are visible(i.e., γij = 0 and γji = 0). Here w

tijij , wij ,w

tjiji , wji are

the weight parameters, and the notation 〈., .〉 specifies dotproduct and boldface indicates a vector.Image Dependent Occlusion Decoupling (IDOD) Terms:These IDOD terms capture our intuition that the visible part

i near the occlusion boundary (and thus is a leaf node ineach flexible composition) can reliably provide occlusionevidence using only local image measurement (see Fig-ure 2(b) and Figure 3). More formally, the occlusion de-coupling score for decoupling the subtree Tj from the fullgraph at part i is given by:

Dij(γij = 1, li|I) = wijϕd(γij = 1|I(li);θ), (3)

where ϕd(γij = 1|.;θ) is the Image Dependent OcclusionDecoupling (IDOD) term with θ as its parameters (specifiedin Section 3.1), γij = 1 indicates subtree Tj is decoupledfrom the full graph. Here wij is the scalar weight parametershared with the IDPR term.Decoupling Bias Term: The decoupling bias term capturesour intuition that the absence of evidence for suitable bodypart can help to predict occlusion. We specify a scalar biasterm bi for each part i as a learned measure for the absenceof good part appearance, and also the absence of suitablespatial coupling with neighboring parts (our spatial con-straints are also image dependent).

The decoupling bias term for decoupling the subtreeTj = (V(Tj), E(Tj)) from the full graph at part i, is de-fined as the sum of all the bias terms associated with theparts in the subtree, i.e., k ∈ V(Tj). They are of form:

Bij =∑

k∈V(Tj)

bk (4)

The Model Score: The model score for a person is the max-imum score of all the flexible compositions c ∈ CG , there-fore the index c of the flexible composition is also a randomvariable that need to be estimated, which is different fromthe standard graphical models with single graph structure.

Lower Arm:

Upper Arm:

Elbow:

Occluders:

Figure 3: Different occlusion decoupling and spatial relationshipsbetween the elbow and its neighbors, i.e., wrist and shoulder. Thelocal image measurement around a part (e.g., the elbow) can reli-ably predict the relative positions of its neighbors when they arenot occluded, which is demonstrated in the base model [?]. Inthe case when the neighboring parts are occluded, the local imagemeasurement can also reliably provide evidence for the occlusion.

The score F (l, t,Gc|I,G) for each flexible compositionc ∈ CG is a function of the locations l, the pairwise spatialrelation types t, the index of the flexible composition c, thestructure of the full graph G, and the input image I. It isgiven by:

F (l, t,Gc|I,G) =∑i∈Vc

A(li|I)

+∑

(i,j)∈Ec

R(li, lj , tij , tji|I)

+∑

(i,j)∈Edc

(Bij +Dij(γij = 1, li|I))

(5)

where Edc = {(i, j) ∈ E|i ∈ Vc, j /∈ Vc} is the edges that isdecoupled to generate the composition Gc. See Section 5for the learning of the model parameters.

3.1. Deep Convolutional Neural Network (DCNN)for Image Dependent Terms

Our model has three kinds of terms that depend on thelocal image patches: the appearance terms, IDPR terms andIDOD terms. This requires us to have a method that can ef-ficiently extract information from a local image patch I(li)for the presence of the part i, as well as the occlusion de-coupling evidence γij = 1 of its neighbors j ∈ N (i), wherej ∈ N (i) if, and only if, (i, j) ∈ E . When a neighboringpart j is not occluded, i.e. γij = 0, we also need to extract

information for the pairwise spatial relationship type tij be-tween parts i and j.

Extending the base model [?], we learn the distributionfor the state variables i, tij , γij conditioned on the imagepatches I(li). We’ll first define the state space of this distri-bution.

Let g be the random variable which denotes which partis present g = i for i ∈ {1, ...,K} or g = 0 if no partis present (i.e., the background). We define mgN (g) ={mgk|k ∈ N (g)} to be the random variable that determinesthe pairwise occlusion decoupling and spatial relationshipsbetween part g and all its neighborsN (g), and takes valuesin MgN (g). If g = i has one neighbor j (e.g., the wrist),then MiN (i) = {0, 1, . . . , Tij}, where value 0 representpart j is occluded, i.e., γij = 1 and values v 6= 0 repre-sents part j is not occluded and has corresponding spatialrelationship types with part i, i.e., γij = 0, tij = v . Ifg = i has two neighbors j and k (e.g., the elbow), thenMiN (i) = {0, 1, . . . , Tij}×{0, 1, . . . , Tik} (Figure 3 illus-trates the spaceMiN (i) for the elbow when Tik = Tij = 6).If g = 0, then we defineM0N (0) = {0}.

The full space can be written as:

U = ∪Kg=0 {g} ×MgN (g) (6)

The size of the space is |U| =∑Kg=0 |MgN (g)|. Each ele-

ment in this space corresponds to the background or a partwith a kind of occlusion decoupling configurations of all itsneighbors and the types of its pairwise spatial relationshipswith its visible neighbors.

With the space of the distribution defined, we use asingle Deep Convolutional Neural Network (DCNN) [?],which is efficient and shares features between differ-ent parts, to learn the conditional probability distributionp(g,mgN (g)|I(li);θ). See Section 5 for more details.

We specify the appearance terms φ(.|.;θ), IDPR termsϕs(., γij = 0|.;θ) and IDOD terms ϕd(γij = 1|.;θ) interms of p(g,mgN (g)|I(li);θ) by marginalization:

φ(i|I(li);θ) = log(p(g = i|I(li);θ)) (7)ϕs(tij , γij = 0|I(li);θ) = log(p(mij = tij |g = i, I(li);θ))

(8)

ϕd(γij = 1|I(li);θ) = log(p(mij = 0|g = i, I(li);θ))(9)

4. Inference

To estimate the optimal configuration for each person,we search for the flexible composition c ∈ CG with the con-figurations of the locations l and types t that maximize themodel score: (c∗, l∗, t∗) = arg maxc,l,t F (l, t,Gc|I,G).

Let CiG ⊂ CG be the subset of the flexible compositionsthat have node i present (Obviously, ∪i∈VCiG = CG), and we

will consider the compositions that have the part with index1 present first, i.e., C1G .

For all the flexible compositions c ∈ C1G , we set part 1as root. We will use dynamic programming to compute thebest score over all these flexible compositions for each rootlocation l1.

After setting the root, let K(i) be the set of children ofpart i in the full graph (K(i) = ∅, if part i is a leaf). We usethe following algorithm to compute the maximum score ofall the flexible compositions c ∈ C1G :

Si(li|I) = A(li|I) +∑k∈K(i)

mki(li|I) (10)

Bij = bj +∑

k∈K(j)

Bjk (11)

mki(li|I) = maxγik

((1− γik)×mski(li|I)

+ γik ×mdki(li|I)) (12)

mski(li|I) = max

lk,tik,tki

R(li, lk, tik, tki|I) + Sk(lk|I) (13)

mdki(li|I) = Dik(γik = 1, li|I) +Bik, (14)

where Si(li|I) is the score of the subtree Ti with part ieach location li, and is computed by collecting the mes-sages from all its children k ∈ K(i). Each child computestwo kinds of messages ms

ki(li|I) and mdki(li|I) that convey

information to parent for deciding whether to decouple it(and its followed subtree), i.e., Equation 12.

Intuitively, the message computed by Equation 13 mea-sures how well we can find a child part k that not only showsstrong evidence of part k (e.g., an elbow) and couples wellwith the other parts in the subtree Tk (i.e., Sk(lk|I)), butalso is suitable for the part i at location li based on the lo-cal image measurement (encoded in the IDPR terms). Themessage computed by Equation 14 measures the evidenceto decouple Tk by combining the local image measurementsaround part i (encoded in IDOD terms) and the learned oc-clusion decoupling bias.

The following lemma states each Si(li|I) computes themaximum score for the set of flexible compositions CiTithat’s within the subtree Ti and have part i at li. In otherwords, we consider an object that’s only composed with theparts in the subtree Ti (i.e., Ti is the full graph) and CiTi is theset of flexible compositions of the graph Ti that have part ipresent. Since at root part (i.e., i = 1), we have T1 = G, oncethe messages are passed to the root part, S1(l1|I) gives thebest score for all the flexible compositions in the full graphc ∈ C1G that have part 1 at l1.

Lemma 1.

Si(li, I) = maxc∈CiTi

{maxl/i,t

F (li, l/i, t,Gc|I, Ti)}

(15)

Proof. We will prove the lemma by induction from leaf toroot.Basis: The proof is trivial when node i is a leaf node.Inductive step: Assume for each child k ∈ K(i) of thenode i, the lemma holds. Since we do not consider the casethat people are separated into isolated pieces, each flexi-ble composition at node i (i.e., c ∈ CiTi ) is composed ofpart i and the flexible compositions from the children (i.e.,CkTk , k ∈ K(i)) that are not decoupled. Since the graph is atree, the best scores of the flexible compositions from eachchild can be computed separately, by Si(lk, I), k ∈ K(i)as assumed. These scores are then passed to node i (Equa-tion 13). At node i the algorithm can choose to decouplea child for better score (Equation 12). Therefore, the bestscore at node i is also computed by the algorithm. By in-duction, the lemma holds for all the nodes.

By Lemma 1, we can efficiently compute the best scorefor all the compositions with part 1 present, i.e., c ∈ C1G ,at each locations of part 1 by dynamic programming (DP).These scores can be thresholded to generate multiple esti-mations with part 1 present in an image. The correspondingconfigurations of locations and types can be recovered bythe standard backward pass of DP until occlusion decou-pling, i.e. γik = 1 in Equation 12. All the decoupled partsare inferred as occluded and thus do not have location orpairwise type configurations.

Since ∪i∈VCiG = CG , we can get the best score for all theflexible compositions of the full graph G by computing thebest score for each subset CiG , i ∈ V . More formally:

maxc∈CG ,l,t

F (l, t,Gc|I,G) = maxi∈V

( maxc∈CiG ,l,t

F (l, t,Gc|I,G))

(16)

This can be done by repeating the DP procedureK times,letting each part take its turn as the root. However, it turnsout the messages on each edge only need to be computedtwice, one for each direction. This allows us to implementan efficient message passing algorithm, which is of twice(instead of K times) the complexity of the standard one-pass DP, to get the best score for all the flexible composi-tions.Computation: As discussed above, the inference is oftwice the complexity of the standard one-pass DP. More-over, the max operation over the locations lk in Equation 13,which is a quadratic function of lk, can be accelerated by thegeneralized distance transforms [?]. The resulting approachis very efficient, takes O(2T 2LK) time once the image de-pendent terms are computed, where T is the number of spa-tial relation types, L is the total number of locations, andK is the total number of parts in the model. This analysisassumes that all the pairwise spatial relationships have thesame number of types, i.e., Tij = Tji = T, ∀(i, j) ∈ E .

The computation of the image dependent terms is also ef-ficient. They are computed over all the locations by a singleDCNN. The DCNN is applied in a sliding window fashionby considering the fully-connected layers as 1× 1 convolu-tions [?], which naturally shares the computations commonto overlapping regions.

5. LearningWe learn our model parameters from the images contain-

ing occluded people. The visibility of each part (or joint) islabeled, and the locations of the visible parts are annotated.We adopt a supervised approach to learn the model by firstderiving the occlusion decoupling labels γij and type labelstij from the annotations.

Our model consists of three sets of parameters: themean relative positions r =

{rtijij , r

tjiji |(i, j) ∈ E

}of dif-

ferent pairwise spatial relation types; the parameters θfor the image dependent terms, i.e., the appearance terms,IDPR and IDOD terms; and the weight parameters w(i.e.,wi,w

tijij , wij ,w

tjiji , wji), and bias parameters b (i.e., bk).

They are learnt separately by the K-means algorithm for r,DCNN for θ, and linear Support Vector Machine (SVM) [?]for w and b.Derive Labels and Learn Mean Relative Positions: Theground-truth annotations give the part visibility labels vn,and locations ln for visible parts of each person instancen ∈ {1, . . . , N}. For each neighboring parts (i, j) ∈ E ,we derive γnij = 1 if and only if part i is visible but partj is not, i.e., vni = 1 and vnj = 0. Let dij be the relativeposition from part i to its neighbor j, if both of them arevisible. We cluster the relative positions over the trainingset{dnij |vni = 1, vnj = 1

}to get Tij clusters (in the exper-

iments Tij = 8 for all pairwise relations). Each clustercorresponds to a set of instances of part i that share sim-ilar spatial relationship with its visible neighboring part j.Therefore, we define each cluster as a pairwise spatial rela-tion type tij from part i to j in our model, and the type labeltnij for each training instance is derived based on its clusterindex. The mean relative position r

tijij associated with each

type is defined as the the center of each cluster. In our ex-periments, we use K-means by setting K = Tij to do theclustering.Parameters of Image Dependent Terms: After derivingthe occlusion decoupling label and pairwise spatial typelabels, each local image patch I(ln) centered at an anno-tated (visible) part location is labeled with category labelgn ∈ {1, . . . ,K}, that indicates which part is present, andalso the type labels mn

gnN (gn) that indicate its pairwiseocclusion decoupling and spatial relationships with all itsneighbors. In this way, we get a set of labeled patches{

I(ln), gn,mngnN (gn)|vngn = 1

}from the visible parts of

each labeled people, and also a set of background patches

{I(ln), 0, 0} sampled from negative images, which do notcontain people.

Given the labeled part patches and background patches,we train a |U|-way DCNN classifier by standard stochas-tic gradient descent using softmax loss. The final |U|-waysoftmax output is defined as our conditional probability dis-tribution, i.e., p(g,mgN (g)|I(li);θ). See Section 6.2 for thedetails of our network.Weight and Bias Parameters: Given the derived occlusiondecoupling labels γij , we can associate each labeled posewith a flexible composition cn. For the poses that’s sepa-rately into several isolated compositions, we use the com-position with the most number of parts. The location of eachvisible part in the associated composition cn is given by theground-truth annotation, and the pairwise spatial types of itare derived above. We can then compute the model scoreof each labeled pose as a linear function of the parametersβ = [w,b], so we use a linear SVM to learn these parame-ters:

minβ,ξ

1

2〈β,β〉+ C

∑n

ξn

s.t. 〈β,Φ(cn, In, ln, tn)〉+ b0 ≥ 1− ξn,∀n ∈ pos〈β,Φ(cn, In, ln, tn)〉+ b0 ≤ −1 + ξn,∀n ∈ neg

,

where b0 is the scalar SVM bias, C is the cost parameter,and Φ(cn, In, ln, tn) is a sparse feature vector represent-ing the n-th example and is the concatenation of the im-age dependent terms (calculated from the learnt DCNN),spatial deformation features, and constants 1s for the biasterms. The above constraints encourage the positive exam-ples (pos) to be scored higher than 1 (the margin) and thenegative examples (neg), which we mine from the negativeimages using the inference method described above, lowerthan -1. The objective function penalizes violations usingslack variables ξi.

6. Experiments

This section describes our experimental setup, presentscomparison benchmark results, and gives diagnostic exper-iments.

6.1. Dataset and Evaluation Metrics

We perform experiments on the standard benchmarkeddataset: “We Are Family” Stickmen (WAF) [?], which con-tains challenging group photos, where several people oftenocclude one another (see Figure 5). The dataset contains525 images with 6 people each on average, and is officiallysplit into 350 images for training and 175 images for testing.Following [?, ?], we use the negative training images fromthe INRIAPerson dataset [?] (These images do not containpeople).

………

conv1: 7x7max: 3x3

conv2: 5x5max: 2x2 conv3: 3x3 conv4: 3x3 conv5: 3x3

54x54x3 18x18x32 9x9x128 9x9x128 9x9x128 9x9x128 4096 4096 UU

fc6 fc7 output

Figure 4: An illustration of the DCNN architecture used in ourexperiments. It consists of five convolutional layers (conv), 2 max-pooling layers (max) and three fully-connected layers (fc) with afinal |U|-way softmax output. We use the rectification (ReLU)non-linearity, and the dropout technique described in [?].

We evaluate our method using the official toolkit of thedataset [?] to allow comparison with previous work. Thetoolkit implements a version of occlusion-aware Percent-age of Correct Parts (PCP) metric, where an estimated partis considered correctly localized if the average distance be-tween its endpoints (joints) and ground-truth is less than50% of the length of the ground-truth annotated endpoints,and an occluded body part is considered correct if and onlyif the part is also annotated as occluded in the ground-truth.

We also evaluate the Accuracy of Occlusion Prediction(AOP) by considering occlusion prediction over all peopleparts as a binary classification problem. AOP does not carehow well a part is localized, but is aimed to show the per-centage of parts that have its visibility status correctly esti-mated.

6.2. Implementation detail

DCNN Architecture: The layer configuration of our net-work is summarized in Figure 4. In our experiments, thepatch size of each part is 54× 54. We pre-process each im-age patch pixel by subtracting the mean pixel value over allthe pixels of training patches. We use the Caffe [?] imple-mentation of DCNN.Data Augmentation: We augment the training data by ro-tating and horizontally flipping the positive training exam-ples to increase the number of training part patches withdifferent spatial configurations with its neighbors. We fol-low [?, ?] to increase the number of parts by adding themidway points between annotated parts, which results in 15parts on the WAF dataset. Increasing the number of partsproduce more training patches for DCNN, which helps toreduce overfitting. Also covering a person with more partsis good for modeling foreshortening [?].Part-based Non-Maximum Suppression: Using the pro-posed inference algorithm, a single image evidence of a partcan be used multiple times in different estimations. Thismay produce duplicated estimations for the same person.We use a greedy part-based non-maximum suppression [?]to prevent this. There is a score associated to each estima-tion. We sort the estimations by their score and start from

Method AOP Torso Head U.arms L.arms mPCPOurs 84.9 88.5 98.5 77.2 71.3 80.7

Multi-Person [?] 80.0 86.1 97.6 68.2 48.1 69.4Ghiasi et. al. [?] 74.0 - - - - 63.6One-Person [?] 73.9 83.2 97.6 56.7 28.6 58.6

Table 1: Comparison of PCP and AOP on the WAF dataset.Our method improves the PCP performance on all parts, and sig-nificantly outperform the best previously published result [?] by11.3% on mean PCP, and 4.9% on AOP.

the highest scoring estimation and remove the ones whoseparts overlap significantly with the corresponding parts ofany previously selected estimations. In the experiments, werequire the interaction over union between the correspond-ing parts in different estimation to be less than 60%.Other Settings: We use the same number of spatial rela-tionship types for all pairs of neighbors in our experiments.They are set as 8, i.e., Tij = Tji = 8,∀(i, j) ∈ E .

6.3. Benchmark results

Table 1 compares the performance of our method withthe state of the art methods using the PCP and AOP metricson the WAF benchmark, which shows our method improvesthe PCP performance on all parts, and significantly outper-form the best previously published result [?] by 11.3% onmean PCP, and 4.9% on AOP. Figure 5 shows some estima-tion results of our model on the WAF dataset.

6.4. Diagnostic Experiments

Prior Verification: We analyze the test set of the WAFdataset using ground truth annotation, and find that 95.1%of the people instances have their visible parts form a con-nected subtree. This verifies our prior that visible parts of ahuman tend to form a connected subtree, even in the pres-ence of significant occlusion.Term Analysis: We design the following experiments tobetter understand each design component in our model.

Firstly, our model is designed to handle different degreesof occlusion by efficiently searching over large number offlexible compositions. When we do not consider occlusionand use a single composition (i.e., the full graph), our modelreduces to the base model [?]. So we perform an diagnosticexperiment by using the base model [?] on the WAF dataset,which will infer every part as visible and localize them.

Secondly, we model occlusion by combining the cuefrom absence of evidence for body part and local imagemeasurement around the occlusion boundary, which is en-coded in the IDOD terms. So we perform the second di-agnostic experiment by removing the IDOD terms(i.e., inEquation 3, we have ϕd(γij = 1|.;θ) = 0). In this case, themodel handles occlusion only by exploiting the cue fromabsence of evidence for body part.

Figure 5: Results on the WAF dataset. We show the parts that are inferred as visible, and thus have estimated configurations, by our model.

Method AOP Torso Head U.arms L.arms mPCPBase Model [?] 73.9 81.4 92.6 63.6 47.6 66.1

FC 82.0 87.0 98.6 72.7 67.5 77.7FC+IDOD 84.9 88.5 98.5 77.2 71.3 80.7

Table 2: Diagnostic Experiments PCP and AOP results on theWAF dataset. Using flexible compositions (i.e., FC) significantlyimproves our base model [?] by 11.6% on PCP and 8.1% on AOP.Adding IDOD terms (FC+IDODs, i.e., the full model) further im-proves our PCP performance to 80.7% and AOP performance to84.9%, which is significantly higher than the state of the art meth-ods.

We show the PCP and AOP performance of the diagnos-tic experiments in Table 2. As is shown, flexible compo-sitions (i.e., FC) significantly outperform a single compo-sition (i.e., the base model [?]), and adding IDOD termsimproves the performance significantly (see the caption fordetails).

7. Conclusion

This paper develops a new graphical model for pars-ing people. We introduce and experimentally verify on theWAF dataset (see Section 6.4) a novel prior that the visi-ble body parts of human tend to form a connected subtree,which we define as a flexible composition, even with thepresence of significant occlusion. This is equivalent to mod-eling people as a mixture of flexible compositions. We de-fine novel occlusion cues and learn them from data. Weshow very efficient inference can be done for our model byexploiting part sharing so that computing over all the flexi-ble compositions takes only twice that of the base model [?].We evaluate on the WAF dataset and show we significantlyoutperform current state of the art methods [?, ?]. We alsoshow big improvement over our base model, which does notmodel occlusion explicitly.

8. AcknowledgementsThis research has been supported by grants ONR

N00014-12-1-0883 and ARO 62250-CS. The GPUs usedin this research were generously donated by the NVIDIACorporation.

References[1] H. Azizpour and I. Laptev. Object detection using strongly-

supervised deformable part models. In European Conferenceon Computer Vision (ECCV), 2012.

[2] L. Bourdev and J. Malik. Poselets: Body part detectorstrained using 3d human pose annotations. In InternationalConference on Computer Vision (ICCV), 2009.

[3] X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust facelandmark estimation under occlusion. In International Con-ference on Computer Vision (ICCV), 2013.

[4] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, andA. Yuille. Detect what you can: Detecting and representingobjects using holistic models and body parts. In ComputerVision and Pattern Recognition (CVPR), 2014.

[5] X. Chen and A. Yuille. Articulated pose estimation by agraphical model with image dependent pairwise relations. InAdvances in Neural Information Processing Systems (NIPS),2014.

[6] C. Cortes and V. Vapnik. Support-vector networks. Machinelearning, 1995.

[7] J. Coughlan, A. Yuille, C. English, and D. Snow. Efficientdeformable template detection and localization without userinitialization. Computer Vision and Image Understanding,2000.

[8] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In Computer Vision and Pattern Recogni-tion (CVPR), 2005.

[9] M. Eichner and V. Ferrari. We are family: Joint pose es-timation of multiple persons. In European Conference onComputer Vision (ECCV), 2010.

[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI), 2010.

[11] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial struc-tures for object recognition. International Journal of Com-puter Vision (IJCV), 2005.

[12] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progres-sive search space reduction for human pose estimation. InComputer Vision and Pattern Recognition (CVPR), 2008.

[13] M. A. Fischler and R. A. Elschlager. The representation andmatching of pictorial structures. IEEE Transactions on Com-puters, 1973.

[14] G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Lo-calizing occluded faces with a hierarchical deformable partmodel. In Computer Vision and Pattern Recognition (CVPR),2014.

[15] G. Ghiasi, Y. Yang, D. Ramanan, and C. C. Fowlkes. Parsingoccluded people. In Computer Vision and Pattern Recogni-tion (CVPR), 2014.

[16] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester. Ob-ject detection with grammar models. In Advances in NeuralInformation Processing Systems (NIPS), 2011.

[17] Y. Jia. Caffe: An open source convolutional archi-tecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.

[18] S. Johnson and M. Everingham. Clustered pose and nonlin-ear appearance models for human pose estimation. In BritishMachine Vision Conference (BMVC), 2010.

[19] G. Kanizsa. Organization in vision: Essays on Gestalt per-ception. Praeger New York, 1979.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems (NIPS),2012.

[21] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Pose-let conditioned pictorial structures. In Computer Vision andPattern Recognition (CVPR), 2013.

[22] B. Sapp, C. Jordan, and B. Taskar. Adaptive pose priors forpictorial structures. In Computer Vision and Pattern Recog-nition (CVPR), 2010.

[23] B. Sapp and B. Taskar. Modec: Multimodal decomposablemodels for human pose estimation. In Computer Vision andPattern Recognition (CVPR), 2013.

[24] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. In InternationalConference on Learning Representations (ICLR), 2014.

[25] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint train-ing of a convolutional network and a graphical model forhuman pose estimation. In Advances in Neural InformationProcessing Systems (NIPS), 2014.

[26] Z. Tu and S.-C. Zhu. Parsing images into regions, curves,and curve groups. International Journal of Computer Vision,2006.

[27] C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In Computer Vision and PatternRecognition (CVPR), 2013.

[28] Y. Yang and D. Ramanan. Articulated human detection withflexible mixtures of parts. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI), 2013.

[29] A. Yao, J. Gall, and L. Van Gool. Coupled action recogni-tion and pose estimation from multiple views. Internationaljournal of computer vision, 2012.

[30] L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. Yuille. Max mar-gin and/or graph learning for parsing the human body. InComputer Vision and Pattern Recognition (CVPR), 2008.

[31] L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille.Part and appearance sharing: Recursive compositional mod-els for multi-view. In Computer Vision and Pattern Recogni-tion (CVPR), 2010.

[32] S.-C. Zhu and D. Mumford. A stochastic grammar of images.Foundations and Trends in Computer Graphics and Vision,2006.

http://caffe.berkeleyvision.org/

http://caffe.berkeleyvision.org/

parsing occluded people by flexible compositionsayuille/pubs15/xianjiechenoccludecvpr2015.pdf ·...

Documents