learning layered motion segmentations of...

Learning Layered Motion Segmentations of Video

M. Pawan Kumar P.H.S. Torr A. ZissermanDept. of Computing Dept. of Eng. Science

Oxford Brookes University University of Oxford{pkmudigonda,philiptorr}@brookes.ac.uk [email protected]

http://cms.brookes.ac.uk/research/visiongroup http://www.robots.ox.ac.uk/˜vgg

Abstract

We present an unsupervised approach for learning a layered representation of a scenefrom a video for motion segmentation. Our method is applicable to any video contain-ing piecewise parametric motion. The learnt model is a composition of layers, whichconsist of one or moresegments. The shape of each segment is represented using abinary matte and its appearance is given by theRGB value for each point belonging tothe matte. Included in the model are the effects of image projection, lighting, and mo-tion blur. Furthermore, spatial continuity is explicitly modelled resulting in contiguoussegments. Unlike previous approaches, our method does not use reference frame(s) forinitialization. The two main contributions of our method are: (i) A novel algorithmfor obtaining the initial estimate of the model by dividing the scene into rigidly mov-ing components using efficient loopy belief propagation; and (ii) Refining the initialestimate usingαβ-swap andα-expansion algorithms, which guarantee a strong localminima. Results are presented on several classes of objectswith different types of cam-era motion, e.g. videos of a human walking shot with static ortranslating cameras. Wecompare our method with the state of the art and demonstrate significant improvements.

1 Introduction

We present an approach for learning a layered representation from a video for motionsegmentation. Our method is applicable to any video containing piecewise parametricmotion, e.g. piecewise homography, without any restrictions on camera motion. It alsoaccounts for the effects of occlusion, lighting and motion blur. For example, Fig. 1shows one such sequence where a layered representation can be learnt and used tosegment the walking person from the static background.

Many different approaches for motion segmentation have been reported in the lit-erature. Important issues are: (i) whether the methods are feature-based or dense; (ii)whether they model occlusion; (iii) whether they model spatial continuity; (iv) whetherthey apply to multiple frames (i.e. a video sequence); and (v) whether they are inde-pendent of which frames are used for initialization.

1

Figure 1:Four intermediate frames of a 31 frame video sequence of a person walkingsideways where the camera is static. Given the sequence, themodel which best de-scribes the person and the background is learnt in an unsupervised manner. Note thatthe arm always partially occludes the torso.

A comprehensive survey of feature-based methods can be found in [19]. Most ofthese method rely on computing a homography corresponding to the motion of a planarobject. This limits their application to a restricted set ofscenes and motions. Densemethods [2, 6, 18, 22] overcome this deficiency by computing pixel-wise motion. How-ever, many dense approaches do not model occlusion which canlead to overcountingof data when obtaining the segmentation, e.g. see [2, 6].

Chief amongst the methods which do model occlusion are thosethat use a layeredrepresentation [21]. One such approach, described in [22] divides a scene into (almost)planar regions for occlusion reasoning. Torret al. [18] extend this representation byallowing for parallax disparity. However, these methods rely on a keyframe for theinitial estimation. Other approaches [8, 23] overcome thisproblem by using layeredflexible sprites. A flexible sprite is a 2D appearance map and matte (mask) of an objectwhich is allowed to deform from frame to frame according to pure translation. Winnet al. [25] extend the model to handle affine deformations. However, these methodsdo not enforce spatial continuity i.e. they assume each pixel is labelled independentof its neighbours. This leads to non-contiguous segmentation when the foregroundand background are similar in appearance (see Fig. 19(b)). Most of these approaches,namely those described in [6, 8, 18, 21, 22], use eitherEM or variational methods forlearning the parameters of the model which makes them prone to local minima.

Wills et al. [24] noted the importance of spatial continuity when learning the re-gions in a layered representation. Given an initial estimate, they learn the shape ofthe regions using the powerfulα-expansion algorithm [5] which guarantees a stronglocal minima. However, their method does not deal with more than 2 views. In ourearlier work [10], we described a similar motion segmentation approach to [24] for avideo sequence. Like [16] this automatically learns a modelof an object. However,the method depends on a keyframe to obtain an initial estimate of the model. This hasthe disadvantage that points not visible in the keyframe arenot included in the model,which leads to incomplete segmentation.

In this paper, we present a model which does not suffer from the problems men-tioned above, i.e. (i) it models occlusion; (ii) it models spatial continuity; (iii) it handlesmultiple frames; and (iv) it is learnt independent of keyframes. An initial estimate ofthe model is obtained based on a method to estimate image motion with discontinuitiesusing a new efficient loopy belief propagation algorithm. Despite the use of piecewiseparametric motion (similar to feature-based approaches),this allows us to learn the

2

Figure 2: The top row shows the various layers of a human model (thelatent imageinthis case). Each layer consists of one of more segments whoseappearance is shown.The shape of each segment is represented by a binary matte (not shown in the image).Any framej can be generated using this representation by assigning appropriate valuesto its parameters and latent variables. The background is not shown.

model for a wide variety of scenes. Given the initial estimate, the shape of the seg-ments, along with the layering, are learnt by minimizing an objective function usingαβ-swap andα-expansion algorithms [5]. Results are demonstrated on several classesof objects with different types of camera motion.

In the next section, we describe the layered representation. In section 3, we presenta five stage approach to learn the parameters of the layered representation from a video.Such a model is particularly suited for applications like motion segmentation. Resultsare presented in section 4. Preliminary versions of this article have appeared in [10, 11].The input videos used in this work together with the description and output of ourapproach are available athttp://www.robots.ox.ac.uk/ vgg/research/moseg/.

2 Layered Representation

We introduce the model for a layered representation which describes the scene as acomposition of layers. Any frame of a video can be generated from our model by as-signing appropriate values to its parameters and latent variables as illustrated in Fig. 2.While the parameters of the model define thelatent image, the latent variables describehow to generate the frames using the latent image (see table 1). Together, they alsodefine the probability of the frame being generated.

The latent image is defined as follows. It consists of a set ofnP segments, which

3

InputD Data (RGB values of all pixels in every frame of a video).nF Number of frames.

ParametersnP Number of segmentspi including the background.

ΘMi Matte for segmentpi.ΘM Set of all mattes, i.e.{ΘMi, i = 1, · · · , nP }.ΘAi Appearance parameter for segmentpi.ΘA Set of all appearance parameters, i.e.{ΘAi, i = 1, · · · , nP }.Hi Histogram specifying the distribution of theRGB values forpi.li Layer number of segmentpi.

Latent VariablesΘ

jT i Transformation{tx, ty, sx, sy, φ} of segmentpi to framej.

ΘjLi Lighting variables{aj

i ,bji} of segmentpi to framej.

Θ {nP ,ΘM ,ΘA,Hi, li;ΘT ,ΘL}.

Table 1:Parameters and latent variables of the layered representation.

are 2D patterns (specified by their shape and appearance) along with their layering.The layering determines the occlusion ordering. Thus, eachlayer contains a number ofnon-overlapping segments. We denote theith segment of the latent image aspi. Theshape of a segmentpi is modelled as a binary matteΘMi of size equal to the frameof the video such thatΘMi(x) = 1 for a pointx belonging to segmentpi (denoted byx ∈ pi) andΘMi(x) = 0 otherwise.

The appearanceΘAi(x) is the RGB value of pointsx ∈ pi. We denote the setof mattes and appearance parameters for all segments asΘM andΘA respectively.The distribution of theRGB valuesΘAi(x) for all pointsx ∈ pi is specified usinga histogramHi for each segmentpi. In order to model the layers, we assign a (notnecessarily unique) layer numberli to each segmentpi such that segments belongingto the same layer share a common layer number. Each segmentpi can partially orcompletely occlude segmentpk, if and only if li > lk. In summary, the latent imageis defined by the mattesΘM , the appearanceΘA, the histogramsHi and the layernumbersli of thenP segments.

When generating framej, we start from a latent image and map each pointx ∈ pi

to x′ using the transformationΘjT i. This implies that points belonging to the same

segment move according to a common transformation. The generated frame is thenobtained by compositing the transformed segments in descending order of their layernumbers. For this paper, each transformation has five degrees of freedom: rotation,translations and anisotropic scale factors. The model accounts for the effects of lightingconditions on the appearance of a segmentpi using latent variableΘj

Li = {aji ,b

ji},

whereaji andb

ji are 3-dimensional vectors. The change in appearance of the segment

pi in framej due to lighting conditions is modelled as

d(x′) = diag(aji ) ·ΘAi(x) + b

ji . (1)

4

The motion of segmentpi from framej − 1 to framej, denoted bymji , can be deter-

mined using the transformationsΘj−1Ti andΘ

jT i. This allows us to take into account

the change in appearance due to motion blur as

c(x′) =

∫ T

0

d(x′ − mji (t))dt, (2)

whereT is the total exposure time when capturing the frame.

Posterior of the model: We represent the set of all parameters and latent variablesof the layered representation asΘ = {nP ,ΘM ,ΘA,Hi, li;ΘT ,ΘL} (summarized intable 1). Given dataD, i.e. thenF frames of a video, the posterior probability of themodel is given by

Pr(Θ|D) =1

Zexp(−Ψ(Θ|D)), (3)

whereZ is the partition function. The energyΨ(Θ|D) has the form

Ψ(Θ|D) =

nP∑

i=1

∑

x∈ΘM

Ai(x;Θ,D) + λ1

∑

y∈N (x)

(Bi(x,y;Θ,D) + λ2Pi(x,y;Θ))

,

(4)whereN (x) is the neighbourhood ofx. For this paper, we defineN (x) as the 8-neighbourhood ofx across all mattesΘMi of the layered representation (see Fig 3).As will be seen in§ 3.3, this allows us to learn the model efficiently by minimizing theenergyΨ(Θ|D) using multi-way graph cuts. However, a larger neighbourhood can beused for each point at the cost of more computation time. Notethat minimizing theenergyΨ(Θ|D) is equivalent to maximizing the posteriorPr(Θ|D) since the partitionfunctionZ is independent ofΘ.

The energy of the layered representation has two components: (i) the data log like-lihood term which consists of the appearance termAi(x;Θ,D) and the contrast termBi(x,y;Θ,D), and (ii) the priorPi(x,y;Θ). The appearance term measures the con-sistency of motion and colour distribution of a pointx. The contrast and the prior termsencourage spatially continuous segments whose boundarieslie on edges in the frames.Their relative weight to the appearance term is given byλ1. The weightλ2 specifiesthe relative importance of the prior to the contrast term. Anextension of Markov ran-dom fields (MRF) described in [12], which we call Contrast-dependent random fields(CDRF), allows a probabilistic interpretation of the energyΨ(Θ|D) as shown in Fig. 4.We note, however, that unlikeMRF it is not straightforward to generate the frames fromCDRFsince it is a discriminative model (due to the presence of contrast termBi(x,y)).We return to this when we provide a Conditional random field formulation of the energyΨ(Θ|D). We begin by describing the three terms of the energy in detail.

Appearance: We denote the observedRGB values at pointx′ = ΘjT i(x) (i.e. the

image of the pointx in framej) by Iji (x). The generatedRGB values of the pointx′

5

Figure 3: The top row shows two segments of the human model. The unfilledcirclesrepresent two of the neighbouring points of the filled circle. The neighbourhood isdefined across all mattes. We show one neighbour which lies onthe same matte (i.e.the torso) and another neighbour which lies on a different matte (i.e. the upper arm).The bottom row shows two frames of the video along with the projections of the pointson the segments. Note that the neighbouring point on the torso is occluded by theneighbouring point on the upper arm in the second frame.

Figure 4: Contrast-dependent random field (CDRF) formulation of the energy of thelayered representation containing two segmentspi and pk. The appearance termAi(x;Θ,D), shown in red, connects the dataD (specifically the set ofRGB valuesIj

i (x)) with the matteΘMi(x). The prior Pi(x, z;Θ) between two neighbouringpointsx andz, which encourages spatial continuity, connectsΘMi(x) andΘMi(z).The data dependent contrast termBi(x,y;Θ,D), which cannot be included in theprior, is shown as the blue diagonal connections. Extendingthe formulation to morethan two segments is trivial. Note that some of these diagonal connections are notshown, i.e. those connecting the other neighbouring pointsof x andy to ΘMk(y) andΘMi(x) respectively, for the sake of clarity of the figure.

6

are described in equation (2). The appearance term for a point x is given by

Ai(x;Θ,D) =

j=nF∑

j=1

− log(Pr(Iji (x)|Θ). (5)

For a pointx ∈ pi, i.e.ΘMi(x) = 1, the likelihood ofIji (x) consists of two factors:

(i) consistency with the colour distribution of the segment, which is the conditionalprobability ofIj

i (x) givenx ∈ pi and is computed using histogramHi, and (ii) con-sistency of motion which measures how well the generatedRGB valuescj

i (x′) match

the observed valuesIji (x) (i.e. how well does the latent image project into the frames).

Thus,Pr(Ij

i (x)|Θ) ∝ Pr(Iji (x)|Hi) exp(−µ(cj

i (x′) − Ij

i (x))2), (6)

whereµ is some scaling factor. We useµ = 1 in our experiments.Note that in the above equation we assume that the pointx is visible in framej.

Whenx is occluded by some other segment, we assign

Pr(Iji (x)|Θ) = κ1. (7)

There is also a constant penalty for all pointsx which do not belong to a segmentpi toavoid trivial solutions, i.e.

Ai(x;Θ,D) = κ2,x /∈ pi. (8)

One might argue that motion consistency alone would providesufficient discrimina-tion between different segments. However, consider a videosequence with homoge-neous background (e.g. brown horse walking on green grass).In such cases, motionconsistency would not be able to distinguish between assigning some blocks of thebackground (i.e. the green grass) to either the horse segment or the grass segment.Fig. 5 shows such a scenario where a block of the green background is assigned todifferent segments. However, the colour distribution of each segment would providebetter discrimination. For example, the likelihood of a green point belonging to thefirst (mostly brown) segment would be low according to the colour distribution of thatsegment. Empirically, we found that using both motion and colour consistency providemuch better results than using either of them alone.

Contrast and Prior: As noted above, the contrast and prior terms encourage spa-tially continuous segments whose boundaries lie on image edges. For clarity, we firstdescribe the two terms separately while noting that their effect should be understoodtogether. We subsequently describe their joint behaviour (see table 2).

Contrast: The contrast term encourages the projection of the boundarybetween seg-ments to lie on image edges and has the form

Bi(x,y;Θ,D) =

{

−γik(x,y) if x ∈ pi,y ∈ pk, i 6= k0 if x ∈ pi,y ∈ pk, i = k.

(9)

7

(a) (b)

(c) (d)

Figure 5: (a)-(b) Two possible latent images of the model consisting of two segments.Unlike (a), the latent image shown in (b) assigns a block of green points to the firstsegment of the model. (c)-(d) Two frames of the video sequences. The brown regiontranslates to the left while the green region remains stationary. Note that when onlyconsistency of motion is used in equation (6), the appearance termAi(x;Θ,D) forx ∈ pi remains the same for both the latent images. This is because the green blockprovides the same match scores for both transformations (i.e. stationary and translationto left) since it maps onto green pixels in the frames. However, when consistency ofappearance is also used this green block would be more likelyto belong to the secondsegment (which is entirely green) instead of the first segment (which is mostly brown).Hence, the appearance termAi(x;Θ,D) would favour the first latent image.

The termγik(x,y) is chosen such that it has a large value whenx andy project ontoimage edges. For this paper, we use

γik(x,y) = 1 − exp

(

−g2ik(x,y)

2σ2

)

·1

dist(x,y), (10)

where

gik(x,y) =1

nF

nF∑

j=1

|Iji (x) − Ij

k(y)|. (11)

In other words,gik(x,y) measures the difference between theRGB valuesIji (x) and

Ijk(y) throughout the video sequence. The termdist(x,y), i.e. the Euclidean distance

betweenx andy, gives more weight to the 4-neighbourhood ofx than the rest of the 8-neighbourhood. The value ofσ in equation (10) determines how the energyΨ(Θ|D) ispenalized since the penalty is high whengik(x,y) < σ and small whengik(x,y) > σ.

8

Segment gik(x,y) Contrast Prior Pairwise Potentiali = k σ/3 0 0 0i = k 3σ 0 0 0i 6= k σ/3 -0.054 1.2 1.146i 6= k 3σ -0.9889 1.2 0.2111

Table 2: Pairwise terms for pointsx ∈ pi and y ∈ pk, wherey belongs to the 4-neighbourhood ofx. In the first column,i = k implies thatx and y belong to thesame segment andi 6= k implies thatx andy belong to different segments. The secondcolumn lists the value ofgik(x,y). The third, fourth and fifth column are computedusing equations (9), (12) and (15) respectively.

Thusσ should be sufficiently large to allow for the variation inRGB values within asegment. In our experiments, we useσ = 5. Note that similar contrast terms have beenapplied successfully in various applications, e.g. image segmentation [4] and imagerestoration [5].

MRF Prior: The prior is specified by an Ising model, i.e.

Pi(x,y;Θ) =

{

τ if x ∈ pi,y ∈ pk, i 6= k0 if x ∈ pi,y ∈ pk, i = k.

(12)

In other words, the prior encourages spatial continuity by assigning a constant penaltyto any pair of neighbouring pixelsx andy which belong to different segments.

CRF formulation: The energy of the model can also be formulated using a condi-tional random field (CRF) [13]. Within theCRF framework, the posterior of the modelis given by

Pr(Θ|D) =1

Zexp

(

−nP∑

i=1

(

∑

x

Φi(x) + λ1

∑

x,y

Φi(x,y)

))

, (13)

whereΦi(x) andΦi(x,y) are called the unary and pairwise potentials respectively.The above formulation is equivalent to equation (3) for an appropriate choice of thepotentials, i.e.

Φi(x) = Ai(x;Θ,D), (14)

and

Φi(x,y) = Bi(x,y;Θ,D) + λ2Pi(x,y;Θ),

=

{

λ2τ − γik(x,y) if x ∈ pi,y ∈ pk, i 6= k0 if x ∈ pi,y ∈ pk, i = k.

Note that theCRF is a discriminative model since the pairwise potentialΦi(x,y|D) isdata dependent. Hence, unlike the generativeMRF model, it is not straightforward togenerate frames using theCRF.

9

In all our experiments, we useλ1 = λ2 = 1 andτ = 1.2. As will be seen, thesevalues are suitable to encourage spatially contiguous segments whose boundaries lieon image edges. Empirically, they were found to provide goodsegmentations for allour input videos. Table 2 shows the values of the pairwise terms for two neighbouringpointsx ∈ pi andy ∈ pk for this choice of the weights. We consider two cases for thetermgik(x,y) defined in equation (11): (i)gik(x,y) = σ/3, i.e.x andy have similarappearance; (ii)gik(x,y) = 3σ, which impliesx andy have different appearance (asis the case with neighbouring pixels lying on image edges). As can be seen from thetable, the value of the pairwise potential is small when boundaries of the segment lieon image edges (i.e. wheni 6= k andgik(x,y) = 3σ).

For two pointsx and y belonging to different segments, the minimum value ofthe pairwise potentialΦi(x,y) depends only onτ (sinceγik(x,y) is always less than1). Unlike theCRF framework, this fact comes across clearly in theCDRF formulationwhich forces us to treat the prior and the contrast term separately. In our case, the valueof τ is chosen such that the minimum value ofΦi(x,y) is always greater than 0.2. Inother words, the penalty for assigningx andy to different segments is at least0.2.This preventsspecklesappearing in the estimated segments by encouraging contiguousregions (i.e. regions which minimize the length of the boundary between segments).For example, consider the case wherex differs in appearance from all its neighboursdue to noise in the video frames. It would be undesirable to assign x to a differentsegment from all its neighbours. Such an assignment would bediscouraged sincexwould have to incur a penalty of at least0.2 from all its neighbours.

In the next section, we describe a five stage approach to obtain the layered repre-sentation (i.e.Θ) of an object, given dataD, by minimizing the energyΨ(Θ|D) (i.e.maximizingPr(Θ|D)). The method described is applicable to any scene with piece-wise parametric motion.

3 Learning Layered Segmentation

1. Rigidly moving components are identified between every pair of consecutiveframes by computing the image motion (§ 3.1).

2. An initial estimate ofΘ is obtained by combining these components (§ 3.2).

3. The parametersΘA and latent variablesΘT andΘL are kept constant and themattesΘM are optimized usingαβ-swap andα-expansion algorithms [5]. Thelayer numbersli are obtained (§ 3.3).

4. Using the refined values ofΘM , the appearance parametersΘA are updated.(§ 3.4).

5. Finally, the transformationsΘT and lighting variablesΘL are re-estimated,keepingΘM andΘA unchanged (§ 3.5).

Table 3:Estimating the parameters and latent variables of the layered representation.

Given a video, our objective is to estimateΘ (i.e. the latent image, the trans-

10

formations and the lighting variables ) of the layered representation. Our approachtakes inspiration from the highly successful interactive image segmentation algorithmof Boykov and Jolly [4] in which the user provides a small number ofobjectandback-groundseed pixels. The appearance model learnt from these seed pixels then providessufficient information to obtain reliable segmentation by minimizing an objective func-tion similar to equation (4). In our case, the seed pixels areprovided by a rough motionsegmentation obtained by computing the image motion (see§ 3.1 and§ 3.2). Theseseed pixels are sufficient to bootstrap the method to minimize equation (4) to obtainreliable segmentations.This is one of the key intuitions behind our method.

We obtain the layered representationΘ in five stages. In the first stage, imagemotion is computed between every pair of consecutive framesto obtain rigidly movingcomponents. An initial estimate ofΘ is then found in the second stage using thesecomponents. This provides us with the seed pixels for each segment. In the remainingstages, we alternate between holding some parameters and latent variables constant andoptimizing the rest as illustrated in table 3.

Our method makes use of two inference algorithms forCRFs : loopy belief propaga-tion (LBP) and graph cuts.LBP is particularly useful for applications such as estimatingmotion fields where each site of theCRF has a large number of labels (i.e. equal to thenumber of possible motions from one frame to the next). However, when refining themodel, the number of labels is small (i.e. equal to the numberof segments) and henceefficient inference can be performed using graph cuts. As will be seen, we take advan-tage of the strengths of both the algorithms. We begin by describing our approach forcomputing image motion.

3.1 Two frame motion segmentation

In this section, we describe a novel, efficient algorithm to obtain rigidly movingcom-ponentsbetween a pair of frames by computing the image motion. This is a simpletwo frame motion segmentation method that is used to initialize the more complexmultiframe one described later. We use the termcomponentshere to distinguish themfrom thesegmentsfinally found. The method is robust to changes in appearance due tolighting and motion blur. The set of components obtained from all pairs of consecutiveframes in a video sequence are later combined to get the initial estimate of the segments(see§ 3.2). This avoids the problem of finding only those segments which are presentin one keyframe of the video.

In order to identify points that move rigidly together from framej to j + 1 in thegiven videoD, we need to determine the transformation that maps each point x inframej to its position in framej + 1 (i.e. the image motion). However, at this stagewe are only interested in obtaining a coarse estimate of the components as they will berefined later using graph cuts. This allows us to reduce the complexity of the problemby dividing framej into uniform patchesfk of sizem×m pixels and determining theirtransformationsϕk. However, using a large value ofm may result in merging of twocomponents. We usem = 3 for all our experiments which was found to offer a goodcompromise between efficiency and accuracy.

The components are obtained in two stages: (i) finding a set ofputative transforma-tionsϕk for each patch in framej; (ii) selecting from those initial transformations the

11

best joint set of transformations over all patches in the frame. As the size of the patch isonly 3 × 3 and we restrict ourselves to consecutive frames, it is sufficient to use trans-formations defined by a scaleρk, rotationθk and translationtk, i.e.ϕk = {ρk, θk, tk}.

Finding putative transformations: We define aCRF over the patches of framejsuch that each sitenk of theCRF represents a patchfk. Each labelsk of sitenk corre-sponds to a putative transformationϕk. Note that this is a differentCRF from the onedescribed in the previous section which models the energy ofthe layered representationusing the unary potentialsΦi(x) and the pairwise potentialsΦi(x,y). It is a simplerone which we will solve in order to provide initialization for the layered representation.

In the presentCRF, the likelihoodψ(sk) of a label measures how well the patchfkmatches framej + 1 after undergoing transformationϕk. The neighbourhoodNk ofeach sitenk is defined as its 4-neighbourhood. As we are interested in finding rigidlymoving components, we specify the pairwise termψ(sk, sl) such that it encouragesneighbouring patches to move rigidly together. The joint probability of the transforma-tions is given by

Pr(ϕ) =1

Z2

∏

k

ψ(sk)∏

nl∈Nk

ψ(sk, sl), (15)

whereZ2 is the partition function of theCRF andϕ is the set of transformations{ϕk, ∀k}.

By taking advantage of the fact that large scaling, translations and rotations arenot expected between consecutive frames, we restrict ourselves to a small number ofputative transformations. Specifically, we vary scaleρk from 0.7 to 1.3 in steps of0.3,rotationθk from−0.3 to 0.3 radians in steps of0.15 and translationstk in vertical andhorizontal directions from−5 to 5 pixels and−10 to 10 pixels respectively in steps of1. Thus, the total number of transformations is3465.

The likelihood of patchfk undergoing transformationϕk is modelled asψ(sk) ∝exp(L(fk, ϕk)). The termL(fk, ϕk) is the normalized cross-correlation between framej+1 and ann×nwindow around the patchfk, transformed according toϕk. When cal-culatingL(fk, ϕk) in this manner, then × n window is subjected to different degreesof motion blur according to the motion specified byϕk, and the best match score ischosen.This, along with the use of normalized cross-correlation, makes the likelihoodestimation robust to lighting changes and motion blur.In all our experiments, we usedn = 5. Since the appearance of a patch does not change drasticallybetween consecu-tive frames, normalized cross-correlation provides reliable match scores. Unlike [10],we do not discard the transformations resulting in a low match score. However, it willbe seen later that this does not significantly increase the amount of time required forfinding the minimum mean squared error (MMSE) estimate of the transformations dueto our computationally efficient method.

We want to assign the pairwise termsψ(sk, sl) such that neighbouring patchesfkandfl which do not move rigidly together are penalized. However, we would be willingto take the penalty when determining theMMSE estimate if it results in better matchscores. Furthermore, we expect two patches separated by an edge to be more likely tomove non-rigidly since they might belong to different segments. Thus, we define the

12

pairwise terms by a Potts model such that

ψ(sk, sl) =

{

exp(1) if rigid motion,exp(ζ∇(fk, fl)) otherwise,

(16)

where∇(fk, fl) is the average of the gradients of the neighbouring pixelsx ∈ fk andy ∈ fl, i.e. along the boundary shared byfk andfl. The termζ specifies how muchpenalty is assigned for two neighbouring patches not movingrigidly together. Wechooseζ such that it scalesζ∇(fk, fl) to lie between0 and1.

To handle occlusion, an additional labelso is introduced for each sitenk whichrepresents the patchfk being occluded in framej + 1. The corresponding likelihoodsand pairwise termsψ(so), ψ(sk, so), ψ(so, sk) andψ(so, so) are modelled as constantsfor all k. In our experiments, we used the values0.1, 0.5, 0.5 and0.8 respectively. Thehigher value forψ(so, so) specifies that two neighbouring patches tend to get occludedsimultaneously.

Obtaining the transformations: The best joint set of transformations for all patchesis found by maximizing the probabilityPr(ϕ) defined in equation (15). We use sum-product loopy belief propagation (LBP) [15] to find the posterior probability of a patchfk undergoing transformationϕk. This provides us with theMMSE estimate of trans-formations.

The two main limitations ofLBP are its large memory requirements and its com-putational inefficiency. We overcome these problems by developing a novel coarse tofine LBP algorithm. This algorithm groups similar labels of theCRF to obtain a smallernumber ofrepresentativelabels, thereby reducing the memory requirements. The timecomplexity of LBP is also reduced using the method described in [7]. Details ofthealgorithm can be found in Appendix A.

Once the transformations for all the patches of framej have been determined, wecluster the points moving rigidly together to obtain rigid components. Componentswith size less than 100 pixels are merged with surrounding components. We repeatthis process for all pairs of consecutive frames of the video. Thekth component offramej is represented as a set of pointsCj

k. Fig. 6 shows the result of our approachon four pairs of consecutive frames. Fig. 7 shows the advantage of modelling motionblur when computing the likelihood of a patchfk undergoing a transformationϕk. Inthe next section, we describe how an initial estimate of the layered representation isobtained using the rigid pairwise components.

3.2 Initial estimation of the model over multiple frames

In this section, we describe a method to get an initial estimate of Θ. The methodconsists of two stages: (i) combining rigidly moving components to obtain the numberof segments and the initial estimate of their shape parameter ΘMi; (ii) computing theremaining parameters and latent variables, i.e.ΘAi, Θ

jT i andΘ

jDi.

Combining rigid components: Given the set of all pairwise components, we wantto determine the number of segmentsnP present in the entire video sequence and

13

Figure 6:Results of obtaining theMMSE estimates of the transformations. The first twocolumns show consecutive frames of a video. The third columnshown the reconstruc-tion of the second frame obtained by mapping the patches of the first frame accordingto the transformations obtained using coarse-to-fine efficient LBP. Points which areoccluded from the first frame but are present in the second frame would be missingfrom the reconstruction. These points are shown in red. Fragments moving rigidly areclustered together to obtain the components shown in the fourth column.

obtain an initial estimate of their shape. The task is made difficult due to the followingproblems: (i) a segment may undergo different transformations (scaling, rotation andtranslation) from one frame to the next which need to be recovered by establishing acorrespondence over components; (ii) the transformationsϕk of the components (foundin § 3.1) may not be accurate enough to obtain the required correspondence; and (iii)each component may overlap with one or more segments therebyproviding us withmultiple estimates of the shape parameterΘMi.

Fig. 8 shows an example of the problem of combining rigid components. All thecomponents shown in Fig. 8(a) contain the ‘torso’ segment which undergoes differenttransformations in each of the four frames. In order to recover the shape of the torso,we need to establish a correspondence over the components (i.e. find the componentswhich contain torso and determine the transformations between them). Further, themethod for obtaining the correspondence should be robust toerrors in the transforma-tionsϕk. Finally, the initial estimate of the shape of the torso needs to be determined

14

(a) (b)

(c) (d)

Figure 7: Effects of modelling motion blur. (a)-(b) Two consecutive frames from thevideo shown in Fig. 1. (c) The image motion computed without modelling motion blur.The reconstruction of the second frame, obtained by mappingthe patches of the firstframe according to the transformations obtained, indicates that incorrect transforma-tions are found around the feet (see e.g. the shoes) where there is considerable motionblur. Note that the pixels marked red are those that are occluded in the first frame butpresent in the second frame. (d) Results after modelling motion blur. Accurate trans-formations for the patches belonging to the shoes are obtained by accounting for thechange in appearance due to motion blur.

from the four estimates provided by the components.We overcome the first problem (i.e. establishing correspondence over components)

by associating the components from one frame to the next using the transformationsϕk. This association is considered transitive, thereby establishing a correspondence ofcomponents throughout the video sequence.

However, as noted above (in problem (ii)), this correspondence may not be accu-rate due to errors in the transformationsϕk. For example, an error in the transformationmay result in a component containing the torso corresponding to a background com-ponent. Hence, we need to make our method more robust to errors in ϕk. To thisend, we attempt to cluster the components such that each cluster contains only thosecomponents which belong to one segment. Clearly, such clusters would provide cor-respondence over the components. Note that we rely on every segment of the scenebeing detected as an individual component in at least one frame.

In order to cluster the components, we measure thesimilarity of each componentCj

k in framej with all the components of framel that lie close to the component corre-sponding toCj

k in framel (i.e. not just to the corresponding component). This makesthe method robust to small errors inϕk. The similarity of two components is mea-

15

Figure 8:(a) Examples of rigid pairwise components obtained for the video sequenceshown in Fig. 1 which contain the ‘torso’ segment. The components are shown tothe right of the corresponding frames. The initial estimateof the torso is obtained byestablishing a correspondence over these components. (b) The smallest component(i.e. the top left one) is used as the initial estimate of the shape of the torso segment.

sured using normalized cross-correlation (to account for changes in appearance due tolighting conditions) over all corresponding pixels.

The number of segments are identified by clustering similar components togetherusing agglomerative clustering. Agglomerative clustering starts by treating each com-ponent as a separate cluster. At each step, the two most similar clusters (i.e. the clusterscontaining the two most similar components) are combined together. The algorithm isterminated when the similarity of all pairs of clusters falls below a certain threshold.We simply let components containing more than one segment lie in a cluster represent-ing one of these segments. For example, the top right component shown in Fig. 8 maybelong to the cluster representing either the ‘head’ or the ‘torso’.

Finally, we address the third problem (i.e. the overlappingof components withmultiple segments) by choosing the smallest component of each cluster to define theshapeΘMi of the segmentpi, as shown in Fig. 8. This avoids using a componentcontaining more than one segment to define the shape of a segment. However, thisimplies that the initial estimate will often be smaller than(or equal to) the ground truthand thus, needs to be expanded as described in§ 3.3.

The above method for combining rigid components to obtain segments is similar tothe method described by Ramanan and Forsyth [16] who clusterrectangular fragmentsfound in a video to obtain parts of an object. However, they rely on finding parallellines of contrast to define the fragments, which restricts their method to a small classof objects and videos. In contrast, our method obtains rigidly moving components bycomputing image motion and hence, is applicable to any videocontaining piecewiseparametric motion.

The initial shape estimates of the segments, excluding the background, obtained byour method are shown in Fig. 9. Note that all the segments of the person visible in thevideo have been found.

16

Figure 9: Result of clustering all pairwise rigid components of the video sequenceshown in Fig. 1. (a) Components obtained for four pairs of consecutive frames. (b) Ini-tial estimate of the shape of the segments obtained by choosing the smallest componentin each cluster.

Initial estimation of the model: Once the mattesΘMi are found, we need to deter-mine the initial estimate of the remaining parameters and latent variables of the model.The transformationsΘj

T i are obtained usingϕk and the component clusters. The ap-pearance parameterΘAi(x) is given by the mean ofIj

i (x) over all framesj. ThehistogramsHi are computed using theRGB valuesΘAi(x) for all pointsx ∈ pi. Asthe size of the segment is small (and hence, the number of suchRGB values is small),the histogram is implemented using only 15 bins each forR, G and B. The lightingvariablesaj

i andbji are calculated in a least squares manner usingΘAi(x) andIj

i (x),for all x ∈ pi. The motion variablesmj

i are given byΘjT i andΘ

j−1Ti . This initial

estimate of the model is then refined by optimizing each parameter or latent variablewhile keeping others unchanged. We start by optimizing the shape parametersΘM asdescribed in the next section.

3.3 Refining shape

In this section, we describe a method to refine the estimate ofthe shape parametersΘM and determine the layer numbersli using theαβ-swap andα-expansion algo-rithms [5]. Given an initial coarse estimate of the segments, we iteratively improvetheir shape using consistency of motion and texture over theentire video sequence.The refinement is carried out such that it minimizes the energy Ψ(Θ|D) of the modelgiven in equation (4).

To this end, we take advantage of efficient algorithms for multi-way graph cuts

17

which minimize an energy function over point labellingsh of the form

Ψ̂ =∑

x∈X

Dx(hx) +∑

x,y∈N

Vx,y(hx, hy), (17)

under fairly broad constraints onD andV (which are satisfied by the energy of thelayered representation) [9]. HereDx(hx) is the cost for assigning the labelhx to pointx andVx,y(hx, hy) is the cost for assigning labelshx andhy to the neighbouring pointsx andy respectively.

Specifically, we make use of two algorithms:αβ-swap andα-expansion [5]. Theαβ-swap algorithm iterates over pairs of segments,pα andpβ . At each iteration, itrefines the mattes ofpα andpβ by swapping the values ofΘMα(x) andΘMβ(x) forsome pointsx. Theα-expansion algorithm iterates over segmentspα. At each iteration,it assignsΘMα(x) = 1 for some pointsx. Note thatα-expansion never reduces thenumber of points with labelα.

In our previous work [10], we described an approach for refining the shape pa-rameters of theLPS model where all the segments are restricted to lie in onereferenceframe. In other words, each point on the reference frame has aunique label, i.e. itbelongs to only one segment. In that case, it was sufficient torefine one segment ata time using theα-expansion algorithm alone to correctly relabel all the wrongly la-belled points. For example, consider a pointx ∈ pi which was wrongly labelled asbelonging topk. During the expansion move whereα = i, the pointx would be rela-belled topi (and hence, it would not belong topk). Since the shape of each segment inthe layered representation is modelled using a separate matte, this restriction no longerholds true (i.e. each pointx can belong to more than one segment). Thus, performingonly α-expansion would incorrectly relabel the pointx to belong to bothpi andpk

(and not topi alone). We overcome this problem by performingαβ-swap over pairsof segments. During the swap move whenα = i andβ = k, the pointx would berelabelled topi and would no longer belong topk. Theα-expansion algorithm wouldthen grow the segments allowing them to overlap (e.g. the segments Fig. 10 grow dueto theα-expansion algorithm). Therefore, refining the shape parametersΘMi of thelayered representation requires both theαβ-swap andα-expansion algorithm.

A standard way to minimize the energy of the layered representation would be to fixthe layer numbers of the segments and refine their shape by performingα-expansion foreach segment andαβ-swap for each pair of segments. The process would be repeatedfor all possible assignments of layer numbers and the assignment which results in theminimum energy would be chosen. However, this would be computationally inefficientbecause of the large number of possible layer numbers for each segment. In orderto reduce the complexity of the algorithm, we make use of the fact that only thosesegments which overlap with each other are required to determine the layering.

We define thelimit Li of a segmentpi as the set of pointsx whose distance frompi

is at most40% of the current size ofpi. Given segmentpi, letpk be a segment such thatthe limit Li of pi overlaps withpk in at least one framej of the video. Such a segmentpk is said to besurroundingthe segmentpi. The number of surrounding segmentspk

is quite small for objects such as humans and animals which are restricted in motion.For example, the head segment of the person shown in Fig. 1 is surrounded by only thetorso and the background segments.

18

We iterate over segments and refine the shape of one segmentpi at a time. At eachiteration, we perform anαβ-swap forpi and each of its surrounding segmentspk. Thisrelabels all the points which were wrongly labelled as belonging topi. We then performanα-expansion algorithm to expandpi to include those pointsx in its limit which moverigidly with pi. During the iteration refiningpi, we consider three possibilities forpi

and its surrounding segmentpk: li = lk, li > lk or li < lk. Recall that ifli < lk,we assignPr(Ij

i (x)|Θ) = κ1 for framesj wherex is occluded by a point inpk. Wechoose the option which results in the minimum value ofΨ(Θ|D). This determinesthe occlusion ordering among surrounding segments.We stop iterating when furtherreduction ofΨ(Θ|D) is not possible. This provides us with a refined estimate ofΘM

along with the layer numberli of the segments. Since the neighbourhood for each pointx is small (see Fig. 3), graph cuts can be performed efficiently. The graph constructionsfor both theαβ-swap andα-expansion algorithms are provided in Appendix B.

Fig. 10 shows the refined shape parameters of the segments obtained by the abovemethod using the initial estimates. Results indicate that reliable shape parameters canbe learnt even while using a small neighbourhood. Note that though the torso is par-tially occluded by the arm and the backleg is partially occluded by the front leg inevery frame, their complete shape has been learnt using individual binary mattes foreach segment. Next, the appearance parameters corresponding to the refined shapeparameters are obtained.

3.4 Updating appearance

Once the mattesΘMi of the segments are obtained, the appearance of a pointx ∈ pi,i.e. ΘAi(x) is calculated as the mean ofIj

i (x) over all framesj. The histogramsHi

are recomputed using theRGB valuesΘAi(x) for all pointsx ∈ pi. Fig. 11 showsthe appearance of the parts of the human model learnt using the video in Fig. 1. Therefined shape and appearance parameters help in obtaining a better estimate for thetransformations as described in the next section.

3.5 Refining the transformations

Finally, the transformationsΘT and the lighting variablesΘL are refined by searchingover putative transformations around the initial estimate, for all segments at each framej. For each putative transformation, variables{aj

i ,bji} are calculated in a least squares

manner. The variables which result in the smallestSSDare chosen. When refining thetransformation, we searched for putative transformationsby considering translationsupto5 pixels in steps of1, scales0.9, 1.0 and1.1 and rotations between−0.15 and0.15radians in steps of0.15 radians around the initial estimate. Fig. 12 shows the rotationand translation in y-axis of the upper arm closest to the camera in Fig. 1 obtained afterrefining the transformations.

The modelΘ obtained using the five stage approach described above can beusediteratively to refine the estimation of the layered representation. However, we foundthat it does not result in a significant improvement over the initial estimate as the pa-rameters and latent variables do not change much from one iteration to the other. In thenext section, we describe a method to refine the segmentationof each frame.

19

Figure 10:Result of refining the mattes of the layered representation of a person usingmulti-way graph cuts. The shape of the head is re-estimated after one iteration. Thenext iteration refines the torso segment. Subsequent iterations refine the half limbs oneat a time. Note that the size of the mattes is equal to that of a frame of the video butsmaller mattes are shown here for clarity.

3.6 Refining the segmentation of frames

Our model maps the segments onto a frame using only simple geometric transforma-tions. This would result in gaps in the generated frame when the motion of segmentscannot be accurately defined by such transformations. In order to deal with this, werefine the segmentation of each frame by relabelling the points around the boundary ofsegments. Note that this step is performed only to obtain more accurate segmentationsand does not change the values of any parameters or latent variables. The relabellingis performed by using theα-expansion algorithm. The costDx(hx) of assigning pointx around the boundary ofpi to pi is the inverse log likelihood of its observedRGB

values in that frame given by the histogramHi. The costVx,y(hx, hy) of assigningtwo different labelshx andhy to neighbouring pointsx andy is directly proportionalto Bi(x,y;Θ,D) for that frame. Fig. 13 shows an example where the gaps in thesegmentation are filled using the above method.

20

Figure 11:Appearance of the parts learnt for the human model as described in§ 3.4.

(a) (b)

Figure 12: (a) Rotation of the upper arm obtained after refining the transformationsas described in§ 3.5. During the first half of the video, the arm swings away fromthe body while in the second half it rotates towards the body.Clearly, this motionhas been captured in the learnt rotations. (b) Translation of the upper arm in thehorizontal direction. The person moves from the right to theleft of the scene with almostconstant velocity as indicated by the learnt translations.Note that the transformationsare refined individually for each frame and are therefore notnecessarily smooth.

21

(a) (b)

Figure 13:Result of refining the segmentation. (a) The segmentation obtained by com-positing the transformed segments in descending order of the layer numbers. (b) Therefined segmentation obtained usingα-expansion (see text). Note that the gaps in thesegmentation that appear in (a), e.g. between the upper and lower half of the arm, havebeen filled.

4 Results

We now present results for motion segmentation using the learnt layered representationof the scene. The method is applied to different types of object classes (such as jeep,humans and cows), foreground motion (pure translation, piecewise similarity trans-forms) and camera motion (static and translating) with static backgrounds. We use thesame weight values in all our experiments.

Fig. 14-16 show the segmentations obtained by generating frames using the learntrepresentation by projecting all segments other than thosebelonging to layer0 (i.e. thebackground). Fig. 14(a) and 14(b) show the result of our approach on simple scenarioswhere each layer of the scene consists of segments which are undergoing pure transla-tion. Despite having a lot of flexibility in the putative transformations by allowing forvarious rotations and scales, the initial estimation recovers the correct transformations,i.e. those containing only translation. Note that the transparent windshield of the jeep is(correctly) not recovered in theM .A .S.H. sequence as the background layer can be seenthrough it. For the sequence shown in Fig. 14(a) the method proves robust to changesin lighting condition and it learns the correct layering forthe segments correspondingto the two people.

Fig. 15(a) and 15(b) show the motion segmentation obtained for two videos, eachof a person walking. In both cases, the body is divided into the correct number ofsegments (head, torso and seven visible half limbs). Our method recovers well fromocclusion in these cases. For such videos, the feet of a person are problematic as theytend to move non-rigidly with the leg in some frames. Indeed the feet are missing fromthe segmentations of some of the frames. Note that the grass in Fig. 15(b) has similar

22

(a)

(b)

Figure 14: Motion Segmentation Results I.In each case, the left image shows thevarious segments obtained in different colours. The top rowshows the original videosequence while the segmentation results are shown in the bottom row. (a): A 40 framesequence taken from a still camera (courtesy Nebojsa Jojic [8]). The scene containstwo people undergoing pure translation in front of a static background. The resultsshow that the layering is learnt correctly. (b): A 10 frame video sequence taken from‘M.A.S.H.’. The video contains a jeep undergoing translation and slight out of planerotation against a static background while the camera pans to track the jeep.

intensity to the person’s trousers and there is some error inthe transformations of thelegs.

Fig. 16(a) and 16(b) are the segmentations of a cow walking. Again the body ofthe cow is divided into the correct number of segments (head,torso and eight halflimbs). The cow in Fig. 16(a) undergoes a slight out of plane rotation in some frames,which causes some bits of grass to be pulled into the segmentation. The video shown inFig. 16(b) is taken from a poor quality analog camera. However, our algorithm provesrobust enough to obtain the correct segmentation. Note thatwhen relabelling the pointsaround the boundary of segments some parts of the background, which are similar inappearance to the cow, get included in the segmentation.

Our approach can also be used to segment objects present at different depths whenthe camera is translating. This is due to the fact that their transformations with re-spect to the camera will differ. Fig. 17 shows one such example using the well-knowngarden sequence. Note that the correct number of objects have been found and goodsegmentation is obtained.

23

(a)

(b)

Figure 15:Motion Segmentation Results II.(a): A 31 frame sequence taken from astill camera (courtesy Hedvig Sidenbladh [17]). The scene consists of a person walkingagainst a static background. The correct layering of various segments of the person islearnt. The ground truth used for comparison is also shown inthe third row. (b): A 57frame sequence taken from a translating camera of a person walking against a staticbackground (courtesy Ankur Agarwal [1]). Again the correctlayering of the segmentsis learnt.

Timing: The initial estimation takes approximately5 minutes for every pair of frames:3 minutes for computing the likelihood of the transformations and2 minutes forMMSE

estimation usingLBP. The shape parameters of the segments are refined by minimiz-ing the energyΨ(Θ|D) as described in§ 3.3. The graph cut algorithms used have,in practice, a time complexity which is linear in the number of points in the binarymatteΘMi. It takes less than1 minute to refine the shape of each segment. Most ofthe time is taken up in calculating the various terms which define the energyΨ(Θ|D)as shown in equation (4). Since the algorithm provides a goodinitial estimate, it con-verges after at most2 iterations through each segment. All timings provided are for aC++ implementation on a2.4 GHz processor.

Ground truth comparison: The segmentation performance of our method was as-sessed using eight manually segmented frames (four each from the challenging se-

24

(a)

(b)

Figure 16: Motion Segmentation Results III. (a): A 44 frame sequence of a cowwalking taken from a translating camera. All the segments, along with their layering,are learnt. (b): A 30 frame sequence of a cow walking against astatic (homogeneous)background (courtesy Derek Magee [14]). The video is taken from a still analog cam-era which introduces a lot of noise. The results obtained using our approach (row 2)and the ground truth used for comparison (row 3) are also shown.

quences shown in Fig. 15(a) and 16(b)). Out of 80901 ground truth foreground pix-els and 603131 ground truth background pixels in these frames, 79198 (97.89%) and595054 (98.66%) were present in the generated frames respectively. Most errors weredue to the assumption of piecewise parametric motion and dueto similar foregroundand background pixels.

Sensitivity to weights: When determining rigidity of two transformations or cluster-ing patches to obtain components, we allow for the translations to vary by one pixel inx and y directions to account for errors introduced by discretization of putative trans-formations. Fig. 18 shows the effects of not allowing for slight variations in the trans-lations. As expected, it oversegments the body of the person. However, allowing formore variation does not undersegment as different components move quite non-rigidlyfor a large class of scenes and camera motion.

Our model explicitly accounts for spatial continuity usingthe weightsλ1 andλ2 asdescribed in equation (4). Recall thatλ1 andλ2 are the weights given to the contrastand the prior term which encourage boundaries of segments tolie on image edges.Fig. 19 shows the effects of settingλ1 andλ2 to zero, thereby not modelling spatial

25

Figure 17:Segmenting objects.The top row shows some frames from the 29 framegarden sequence taken from a translating camera. The scene contains four objects,namely the sky, the house, the field and the tree, at differentdepths which are learntcorrectly. The bottom row shows the appearance and shape of the segmented objects.

Figure 18:Results of finding rigidly moving components for four framesfrom the videoshown in Fig. 1 without tolerating slight variation in translations. This oversegmentsthe body of the person thereby resulting in a large number of incorrect segments. How-ever, the algorithm is robust to larger tolerance as neighbouring components movequite non-rigidly.

continuity. Note that this significantly deteriorates the quality of the segmentation whenthe background is homogeneous.

Sensitivity to length of sequence: We tested our approach by varying the numberof frames used for the video sequence shown in Fig. 1. Since the number of segments(and their initial shape) is found using rigidly moving components, using fewer framestends to undersegment the object. For example, given 10 frames of a video where thetwo half limbs of an arm move rigidly together, our method would detect the arm asone segment. Fig. 20 shows the initial estimate of the segments obtained for a varyingnumber of input frames. Note that the video sequence contains two half-periods ofmotion (i.e. the person takes two steps forward, first with the left leg and then with theright leg). As expected, the algorithm undersegments the body when the full period ofmotion is not considered. By the twenty fourth frame, i.e. just after the beginning ofthe second half-period, all visible segments are detected due to the difference in theirtransformations. Using more than twenty four frames does not change the number of

26

(a) (b)

(c) (d)

Figure 19:Encouraging spatial continuity. (a) Result obtained by setting λ1 andλ2

to zero. The method works well for the simple case of the videoshown in Fig. 14(a)where the foreground and background differ significantly. When compared with groundtruth, 93.1% of foreground pixels and 99.8% of background pixels were labelled cor-rectly. (b) By encouraging spatial continuity, a small improvement is observed (95.4%of foreground pixels and 99.9% of background pixels were present in the generatedframe). (c) For the more difficult case shown in Fig. 14(b), the segmentation starts toinclude parts of the homogeneous background when spatial continuity is not enforced.Only 91.7% of foreground pixels and 94.1% of background pixels are generated, com-pared to 95% of foreground pixels and 99.8% of background pixels correctly obtainedwhen encouraging spatial continuity (shown in (d)).

segments obtained. However, the initial estimate of the segments changes as smallercomponents are found in subsequent frames (see§ 3.2).

5 Summary and Discussion

The algorithm proposed in this paper achieves good motion segmentation results. Whyis this? We believe that the reasons are two fold. Incremental improvements in theComputer Vision field have now ensured that: (i) we can use an appropriate modelwhich accounts for motion, changes in appearance, layeringand spatial continuity. Themodel is not too loose so as to undersegment, and not too tightso as to oversegment;(ii) we have more powerful modern algorithmic methods such as LBP and graph cutswhich avoid local minima better than previous approaches.

However, there is still more to do. As is standard in methods using layered repre-

27

Figure 20:Results of obtaining the initial estimate of the segments for a varying numberof input frames. The refined estimates of the shape obtained using the method describedin § 3.3 are also shown. During the first four frames only two segments are detected,i.e. the body and a leg. In the next four frames, the arm close to the camera and theother leg are detected. The half limbs which constitute thisarm and leg are detectedusing 11 frames of the video sequence. When 24 frames are used, all 9 visible segmentsof the body are detected. The initial estimate of the segments and the refined estimateof their shapes for the entire video sequence is shown in Fig.10.

sentation, we have assumed that the visual aspects of the objects and the layering ofthe segments do not change throughout the video sequence. Atthe very least we needto extend the model to handle the varying visual aspects of objects present in the scene,e.g. front, back and3/4 views, in addition to the side views. The restriction of rigidmotion within a segment can be relaxed using non-parametricmotion models.

For our current implementation, we have set the values of theweightsλ1 andλ2

and the constantκ1 andκ2 empirically. Although these values provide good resultsfor a large class of videos, it would be interesting to learn them using ground truthsegmentations (similar to [3] for image segmentation).

28

Acknowledgments. This work was supported by the EPSRC research grant EP/C006631/1(P)and the IST Programme of the European Community, under the PASCAL Network ofExcellence, IST-2002-506778. This publication only reflects the authors’ views.

Appendix A

Efficient Coarse to Fine Loopy Belief Propagation

Loopy belief propagation (LBP) is a message passing algorithm similar to the one pro-posed by Pearl [15] for graphical models with no loops. For the sake of completeness,we describe the algorithm below.

The message that sitenk sends to its neighbournl at iterationt is given by

mtkl(sl) =

∑

sk

ψ(sk, sl)ψ(sk)∏

nd∈Nk\nl

mt−1dk (sk)

. (18)

All messages are initialized to1, i.e.m0kl(sk) = 1, for all k andl. The belief (posterior)

of a patchfk undergoing transformationϕk afterT iterations is given by

b(sk) = ψ(sk)∏

nl∈Nk

mTlk(sk) . (19)

The termination criterion is that the rate of change of all beliefs falls below a certainthreshold. The labels∗k that maximizesb(sk) is selected for each patch thus, providingus an estimate of the image motion.

The time complexity ofLBP is O(nH2), wheren is the number of sites in theCRF andH is the number of labels per site, which makes it computationally infeasiblefor largeH . However, since the pairwise terms of theCRF are defined by a Pottsmodel as shown in equation (16), the runtime ofLBP can be reduced toO(nH ′), whereH ′ ≪ H2 using the method described in [7]. Briefly, we can speed-up the computationof the messagemt

kl by precomputing the terms which are common inmtkl(sl), for all

labelssl as follows:

T =∑

sk

ψ(sk)∏

nd∈Nk\nl

mt−1dk (sk)

. (20)

To compute the messagemkl(sl) for a particular labelsl, we now consider only thoselabelssk which define a rigid motion withsl. These labels are denoted by the setCk(sl). Specifically, let

Tc =∑

sk∈Ck(sl)

ψ(sk)∏

nd∈Nk\nl

mt−1dk (sk)

, (21)

which can be computed efficiently by summing|Ck(sl)| ≪ H terms. Clearly, themessagemt

kl(sl) defined in equation (18) is equivalent to

mtkl(sl) = Tc exp(1) + (T − Tc) exp(ζ∇(fk, fl)). (22)

29

Thus, the messages can be computed efficiently inO(nH ′) time whereH ′ ≪ H2.Another limitation ofLBP is that it has memory requirements ofO(nH). To over-

come this problem, we use a variation of the coarse to fine strategy suggested in [20].This allows us to solveO(log(H)/ log(h)) problems ofh labels instead of one problemof H labels, whereh ≪ H . Thus, the memory requirements are reduced toO(nh).The time complexity is reduced further fromO(nH) toO(log(H)nh/ log(h)).

(a) (b)

Figure 21:Coarse to fine loopy belief propagation. (a) An exampleCRF with 12 sitesand 20 labels per site. (b) A set of 5 labels is grouped into onerepresentative label(shown as a square) thereby resulting in a coarserCRFwith 4 labels per site. Inferenceis performed on thisCRF using efficientLBP. In this case, the bestr = 2 representativelabels (shown as red squares) are chosen for each site and expanded. This results inan CRFwith 10 labels per site. The process of grouping the labels iscontinued until weobtain the most likely label for each site.

The basic idea of the coarse to fine strategy is to group together similar labels (dif-fering slightly only in translation) to obtainh representativelabelsφk (see Fig. 21). Wenow define anCRFwhere each sitenk hash labelsSk such thatψ(Sk) = maxϕk∈φk

ψ(sk)andψ(Sk, Sl) = maxϕk∈φk,ϕl∈φl

ψ(sk, sl). Using LBP on this CRF, we obtain theposterior for each representative transformation. We choose the bestr representativetransformations (unlike [20], which chooses only the best)with the highest beliefs foreach site. These transformations are again divided intoh representative transforma-tions. Note that theseh transformations are less coarse than the ones used previously.We repeat this process until we obtain the most likely transformation for each patchfk.In our experiments, we useh = 165 andr = 20. LBP was found to converge within20iterations at each stage of the coarse to fine strategy.

Appendix B

Refining the shape of the segments by minimizing the energy ofthe model (defined inequation (4)) requires a series of graph cuts. Below, we provide the graph constructions

30

required for both theαβ-swap and theα-expansion algorithms (see§ 3.3).

Graph construction for αβ-swap

Theαβ-swap algorithm swaps the assignments of certain pointsx which have labelαto β and vice versa. In our case, it attempts to relabel points which were incorrectlyassigned to segmentspα or pβ . We now present the graph construction required forperformingαβ-swap such that it minimizes the energy of the layered representation.For clarity, we only consider the case when there are two neighbouring pointsx andy. The complete graph can be obtained by concatenating the graphs for all pairs ofneighbouring points [9].

Each of the two pointsx andy are represented by one vertex in the graph (shownin blue in Fig. 22). In addition, there are two special vertices called the source andthe sink (shown in brown and green respectively) which represent the labelsα andβ.Recall that the unary potential for assigning pointx to segmentpα is Aα(x;Θ,D).Similarly, the unary potential for assigningx to segmentpβ isAβ(x;Θ,D).

The pairwise potentials, given by equation (15), for all four possible assignmentsof two pointsx andy are summarized in Fig. 22. Here,γ′ik(x,y) = λ2τ −γik(x,y) isthe total cost (contrast plus prior) for assigning pointsx andy to (different) segmentspα andpβ. The corresponding graph construction, also shown in Fig. 22, is obtainedusing the method described in [9].

y ∈ pα y ∈ pβ

x ∈ pα 0 γ′αβ

(x, y)

x ∈ pβ γ′βα

(x, y) 0

Figure 22:Graph construction forαβ-swap. The table shown in the left summarizesthe pairwise potentials for two pointsx andy. The figure on the right shows the cor-responding graph construction. HereC1 andC2 are the(1, 2)th and(2, 1)th (i.e. thenon-zero) elements of the table respectively.

Graph construction for α-expansion

Theα-expansion algorithm relabels some pointsx to α. In other words, it attemptsto assign the points belonging topα, which were missed by the initial estimate, to the

31

segmentpα. Again, we show the graph construction for only two neighbouring pointsx andy for clarity.

Similar to theαβ-swap case, the unary potential of assigningx topα isAα(x;Θ,D).Recall that the potential of not assigning a pointx to a segmentα is given by the con-stantκ2 (see equation (8)).

The pairwise potentials for all four possible assignments of two pointsx andy

are summarized in Fig. 23. Note that in accordance with the energy of the model, thepairwise potentials are summed over all segments contain the pointsx or y. Using thesource and sink vertices to represent labelsα and not-α (denoted by∼ α) respectivelythe corresponding graph construction, shown in Fig. 23, canbe obtained by the methoddescribed in [9].

y ∈ pα y /∈ pα

x ∈ pα 0∑

i,y∈pi

γ′αi

(x, y)

x /∈ pα

∑

i,x∈pi

γ′iα

(x, y) 0

Figure 23:Graph construction forα-expansion. The table shown in the left summa-rizes the pairwise potentials for two pointsx and y. The figure on the right showsthe corresponding graph construction. Again,C1 andC2 are the(1, 2)th and(2, 1)th

elements of the table respectively.

References

[1] A. Agarwal and B. Triggs. Tracking articulated motion using a mixture of autore-gressive models. InECCV, pages III:54–65, 2004.

[2] M Black and D. Fleet. Probabilistic detection and tracking of motion discontinu-ities. IJCV, 38:231–245, 2000.

[3] A. Blake, C. Rother, M. Brown, P. Perez, and P.H.S. Torr. Interactive imagesegmentation using an adaptive GMMRF model. InECCV, pages Vol I: 428–441, 2004.

[4] Y. Boykov and M.P. Jolly. Interactive graph cuts for optimal boundary and regionsegmentation of objects in N-D images. InICCV, pages I: 105–112, 2001.

[5] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization viagraph cuts.PAMI, 23(11):1222–1239, 2001.

32

[6] D. Cremers and S. Soatto. Variational space-time motionsegmentation. InICCV,pages II:886–892, 2003.

[7] P.F. Felzenszwalb and D.P. Huttenlocher. Fast algorithms for large state spaceHMMs with applications to web usage analysis. InNIPS, 2003.

[8] N. Jojic and B. Frey. Learning flexible sprites in video layers. InCVPR, volume 1,pages 199–206, 2001.

[9] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graphcuts. IEEE PAMI, 26(2):147–159, 2004.

[10] M. P. Kumar, P. H. S. Torr, and A. Zisserman. Learning layered pictorial structuresfrom video. InICVGIP, pages 148–153, 2004.

[11] M. P. Kumar, P. H. S. Torr, and A. Zisserman. Learning layered motion segmen-tations of video. InICCV, volume I, pages 33–40, 2005.

[12] M. P. Kumar, P. H. S. Torr, and A. Zisserman. OBJ CUT. InProceedings of IEEEConference on Computer Vision and Pattern Recognition, pages 18–25, 2005.

[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilisticmodels for segmenting and labelling sequence data. InICML, 2001.

[14] D.R. Magee and R.D. Boyle. Detecting lameness using re-sampling condensationand multi-stream cyclic hidden markov models.IVC, 20(8):581–594, June 2002.

[15] J. Pearl.Probabilistic Reasoning in Intelligent Systems: Networksof PlausibleInference. Morgan Kauffman, 1998.

[16] D. Ramanan and D.A. Forsyth. Using temporal coherence to build models ofanimals. InICCV, pages 338–345, 2003.

[17] H. Sidenbladh and M.J. Black. Learning the statistics of people in images andvideo. IJCV, 54(1):181–207, September 2003.

[18] P. H. S. Torr, R. Szeliski, and P. Anandan. An integratedbayesian approach tolayer extraction from image sequences.IEEE PAMI, 23(3):297–304, 2001.

[19] P. H. S. Torr and A Zisserman. Feature based methods for structure and motionestimation. In W. Triggs, A. Zisserman, and R. Szeliski, editors, InternationalWorkshop on Vision Algorithms, pages 278–295, 1999.

[20] G. Vogiatzis, P. H. S. Torr, S. Seitz, and R. Cipolla. Reconstructing relief surfaces.In BMVC, pages 117–126, 2004.

[21] J. Wang and E. Adelson. Representing moving images withlayers. IEEE Trans.on IP, 3(5):625–638, 1994.

[22] Y. Weiss and E. Adelson. A unified mixture framework for motion segmentation.In CVPR, pages 321–326, 1996.

33

[23] C. Williams and M. Titsias. Greedy learning of multipleobjects in images usingrobust statistics and factorial learning.Neural Computation, 16(5):1039–1062,2004.

[24] J. Wills, S. Agarwal, and S. Belongie. What went where. In CVPR, pages I:37–44,2003.

[25] J. Winn and A. Blake. Generative affine localisation andtracking. InNIPS, pages1505–1512, 2004.

34

learning layered motion segmentations of...

Documents