multiplecosegmentation - stanford ai...

IntroductionMethod overview

Spatial consistencyDiscriminative clustering

OptimizationResults

Multiple cosegmentation

Armand Joulin,Francis Bach and Jean Ponce.

INRIA -Ecole Normale Superieure

April 25, 2012

Armand Joulin, Francis Bach and Jean Ponce. Multiple cosegmentation



OptimizationResults

SegmentationSupervised and weakly-supervised segmentationCosegmentation

Segmentation

Segmentation is classical and fundamental vision problem.




OptimizationResults


Segmentation

Segmentation is classical and fundamental vision problem.

Problem: Many possible solutions.




OptimizationResults


Existing solutions

Supervised Segmentation:

Need ground truth for every class of objectCannot deal with an unknown object.

P. Krahenbuhl and V. Koltun (NIPS’11)

Interactive segmentation (scribbles or a bounding box)

Need human interaction for each image.

GrabCut




OptimizationResults


Cosegmentation

Dividing one images.




OptimizationResults


Cosegmentation

Dividing a set of images by using shared information.




OptimizationResults


Cosegmentation

Dividing a set of images by using shared information.

No prior information.

But: common foreground and different background.




OptimizationResults


Cosegmentation

Previous existing methods (Rother et al. 2006, Singh andHochbaum 2009,...) only work with 2 images and the exactsame object.

The first presented method works on multiple images and onan object class.

The second one extends it to multiple images and multipleobject classes.




OptimizationResults


....Cosegmentation is also a ill-posed problem

In natural images, objects are link with their environement

...the background is also common to all the images.




OptimizationResults





Solutions:

Use user interaction on some images,




OptimizationResults





Solutions:

Use user interaction on some images,

Segment the background into meaningful regions.




OptimizationResults

Goal of our approachNotations

The goals of our approach

Our method should:

Handle multiple images.




OptimizationResults



Our method should:


Works on any kind of object/stuff.




OptimizationResults



Our method should:



Segments the ”background” into meaningful regions.




OptimizationResults



Our method should:



Segments the ”background” into meaningful regions.

Uses no prior information but can be easily extended tointeractive cosegmentation.




OptimizationResults


Method goals

Local consistency

Figure: Image

space.

Maximizing spatial consistencywithin a particular image.




OptimizationResults


Method goals

Local consistency

Figure: Image

space.

Figure:Feature space.

Maximizing spatial consistencywithin a particular image.

Separation of the classes

Maximizing the separabilityof K classes between different images

Our framework:Unsupervised discriminative clustering.




OptimizationResults


Problem Notations

Each image i is reduced to a subsampled grid of pixels.

For the n-th pixel, we denote by:

xn its d-dimensional feature vector.yn the K -vector such as ynk = 1 if the n-th pixel is in thek-class and 0 otherwise.




OptimizationResults

Spatial consistency

Figure: Image space.

Normalized Cut (Shi and Malik, 2000):

The similarty between two pixels is mesured by the rbfdistance between their position pn and their color cn.




OptimizationResults

Spatial consistency



The similarty between two pixels is mesured by the rbfdistance between their position pn and their color cn.

For an image i , our similarity matrix is:

W inm = exp(−λp‖pn − pm‖

22 − λc‖cn − cm‖

2).




OptimizationResults

Spatial consistency



The Laplacian matrix is L = I − D−1/2WD−1/2 where D thediagonal matrix composed of the row sums of W




OptimizationResults

Spatial consistency



The Laplacian matrix is L = I − D−1/2WD−1/2 where D thediagonal matrix composed of the row sums of W

We thus have the following in our cost function:

EB(y) =µ

Ntr(yTLy).




OptimizationResults

FormulationMapping approximationLoss functionCluster size balancingOverall problemProbabilistic interpretation

Discriminative clustering

Figure: Feature space.

Discriminative classifier:

given the labels y , we solve the following problem:

EU(y) = minA∈IR

K×d,

b∈IRK

1

N

N∑

n=1

ℓ(yn,Aφ(xn) + b) +λ

2K‖A‖2F ,

Notations

φ a non-linear mapping of the feature,

ℓ is a cost function.Armand Joulin, Francis Bach and Jean Ponce. Multiple cosegmentation



OptimizationResults


Mapping approximation

Our discriminative clustering framework works with positivedefinite kernels




OptimizationResults




We use the χ2 kernel matrix K:

Knm = exp

(

− λh

D∑

d=1

(xnd − xmd)2

xnd + xmd

)

,




OptimizationResults




We use the χ2 kernel matrix K:

Knm = exp

(

− λh

D∑

d=1

(xnd − xmd)2

xnd + xmd

)

,

Equivalent to apply a mapping φ from the feature space to ahigh-dimensional Hilbert space F , such that:

Knm = 〈φ(xn), φ(xm)〉




OptimizationResults



Figure: Feature space.

Discriminative classifier:

given the labels y , we solve the following problem:

EU(y) = minA∈IR

K×d,

b∈IRK

1

N

N∑

n=1

ℓ(yn,Aφ(xn) + b) +λ

2K‖A‖2F ,

Notations

φ a non-linear mapping of the feature,

ℓ is a cost function.Armand Joulin, Francis Bach and Jean Ponce. Multiple cosegmentation



OptimizationResults


Loss function

We choose the soft-max loss function because it is suited formulticlass and is related to probabilistic models:

ℓ(yn,Aφ(xn) + b) = −K∑

k=1

ynk log

(

exp(aTk φ(xn) + bk)∑K

l=1 exp(aTl φ(xn) + bl)

)

,




OptimizationResults



Find the set of labels y which leads to the best dataseparation into K classes:

miny∈{0,1}N×K ,

y1K=1N

minA∈IRK×d ,b∈IRK

EU(y ,A, b)




OptimizationResults



Find the set of labels y which leads to the best dataseparation into K classes:

miny∈{0,1}N×K ,

y1K=1N

minA∈IRK×d ,b∈IRK

EU(y ,A, b)

Problem: Same label for all the pixels → perfect separation




OptimizationResults


Cluster size balancing

Two solutions:adding linear constraints on the number of elements per classEncourage the proportion of points per class to be uniform




OptimizationResults


Cluster size balancing

Two solutions:adding linear constraints on the number of elements per classEncourage the proportion of points per class to be uniform

We choose the second: No additional parameters and have aprobabilistic interpretation.

H(y) = −∑

i∈I

K∑

k=1

(

1

N

∑

n∈Ni

ynk

)

log

(

1

N

∑

n∈Ni

ynk

)

.

where i is an image, and Ni the number of pixels in i

Note: In a weakly supervised setting (e.g., interactivesegmentation), this term can be modify to take into accountprior knowledge.




OptimizationResults


Overall problem

Combining the unary and binary term with the class balancingterm, we obtain the following problem:

miny∈{0,1}N×K ,

y1K=1N

[

minA∈IRd×K ,b∈IRK

EU(y ,A, b)

]

+ EB(y)− H(y).




OptimizationResults


Probabilistic interpretation

We introduce tn in {0, 1}|I| indicating to which image n

belongs and zn in {1, . . . ,M} giving for each pixel n someobservable information

The label y is a latent variable of the observable information zgiven x (x → y → z ← t) inducing an “explain away”phenomenon:

the label yn and the variable tn compete to explain theobservable information zn.




OptimizationResults



More precisely, we suppose a bilinear model:

P(znm = 1 | tni = 1, ynk = 1) = ynkGikm tni ,

where∑N

m=1 Gikm = 1




OptimizationResults



More precisely, we suppose a bilinear model:

P(znm = 1 | tni = 1, ynk = 1) = ynkGikm tni ,

where∑N

m=1 Gikm = 1

and a exponential family model for Y = (y1, . . . , yN) givenX = (x1, . . . , xN) with unary parameters (A, b) and binaryparameters L.




OptimizationResults



Our cost function is the mean-field variational approximationof the following (regularized) negative conditionallog-likelihood of Z = (z1, . . . , zN) given X andT = (t1, . . . , tN) for our model:

minA∈IRd×K ,b∈IRK ,G∈IRN×K |I|,

GT 1N=1, G≥0

−1

N

N∑

n=1

log(

p(zn | xn, tn))

+λ

2K‖A‖22.




OptimizationResults



Our cost function is the mean-field variational approximationof the following (regularized) negative conditionallog-likelihood of Z = (z1, . . . , zN) given X andT = (t1, . . . , tN) for our model:

minA∈IRd×K ,b∈IRK ,G∈IRN×K |I|,

GT 1N=1, G≥0

−1

N

N∑

n=1

log(

p(zn | xn, tn))

+λ

2K‖A‖22.

Z can encode “must-link” and “must-not-link” constraintsbetween pixels (e.g., superpixels).




OptimizationResults

EM procedure

miny∈{0,1}N×K ,

y1K=1N

[


EU(y ,A, b)

]

+ EB(y)− H(y).




OptimizationResults

EM procedure

miny∈{0,1}N×K ,

y1K=1N

[


EU(y ,A, b)

]

+ EB(y)− H(y).

This cost function is not jointly convex in y and (A, b).




OptimizationResults

EM procedure

miny∈{0,1}N×K ,

y1K=1N

[


EU(y ,A, b)

]

+ EB(y)− H(y).


However it is convex in both independently.




OptimizationResults

EM procedure

miny∈{0,1}N×K ,

y1K=1N

[


EU(y ,A, b)

]

+ EB(y)− H(y).


However it is convex in both independently.

We alternatively optimize over each variable while fixing theother:

We use L-BFGS for (A, b)We use a projected gradient descent for y .




OptimizationResults

The initialization

Since our problem is not convex, a good initialization is crucial

We propose a quadratic convex approximation related toJoulin et al. (CVPR’10).

Quadratic function may lead to poor solutions, thus we alsouse random initializations.




OptimizationResults

Initialization: Quadratic approximation

The second-order Taylor expansion of our cost function is:

J(y) =K

2

[

tr(yyTC ) +2µ

NKtr(yyTL)−

1

Ntr(yyTΠI )

]

,

where C = 1NΠN(I − Φ(NλIK +ΦTΠNΦ)

−1ΦT )ΠN is relatedto the reweighted ridge regression classifier (Joulin et al.CVPR’10).




OptimizationResults

Initialization: Quadratic approximation

The second-order Taylor expansion of our cost function is:

J(y) =K

2

[

tr(yyTC ) +2µ

NKtr(yyTL)−

1

Ntr(yyTΠI )

]

,

where C = 1NΠN(I − Φ(NλIK +ΦTΠNΦ)

−1ΦT )ΠN is relatedto the reweighted ridge regression classifier (Joulin et al.CVPR’10).

This is not convex because of the last term which can bereplaced by the following linear constraints:

∑

n∈Ni

ynk ≤ 0.9Ni ;∑

j∈I\i

∑

n∈Nj

ynk ≥ 0.1(N − Ni ).

we obtain a formulation similar to Joulin et al. (CVPR’10).




OptimizationResults

Results

Binary segmentation (foreground/background) on MSRC:

High variability in foreground and background,around 30 images per classes,We use SIFT features.

Multiclass cosegmentation on iCoseg:

Low variability in the image, same illumination...around 10 images per classes,We use color histograms.

Some extensions:

Grabcut.weakly supervised problemvideo key frames.




OptimizationResults

Binary cosegmentation




OptimizationResults

Binary cosegmentation

class Ours Kim et al. (ICCV’11) Joulin et al. (CVPR’10)

Bike 43.3 29.9 42.3

Bird 47.7 29.9 33.2

Car 59.7 37.1 59.0

Cat 31.9 24.4 30.1

Chair 39.6 28.7 37.6

Cow 52.7 33.5 45.0

Dog 41.8 33.0 41.3

Face 70.0 33.2 66.2

Flower 51.9 40.2 50.9

House 51.0 32.2 50.5

Plane 21.6 25.1 21.7

Sheep 66.3 60.8 60.4

Sign 58.9 43.2 55.2

Tree 67.0 61.2 60.0

Average 50.2 36.6 46.7




OptimizationResults





OptimizationResults


class K Ours Joulin et al. CVPR’10 Kim et al ICCV’11

Baseball player 5 62.2 53.5 51.1

Brown bear 3 75.6 78.5 40.4

Elephant 4 65.5 51.2 43.5

Ferrari 4 65.2 63.2 60.5

Football player 5 51.1 38.8 38.3

Helicopter 3 43.3 67.8 7.3

Kite Panda 2 57.8 58.0 66.2Monk 2 77.6 76.9 71.3

Panda 3 55.9 49.1 39.4

Skating 2 64.0 47.2 51.1

Stonehedge 3 86.3 85.4 64.6

Plane 3 45.8 39.2 25.2

Face 3 70.5 56.4 33.2

Average 64.8 58.1 48.7




OptimizationResults

Extensions

grabCut:

Weakly supervised learning with image tags ({ plane, sheep,sky, grass}).

Video shot segmentation:




OptimizationResults

Limitations

Number of classes:

Each class must be in each image (because of the entropy).

Running time: About half an hour to one hour (MATLABimplementation).




OptimizationResults

Thank you.


multiplecosegmentation - stanford ai...

Documents