deepak pathak philipp kr ahenb¨ uhl trevor darrell¨ abstractdeepak pathak philipp kr ahenb¨ uhl...

Constrained Convolutional Neural Networks forWeakly Supervised Segmentation

Deepak Pathak Philipp Krahenbuhl Trevor DarrellUniversity of California, Berkeley

{pathak,philkr,trevor}@cs.berkeley.edu

Abstract

We present an approach to learn a dense pixel-wise la-beling from image-level tags. Each image-level tag imposesconstraints on the output labeling of a Convolutional Neu-ral Network classifier. We propose a novel loss function tooptimize for any set of linear constraints on the output space(i.e. predicted label distribution) of a Convolutional NeuralNetwork. Our loss formulation is easy to optimize and canbe incorporated directly into standard stochastic gradientdescent optimization. The key idea is to phrase the train-ing objective as a biconvex optimization for linear models,which we then relax to nonlinear deep networks. Extensiveexperiments demonstrate the generality of our new learn-ing framework. The constrained loss yields state-of-the-artresults on weakly supervised semantic image segmentation.We further demonstrate that adding slightly more supervi-sion can greatly improve the performance of the learningalgorithm.

1. Introduction

In recent years, standard computer vision tasks, suchas recognition or classification, have made tremendousprogress. This is primarily due to the widespread adop-tion of Convolutional Neural Networks (CNNs) [11,18,19].Existing models excel by their capacity to take advantageof massive amounts of fully supervised training data [28].This reliance on full supervision is a major limitation onscalability with respect to the number of classes or tasks.For structured prediction problems, such as semantic seg-mentation, fully supervised, i.e. pixel-level, labels are bothexpensive and time consuming to obtain. Summarizationof the semantic-labels in terms of weak supervision, e.g.image-level tags or bounding box annotations, is often lesscostly. Leveraging the full potential of these weak annota-tions is challenging, and existing approaches are susceptibleto diverging into bad local optima from which recovery isdifficult [6, 15, 25].

NetworkOutput

LatentDistribution

ConstrainedRegion

Predicted MaskClass Heat Maps

Input Image

CNN

Figure 1: We train convolutional neural networks from a setof linear constraints on the output variables. The networkoutput is encouraged to follow a latent probability distribu-tion, which lies in the constraint manifold. The resultingloss is easy to optimize and can incorporate arbitrary linearconstraints.

In this paper, we present a framework to incorporateweak supervision into the learning procedure through a se-ries of linear constraints. In general, it is easier to expresssimple constraints on the output space than to craft regu-larizers or adhoc training procedures to bias learning. Insemantic segmentation, such constraints can describe theexistence and expected distribution of labels from image-level tags. For example, given a car is present in an image,a certain fraction of pixels should be labeled as car.

We propose a novel loss function to optimize CNNs with

1

arX

iv:1

506.

0364

8v1

[cs

.CV

] 1

1 Ju

n 20

15

arbitrary linear constraints on the structured output space ofpixel labels. Direct evaluation of a constrained net is madedifficult by the non-convex nature of deep nets. Our keyinsight is to model a distribution over latent “ground truth”labels, while the output of the deep net follows this latentdistribution as closely as possible. This allows us to en-force the constraints on the latent distribution instead of thenetwork output, which greatly simplifies the resulting op-timization problem. The resulting objective is a biconvexproblem for linear models. For deep nonlinear models, it re-sults in an alternating convex and gradient based optimiza-tion which can be naturally integrated into standard stochas-tic gradient descent (SGD). As illustrated in Figure 1, aftereach iteration the output is pulled toward the closest pointon a constrained manifold that encourages better semanticsegmentation output by the convnet. This closest point (i.e.the latent distribution) is inferred from the network output.Our optimization defines a Constrained CNN (CCNN) thatis jointly guided by weak annotations and trained end-to-end.

Weakly supervised learning seeks to capture the signalthat is common to all the positives but absent from all thenegatives. This is challenging due to nuisance variablessuch as pose, occlusion, and intra-class variation. Pin-heiro et al. [26] and Pathak et al. [25] extend the multiple-instance learning framework to semantic semantic segmen-tation using convolutional networks. Their methods iter-atively boost well-predicted classifier outputs while sup-pressing erroneous segmentations according to image-leveltags. Both algorithms are very sensitive to the initialization,and rely on carefully pretrained classifiers for all layers inthe convolutional network. Our constrained optimization ismuch less sensitive and recovers a good solution from ran-dom initializations of the classification layer.

Recently, Papandreou et al. [23] showed promising re-sults by adding an adaptive bias into the multi-instancelearning framework. Their algorithm boosts classes knownto be present in the image and suppresses all others. Weshow that this simple heuristic can be viewed as an specialcase of constrained optimization, where the adaptive biascontrols the constraint satisfaction. However the constraintsthat can be modeled by this adaptive bias are quite limitedand cannot leverage the full power of weak labels. In thispaper we show how to apply more general linear constraintswhich lead to better segmentation performance.

Constrained optimization problems have long been ap-proximated by artificial neural networks [36]. These mod-els are usually non-parametric, and solve just a single in-stance of a linear program, while we use constraints to learnparameters for modern large scale deep networks. Platt etal. [27] shows how to optimize equality constraints on theoutput of a neural network. However the resulting objec-tive is highly non-convex, which makes a direct minimiza-

tion hard. In this paper, we show how to decompose a con-strained objective into a biconvex problem for linear modelsand an alternating convex and gradient-based optimizationfor nonlinear deep networks.

We evaluate our optimization on the problem of multi-class semantic image segmentation with varying levels ofweak supervision. Our approach achieves state of the artperformance on Pascal VOC 2012 compared to other weaklearning approaches. It does not require pixel-level labelsfor any objects during training time, but infers them directlyfrom the image-level tags. We show that our constrainedoptimization framework can incorporate additional formsof weak supervision, such as a rough estimate of the sizeof an object. The proposed technique is general, and canincorporate many forms of weak supervision.

2. Related Work

Learning with weak labels, often termed as Multiple In-stance Learning [8], has been addressed using various ap-proaches. Most frequently, it is formulated as a maximiz-ing margin problem in a binary classification task. It hasalso been attempted using boosting [1,37], Noisy-OR mod-els [14] etc. In the max-margin classification, the problemis non-convex and solved as an alternating minimization ofa biconvex objective [2]. The task is to predict classifica-tion labels of individual instances contained in the bags,where only bag-level labels are known during training. MI-SVM [2] and LSVM [10, 35] are some of the classic meth-ods in this paradigm, which are extension of SVMs to MILscenario. This setting naturally applies to weakly-labeleddetection i.e. using image-level labels [31, 34] and personidentification [24]. However, most of these approaches aresensitive to initial weights of the detector and possible loca-tion hypotheses [6, 24]. There are several heuristics whichhave been proposed to handle these issues [22, 30, 31]. Weconsider this setting for dense pixel segmentation, and useconstraints while starting from random classifier weights.

Traditionally, the problem of weak segmentation orscene parsing with image level labels has mostly been ad-dressed using graphical models, and parametric structuredmodels [32, 33, 38]. Most of these works exploit low-levelinformation of image to connect regions similar in appear-ance. [32]. Chen et al. [5] exploit top-down segmentationpriors based on visual subcategories for object discovery.Recently, there has been a series of work in tackling thisproblem using convolutional neural networks [23, 25, 26]which we discussed in Section 1.

3. Preliminaries

We define a pixelwise labeling for an image I as a set ofrandom variables X = {x0, . . . , xn} where n is the num-ber of pixles in an image. xi ∈ L takes one of m dis-

tag : “Train”

B"A"

| A | | B | suchthat

FCN

Figure 2: Overview of our weak learning pipeline. Inputimage is passed through Fully convolutional neural network(FCN) which produces an output labeling. The model istrained such that the output labeling follows a set of simplelinear constraints imposed by image level tags.

crete labels L = {1, . . . ,m}. CNN models a probabilitydistribution Q(X|θ, I) over those random variables, whereθ are the parameters of the network. The distribution iscommonly modeled as a product of independent marginalsQ(X|θ, I) =

∏i qi(xi|θ, I), where each of the marginal

represents a softmax probability:

qi(xi|θ, I) =1

Ziexp

(fi(xi; θ, I)

)(1)

where Zi =∑l∈L exp

(fi(l; θ, I)

)is the partition func-

tion of a pixel i. The function fi represents the real-valuedscore of the neural network. A higher score corresponds toa higher likelihood.

Standard learning algorithms aim to maximize the like-lihood of the observed training data under the model. Thisrequires full knowledge of the ground truth labeling, whichis not available in the weakly supervised setting. In the nextsection, we show how to optimize the parameters of a CNNusing some high-level constraints on the distribution of out-put labeling. An overview of this is given in Figure 2. InSection 5, we then present a few examples of useful con-straints for weak labeling.

4. Constrained Optimization

For notational convenience, let ~QI be the vectorizedform of network output Q(X|θ, I). The constrained CNNoptimization can be framed as:

find θ

subject to∑

I

AI ~QI ≥ ~bI , (2)

where AI ∈ Rk×nm and ~bI ∈ Rk enforce k individual lin-ear constraints on the output distribution of the convnet onimage I . For notational simplicity, we derive our inferencealgorithm for a single image with A = AI , ~b = ~bI and~Q = ~QI . The entire derivation generalizes to an arbitrary

number of images and constraints. Constraints include forexample lower and upper bounds on the expected numberof foreground and background pixel labels in a scene. Formore examples, see Section 5. In the first part of this sec-tion, we assume that all constraints are satisfiable, meaningthere always exists a parameter setting θ such that A~Q ≥ ~b.In Section 4.3, we lift this assumption by adding slack vari-ables to each of the constraints.

While objective (2) is convex in the network output Q, itis generally not convex with respect to the network param-eters θ. For any non-linear function Q, the matrix A can bechosen such that the constraint is an upper or lower boundto Q, one of which is non-convex. This makes a direct opti-mization hard. As a matter of fact, not even log-linear mod-els, such as logistic regression, can be directly optimizedunder this objective. Alternatively one could optimize theLagrangian dual of (2). However this is computationallyvery expensive, as we would need to optimize an entire con-volutional neural network in an inner loop of a dual descentalgorithm.

In order to efficiently optimize objective (2), we intro-duce a latent probability distribution P (X) over the seman-tic labels X . We constrain P (X) to lie in the feasibilityregion of the constraint objective, while removing the con-straints on the network output Q. We then encourage P andQ to model the same probability distribution, by minimiz-ing their respective KL-divergence. The resulting objectiveis defined as

minimizeθ,P

−HP − EX∼P [logQ(X|θ)]

subject to A~P ≥ ~b,∑

X

P (X) = 1, (3)

where HP = −∑X P (X) logP (X) is the entropy of la-tent distribution and ~P is the vectorized version of P (X). Ifthe constraints in (2) are satisfiable then objectives (2) and(3) are equivalent with a minimum at P = Q. This equalityimplies that P (X) can be modeled as a product of indepen-dent marginals P (X) =

∏i pi(xi) without loss of general-

ity, with a minimum at pi(xi) = qi(xi|θ). A detailed proofis provided in the supplementary material.

The new objective is much easier to optimize, as it de-couples the constraints from the network output. For fixednetwork parameters θ, the objective is convex in P . For afixed latent distribution P , the objective reduces to a stan-dard cross entropy loss which is optimized using stochasticgradient descent.

In the remainder of this section, we derive an algorithmto optimize objective (3) using block coordinate descent.Section 4.1 solves the constrained optimization for P whilekeeping the network parameters θ fixed. Section 4.2 thenincorporates this optimization into standard stochastic gra-dient descent, keeping P fixed while optimizing for θ. Each

step is guaranteed to decrease the overall energy of objec-tive (3), converging to a good local optimum. At the end ofthis section, we show how to handle constraints that are notdirectly satisfiable by adding a slack variable to the loss.

4.1. Latent distribution optimization

We first show how to optimize objective (3) with respectto P while keeping the convnet output fixed. The objec-tive function is convex with linear constraints, which im-plies that Slaters condition and hence strong duality holdsas long as the constraints are satisfiable [3]. We can there-fore optimize objective (3) by maximizing its dual function

L(λ) = λ>~b−n∑

i=1

log∑

l∈Lexp

(fi(l; θ) +A>i;lλ

), (4)

where λ ≥ 0 are the dual variables pertaining to the in-equality constraints and fi(l; θ) is the score of the convnetclassifier for pixel i and label l. Ai;l is the column of Acorresponding to pi(l). A detailed derivation of this dualfunction is provided in the supplementary material.

The dual function is concave and can be optimized glob-ally using projected gradient ascent [3]. The gradient of thedual function is given by ∂

∂λL(λ) = ~b−A~P , which resultsinto

pi(xi) =1

Ziexp

(fi(xi; θ) +A>i;xi

λ),

where Zi =∑l exp

(fi(l; θ) + A>i;lλ

)is the local partition

function ensuring that the distribution pi(xi) sums to onefor ∀xi ∈ L. Intuitively, the projected gradient descent al-gorithm increases the dual variables for all constraints thatare not satisfied. Those dual variables in turn adjust thedistribution pi to fulfill the constraints. The projected dualgradient descent algorithm usually converges within fewerthan 50 iterations, making the optimization highly efficient.

Next, we show how to incorporate this estimate forP (X) into the standard stochastic gradient descent algo-rithm.

4.2. SGD

For a fixed latent distribution P , objective (3) reduces tothe standard cross entropy loss

L(θ) = −∑

i

∑

xi

pi(xi) log qi(xi|θ). (5)

The gradient of this loss function is given by∂

∂ ~fi(xi)L(θ) = ~qi(xi|θ)− ~pi(xi). For linear models,

the loss function (5) is convex and can be optimized usingany gradient based optimization. For multi-layer deepnetworks, we optimize it using back-propagation andstochastic gradient descent (SGD) with momentum, asimplemented in Caffe [16].

Q(0)

Q(1)

P (1)

P (0)

ConstrainedRegion

SGD

Figure 3: Illustration of our alternating convex optimizationand gradient descent optimization for t = 0. At each iter-ation t we compute a latent probability distribution P (t) asthe closest point in the constrained region. We then updatethe convnet parameters to follow P (t) as closely at possibleusing Stochastic Gradient Descent (SGD), which takes theconvnet output from Q(t) to Q(t+1).

Theoretically, we would need to keep the latent distribu-tion P fixed for a few iterations of SGD until the objectivevalue decreases. Otherwise we are not strictly guaranteedthat the overall objective (3) decreases. However, in prac-tice we found inferring a new latent distribution at everystep of SGD does not hurt the performance and leads to afaster convergence.

In summary, we optimize objective (3) using SGD ateach iteration we infer a latent distribution P which definesboth our loss and loss gradient. Figure 3 shows an overviewof the training procedure. For more details, see Section 6.

Up to this point, we assumed that all the constraints aresimultaneously satisfiable. While this might hold for care-fully chosen constraints, our optimization should be robustto arbitrary linear constraints. In the next section, we relaxthis assumption by adding a slack variable to the constraintset and show that this slack variable can then be easily inte-grated into the optimization.

4.3. Constraints with slack variable

We relax objective (3) by adding a slack ξ ∈ Rk to thelinear constraints. The slack is regularized using a hingeloss with weight β ∈ Rk. It results into the following opti-mization:

minimizeθ,P,ξ

−HP − EX∼P [logQ(x|θ)] + βT ξ

subject to A~P ≥ ~b−ξ,∑

X

P (X)=1, ξ ≥ 0. (6)

This objective is now guaranteed to be satisfiable for anyassignment to P and any linear constraint. Similar to Equa-tion 4, this is optimized using projected dual coordinate as-cent. The dual objective function is exactly same as Equa-tion 4. The weighting term of the hinge loss β merely actsas an upper bound on the dual variable i.e. 0 ≤ λ ≤ β. Adetailed derivation of this loss is given in the supplementarymaterial.

This slack relaxed loss allows the optimization to ignorecertain constraints if they become too hard to enforce. Italso trades off between various competing constraints.

5. Constraints for Weak Semantic Segmenta-tion

We now describe all constraints we use for our weaklysupervised semantic segmentation. For each training imageI , we are given a set of image-level labels LI . Our con-straints affect different parts of the output space dependingon the image-level labels. All the constraints are comple-mentary, and each constraint exploits the set of image-levellabels differently.

Suppression constraint The most natural constraint is tosuppress any label l that does not appear in the image.

n∑

i=1

pi(l) ≤ 0 ∀l/∈LI.

This constraint alone is not sufficient, as a solution involv-ing all background labels satisfies it perfectly. We can easilyaddress this by adding a lower-bound constraint for labelspresent in an image.

Foreground constraint

al ≤n∑

i=1

pi(l) ∀l∈LI.

This foreground constraint is very similar to the commonlyused multiple instance learning (MIL) paradigm, where atleast one pixel is constrained to be positive [2, 15, 25, 26].Unlike MIL, our foreground constraint can encourage mul-tiple pixels to take a specific foreground label by increasingal. In practice, we set al = 0.05n with a slack of β = 2,where n is the number of outputs of our network.

While this foreground constraint encourages some of thepixels to take a specific label, it is often not strong enoughto encourage all pixels within an object to take the cor-rect label. We could increase al to encourage more fore-ground labels, but this would over-emphasize small objects.A more natural solution is to constrain the total number offoreground labels in the output, which is equivalent to con-straining the overall area of the background label.

Background constraint

a0 ≤n∑

i=1

pi(0) ≤ b0.

Here l = 0 is assumed to be the background label. Weapply both a lower and upper bound on the backgroundlabel. This indirectly controls the minimum and maxi-mum combined area of all foreground labels. We founda0 = 0.3n and b0 = 0.7n to work well in practice.

The above constraints are all complementary and ensurethat the final labeling follows the image-level labels LI asclosely as possible. However, it is important to note that ourmethod is quite robust to wide range of constraint boundsdue to the presence of slack variable.

If we have access to the rough size of an object, we canalso exploit this information during training. In our experi-ments, we show that substantial gains can be made by sim-ply knowing if a certain object class covers more or lessthan 10% of the image.

Size constraint We exploit the size constraint two fold:We boost all classes larger than 10% of the image by settingal = 0.1n. We also put an upper bound constraint on theclasses that are guaranteed to be small

n∑

i=1

pi(l) ≤ bl.

In practice, a threshold bl < 0.01n works slightly betterthan a tight threshold.

We will evaluate the individual constraints in Section 7.The EM-Adapt algorithm of Papandreou et al. [23] can

be seen as a special case of a constrained optimization prob-lem with just suppression and foreground constraints. Theadaptive bias parameters then correspond to the Lagrangiandual variables λ of our constrained optimization. How-ever in the original algorithm of Papandreou et al., the con-straints are not strictly enforced especially when some ofthem conflict. We show that a principled optimization ofthose constraints leads to a substantial increase in perfor-mance.

6. Implementation DetailsIn this section, we discuss the overall pipeline of our al-

gorithm applied for semantic image segmentation. We con-sider the weakly supervised setting i.e. only image-levellabels are present during training. At test time, the task is topredict semantic segmentation mask for a given image.

Learning The convolutional neural network architectureused in our experiments is derived from VGG 16-layer net-work [29]. It was pre-trained on Imagenet 1K class dataset,

Method bgnd

aero

bike

bird

boat

bottl

e

bus

car

cat

chai

r

cow

tabl

e

dog

hors

e

mob

ile

pers

on

plan

t

shee

p

sofa

trai

n

tv mIoU

MIL-FCN [25] - - - - - - - - - - - - - - - - - - - - - 24.9MIL-Base [26] 37.0 10.4 12.4 10.8 05.3 05.7 25.2 21.1 25.15 04.8 21.5 08.6 29.1 25.1 23.6 25.5 12.0 28.4 08.9 22.0 11.6 17.8MIL-Base w/ ILP [26] 73.2 25.4 18.2 22.7 21.5 28.6 39.5 44.7 46.6 11.9 40.4 11.8 45.6 40.1 35.5 35.2 20.8 41.7 17.0 34.7 30.4 32.6EM-Adapt w/o CRF [23] 63.4 27.1 16.4 25.6 20.5 27.7 46.8 42.1 42.3 13.6 31.5 23.9 38.1 31.9 39.7 31.1 21.7 32.9 22.3 38.4 29.1 31.7EM-Adapt [23] 65.0 27.9 17.0 26.5 21.2 29.2 48.0 44.8 43.8 15.0 33.8 25.0 39.9 34.0 41.3 31.8 22.9 35.2 23.2 39.3 30.4 33.1

CCNN w/o CRF 64.2 23.2 16.9 22.2 18.9 34.5 46.8 44.9 44.4 14.9 33.2 22.4 40.5 31.4 42.2 38.7 28.6 31.3 21.3 37.8 34.1 33.0CCNN 65.9 23.8 17.6 22.8 19.4 36.2 47.3 46.9 47.0 16.3 36.1 22.2 43.2 33.7 44.9 39.8 29.9 33.4 22.2 38.8 36.3 34.5

Table 1: Comparison of weakly supervised semantic segmentation methods on PASCAL VOC 2012 validation set.

and achieved winning performance on ILSVRC14. We castthe fully connected layers into convolutions in a similarfashion as suggested in [20], and the last fc8 layer with1K outputs is replaced by that containing 21 outputs cor-responding to 20 object classes in Pascal VOC and back-ground class. The overall network stride of this fully con-volutional network is 32s. However, we observe that theslightly modified architecture with the denser 8s networkstride proposed in [4] gives better results in the weaklysupervised training. Unlike [25, 26], we do not learn anyweights of the last layer from Imagenet. Apart from the ini-tial pretraining, all parameters are finetuned only on PascalVOC. We initialize the weights of the last layer with randomGaussian noise.

The FCN takes in arbitrarily sized images and producescoarse heatmaps corresponding to each class in the dataset.We apply the convex constrained optimization on thesecoarse heatmaps, reducing the computational cost. The con-strained optimization for single image takes less than 30mson a CPU single core, and could be accelerated using aGPU. The network is trained using SGD with momentum.We follow [20] and train our models with a batch size of 1,momentum of 0.99 and an initial learning rate of 1e − 6.We train for 60000 iterations, which corresponds to roughly5 epochs. The learning rate is decreased by a factor of 0.1every 20000 iterations. We found this setup to outperform abatch size of 20 with momentum of 0.9 [4].

Inference At inference time, we optionally apply a fullyconnected conditional random field model [17] to refine thefinal segmentation. The fully connected CRF has 5 pa-rameters controlling the weight and extent of the pairwiseterm. Training those parameters requires fully annotatedtraining data. We used 100 images of the training set tocross-validate the CRF parameters using grid-search. How-ever, those numbers should be considered with caution asthey use fully supervised training data.

7. ExperimentsWe analyze and compare the performance of our con-

strained optimization for varying levels of supervision:

image-level tags and additional supervision such as objectsize information. The objective is to learn models to predictdense multi-class semantic segmentation i.e. pixel-wise la-beling for any new image. We use the provided supervi-sion with few simple spatial constraints on the output, anddon’t use any additional low-level graph-cut based meth-ods in training. The goal is to demonstrate the strength oftraining with constrained outputs, and how it helps with in-creasing levels of supervision.

7.1. Dataset

We evaluate Constrained CNNs (CCNNs) for the task ofsemantic image segmentation on PASCAL VOC dataset [9].The dataset contains pixel level labels for 20 object classesand a separate background class. For a fair comparison toprior work, we use the similar setup to train all models.Training is performed on the union of VOC 2012 train setand the larger dataset collected by Hariharan et al. [12] sum-ming upto a total of 10,582 training images. The VOC12Validation set containing a total of 1449 images is keptheld-out during ablation studies. The VGG network archi-tecture used in our algorithm was pre-trained on ILSVRCdataset [28] for classification task of 1K classes [29].

Results are reported in the form of standard intersectionover union (IoU) metric, also known as Jaccard Index. It isdefined per class as the percentage of pixels predicted cor-rectly out of total pixels labeled or classified as that class.Ablation studies and comparison with baseline methods forboth the weak settings are presented in the following sub-sections.

7.2. Training from image-level tags

We start by training our model using just image-leveltags. We obtain these tags from the presence of a classin the pixel-wise ground truth segmentation masks. Sincesome of the baseline methods report results on the VOC12validation set, we present the performance on both valida-tion and test set. Some methods boost their performanceby using a Dense CRF model [17] to post process the finaloutput labeling. To allow for a fair comparison, we presentresults both with and without a Dense CRF.

Method aero

bike

bird

boat

bottl

e

bus

car

cat

chai

r

cow

tabl

e

dog

hors

e

mob

ile

pers

on

plan

t

shee

p

sofa

trai

n

tv mIoU

Fully Supervised:SDS [13] 63.3 25.7 63.0 39.8 59.2 70.9 61.4 54.9 16.8 45.0 48.2 50.5 51.0 57.7 63.3 31.8 58.7 31.2 55.7 48.5 51.6FCN-8s [20] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2TTIC Zoomout [21] 81.9 35.1 78.2 57.4 56.5 80.5 74.0 79.8 22.4 69.6 53.7 74.0 76.0 76.6 68.8 44.3 70.2 40.2 68.9 55.3 64.4DeepLab-CRF [4] 78.4 33.1 78.2 55.6 65.3 81.3 75.5 78.6 25.3 69.2 52.7 75.2 69.0 79.1 77.6 54.7 78.3 45.1 73.3 56.2 66.4

Weakly Supervised:CCNN w/ tags 21.3 17.7 22.8 17.9 38.3 51.3 43.9 51.4 15.6 38.4 17.4 46.5 38.6 53.3 40.6 34.3 36.8 20.1 32.9 38.0 35.5CCNN w/ size 42.3 24.5 56.0 30.6 39.0 58.8 52.7 54.8 14.6 48.4 34.2 52.7 46.9 61.1 44.8 37.4 48.8 30.6 47.7 41.7 45.1

Table 2: Results on PASCAL VOC 2012 test. We compare our results to the fully supervised state of the art.

Table 1 compares all contemporary weak segmentationmethods. Our proposed method, CCNN, outperforms allprior methods for weakly labeled semantic segmentationby a significant margin. MIL-FCN [25] is an extension oflearning based on maximum scoring instance based MILto multi-class segmentation. The algorithm proposed byPinheiro et al. [26] introduces a soft version of MIL. It istrained on 0.7 million images for 21 classes taken fromILSVRC13, which is 70 times more data than all other ap-proaches used. They achieve boost in performance by re-ranking the pixel probabilities with the image-level priors(ILP) i.e. the probability of class to be present in the image.This suppresses the negative classes, and smooths out thepredicted segmentation mask. For the EM-Adapt [23] algo-rithm, we reproduced the models using their publicly avail-able implementation1. Note that unconstrained MIL basedapproach require the final 21-class classifier to be well ini-tialized for reasonable performance. While our constraintoptimization can handle arbitrary random initializations.

We also directly compare our algorithm against the EM-Adapt results as reported in Papandreou et al. [23] for weaksegmentation. However, their training procedure uses ran-dom crops of both the original image and the segmentationmask. The weak labels are then computed from those ran-dom crops. This introduces limited information about thespatial location of the weak tags. Taken to the extreme,a 1 × 1 output crop reduces to full supervision. We thuspresent this result in the next subsection on incorporatingincreasing supervision.

7.3. Training with additional supervision

We now consider slightly more supervision than just theimage-level tags. Firstly, we consider the training with tagson random crops of original image, following Papandreouet al. [23]. We evaluate our constrained optimization onthe EM-Adapt architecture using random crops, and com-pare to the result obtained from their released caffemodel asshown in Table 3. Using limited spatial information our al-

1https://bitbucket.org/deeplab/deeplab-public/overview

Figure 4: Ablation study with varying amount of fully su-pervised images. Our model makes good use of the addi-tional supervision.

gorithm slightly outperforms EM-Adapt, mainly due to themore powerful background constraints. Note that the differ-ence is not as striking as in the pure weak label setting. Webelieve this is due to the fact that the spatial information incombination with the foreground prior emulates the upperbound constraint on background, as a random crop is likelyto contain much fewer labels.

Method Training Supervision mIoU mIoUw/o CRF

EM-Adapt [23] Tags w/ random crops 33.9 35.4CCNN Tags w/ random crops 34.2 35.8

EM-Adapt [23] Tags w/ object sizes – –CCNN Tags w/ object sizes 40.5 44.1

Table 3: Results using additional supervision during train-ing evaluated on the VOC 2012 validation set.

More interestingly however, a simple size constraint canimprove the performance much more dramatically. For eachlabel, we use one additional bit of information: whethera certain class occupies more than 10% of the image or

(a) Original image (b) Ground truth (c) Image tags (d) Image tags + size

Figure 5: Qualitative results on the VOC 2012 dataset for different levels of supervision. We show the original image, groundtruth, our trained classifier with image level tags and with size constraints. Note that the size constraints localize the objectsmuch better than just image level tags at the cost of missing small objects in few examples.

not. We incorporate this weak supervision into the size con-straints described in Section 5. As shown in Table 3, us-ing this one additional bit of information greatly increasesthe accuracy. Unfortunately, EM-Adapt heuristic cannot di-rectly incorporate this more meaningful size constraint.

Table 2 reports our results on PASCAL VOC 2012 testserver and compares it to fully supervised approaches. As afinal experiment we add fully supervised images in additionto our weak objective, as depicted in Figure 4. Qualitativeresults are shown in Figure 5.

7.4. Discussion

We further experimented with bounding box constraints.We constrain 75% of pixels within a bounding box to takea specific label, while we suppress any labels outside thebounding box. This additional supervision allows us toboost the IoU accuracy to 54%. This number is compet-itive with a baseline for which we train a model on allpixels within a bounding box, which gives 52.3% [23].However it is not yet competitive with more sophisticatedsystems that use more segmentation information within

bounding boxes [7, 23]. Those systems perform at roughly58.5− 62.0% IoU accuracy. We believe the key to this per-formance is a stronger use of the pixel level segmentationinformation. We are currently exploring various avenues toincorporate this segmentation information into the trainingprocess, either through CRF models [23] or object propos-als [7].

8. ConclusionWe presented a constrained optimization framework to

optimize convolutional neural networks. The framework isgeneral and can incorporate arbitrary linear constrains. Itnaturally integrates into standard Stochastic Gradient De-scent, and can easily be used in publicly available frame-works such as Caffe [16].

We showed that constraints are a natural way to describethe desired output space of a labeling and can reduce theamount of strong supervision convolutional neural networksrequire.

References[1] K. Ali and K. Saenko. Confidence-rated multiple instance

boosting for object detection. In CVPR, 2014. 2[2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vec-

tor machines for multiple-instance learning. In NIPS, pages561–568, 2002. 2, 5

[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-bridge University Press, 2004. 4

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deep con-volutional nets and fully connected crfs. arXiv preprintarXiv:1412.7062, 2014. 6, 7

[5] X. Chen, A. Shrivastava, and A. Gupta. Enriching visualknowledge bases via object discovery and segmentation. InCVPR, 2014. 2

[6] R. G. Cinbis, J. Verbeek, C. Schmid, et al. Multi-fold miltraining for weakly supervised object localization. In CVPR,2014. 1, 2

[7] J. Dai, K. He, and J. Sun. Boxsup: Exploiting boundingboxes to supervise convolutional networks for semantic seg-mentation. arXiv preprint arXiv:1503.01640, 2015. 9

[8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solv-ing the multiple instance problem with axis-parallel rectan-gles. Artificial intelligence, 1997. 2

[9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective. IJCV, 111(1):98–136, 2014. 6

[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models. IEEE Tran. PAMI, 2010. 2

[11] K. Fukushima. Neocognitron: A self-organizing neu-ral network model for a mechanism of pattern recogni-tion unaffected by shift in position. Biological cybernetics,36(4):193–202, 1980. 1

[12] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. In ICCV, 2011. 6

[13] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simul-taneous detection and segmentation. In ECCV, 2014. 7

[14] D. Heckerman. A tractable inference algorithm for diagnos-ing multiple diseases. arXiv preprint arXiv:1304.1511, 2013.2

[15] J. Hoffman, D. Pathak, T. Darrell, and K. Saenko. Detectordiscovery in the wild: Joint multiple instance and represen-tation learning. In CVPR, 2015. 1, 5

[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B.Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. In ACM Mul-timedia, MM, pages 675–678, 2014. 4, 9

[17] P. Krahenbuhl and V. Koltun. Efficient inference in fully con-nected CRFs with gaussian edge potentials. In NIPS, 2011.6

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, pages 1097–1105, 2012. 1

[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backpropagationapplied to handwritten zip code recognition. Neural compu-tation, 1(4):541–551, 1989. 1

[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015. 6, 7

[21] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich.Feedforward semantic segmentation with zoom-out features.arXiv preprint arXiv:1412.0774, 2014. 7

[22] M. Pandey and S. Lazebnik. Scene recognition and weaklysupervised object localization with deformable part-basedmodels. In Proc. ICCV, 2011. 2

[23] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.Weakly-and semi-supervised learning of a dcnn for seman-tic image segmentation. arXiv preprint arXiv:1502.02734,2015. 2, 5, 6, 7, 8, 9

[24] D. Pathak, S. N. Satyavolu, and V. P. Namboodiri. Whereis my friend? - person identification in social networks. InAutomatic Face and Gesture Recognition (FG), 2015. 2

[25] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fullyconvolutional multi-class multiple instance learning. arXivpreprint arXiv:1412.7144, 2014. 1, 2, 5, 6, 7

[26] P. O. Pinheiro and R. Collobert. Weakly supervised semanticsegmentation with convolutional networks. arXiv preprintarXiv:1411.6228, 2014. 2, 5, 6, 7

[27] J. C. Platt and A. H. Barr. Constrained differential optimiza-tion for neural networks. 1988. 2

[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-nition challenge. IJCV, 2015. 1, 6

[29] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In ICLR, 2015.5, 6

[30] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In ECCV. 2012. 2

[31] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui,and T. Darrell. On learning to localize objects with minimalsupervision. In ICML, 2014. 2

[32] A. Vezhnevets and J. M. Buhmann. Towards weakly super-vised semantic segmentation by means of multiple instanceand multitask learning. In CVPR, 2010. 2

[33] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly su-pervised structured output learning for semantic segmenta-tion. In CVPR, 2012. 2

[34] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervisedobject localization with latet category learning. In ECCV,2014. 2

[35] C.-N. J. Yu and T. Joachims. Learning structural svms withlatent variables. In ICML, 2009. 2

[36] S. H. Zak, V. Upatising, and S. Hui. Solving linear program-ming problems with neural networks: a comparative study.IEEE transactions on neural networks/a publication of theIEEE Neural Networks Council, 6(1):94–104, 1994. 2

[37] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instanceboosting for object detection. In Advances in neural infor-mation processing systems, 2005. 2

[38] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji. Repre-sentative discovery of structure cues for weakly-supervisedimage segmentation. Multimedia, IEEE Transactions on,16(2):470–479, 2014. 2

Supplementary Material:Constrained Convolutional Neural Networks for

Weakly Supervised Segmentation

Deepak Pathak Philipp Krahenbuhl Trevor DarrellUniversity of California, Berkeley

{pathak,philkr,trevor}@cs.berkeley.edu

This document provides detailed derivation for the re-sults used in the paper.

In the paper, we optimize constraints on CNN output byintroducing latent probability distribution P (X). Recall theoverall main objective as follows:

minimizeθ,P

−HP −EX∼P [logQ(X|θ)]︸︷︷︸HP |Q

subject to A~P ≥ ~b,∑

X

P (X) = 1, (1)

where HP = −∑X P (X) logP (X) is the entropy of la-tent distribution,HP |Q is the cross entropy and ~P is the vec-torized version of P (X). In this supplementary material,we will show how to minimize the objective with respect toP . For the complete block coordinate descent minimizationalgorithm with respect to both P and θ, see the main paper.

Note that the objective function in (1) is KL Divergenceof network output distribution Q(X|θ) from P (X), whichis convex. Objective (1) is convex optimization problem,since all constraints are linear. Furthermore, Slaters condi-tion holds as long as the constraints are satisfiable and hencewe have strong duality [1]. First, we will use this strong du-ality to show that the minimum of (1) is a fully factorizeddistribution. We then derive the dual function (equation (4)in main paper), and finally extend the analysis for the casewhen objective is relaxed by adding slack variable.

1. Latent Distribution

In this section, we show that the latent label distribu-tion P (X) can be modeled as the product of independentmarginals without loss of generality. This is equivalent toshowing that the latent distribution that achieves the globaloptimal value factorizes, while keeping the network param-eters θ fixed. First, we simplify the cross entropy term in

the objective function in (1) as follows:

HP |Q = −EX∼P [logQ(X|θ)]

= −EX∼P[n∑

i=1

log qi(xi|θ)]

= −n∑

i=1

EX∼P [log qi(xi|θ)]

= −n∑

i=1

Exi∼P [log qi(xi|θ)]

= −n∑

i=1

∑

l∈LP (xi = l) log qi(l|θ) (2)

We used the linearity of expectation and the fact that qiis independent of any variable Xj for j 6= i to sim-plify the objective, as shown above. Here, P (xi = l) =∑X:xi=l

P (X) is the marginal distribution.Let’s now look at the Lagrangian dual function

L(P, λ, ν) =−HP +HP |Q

+ λ>(~b−A~P ) + ν

(∑

X

P (X)− 1

)

=−HP +HP |Q −∑

i,l

λ>Ai;l ~P (xi = l)

+ λ>~b+ ν

(∑

X

P (X)− 1

)

=−HP −n∑

i=1

∑

l∈LP (xi = l)

(log qi(l|θ) +ATi;lλ

)

︸︷︷︸HP |Q

+ λT~b+ ν

(∑

X

P (X)− 1

), (3)

where HP |Q is a biased cross entropy term and Ai;l is thecolumn of A corresponding to pi(l).. Here we use the fact

1

arX

iv:1

506.

0364

8v1

[cs

.CV

] 1

1 Ju

n 20

15

that the linear constraints are formulated on the marginals torephrase the dual objective. We will now show this objec-tive can be rephrased as a KL-divergence between a biaseddistribution Q and P .

The biased cross entropy term can be rephrasedas a cross entropy between a distributionQ(X|θ, λ) =

∏i qi(xi|θ, λ), where qi(xi|θ, λ) =

1Ziqi(xi|θ) exp(A>i;xi

λ) is the biased CNN distribution and

Zi is a local partition function ensuring qi sums to 1. Thispartition function is defined as

Zi =∑

l

exp(log qi(l|θ) +A>i;lλ

)

The cross entropy between P and Q is then defined as

HP |Q = −∑

X

P (X) log Q(X|θ, λ)

= −∑

i

∑

l

P (xi= l) log qi(l|θ, λ)

= −∑

i

∑

l

P (xi= l)(log qi(l|θ, λ)+A>i;lλ−log Zi)

= HP |Q +∑

i

log Zi (4)

This allows us to rephrase (3) in terms of a KL-divergence between P and Q

L(P, λ, ν)

=−HP +HP |Q −∑

i

log Zi+λT~b+ν

(∑

X

P (X)− 1

)

=D(P ||Q)− C + λT~b+ ν

(∑

X

P (X)− 1

), (5)

whereC =∑i log Zi is a constant that depends on the local

partition functions of Q.The primal objective (1) can be phrased as

minP maxλ≥0,ν L(P, λ, ν) which is equivalent to thedual objective maxλ≥0,ν minP L(P, λ, ν), due to strongduality.

Maximizing the dual objective can be phrased as maxi-mizing a dual function

L(λ) =λ>~b−C+maxν

minP

D(P‖Q)+ν

(∑

X

P (X)− 1

)

=λ>~b−C+ minP :

∑X P (X)=1

D(P‖Q)

︸︷︷︸0

=λ>~b−∑

i

log∑

l

exp(log qi(l|θ) +A>i;lλ

), (6)

where the maximization of ν can be rephrased as a con-straint on P i.e.

∑X P (X) = 1. Maximizing (6) is equiv-

alent to minimizing the original constraint objective (1).

Factorization The KL-divergenceD(P ||Q) is minimizedat P = Q =

∏i qi, hence the minimal value of P fully

factorizes over all variables for any assignment to the dualvariables λ.

Dual function Using the definition qi(l|θ) =1Zi

exp(fi(l; θ)) we can define the dual function withrespect to fi

L(λ)=λ>~b−∑

i

log∑

l

exp(fi(l; θ)+A>i;lλ) +

∑

i

logZi

︸︷︷︸const.

,

where the log partition function is constant and falls out inthe optimization.

2. Optimizing Constraints with Slack VariableThe slack relaxed loss function is given by

minimizeθ,P,ξ

−HP −EX∼P [logQ(X|θ)]︸︷︷︸HP |Q

+β>ξ (7)

subject to A~P ≥ ~b− ξ, ξ ≥ 0,∑

X

P (X) = 1

For any β ≤ 0, a value of ξ → ∞ will minimize the ob-jective and hence invalidate the corresponding constraints.Thus, for the remainder of the section we assume thatβ > 0. The Lagrangian dual to this loss is defined as

L(P, λ, ν, γ) =−HP +HP |Q + β>ξ + λ>(~b−A~P − ξ)

+ ν

(∑

X

P (X)− 1

)− γ>ξ. (8)

We know that the dual variable γ ≥ 0 is strictly non-negative, as well as

∂

∂ξL(P, λ, ν, γ) = β − λ− γ = 0. (9)

This leads to the following constraint on λ:

0 ≤ γ = β − λ.Hence the slack weight forms an upper bound on λ ≤ β.Substituting Equation (9) into Equation (8) reduces the dualobjective to the non-slack objective in Equation (3), and therest of the derivation is equivalent.

References[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-

bridge University Press, 2004. 1

deepak pathak philipp kr ahenb¨ uhl trevor darrell¨ abstractdeepak pathak philipp kr ahenb¨ uhl...

Documents