solving constrained combinatorial optimization problems ...jmcauley/pdfs/aaai17.pdf · solving...

Solving constrained combinatorial optimization problems via MAP inferencewithout high-order penalties

Zhen Zhang1 Qinfeng Shi2 Julian McAuley3 Wei Wei1 Yanning Zhang1

Rui Yao4 Anton van den Hengel21School of Computer Science and Engineering,Northwestern Polytechnical University, China

2School of Computer Science,The University of Adelaide, Australia

3Computer Science and Engineering Department,University of California, San Diego, USA

4School of Computer Science and Technology,China University of Mining and Technology, China

AbstractSolving constrained combinatorial optimization problems viaMAP inference is often achieved by introducing extra poten-tial functions for each constraint. This can result in very highorder potentials, e.g. a 2nd-order objective with pairwise po-tentials and a quadratic constraint over all N variables wouldcorrespond to an unconstrained objective with an order-N po-tential. This limits the practicality of such an approach, sinceinference with high order potentials is tractable only for a fewspecial classes of functions.We propose an approach which is able to solve constrainedcombinatorial problems using belief propagation without in-creasing the order. For example, in our scheme the 2nd-orderproblem above remains order 2 instead of order N . Exper-iments on applications ranging from foreground detection,image reconstruction, quadratic knapsack, and the M-best so-lutions problem demonstrate the effectiveness and efficiencyof our method. Moreover, we show several situations in whichour approach outperforms commercial solvers like CPLEXand others designed for specific constrained MAP inferenceproblems.

IntroductionMaximum a posteriori (MAP) inference for graphical modelscan be used to solve unconstrained combinatorial optimiza-tion problems, or constrained problems by introducing extrapotential functions for each constraint (Ravanbakhsh, Rab-bany, and Greiner 2014; Ravanbakhsh and Greiner 2014;Frey and Dueck 2007; Bayati, Shah, and Sharma 2005;Werner 2008). The main limitation of this approach is that itoften results in very high order potentials. Problems withpairwise potentials, for example, are very common, andadding a quadratic constraint (order 2) over N variablesresults in an objective function of order N . Optimizing oversuch high-order potentials is tractable only for a few spe-cial classes of functions (Tarlow, Givoni, and Zemel 2010;Potetz and Lee 2008; Komodakis and Paragios 2009; Mezard,Parisi, and Zecchina 2002; Aguiar et al. 2011), such as linearfunctions.

Recently, Lim, Jung, and Kohli (2014) proposed cutting-plane based methods to handle constrained problems without

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

introducing very high order potentials. However, their ap-proaches require exact solutions of a series of unconstrainedMAP inference problems, which is, in general, intractable.Thus their approaches are again only applicable to a particularclass of potentials and constraints.

To tackle general constrained combinatorial problems, ouroverall idea is to formulate the unconstrained combinato-rial problem as a linear program (LP) with local marginalpolytope constraints only (which corresponds to a classicalMAP inference problem), and then add the “real” constraintsfrom the original combinatorial problem to the existing LPto form a new LP. Duality of the new LP absorbs the “real”constraints naturally, and yields a convenient message pass-ing procedure. The proposed algorithm is guaranteed to findfeasible solutions for a quite general set of constraints.

We apply our method to problems including foregrounddetection, image reconstruction, quadratic knapsack, and theM-best solutions problem, and show several situations inwhich it outperforms the commercial optimization solverCPLEX. We also test our method against more restrictiveapproaches including Aguiar et al. (2011) and Lim, Jung, andKohli (2014) on the subsets of our applications to which theyare applicable. Our method outperforms these methods inmost cases even in settings that favor them.

PreliminariesHere we consider factor graphical models with discrete vari-ables. Denote the graph G = (V,C), where V is the set ofnodes, and C is a collection of subsets of V. Each c ∈ C iscalled a cluster. If we associate one random variable xi witheach node, and let x = [xi]i∈V, then it is often assumed thatthe joint distribution of x belongs to the exponential familyp(x) = 1

Z exp[∑

c∈C θc(xc)], where xc denotes the vec-

tor [xi]i∈c. The real-valued function θc(xc) is known as apotential function. Without loss of generality we make thefollowing assumption to simplify the derivation:Assumption 1. For convenience we assume that: (1) C in-cludes every node, i.e., ∀i ∈ V, {i} ∈ C; (2) C is closed underintersection, i.e., ∀c1, c2 ∈ C, it is true that c1 ∩ c2 ∈ C.

MAP and its LP relaxations The goal of MAP inferenceis to find the most likely assignment of values to the random

variable x given its joint distribution, which is equivalent tox∗ = argmaxx

∑c∈C θc(xc). The MAP inference problem

is NP-hard in general (Shimony 1994). Under Assumption1, by introducing µ = [µc(xc)]c∈C corresponding to theelements of C one arrives at the following LP relaxation(Globerson and Jaakkola 2007):

maxµ∈L

∑c∈C

∑xc

µc(xc)θc(xc), (1)

L =

{µ

∣∣∣∣ ∀c ∈ C,xc, µc(xc) > 0,∑

xcµc(xc) = 1,

∀c, s ∈ C, s ⊂ c,xs,∑

xc\sµc(xc) = µs(xs)

}.

The ProblemWe consider the following constrained combinatorial problemwith K constraints:

maxx

∑c∈C

θc(xc), s.t.∑c∈C

φkc (xc) 6 0, k ∈ K . (2)

where K = {1, 2, . . . ,K}, and φkc (xc) are real-valued func-tions. Since θc(xc) and φkc (xc) can be any real-valued func-tions, problem (2) represents a large range of combinatorialproblems. We now provide a few examples.Example 1 (`0 norm constraint). Given b ∈ R+, ‖x ‖0 6 bcan be reformulated as∑

i∈V

φi(xi) 6 0, φi(xi) = 1(xi 6= 0)− b/|V |, (3)

where 1(S) denotes the indicator function (1(S) = 1 if Sis true, 0 otherwise). This type of constraint is often used inapplications where sparse signals or variables are expected.Example 2 (Sparse gradient constraint). Some approachesto image reconstruction (e.g. Maleh, Gilbert, and Strauss(2007)) exploit a constraint reflecting an expectation thatimage gradients are sparse, such as

∑{i,j}∈C ‖xi − xj‖0 6

b, where b is a threshold. The constraint can be rewritten as∑i∈V

φi(xi) +∑{i,j}∈C

φi,j(x{i,j}) 6 0, (4)

φi(xi) = −b/|V |, φi,j(x{i,j}) = 1(xi = xj).

Example 3 (Assignment difference constraint). Image seg-mentation (Batra et al. 2012) and the M-best MAP solutionsproblem (Fromer and Globerson 2009), often require the dif-ference between a new solution x and a given assignment ato be greater than a positive integer b, i.e. ‖x−a ‖0 > b.This can be reformulated as∑

i∈V

φi(xi) 6 0, φi(xi) = −1(xi 6= ai) + b/|V |. (5)

For b = 1, it can also be reformulated as∑i

φi(xi) +∑{i,j}∈T

φi,j(xi, xj) ≤ 0, (6)

φi(xi)=

{1− di, xi=ai,

0, xi 6=ai,φij(xi, xj)=

{1, xij =aij ,0, xij 6=aij ,

where T is an arbitrary spanning tree of G, and di is thedegree of node i in T (Fromer and Globerson 2009).

Example 4 (QKP). The Quadratic Knapsack Problem withmultiple constraints (Wang, Kochenberger, and Glover 2012)is stated as

maxx∈{0,1}|V |

∑i,j∈V

xicijxj , s.t.∑i∈V

wki xi 6 bk, k ∈ K,

where all cij and wki are non-negative real numbers. Theabove QKP can be reformulated as

maxx

∑i∈V

θi(xi) +∑{i,j}∈E

θij(xi, xj),

s.t.∑i∈V

φki (xi) 6 0, φi(xi) = wki xi − b/|V |, k ∈ K

where θi(xi) = ciixi, E = {{i, j}|cij + cji > 0} andθij(xi, xj) = (cij + cji)xixj .

Belief Propagation with Constraints withoutthe High Order Penalties

Problem (2) is NP hard in general, thus we first show how torelax the problem.

LP RelaxationsWe consider the following LP relaxation for (2),

µ∗ = argmaxµ∈L

∑c∈C

∑xc

µc(xc)θc(xc), (7)

s.t.∑c∈C

∑xc

φkc (xc)µc(xc) 6 0, k ∈ K .

The key trick here is to reformulate the general constraint in(2) as a constraint linear in µ.

In previous belief propagation schemes (e.g. Potetz andLee; Tarlow, Givoni, and Zemel; Duchi et al. (2008; 2010;2006)) for constrained MAP inference problems, extra po-tential functions as well as clusters are introduced for eachconstraint. However, this approach only works for very spe-cific classes of constraints. For general constraints, for somek from K, the max-marginal between an introduced clusterand a node requires solving an integer program as follows:

maxx∈{x′ |x′

i=xi}I∞(

∑c∈C

φkc (xc)) 6 0) +∑i∈V

ϑi(xi), (8)

where ϑi(xi) is an arbitrary real-valued function, and I∞(S)maps the statement S to zero if S is true, and −∞ otherwise.Unfortunately, even under quite strong assumptions (e.g. Conly involves nodes), (8) is challenging to solve due to itsNP-hardness.

The Dual ProblemLetting γ = [γk]k∈K be the Lagrange multiplier correspond-ing to the additional constraint in (7), the LP relaxation (7)has the following dual problem,

minθ∈Λ(θ),γ>0

∑c∈C

maxxc

[θc(xc)−

∑k∈K

γkφkc (xc)

], (9)

Λ(θ) =

{ϑ

∣∣∣∣∣∑c∈C

ϑc(xc) =∑c∈C

θc(xc)

},

γ

g(γ)

γmin γ∗ γmaxγ

δg(γ)

γγmin γ∗ γmaxγ

0

Figure 1: Illustration of one iteration of BP for one constraint.Given γmin and γmax, we can conveniently determine if γ =1/2(γmin + γmax) is greater or less than the unknown γ∗. Hereδg(γ) 6 0, thus γ 6 γ∗. Hence we update γmin = γ.

where θ = [θc(xc)]c∈C, which is known as ‘reparametriza-tion’ (Kolmogorov 2006) of θ = [θc(xc)]c∈C, and Λ(θ) isthe set of all reparametrizations of θ.

To derive the belief propagation scheme, we first considerMAP inference with a single constraint. Later we show howto generalize to multiple constraints. With only one constraint,(9) becomes

minθ∈Λ(θ),γ>0

∑c∈C

maxxc

[θc(xc)− γφc(xc)

]. (10)

The optimization problem (10) can be reformulated as:

minγ>0

g(γ) = minθ∈Λ(θ)

∑c∈C

maxxc


]. (11)

Problem (11) can be efficiently optimized, due to the propertyof g(γ) below:Proposition 1. g(γ) is piecewise linear and convex.

Belief Propagation for One ConstraintIn this section, we show how (11) can be efficiently solvedvia belief propagation. For convenience in derivation, wedefine

bγc (xc) = θc(xc)− γφc(xc),∀c ∈ C,xc ∈ Xc . (12)

We can perform the following procedure to compute a sub-gradient of g(γ):1

θ?,γ = argminθ∈Λ(θ)

∑c∈C

maxxc

bγc (xc), (13a)

xγ,?c = argmaxxc

bγ,?c (xc); δg(γ) = −∑c∈C

φc(xγ,?c ), (13b)

where for all c we use the notation bγ,?c (xc) = θ?,γc (xc) −γφc(xc). Then for δg(γ) the following proposition holds:Proposition 2. δg(γ) is a sub-gradient of g(γ).

Computing δg(γ) requires the optimal reparametrizationθ?. It can be obtained via sub-gradients (Schwing et al. 2012)

or smoothing techniques (Meshi, Jaakkola, and Globerson

1When there are multiple maximizers (or minimizers), randomlypicking one suffices.

Algorithm 1: BP for one Constraint

input : G = (V,C), θc(xc), φc(xc), c ∈ C, γmin, γmax andε.

output : γ,x∗.1 fmax = −∞;2 repeat3 γ = 1

2(γmin + γmax);

4 Approximately Compute θ?,γ

in (13) by BP withupdating rules in (15) ;

5 x∗,γc = argmaxxc

[θ?,γc (xc)− γφc(xc)

];

6 δg(γ) = −∑c∈C φc(x

∗,γc );

7 if δg(γ) > 0 then γmax = γ, else γmin = γ;8 Decode xi = argmaxxi b

?,γi (xi);

9 If x is feasible, and∑c∈C θc(xc) > fmax;

10 Then x∗ = x, fmax =∑c∈C θc(xc);

11 until |γmin − γmax| < ε or δg(γ) = 0;

2012; Meshi and Globerson 2011). However, these methodsare often slower than belief propagation (BP) on non-smoothobjectives, e.g. Max-Product Linear Programming (MPLP)(Sontag, Globerson, and Jaakkola 2011). For the best perfor-mance in practice we use BP on non-smooth objectives toapproximately compute θ

?2.

Reparametrization Update Rules For a fixed γ, (10) be-comes a standard dual MAP inference problem. Thus we pro-pose a coordinate descent belief propagation scheme that is avariant of MPLP (Globerson and Jaakkola 2007) to optimizeover (13). In each step, we make θc(xc) and θs(xs), s ≺ cfor some c ∈ C flexible and the others fixed (s ≺ c denotess ∈ {s′|s′ ∈ C, s′ ⊂ c, @t′, s.t. s′ ⊂ t ⊂ c}), using bγc (xc)defined in (12), the sub-optimization problem becomes

minθγc ,θ

γs

maxxc

bγc (xc) +∑s≺c

maxxs

bγs (xs), s.t. θ ∈ Λ(θ). (14)

Fortunately, a closed-form solution of (14) exists as follows,

bγ,?s (xs) = maxxc\s

[bγc (xc) +

∑s′≺c

bγs′(xs′)]/|{s′|s′ ≺ c}|,

bγ,?c (xc) = bγc (xc) +∑s≺c

bγs (xs)−∑s≺c

bγ,?s (xs), (15)

θ∗s(xs) = bs(xs)+γφs(xs), θ∗c (xc) = bc(xc)+γφc(xc).

We iteratively update θc(xc) and θs(xs) to θ∗c (xc) andθ∗s(xs), s ≺ c for all c ∈ C in order to approximately op-timize θ

?.

By Proposition 2, we can apply (approximate) projectedsub-gradient descent to optimize g(γ). However, it is knownthat sub-gradient descent is very slow—usually with a con-vergence rate of O(1/ε2). The next proposition will enableus to optimize g(γ) much more efficiently:Proposition 3. Let γ∗ = argminγ g(γ). For any γ > 0, wehave that (1) if δg(γ) > 0, then γ > γ∗; (2) if δg(γ) 6 0,then γ 6 γ∗.

2Detailed comparison is provided in the supplementary file.

Algorithm 2: BP for K constraints

input : G = (V,C); θc(xc), c ∈ C;φkc (xc), k ∈ K, c ∈ C; γ, γ0, k ∈ K, ε.

output : x∗.1 Initialize θ = θ,γ = 0, fmax = −∞;2 repeat3 For c ∈ C, update θc(xc) and θs(xs) as (15);4 for k ∈ K do5 Repeat γk = (γmin

k + γmaxk )/2;

6 If δhk(γk) < 0, then γmink = γk, else

γmaxk = γk;

7 until |γmink − γmax

k | 6 ε;8 xi = argmaxxi [θi(xi) +

∑Kk=1 γiφi(xi)];

9 If x is feasible and∑c∈C θc(xc) > fmax;

10 Then x∗ = x, fmax =∑c∈C θc(xc);

11 until Dual Decrease less than 1e− 6;

Proposition 3 allows us to determine if a given γ is greateror less than the unknown γ∗, which determines the searchdirection when optimizing g(γ). Thus given search boundaryγmin and γmax, we can use a binary search procedure in Algo-rithm 1 to determine the optimal γ∗. Figure 1 illustrates oneiteration of Algorithm 1.

Decoding and Initializing We decode an integer solutionby xi = argmaxxi b

∗,γi (xi) due to Assumption 1. In practice,

we initialize γmin = 0 and γmax = 0.1. If δg(γmax) < 0,we update γmin = γmax, and increase γmax by a factor of2. We iterate until we find some γmax s.t. δg(γmax) > 0.Since binary search takes logarithmic time, performance isnot sensitive to the initial γmax Although our decoding andinitialization scheme is quite simple, it is guaranteed thatAlgorithm 1 will return feasible solutions for a quite generalset of constraints:

Proposition 4. For constraints of the form∑i∈V φi(xi) 6 0,

if the feasible set is nonempty, then with the above initializa-tion strategy Algorithm 1 must return some feasible x∗.

The approximate b∗,γ is computed via Max-Product Lin-ear Programming (MPLP)(Sontag et al. 2008). In the compar-ison, we find that in Algorithm 2 approximate δg(γ) results inthe same solution as exact δg(γ), but with much higher speed.For projected sub-gradient methods, similar phenomenon arealso observed. The typical time comparison is show in Fig-ure 6.

Belief Propagation for K ConstraintsWith K constraints, optimizing (9) amounts to finding opti-mal {γk}Kk=1 and their reparametrizations. Naively applyingAlgorithm 1 to (9) by iterating through all K constraints canbe expensive; to overcome this, we use coordinate descent tooptimize over (9). Fixing all θ and γ except for a particularγk, the sub-optimization becomes

minγk>0

hk(γk) =∑c∈C

maxxc

[θc(xc)−

∑k′∈K

γk′φk′

c (xc)]. (16)

Similar to g(γ) for one constraint, the sub-gradient of hk(γk)can be computed via:

δhk(γk) = −∑c∈C

φkc (xc), (17)

xc = argmaxxc

[θc(xc)−∑k′∈K

γk′φk′

c (xc)].

Thus the sub-problem (16) can be solved via a binary searchin Algorithm 2, which is a variant of Algorithm 1. Whenupdating reparametrizations, the reparametrizations updatingscheme for one constraint can be directly applied. As a result,we run one iteration of BP to update θ and then update γ bybinary search. Both updates result in a decrease of the dualobjective, and thus the proposed algorithm will produce aconvergent sequence of dual values. The following proposi-tion provides the condition under which Algorithms 1 and 2obtain exact solutions:Proposition 5. Let θ and γ be the solution provided byAlgorithm 1 or 2. The solution is exact if (1) ∃x, s.t. ∀c ∈C, xc ∈ argmaxxc [θc(xc) +

∑Kk=1 γkφ

kc (xc)], and (2) ∀k ∈

K, γk∑c∈C φ

kc (xc) = 0.

Note that the condition takes into account the functionsfrom the constraints φkc (xc), which is different from the resultfor standard MAP inference problems. Furthermore, whenexact solutions are not attained, the gap between dual optimaland true MAP can be estimated by the following proposition.

Proposition 6. Let θ and γ be the solution provided byour algorithm. If there exists some feasible x, s.t. xc ∈argmaxxc [θc(xc) +

∑Kk=1 γkφc(xc)], then the integral-

ity gap of the proposed relaxation (7) is less than−∑Kk=1 γk

∑c∈C φ

kc (xc).

ExperimentsWe consider four applications: foreground detection, imagereconstruction, diverse M-best solutions, and quadratic knap-sack. We compare against four competitors: Lim’s Method(Lim, Jung, and Kohli 2014), Alternating Directions DualDecomposition (AD3) (Aguiar et al. 2011), the commer-cial optimizer CPLEX, and the approximate projected sub-gradient method, which is a byproduct of Proposition 2 andis often very slow. For Lim’s cutting-plane method (Lim,Jung, and Kohli 2014), we have two variants. One version re-quires exact solutions, and we use branch and bound methodsby Sun et al. (2012) to exactly solve the series of MAP in-ference problems.3 Another version works on the relaxedproblem (7), and we use the belief propagation schemein (15) inside the cutting-plane scheme. The former is re-ferred to as Lim and the latter as Lim-Relax. To implementthe approximate projected sub-gradient method, we sim-ply replace the update scheme for γ in Algorithm 1 withγt+1 = (γt + α δg(γt)

t‖δg(γt)‖2 )+, where (·)+ denotes projectionto the non-negative reals. The maximum number of iterationsfor sub-gradient descent is 200 and α = 10. We use the AD3

3For submodular potentials, using graph-cuts instead of branchand bound would be significantly faster.

10 20 30 40 50n

020040060080010001200

Time

Ours Sub-gradient Lim Lim-Relax AD3

Figure 2: Foreground detection results. Left two images: Back-ground image and image with foreground object. Right two images:Results from MRF without and with constraint (3). Bottom: Run-ning time vs. grid size (CPLEX ran out of memory for all n > 20).

package released by its authors, and set η = 0.1 accordingto the example provided in the package. All other methodsexcept CPLEX are implemented in single-threaded C++ witha MATLAB mex interface. For CPLEX, we use the interiorpoint algorithm. Since the solution for the relaxed problem(7) may not be an integer, an integer solution is decodedvia xi = argmaxxi µ

∗(xi), where µ∗ = [µ∗c(xc)]c∈C is thesolution from CPLEX.

Evaluation We are primarily interested in the speed andaccuracy of the proposed algorithms. Thus we compare therun time and the normalized decoded primal. Letting x be adecoded solution from our methods (or its competitors), thenormalized decoded primal is defined as∑

c∈C

θc(xc)/∑c∈C

maxxc

[θ∗c (xc)−∑k∈K

γ∗kφkc (xc)], (18)

where θ∗

and γ∗ is a solution of (9). Experimentally, since thedual optimal might not be attained, we choose the minimumdual objective value among all algorithms as the dual optimalvalue in (9).

Foreground DetectionHere the task is to distinguish a foreground object from thereference background model. The problem can be formulatedas a MAP inference problem of the form

∑i∈V θi(xi) +∑

{i,j}∈E θij(xi, xj). Since the foreground object is oftenassumed to be sparse, we add constraint (3) with b = 1300.We tested on the Change Detection dataset (Goyette et al.2012). Potentials θi and θij are obtained as in (Schick, Bauml,and Stiefelhagen 2012).

The example given in Figure 2 shows that the constraint(3) significantly reduces false positives. For this problem,Lim and Lim-Relax are equivalent. The running times are19s for our method, 55s for sub-gradient, 33s for Lim etal.’s method, 24s for AD3 and 73.9 seconds for CPLEX ona 320 × 240 image. All methods find the optimal solutionof the relaxed problem. Furthermore, our method generates

10 20 30 40 50n

010002000300040005000

Time

10 20 30 40 50n

0.950.960.970.980.991.00

Primal

Ours Sub-gradient Lim Lim-Relax

Figure 3: Image reconstruction results. Top: Part of the ISO12233Chart with noise; reconstruction results without constraints; andreconstruction results from our algorithm with constraints (19). Bot-tom: Running time and normalized decoded primal vs. grid size n(i.e., Decoded Primal/Dual optimal vs. n; higher is better).

the best decoded integer solution, followed by Lim et al.’smethod and CPLEX. The sub-gradient method and AD3 donot find a feasible integer solution.

To further investigate, we generate synthetic data at dif-ferent scales. Several n × n grids are generated (n ∈{10, 20, . . . , 50}). b in the constraints (3) is set to n2/10.All unary and pairwise potentials are sampled from [0, 1]uniformly. For each n, we use 6 different random seeds togenerate instances. Furthermore, to tighten the LP relaxation,square clusters with all-zero potentials are added as in Fromerand Globerson (2009). Running time comparison is shownin Figure 2, which shows that our method performs the bestin terms of speed. In terms of solution quality, all algorithmsexcept AD3 (Aguiar et al. 2011), converge to exact solutionsfor 26 problems out of 30. The AD3 solver often failed toobtain feasible solutions, which might be caused by the factthat its LP relaxation (Eq. (5) in Aguiar et al. (2011)) is looserthan that of other methods.

Image ReconstructionThe task here is to reconstruct an image from a corruptedmeasurement. The problem can be formulated as a MAPinference problem (Li 2009),∑

i∈V

[−(xi − di)2] + λ∑{i,j}∈E

[−(xi − xj)2],

where di is the observed grayscale value of pixel i, E is theedge set, and each xi takes one of 16 states. As in Maleh,Gilbert, and Strauss (2007), we use the constraint∑

{i,j}∈E

‖xi − xj‖0 6 b (19)

to smooth the image. Here λ = 0.25, and b = 3000.An example reconstruction result is shown in Figure 3.

AD3 is unable to handle constraints like (19), and is ex-cluded. The running time for our algorithm is 338s, 546s forLim-Relax, 680s for CPLEX, and 2051s for sub-gradient.Lim et al.’s method did not return a feasible solution after 1hour. All methods attain similar dual objectives (the relative

0 5 10 15 20 25 30SideChain Problems

0.0

0.2

0.4

0.6

0.8

1.0In

vers

e R

unni

ng T

ime

Ours Sub-gradient Lim-Relax CPLEX

−100 0 100 200 300 400 500 600Time

0.90

0.92

0.94

0.96

0.98

1.00

Primal

Ours Sub-gradient Lim-Relax AD3

Figure 4: Left: Running time comparison for 100-Best solutionsside-chain problems (Yanover, Meltzer, and Weiss 2006). For eachmethod m we report fastest method

m, so that the fastest method always

has a normalized running time of 1.0. Right: Average normalizeddecoded primal vs. average running time for the series of constrainedinference problems with K = 1, 2, 3, 4 constraints in the diverseM-best solutions problem. Error bars are standard derivations.

error is within 0.07%). For the decoded primal, our methodand sub-gradient yield similar results (the relative error iswithin 0.15%). The decoded primal of Lim-Relax is about1.2% larger than ours, and CPLEX fails to obtain a feasiblesolution.

To further investigate, we test on synthetic data as in theprevious Section with constraint (19). CPLEX ran out ofmemory for all n ≥ 20, thus its results are excluded. Otheralgorithms converge to similar dual objective values (witha relative error less than 0.23%), and thus we compare run-ning times and normalized decoded primals in Figure 3. Ourmethod always outperforms the others in running time. Forsolution quality, our method attains a similar decoded pri-mal to the sub-gradient method, and the other two methodsconverge to worse decoded primals.

Diverse M-best Solutions problemDiverse M-best Solutions (Batra et al. 2012) is a variant ofthe M-best solutions problem in which we require that the `0norm of the difference between the mth and the top (m− 1)solutions to be larger than a threshold b > 1 (setting b = 1recovers the original M-best solutions problem).

Here we consider 30 inference problems from the side-chain dataset (Yanover, Meltzer, and Weiss 2006). As inFromer and Globerson (2009), we add dummy higher orderclusters to tighten the LP relaxation for MAP inference. Weuse Spanning Tree Inequalities and Partitioning for Enumer-ating Solutions (STRIPES) (Fromer and Globerson 2009),where the M-best solutions problem is approached by solvinga series of second-best-solution problems. For each second-best problem, one inequality constraint (as in (6)) is added toensure that the solution differs from the MAP solution. Werefer to Fromer and Globerson (2009) for details.

Batra (2012) provides another fast M-best solver via be-lief propagation, though their approach does not handle theadded higher order clusters. Lim, Jung, and Kohli (2014)is proved to be slow for M-best solutions problems in theside-chain dataset (it didn’t finish after several days). AD3is also excluded, as it is unable to handle constraints like (6).Among all 30 side-chain problems, CPLEX fails for two, andall other methods find the top 100 solutions. A Running timecomparison is shown in Figure 4, from which we can see ourmethod achieves the best running time for 28 problems, and

the second best on two.Again we test on synthetic data and again dummy higher-

order clusters are added to tighten the LP relaxation. Weconsider the diverse 5-best solutions problem with parameterb = 12. In contrast to M-best problems, constraints (5) canbe handled by AD3. However, the addition of higher orderclusters significantly increases the running time of AD3, andthus we only consider n = 10. The diverse 5-best solutionsproblem reduces to a series of constrained inference problemswithK many constraints whereK = 1 to 4. The running timeand solution quality is compared in Figure 4. Our methodalways outperforms the others in speed, and achieves solutionquality similar to sub-gradient and Lim-Relax. The methodfrom Lim, Jung, and Kohli (2014) is excluded, since it did notreturn a result after 24 hours. CPLEX often failed to obtain afeasible decoded solution and is excluded from Figure 4.

Quadratic Knapsack ProblemWe consider the Quadratic Knapsack Problem (QKP) withmultiple constraints with 40k, 160k, 360k variables. Vari-ables are organized as an n × n grid, with entry cij > 0only if i and j are adjacent in the grid (we consider i tobe adjacent with itself). All non-zero cij are sampled from[0, 1] uniformly. In each problem, 5 constraints are added,where wki are sampled from [0, 1], and bk are sampled from[∑i∈V w

ki /2,

∑i∈V w

ki ]. We use 6 different random seeds to

generate the data.In QKP problems, Lim and Lim-Relax are equivalent, thus

we only report Lim. From Table 5, our method usually per-forms the best in speed. In terms of the normalized decodedprimal, our method performs the best in 2 cases, and secondbest in 1 case. CPLEX and AD3 often fail to find a feasibledecoded integer solution. Typical objective plots are shown inFigure 5. The scalability of the proposed approach w.r.t. thenumber of constraints K is also investigated by adding differ-ent numbers of constraints to QKP problems. Typical resultsare shown in Figure 5.

Conclusion and Future WorkSolving constrained combinatorial problems by MAP infer-ence is often achieved by introducing an extra potential func-tion for each constraint, which results in very high orderpotential functions. We presented an approach to solve con-strained combinatorial problems using belief propagation inthe dual without increasing the order, which not only reducesthe computational complexity but also improves accuracy.

Currently the theoretical guarantees for our local decodingscheme are still loose, though in practice our simple schemeoften provides feasible solutions with moderate accuracy. Inour future research, we plan to provide tighter theoreticalguarantees for our local decoding scheme, and provide betterdecoding schemes for multi-constraint problems.

AcknowledgementsThis work was supported by NSFC Project 61301192,61671385, 61301193, 61231016, and 61303123, ARCProject DP140102270 and DP160100703. Rui Yao’s par-ticication is supported by NSFC Project 61402483.

Time(s)

100

102

104

obj.

×104

7.5

8

8.5

9

9.5

10

10.5

11

K

5 10 15 20

Tim

e

0

2000

4000

6000

8000Ours Sub-gradient Lim AD3 CPLEX

|V | CPLEX Sub-grad. Lim AD3 Ours

Av.time

40K 66 214 823 159 91160K 8985 886 4326 749 364360K 4594 3950 9247 1999 852

Av.primal

40K 0.6664 0.9974 0.9999 0.6687 0.9998160K 0.1667 0.9947 0.99993 0.1667 0.99995360K 0.5000 0.9989 0.99992 0.5006 0.99998

Figure 5: Experimental results on QKP. Left plot: Decoded primal (dashed) and dual (solid) vs. time for QKP with 160K variables. AD3 andCPLEX do not decode a primal in every iteration, and plots for their primal are excluded . Right plot: Typical running time vs. number ofconstraints K for QKP with 160K variables. Table: Comparison of running time and normalized decoded primal on QKP.

ReferencesAguiar, P.; Xing, E. P.; Figueiredo, M.; Smith, N. A.; and Martins,A. 2011. An augmented lagrangian approach to constrained MAPinference. In ICML.

Batra, D.; Yadollahpour, P.; Guzman-Rivera, A.; and Shakhnarovich,G. 2012. Diverse M-best solutions in Markov random fields. InECCV. 1–16.

Batra, D. 2012. An efficient message-passing algorithm for them-best MAP problem. In UAI.

Bayati, M.; Shah, D.; and Sharma, M. 2005. Maximum weightmatching via max-product belief propagation. In ISIT, 1763–1767.IEEE.

Duchi, J.; Tarlow, D.; Elidan, G.; and Koller, D. 2006. Usingcombinatorial optimization within max-product belief propagation.In NIPS.

Frey, B. J., and Dueck, D. 2007. Clustering by passing messagesbetween data points. Science.

Fromer, M., and Globerson, A. 2009. An LP view of the M-bestMAP problem. In NIPS.

Globerson, A., and Jaakkola, T. 2007. Fixing max-product: Con-vergent message passing algorithms for MAP LP-relaxations. InNIPS.

Goyette, N.; Jodoin, P.-M.; Porikli, F.; Konrad, J.; and Ishwar, P.2012. Changedetection. net: A new change detection benchmarkdataset. In Computer Vision and Pattern Recognition Workshops(CVPRW), 2012 IEEE Computer Society Conference on, 1–8. IEEE.

Kolmogorov, V. 2006. Convergent tree-reweighted message passingfor energy minimization. TPAMI.

Komodakis, N., and Paragios, N. 2009. Beyond pairwise energies:Efficient optimization for higher-order MRFs. In CVPR, 2985–2992.IEEE.

Kumar, M. P.; Kolmogorov, V.; and Torr, P. H. 2009. An analysis ofconvex relaxations for MAP estimation of discrete MRFs. JMLR10:71–106.

Li, S. Z. 2009. Markov random field modeling in image analysis.Springer Science & Business Media.

Lim, Y.; Jung, K.; and Kohli, P. 2014. Efficient energy minimizationfor enforcing label statistics. TPAMI.

Maleh, R.; Gilbert, A. C.; and Strauss, M. J. 2007. Sparse gradientimage reconstruction done faster. In ICIP, volume 2, II–77. IEEE.

Meshi, O., and Globerson, A. 2011. An alternating direction methodfor dual MAP LP relaxation. Machine Learning and KnowledgeDiscovery in Databases 470–483.

Meshi, O.; Jaakkola, T.; and Globerson, A. 2012. Convergencerate analysis of MAP coordinate minimization algorithms. In NIPS,3023–3031.Mezard, M.; Parisi, G.; and Zecchina, R. 2002. Analytic andalgorithmic solution of random satisfiability problems. Science297(5582):812–815.Potetz, B., and Lee, T. S. 2008. Efficient belief propagation forhigher-order cliques using linear constraint nodes. CVIU 112(1):39–54.Ravanbakhsh, S., and Greiner, R. 2014. Perturbed message passingfor constraint satisfaction problems. arXiv:1401.6686.Ravanbakhsh, S.; Rabbany, R.; and Greiner, R. 2014. Augmen-tative message passing for traveling salesman problem and graphpartitioning. arXiv:1406.0941.Schick, A.; Bauml, M.; and Stiefelhagen, R. 2012. Improving fore-ground segmentations with probabilistic superpixel markov randomfields. In CVPRW.Schwing, A. G.; Hazan, T.; Pollefeys, M.; and Urtasun, R. 2012.Globally Convergent Dual MAP LP Relaxation Solvers usingFenchel-Young Margins. In Proc. NIPS.Shimony, S. 1994. Finding MAPs for belief networks is np-hard.AI 68(2):399–410.Sontag, D.; Meltzer, T.; Globerson, A.; Weiss, Y.; and Jaakkola, T.2008. Tightening LP relaxations for MAP using message-passing.In UAI, 503–510. AUAI Press.Sontag, D.; Globerson, A.; and Jaakkola, T. 2011. Introductionto dual decomposition for inference. In Optimization for MachineLearning. MIT Press.Sun, M.; Telaprolu, M.; Lee, H.; and Savarese, S. 2012. Efficient andexact MAP-MRF inference using branch and bound. In AISTATS,1134–1142.Tarlow, D.; Givoni, I. E.; and Zemel, R. S. 2010. HOP-MAP:Efficient message passing with high order potentials. In AISTATS,812–819.Wang, H.; Kochenberger, G.; and Glover, F. 2012. A computationalstudy on the quadratic knapsack problem with multiple constraints.Computers & Operations Research 39(1):3–11.Werner, T. 2008. High-arity interactions, polyhedral relaxations,and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In CVPR, 1–8. IEEE.Yanover, C.; Meltzer, T.; and Weiss, Y. 2006. Linear ProgrammingRelaxations and Belief Propagation–An Empirical Study. JMLR7:1887–1907.

Supplementary fileNP-hardness of The Problem (8)

First we introduce the following lemma:Lemma 1. The knapsack problem

maxx∈{0,1}N

N∑i=1

cixi, s.t. ,N∑i=1

wixi 6 bi, (20)

is NP-hard, where N is a positive integer, and all ci and wi are positive real numbers.

Proof. See the book “Kellerer, Hans, Ulrich Pferschy, and David Pisinger. Introduction to NP-Completeness of knapsack problems. SpringerBerlin Heidelberg, 2004.”

Now we show that the knapsack problem (20) is a special case of the optimization problem (8). Let C = {{1}, {2}, . . . , {N}, {N + 1}},and let

φi(xi) = wixi − b/N,∀i = 1, 2, . . . , N ; φN+1(xN+1) = 0, (21a)ϑi(xi) = cixi, ∀i = 1, 2, . . . , N ; ϑN+1(xN+1) = 0. (21b)

Then (20) can be reformulated as

maxx∈{x′ |x′

N+1=0}

I∞(

N+1∑i=1

φi(xi) 6 0) +

N+1∑i=1

ϑi(xi), (22)

which is a special case of (8). Thus the problem (8) must be NP-hard.

Derivation of DualFirstly, Removal of redundant constraints in (7) yields

µ∗ = argmaxµ

∑c∈C

∑xc

µc(xc)θc(xc), (23a)

s.t. ∀c ∈ V,xc, µc(xc) > 0,∑xc

µc(xc) = 1 (23b)

∀c ∈ C, s ≺ c,xs,∑xc\s

µc(xc) = µs(xs), (23c)

∑c∈C

∑xc

φkc (xc)µc(xc) 6 0, k ∈ K . (23d)

Let λc→s(xs) be the Lagrange multiplier corresponding to (23c), and γk be the Lagrange multiplier corresponding to (23d), then bystandard Lagrangian the dual problem of (23) is

minλ,γ>0

maxµ

∑c∈C

∑xc

µc(xc)θc(xc)−∑c∈C

∑s≺c

∑xs

[∑xc\s

µc(xc)− µs(xs)]λc→s(xs)

−K∑k=1

γk∑c∈C

∑xc

µc(xc)φkc (xc), (24)

s.t.∀c ∈ C,xc, µc(xc) > 0,∑xc

µc(xc) = 1.

Rearrangement of variables results in the following equivalent problem:

minλ,γ>0

∑c∈C

maxµc(xc)

∑xc

µc(xc)

[θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r

λr→c(xc)−K∑k=1

γφkc (xc)

](25)

s.t.∀c ∈ C,xc, µc(xc) > 0,∑xc

µc(xc) = 1.

It is obvious the above problem is equivalent to

minλ,γ>0

∑c∈C

maxxc

[θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r


γkφkc (xc)

]. (26)

Then by work of Kolmogorov (2006) (Lemma 6.3), (26) is equivalent to

minθ,γ>0

∑c∈C

maxxc

[θc(xc)−

K∑k=1

γkφkc (xc)

], s.t.

∑c∈C

θc(xc) =∑c∈C

θc(xc). (27)

Proof of Proposition 1Proposition 1. g(γ) is a piecewise linear and convex function of γ.

Proof. It is trivial that g(γ) is a piecewise linear function of γ. Thus we only proof that g(γ) is a convex function of γ.For arbitrary γ1 > 0, γ2 > 0, we let

θ1

= argminθ

∑c∈C

maxxc

[θc(xc)− γ1φc(xc)

], s.t.

∑c∈C

θc(xc) =∑c∈C

θc(xc), (28a)

θ2

= argminθ

∑c∈C

maxxc

[θc(xc)− γ2φc(xc)

], s.t.

∑c∈C

θc(xc) =∑c∈C

θc(xc). (28b)

Then ∀t ∈ [0, 1], we have that

tg(γ1) + (1− t)g(γ2)

=t∑c∈C

maxxc

[θ1c(xc)− γ1φc(xc)

]+ (1− t)

∑c∈C

maxxc


]=∑c∈C

maxxc

t


]+∑c∈C

maxxc

(1− t)[θ2c(xc)− γ2φc(xc)

]. (29)

Then by definition of θ1

and θ2

we have that ∑c∈C

[tθ1c(xc) + (1− t)θ2c(xc)

]=t∑c∈C

θc(xc) + (1− t)∑c∈C

θc(xc)

=∑c∈C

θc(xc). (30)

As a result, by (29) and (30) we have that

tg(γ1) + (1− t)g(γ2)

>∑c∈C

maxxc

{tθ1c(xc) + (1− t)θ2c(xc)−

[tγ1 + (1− t)γ2

]φc(xc)

}>min

θ

∑c∈C

maxxc

{θc(xc)−

[tγ1 + (1− t)γ2

]φc(xc)

}, s.t.

∑c∈C

θc(xc) =∑c∈C

θc(xc)

=g(tγ1 + (1− t)γ2), (31)

which completes the proof.

Proof of Proposition 2Proposition 2. δg(γ) is a sub-gradient of g(γ).

Proof. For arbitrary γ > 0, we let

θγ,∗

= argminθ

∑c∈C

maxxc


], s.t.

∑c∈C

θc(xc) =∑c∈C

θc(xc). (32)

Then for some γ0 > 0 , we have that

g(γ)− g(γ0) =∑c∈C

maxxc

[θγ,∗c (xc)− γφc(xc)

]−∑c∈C

maxxc

[θγ0,∗c (xc)− γ0φc(xc)

]. (33)

∀c ∈ C, we let

xγ0,∗c = argmaxxc

[θγ0,∗c (xc)− γ0φc(xc)

]. (34)

Then by the fact θγ,∗

and θγ0,∗ are reparametrizations of θ, we have that

g(γ)− g(γ0) =∑c∈C

maxxc

[θγ,∗c (xc)− γφc(xc)

]−∑c∈C

[θγ0,∗c (xγ0,∗c )− γ0φc(xγ0,∗c )

]>∑c∈C

[θγ,∗c (xγ0,∗c )− γφc(xγ0,∗c )

]−∑c∈C

[θγ0,∗c (xγ0,∗c )− γ0φc(xγ0,∗c )

]= −(γ − γ0)

∑c∈C

φc(xγ0,∗c )

= (γ − γ0)δg(γ0), (35)


Proof of Proposition 3Proposition 3. Let γ∗ = argminγ g(γ), for arbitrary γ > 0 we have that

1. if δg(γ) > 0, then γ > γ∗;2. if δg(γ) 6 0, then γ 6 γ∗.

Proof. By Proposition 2, δg(γ) is a subgradient of g(γ). Then for arbitrary γ > 0, if δg(γ) > 0, we have that

(γ∗ − γ)δg(γ) 6 g(γ∗)− g(γ) 6 0. (36)

Then if δg(γ) > 0, we must have that γ∗ − γ > 0, and if δg(γ) 6 0, we must have that γ∗ − γ 6 0, which completes the proof.

Proof of Proposition 4Decoding and Initializing We decode an integer solution by xi = argmaxxi b

∗,γi (xi) due to Assumption 1. In practice, we initialize

γmin = 0 and γmax = 0.1. If δg(γmax) < 0, we update γmin = γmax, and increase γmax by a factor of 2. We iterate until we find some γmax

s.t. δg(γmax)>0. Since binary search takes logarithmic time, performance is not sensitive to the initial γmax.

Algorithm 3: Initializing Procedure for Algorithm 1

input : G = (V,C), θc(xc), φc(xc), c ∈ C, and ε.output : γmin, γmax, x.

1 γmin = 0, γmax = 0.1;2 fmax = −∞;3 repeat4 Approximately compute θ

?,γmax in (13) by BP with updating rules in (15) ;5 x∗,γmax

c = argmaxxc

[θ∗,γc (xc)− γφc(xc)

];

6 δg(γmax) = −∑c∈C φc(x

?,γmaxc );

7 Decoding x by xi = argmaxxi [θ?,γi (xi)− λφi(xi)];

8 if x is feasible and∑c∈C θc(xc) > fmax then

9 Update fmax =∑c∈C θc(xc);

10 Update x to x;11 end12 if δg(γmax) > 0 then13 break;14 end15 else16 γmin = γmax;17 γmax = 2γmax;18 end19 until True;

Proposition 4. For constraints of the form∑i∈V φi(xi) 6 0, if the feasible set is nonempty, then with the above initializing strategy Algorithm

1 must return some feasible x∗.

Proof. In this proposition, we consider the optimization problem:

argmaxx

∑c∈C

θc(xc), s.t.∑i∈V

φi(xi) 6 0. (37)

For arbitrary γ, if δg(γ) > 0, by definition of δg(γ), there must exist some θ ∈ Λ(θ), s.t.

xi = argmaxxi

θi(xi)− γφ(xi),

δg(γ) = −∑i∈V

φi(xi) > 0. (38)

Thus we must have that x is feasible for (37). Thus using the initializing scheme, if we find some γmax, s.t. δg(γmax) > 0. Then a correspondingfeasible solution exists. As a result, we only need to prove that such a γmax always exists. By the fact that the feasible set is nonempty, we havethat

minx

∑i∈V

φi(xi) =∑i∈V

minxi

φi(xi) 6 0. (39)

Thus for a sufficient large γ, for arbitrary θ ∈ Λ(θ) we have ∃x, s.t.

xi = argmaxi∈V

[θi(xi)− γφi(xi)] = argminxi

φi(xi), (40)

and such a x is always feasible. As a result, by the above initializing scheme and Algorithm 1, we can always find a feasible x.

Proof of Proposition 5Proposition 5. Let θ and γ be the solution provided by Algorithm 2 or 1. The solution is exact if

1. ∃x, s.t. ∀c ∈ C, xc ∈ argmaxxc[θc(xc) +

∑Kk=1 γkφ

kc (xc)],

2. ∀k ∈ [K], γk∑c∈C φ

kc (xc) = 0.

Proof. By the two conditions, we have that∑c∈C

maxxc

[θc(xc) +

K∑k=1

γkφkc (xc)] = max

x

∑c∈C

[θc(xc) +

K∑k=1

γkφkc (xc)]

=∑c∈C

[θc(xc) +

K∑k=1

γkφkc (xc)]

=∑c∈C

θc(xc), (41)

and by the second condition x is feasible, which completes the proof.

Proof of Proposition 6Proposition 6. Let θ and γ be the solution provided by Algorithm 2 or 1. If there exists some feasible x, s.t. xc ∈ argmaxxc

[θc(xc) +∑Kk=1 γkφc(xc)], then the integrality gap of proposed relaxation (7) is less than −

∑Kk=1 γk

∑c∈C φ

kc (xc), i.e.

∑c∈C

∑xc

µ∗c(xc)θc(xc)−∑c∈C

θc(x∗c) 6 −

K∑k=1

γk∑c∈C

φkc (xc),

where µ∗ is a solution of (7), x∗ is a solution of (2).

Proof. By duality and feasibility of x, we have that∑c∈C

∑xc

µ∗c(xc)θc(xc)−∑c∈C

θc(x∗c)

6∑c∈C

maxxc

[θc(xc)−

K∑k=1

γkφkc (xc)

]−∑c∈C

θc(x∗c)

=∑c∈C

[θc(xc)−

K∑k=1

γkφkc (xc)

]−∑c∈C

θc(x∗c)

6∑c∈C

[θc(xc)−

K∑k=1

γkφkc (xc)

]−∑c∈C

θc(xc)

=−K∑k=1

γk∑c∈C

φc(xc),


Approximate bound for submodular potentialsFor submodular potentials and constraints, we have the following proposition:

Proposition 7. If ∀c ∈ C, ∀γ, the augmented potentials θc(xc) −∑Kk=1 γkφ

kc (xc) are submodular, then (7) is equivalent to

minγ maxx

∑c∈C

[θc(xc)−

∑Kk=1 γkφ

kc (xc)

].

Proof. By Kumar, Kolmogorov, and Torr (2009), we have that if ∀c ∈ C, ∀γ, the augmented potentials θc(xc) −∑Kk=1 γkφ

kc (xc) is

submodular, then we must have that

maxx

∑c∈C

[θc(xc)−

K∑k=1

γkφkc (xc)

]= max

µ∈ML(G)

∑c∈C

∑xc

µc(xc)

[θc(xc)−

K∑k=1

γkφkc (xc)

], (42)

where the feasible set ML(G) is defined as

ML(G) =

{µ

∣∣∣∣ ∀c ∈ C,∑

xcµc(xc) = 1;

∀c, s ∈ C, s ⊂ c,∑

xc\sµc(xc) = µs(xs)

}. (43)

Thus the proposition must holds.

Derivation of Belief Propagation SchemeIn this section, we derive the proposed belief propagation scheme in (15). The skeleton of this section is as follows, (1) derive the messageupdating rules; (2) by the derived message updating rules, we derive a simpler Reparametrization updating rule.

Message Updating RulesIn this section, we derive the message updating rules by applying coordinate descent on the dual objective (26) with fixed γ4. With fixed γ, thesub-optimization problem becomes,

minλ

∑c∈C

maxxc

[θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r


γkφkc (xc)

]. (44)

In each step of the belief propagation, we pick one particular c, updating λc→s(xs), s ≺ c with all other messages fixed. The sub-optimizationproblem becomes

minλc→s(xs),s≺c

maxxc

[θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r


γkφkc (xc)

]

+∑s≺c

maxxs

[θs(xs)−

∑t≺s

λs→t(xt) +∑

c′∈C,s≺c′λc′→s(xs) +

K∑k=1

γkφks (xs)

]. (45)

For the sub-optimization problem (45), its closed-form solution is captured in the following proposition.

Proposition 8. One closed-form solution of (45) is

λ?c→s(xs) = λc→s(xs)− bγs (xs) +1

|{s|s ≺ c}| maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)

],∀s ≺ c. (46)

First we introduce the following lemmas as preliminary to prove Proposition 8. The following lemma will also be used in derivation ofreparametrization updating rules.

Lemma 2. For sub-optimization problem (45), we have that ∀s ≺ c

θs(xs)−∑t≺s

λs→t(xt) +∑

s≺c′,c′ 6=c

λc′→s(xs) + λ?c→s(xs)−K∑k=1

φks (xs)

=1


[bγc (xc) +

∑s≺c

bγs (xs)

]. (47)

4Note that (26) is equivalent to the dual objective (9)

Proof. By definition we have that

θs(xs)−∑t≺s

λs→t(xt) +∑

s≺c′,c′ 6=c


φks (xs)

=θs(xs)−∑t≺s

λs→t(xt) +∑

s≺c′,c′ 6=c

λc′→s(xs)−K∑k=1

φks (xs)

+ λc→s(xs)− bγs (xs) +1


[bγc (xc) +

∑s≺c

bγs (xs)

]

=θs(xs)−∑t≺s

λs→t(xt) +∑s≺c

λc′→s(xs)−K∑k=1

φks (xs)

− bγs (xs) +1


[bγc (xc) +

∑s≺c

bγs (xs)

]=bγs (xs)− bγs (xs) +

1


[bγc (xc) +

∑s≺c

bγs (xs)

]=

1


[bγc (xc) +

∑s≺c

bγs (xs)

]. (48)

Lemma 3. For the sub-optimization problem (45), we have that

maxxc

[θc(xc)−

∑s≺c

λ∗c→s(xs) +∑c≺r

λr→c(xc)

]= 0. (49)

Proof. By definition we have that

θc(xc)−∑s≺c


λr→c(xc)

=θc(xc)−∑s≺c

[λc→s(xs)− bγs (xs) +

1


[bγc (xc) +

∑s≺c

bγs (xs)]]

+∑c≺r

λr→c(xc)

=θc(xc)−∑s≺c

λc→s(xs) +∑c≺r

λr→c(xc) +∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]

=bγc (xc) +∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]. (50)

It is trivial that

bγc (xc) +∑s≺c

bγs (xs) 6 maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)],∀s ≺ c,xc,

and thus we must have that

maxxc

[θc(xc)−

∑s≺c


λr→c(xc)

]

= maxxc

{bγc (xc) +

∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]}

= maxxc

1

|{s|s ≺ c}|∑s≺c

{bγc (xc) +

∑s≺c

bγs (xs)−maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]}

6 0.

On the other hand, we also have that

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]6 max

xc

[bγc (xc) +

∑s≺c

bγs (xs)],

and thus we have

maxxc

[θc(xc)−

∑s≺c


λr→c(xc)

]

= maxxc

{bγc (xc) +

∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]}

>maxxc

{bγc (xc) +

∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc

[bγc (xc) +

∑s≺c

bγs (xs)]}

= maxxc

{bγc (xc) +

∑s≺c


maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]}

= maxxc

[bγc (xc) +

∑s≺c

bγs (xs)]−max

xc\s

[bγc (xc) +

∑s≺c

bγs (xs)]

= 0.

As a result, we must have that

maxxc

[θc(xc)−

∑s≺c


λr→c(xc)

]= 0.

Now we prove Proposition 8.

Proof of Proposition 8. Obviously, a lower bound of (44) is

maxxc

[θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r


γkφkc (xc)

]

+∑s≺c

maxxs

[θs(xs)−

∑t≺s

λs→t(xt) +∑


K∑k=1

γkφks (xs)

]

>maxxbc

{θc(xc)−

∑s≺c

λc→s(xs) +∑

r∈C,c≺r


γkφkc (xc)

]

+∑s≺c

[θs(xs)−

∑t≺s

λs→t(xt) +∑


K∑k=1

γkφks (xs)

]

= maxxc

[bγc (xc) +

∑s≺c

bs(xs)

]. (51)

Now we show that this lower bound is attained via the closed-form solution in (46). By Lemma 2 and 3, we have that∑s≺c

maxxs

[θs(xs)−

∑t≺s

λs→t(xt) +∑

s≺c′,c′ 6=c


φks (xs)

]

+ maxxc

[θc(xc)−

∑s≺c


λr→c(xc)

]=

1

|{s|s ≺ c}|∑s≺c

maxxs

maxxc\s

[bγc (xc) +

∑s≺c

bs(xs)

]+ 0

= maxxc

[bγc (xc) +

∑s≺c

bs(xs)

], (52)

which means that (46) is a solution of (45).

Reparametrization Updating RulesIn this section, we derive the reparametrization updating rules. By Lemma 2, we have that the optimal reparametrization bγ,?s (xs) correspondingto (45) is

bγ,?s (xs) =θs(xs)−∑t≺s

λs→t(xt) +∑

s≺c′,c′ 6=c


φks (xs)

=1


[bγc (xc) +

∑s≺c

bγs (xs)

]. (53)

m

0 10 20 30 40 50

Tim

e

100

101

102

103

104

Approximate δg(γ)

Exact δg(γ)

m

0 10 20 30 40 50

Tim

e

100

101

102

103

104

Approximate δg(γ)

Exact δg(γ)

Figure 6: Running time comparison of exact δg(γ) and approximate δg(γ) on m-best problems. The plots show the results on problem 1a6mfrom side-chain dataset. The two strategies result in the same solution, but approximately computing δg(γ) can significantly improve the speed.Left: Time comparison on Algorithm 1. Right: Time comparison on projected sub-gradients.

By definition, the optimal bγ,?c (xs) corresponding to (45) is

bγ,?c (xc) =θc(xc)−∑s≺c


λr→c(xc)

=θc(xc)−∑s≺c

[λc→s(xs)− bγs (xs) +

1


[bγc (xc) +

∑s≺c

bγs (xs)]]

+∑c≺r

λr→c(xc)

=θc(xc)−∑s≺c

λc→s(xs) +∑c≺r

λr→c(xc) +∑s≺c

bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]


bγs (xs)−1

|{s|s ≺ c}|∑s≺c

maxxc\s

[bγc (xc) +

∑s≺c

bγs (xs)]



bγ,∗s (xs). (54)

Furthermore, the reparametrization updating can be efficiently done as follows:

bγc (xc)← bγc (xc) +∑s≺c

bγs (xs), ∀xc, (55a)

bγs (xs)← maxxc\s

bγc (xc)/|{s′|s′ ≺ c}|, ∀xs, (55b)

bγc (xc)← bγc (xc)−∑s≺c

bγs (xs), ∀xc, (55c)

θ∗s (xs) = bs(xs)+γφs(xs), θ∗c (xc) = bc(xc)+γφc(xc). (55d)

Comparison of Difference Belief Propagation SchemeIn this section, we mainly compare the two different approaches to optimize over sub-optimization problem (13). The first approach is touse the proposed BP to approximately optimize over (13). The second approach is to use smoothing technique to smooth (13), and then usesum-product to compute ε-optimal θ

?.

The performance of the two approaches are compared on m-best problems from side-chain datasets (Yanover, Meltzer, and Weiss 2006). Inthe comparison, we solve m-best problems by using the STRIPES framework. We found that the two approach results in the same solution, butthe former one use much less time than the later one. Typical comparison results is shown in Figure 65.

5Details of the data-set and STRIPES framework can be found in Section 5.3.

solving constrained combinatorial optimization problems ...jmcauley/pdfs/aaai17.pdf · solving...

Documents