constraint aggregation principle in convex optimization

Working PaperConstraint Aggregation Principlein Convex OptimizationYuri M. ErmolievArkadii V. KryazhimskiiAndrzej Ruszczy�nskiWP-95-015February 1995 (revised September 1995)

�IIASA International Institute for Applied Systems Analysis A-2361 Laxenburg AustriaTelephone: 43 2236 807 Fax: 43 2236 71313 E-Mail: [email protected]

Constraint Aggregation Principlein Convex OptimizationYuri M. ErmolievArkadii V. KryazhimskiiAndrzej Ruszczy�nskiWP-95-015February 1995 (revised September 1995)

Working Papers are interim reports on work of the International Institute for AppliedSystems Analysis and have received only limited review. Views or opinions expressedherein do not necessarily represent those of the Institute, its National MemberOrganizations, or other organizations supporting the work.�IIASA International Institute for Applied Systems Analysis A-2361 Laxenburg AustriaTelephone: 43 2236 807 Fax: 43 2236 71313 E-Mail: [email protected]

AbstractA general constraint aggregation technique is proposed for convex optimization prob-lems. At each iteration a set of convex inequalities and linear equations is replaced bya single surrogate inequality formed as a linear combination of the original constraints.After solving the simpli�ed subproblem, new aggregation coe�cients are calculated andthe iteration continues. This general aggregation principle is incorporated into a numberof speci�c algorithms. Convergence of the new methods is proved and speed of conver-gence analyzed. Next, dual interpretation of the method is provided and application todecomposable problems is discussed. Finally, numerical illustration is given.Key words: Nonsmooth optimization, surrogate constraints, subgradient methods, de-composition.

iii

Constraint Aggregation Principlein Convex OptimizationYuri M. ErmolievArkadii V. KryazhimskiiAndrzej Ruszczy�nski1 IntroductionIn this article we consider the problem of minimizing a convex non-di�erentiable functionsubject to a large number of linear and nonlinear (convex) constraints. The study of suchproblems is motivated by new challenges arising in the analysis of large scale systemswith the lack of smooth behavioral properties. Our objective is to develop a generalconstraint aggregation principle which can be used in a variety of methods for solvingconvex optimization problems of the formminf(x) (1:1)hj(x) � 0; j = 1; : : : ;m; (1:2)x 2 X: (1:3)We assume throughout this paper that the functions f : IRn 7! IR and hj : IRn 7! IR; j =1; : : : ;m, are convex and X � IRn is convex and compact. We also assume that thefeasible set de�ned by (1.2)-(1.3) is non-empty which guarantees that the problem has anoptimal solution.We are especially interested in the case when the number of constraints m is verylarge, and the number of variables n is large, too. Problems with very many constraintsarise in a natural way in various application areas, like stochastic programming problemswith constraints that have to hold with probability one, in optimal control problems withstate constraints, etc. (see, e.g., [5]).For very large problems the direct application of such procedures as bundle methods(see, e.g., [7, 6]) may become di�cult. An alternative approach is provided by non-monotone subgradient methods dating back to [1, 14, 18], which found numerous applica-tions and extensions to various classes of problems, such as discrete optimization, designof robust decomposition procedures, general problems of stochastic optimization, min-max and more general game-theoretical problems (see [2, 12, 13, 19]). However, treatinggeneral convex constraints within subgradient methods may also be di�cult. This re-quires either projection (impractical for large m and n) or additional penalty/multiplieriterations, which for general f and hj have (so far) only theoretical importance in thisframework. 1

We are going to follow a di�erent path. We shall assume that the structure of Xis simple and the main di�culty comes from the large number of constraints (1.2). Toovercome this, we shall replace the original problem by a sequence of problems in whichthe complicating constraints (1.2) are represented by one surrogate inequalitymXj=1 skjhj(x) � 0; (1:4)where sk � 0 are iteratively modi�ed aggregation coe�cients. In this way a substantialsimpli�cation of (1.1)-(1.3) is achieved, because (1.4) inherits linearity or di�erentiabilityproperties of (1.2). We shall show how to update the aggregation coe�cients and howto use the solution of the simpli�ed problems to arrive at the solution of (1.1)-(1.3). Itis worth noting here that our approach is fundamentally di�erent from the aggregationprovided by Lagrange multipliers and also works for problems for which duality does nothold. It is also di�erent from the mostly heuristic aggregation methods discussed in [16].In section 2 we shall consider the simpli�ed version of the problem having only lin-ear constraints and we shall develop a continuous time feedback rule using constraintaggregation. In section 3 a reformulation of this procedure is presented, which allows forvarious extensions and generalizations. We shall prove a simple lemma which is an analogof the Lyapunov function approach in the study of convergence of non-monotone dynamic(iterative) processes. Using this result, the convergence of the aggregation procedure inthe simpli�ed case is proved, under the assumption that the problem of minimizing fsubject to the surrogate constraint can be solved easily. In section 4 this assumption isrelaxed: it is shown that it is su�cient to make a step in the direction of the subgradientand project the result on the non-stationary surrogate constraint. Section 5 generalizesthe aggregation principle to convex non-linear inequalities. Two aggregation schemes areanalyzed: direct aggregation (1.4) and aggregation combined with linearization, whichreplaces (1.2) with one linear inequality. In section 6, we follow this idea further anddevelop a class of linearization methods with aggregation. Section 7 is devoted to theanalysis of the speed of convergence. We derive estimates of the rate of convergence andwe evaluate the complexity of the method for linear programming problems. In section 8we interpret the method from the dual viewpoint and we sketch out a possible applicationto decomposition techniques. Finally, in section 8 we provide a numerical illustration ofthe properties of the method.2 The continuous time procedureLet us consider the simpli�ed version of the problem:minf(x) (2:1)Ax = b; (2:2)x 2 X: (2:3)2

Following [10], we consider the dynamic system_x(t) = u(t); x(0) = 0 (2:4)operating on the time interval IR+ = [0;1). Let us assume thatu(t) 2 X (2:5)for all t � 0. The control function u(�) will be constructed in such a way that for thecorresponding trajectory x(�) of the system (2.4) the ratio x(t)=t will approach the optimalsolution set X� of (2.1)-(2.3) as t!1.In general, for a feedback controlu(t) = U(t; x(t))the closed system (1.2) may have no trajectories understood as solutions of the abovedi�erential equation (if, for example, U is discontinuous). We shall, therefore, rede�nethe notion of trajectory using the concept of �-trajectories coming from the theory ofdi�erential games [9].For any � > 0 the �-trajectory x�(�) under the feedback rule U is de�ned as follows:x�(t) = x�(tk) + U(tk; x�(tk))(t� tk); (2:6)where x�(0) = 0, tk = k�, t 2 [tk; tk+1], k = 0; 1; 2; : : :. A function x(�) : IR+ 7! IRn iscalled a trajectory under U if for every bounded subinterval of IR+, x(�) is the uniformlimit of a sequence of x�(�) with �! 0.Let us note that (2.6) yields x�(t)t = 1t Z t0 _x�(� )d�:Hence, by (2.5) and the convexity and closedness of X,x�(t)t 2 X:A feedback ensuring that the above ratio approaches X� is de�ned as follows:U(t; x) 2 arg minff(u) : u 2 X; hAx� tb;Au� bi � 0g : (2:7)Note that U(t; x) 6= ;, since the feasible set (2.2)-(2.3) is non-empty. Let the constantKA be such that jAx� bj2 � KA for all x 2 X.Theorem 2.1. For any x�(t) under the feedback (2.7) the following inequalities hold forall t � 0: jA(x�(t)=t)� bj2 � KA�=t;f(x�(t)=t) � f�;where f� denotes the optimal value of (2.1)-(2.3).3

Proof. Since hAx�(tk)� tkb;Au(tk)� bi � 0, then for t 2 [tk; tk+1]jAx�(t)� tbj2= jAx�(tk)� tkbj2 + 2hAx�(tk)� tkb;Au(tk)� bi(t� tk) + jAu(tk)� bj2(t� tk)2� jAx�(tk)� tkbj2 + KA(t� tk)2: (2.8)The �rst assertion follows then by induction.Next, for t 2 [tk; tk+1],Z t0 f( _x�(� ))d� � Z tk0 f( _x�(� ))d� + f�(t� tk);because f(u(tk)) � f�. By induction,Z t0 f( _x�(� ))d� � tf�:Dividing by t and using the convexity of f we getf x�(t)t ! � 1t Z t0 f( _x�(� ))d� � f�;which proves the second assertion. 23 The basic iterative procedureLet us take the following denotations in the continuous time procedure: xk = x�(tk)=tk,tk = k�, �k = 1=(k + 1), k = 0; 1; 2; : : : (for k = 0 we set x0 = 0). Then (2.6)-(2.7) can berewritten as xk+1 = xk + �k(uk � xk); k = 0; 1; 2; : : : ; (3:1)where uk is the solution of the subproblemminf(u) (3:2)hAxk � b;Au� bi � 0; (3:3)u 2 X: (3:4)This reformulation is the starting point for our developments. Procedure (3.1)-(3.4) canbe viewed as an iterative constraint aggregation method. The initial equality constraints(2.2) are replaced by a sequence of non-stationary scalar inequalities (3.3). Obviously,(3.3) is a relaxation of (2.2), so uk exists and f(uk) � f�. It is rather clear that thesequence fxkg can converge to the optimal solution set X� only if jAxk � bj2 ! 0 ask !1. This can be analyzed by the following key inequality:jAxk+1 � bj2 = j(1 � �k)(Axk � b) + �k(Auk � b)j2= (1 � �k)2jAxk � bj2 + 2�k(1 � �k)hAxk � b;Auk � bi+ � 2k jAuk � bj2� (1 � 2�k)jAxk � bj2 + 2KA� 2k ; (3.5)4

where KA is an upper bound on jAx� bj2 in X.The convergence of various iterative algorithms analyzed in this paper can be provedby using the following simple lemma (similar results along these lines can be found in[1, 13]).Lemma 3.1. Let the sequences f�kg, f�kg, f�kg and f kg satisfy the inequality0 � �k+1 � �k � �k�k + k: (3:6)If (i) liminf �k � 0;(ii) for every subsequence fkig � IN one has [liminf �ki > 0] ) [liminf �ki > 0] ;(iii) �k � 0; lim�k = 0; P1k=0 �k = 1;(iv) lim k=�k = 0;then limk!1 �k = 0.Proof. Suppose that liminf �k = � > 0. Then (3.6) for large k yields �k+1 � �k��k�=2+ k � �k � �k�=4. This contradicts (iii). Therefore there is a subsequence fkig such that�ki ! 0. Suppose that there is another subsequence fsjg such that �sj � � > 0 forj = 0; 1; 2; : : :. With no loss of generality we may assume that k1 < s1 < k2 < s2 : : :.By (i), (iii) and (iv), for all su�ciently large j there must exist indices rj 2 [kj; sj] suchthat �rj > �=2 and �rj+1 > �rj . But then, by (ii), liminf �rj = � > 0 and we obtain acontradiction with (3.6) for large j. 2We are now ready to prove the convergence of our method.Theorem 3.2. Consider the algorithm (3.1)-(3.4) and assume that �k 2 [0; 1], limk!1 �k =0, P1k=0 �k = 1 and x0 2 X. Then every accumulation point of the sequence fxkg is asolution of (2.1)-(2.3).Proof. By (3.5) and Lemma 3.1, taking �k = jAxk � bj2, �k = 2�k and k = 2KA� 2k , weobtain jAxk � bj ! 0 as k !1. Since X is compact, we must haveliminf f(xk) � f�:Next, by the convexity of f ,f(xk+1) � (1 � �k)f(xk) + �kf(uk) � (1� �k)f(xk) + �kf�:Thus, max(f(xk+1)� f�; 0) � (1� �k) max(f(xk)� f�; 0): (3:7)Using Lemma 3.1 again, with �k = �k = max(0; f(xk)� f�), we obtain f(xk) ! f�. Ourassertion follows then from the compactness of X. 25

Clearly, in view of the compactness of X we also have dist�xk;X��! 0, where X� is theset of optimal solutions of (2.1)-(2.3). This general remark applies to many other resultsin this paper.The stepsizes in the basic procedure can also be generated in a more systematic way:�k = arg min0��1 j(1 � � )(Axk � b) + � (Auk � b)j2: (3:8)Theorem 3.3. Consider the algorithm (3.1)-(3.4) and assume that f(x0) � f� and thestepsizes �k are de�ned by (3.8). Then every accumulation point of the sequence fxkg isa solution of (2.1)-(2.3).Proof. From the key inequality (3.5) we obtainjAxk+1 � bj2 � min0��1 �(1� 2� )jAxk � bj2 + 2KA� 2�� 1� jAxk � bj22KA ! jAxk � bj2:Therefore jAxk � bj2 ! 0. Since f(xk) � f� for all k, the required result follows. 2We end this section by stressing that the theoretical framework provided by the basicprocedure can be modi�ed and extended in di�erent ways.First, it is not necessary to generate exactly one surrogate constraint. We can aggre-gate the constraints in groups as follows. LetAlx = bl; l = 1; : : : ; L; (3:9)be (possibly overlapping) subgroups of constraints (2.2), such that each row of (2.2) isrepresented at least once. Then we can replace (3.3) by L surrogate constraints:hAlxk � bl; Alu� bli � 0; l = 1; : : : ; L:Indeed, our key inequality (3.5) remains valid for each subgroup jAlxk� blj2, l = 1; : : : ; L,and all our results follow. The optimizing stepsize (3.8) should be replaced, of course, by�k = arg min0��1 LXl=1 j(1 � � )(Alxk � bl) + � (Aluk � bl)j2: (3:10)Secondly, we can keep in the auxiliary subproblem as many previous surrogate con-straints as we wish. They are all relaxations of the original constraints and do not cut o�points of the true feasible set. Therefore f(uk) � f� for all k, so our results remain true.4 The subgradient projection methodThe basic procedure requires the solution of the auxiliary subproblems (3.2)-(3.4), whichmay still prove di�cult. We shall show that it is su�cient to use individual subgradientsof f within the constraint aggregation scheme to achieve convergence.6

Consider the same procedure as before:xk+1 = (1� �k)xk + �kuk; (4:1)but with uk generated in a simpler fashion:uk = �k(xk � �kgk): (4:2)Here gk 2 @f(xk) and �k(�) denotes the orthogonal projection onto the setXk = fx 2 X : hAxk � b;Ax� bi � 0g:Theorem 4.1. Assume �k 2 [0; 1]; �k � 0; �k ! 0; �k ! 0; P1k=0 �k = 1; P1k=0 �k�k =1. Then every accumulation point of the sequence generated by the method (4.1)-(4.2) isa solution of (2.1)-(2.3).Proof. For every optimal solution x� we havejxk+1 � x�j2 � �(1� �k)jxk � x�j+ �kjuk � x�j�2= (1� �k)2jxk � x�j2 + 2�k(1 � �k)jxk � x�jjuk � x�j+ � 2k juk � x�j2:(4.3)Since x� 2 Xk, from the non-expansiveness of �k(�) one obtainsjuk � x�j2 = ��k(xk � �kgk)��k(x�)��2 � jxk � x�j2 � 2�khgk; xk � x�i+ C�2k;where C is an upper bound on jgkj2. Thusjuk � x�j2 � jxk � x�j2 � 2�k�k + C�2k (4:4)with �k = f(xk)� f�. Taking the square root of both sides and using the concavity of p�we get juk � x�j � jxk � x�j 1� 2�k�k � C�2kjxk � x�j2 ! 12 � jxk � x�j � 2�k�k � C�2k2jxk � x�j :Multiplication by jxk � x�j yieldsjuk � x�jjxk � x�j � jxk � x�j2 � �k�k + C�2k=2: (4:5)Combining (4.3), (4.4) and (4.5), after elementary simpli�cations, for every x� 2 X� onehas jxk+1 � x�j2 � jxk � x�j2 � 2�k�k�k + C�k�2k:Then also (after minimizing both sides with respect to x�)dist(xk+1;X�)2 � dist(xk;X�)2 � 2�k�k�k + C�k�2k: (4:6)Similarly to Theorem 3.2, jAxk� bj ! 0. Therefore the sequence f�kg satis�es conditions(i) and (ii) of Lemma 3.1 with �k = dist(xk;X�)2. By virtue of Lemma 3.1 (with �k�kplaying the role of �k), dist(xk;X�) ! 0 as k !1. 27

5 Convex inequalitiesTo show how the aggregation principle extends to nonlinear constraints, let us considerthe problem minf(x) (5:1)hj(x) � 0; j = 1; : : : ;m; (5:2)x 2 X: (5:3)We assume that f : IRn 7! IR is convex, X � IRn is compact and convex, and thefunctions hj : IRn 7! IR, j = 1; : : : ;m, are convex. Denote h+j (x) = max(0; hj(x)),h(x) = (h1(x); : : : ; hm(x)) and h+(x) = (h+1 (x); : : : ; h+m(x)).The basic iterative procedure (3.1)-(3.4) can be adapted to our problem as follows:xk+1 = xk + �k(uk � xk); k = 0; 1; 2; : : : ; (5:4)where uk is the solution of the subproblemminf(u) (5:5)hh+(xk); h(u)i � 0; (5:6)u 2 X: (5:7)Note that if linear constraints Ax = b are written as inequalities Ax�b � 0 and �Ax+b �0, then the surrogate inequalities (5.6) and (3.3) are identical.Theorem 5.1. Consider the algorithm (5.4)-(5.7) and assume that �k 2 [0; 1], limk!1 �k =0, P1k=0 �k = 1 and x0 2 X. Then every accumulation point of the sequence fxkg is asolution of (5.1)-(5.3).Proof. Let us derive a counterpart of the key inequality (3.5). By the convexity of h(�),h((1 � �k)xk + �kuk) � (1 � �k)h(xk) + �kh(uk) � (1� �k)h+(xk) + �kh(uk):In the above vector inequality, for the indices j for which the left hand side is positive,the right hand side has a larger absolute value. Thereforejh+((1 � �k)xk + �kuk)j2 � j(1� �k)h+(xk) + �kh(uk)j2� (1� �k)2jh+(xk)j2 + � 2k jh(uk)j2;where in the last inequality we used (5.6). Consequently,jh+(xk+1)j2 � (1 � 2�k)jh+(xk)j2 + 2Kh� 2k ; (5:8)with Kh being an upper bound on jh(x)j2 in X. By Lemma 3.1, limk!1 h+(xk) = 0. Theremaining part of the proof is identical with the proof of Theorem 3.2. 28

Again, similar to theorem 3.3 the stepsizes in the basic procedure (5.4) can be generatedin a more systematic way:�k = arg min0��1 jh+((1� � )xk + �uk)j2: (5:9)Theorem 5.2. Consider the algorithm (5.4)-(5.7) and assume that f(x0) � f� and thestepsizes �k are de�ned by (5.9). Then every accumulation point of the sequence fxkg isa solution of (5.1)-(5.3).Proof. From the inequality (5.8) we obtainjh+(xk+1)j2 � min0��1 �(1 � 2� )jh+(xk)j2 + 2Kh� 2� � 1 � jh+(xk)j22Kh ! jh+(xk)j2: (5:10)Therefore jh+(xk)j2 ! 0. Since f(xk) � f� for all k, the required result follows. 2For continuously di�erentiable constraint functions h(�) the subproblems can be furthersimpli�ed by replacing (5.6) with the inequalityhh+(xk); h(xk) +rh(xk)(u� xk)i � 0; (5:11)where rh(�) denotes the Jacobian of h(�). Let us note that by the convexity of h(�), (5.11)is still a relaxation of (5.2).Theorem 5.3. Consider the algorithm (5.4), (5.5), (5.11), (5.7) and assume that �k 2[0; 1], limk!1 �k = 0, P1k=0 �k = 1 and x0 2 X. Then every accumulation point of thesequence fxkg is a solution of (5.1)-(5.3).Proof. We shall again derive an inequality similar to (5.8). By the uniform continuity ofrh(�) in X,h(xk + �k(uk � xk)) � h(xk) + �krh(xk)(uk � xk) + o(�k; uk; xk)� (1� �k)h+(xk) + �k(h(xk) +rh(xk)(uk � xk)) + o(�k; uk; xk);where o(�; u; x)=� ! 0, when � ! 0, uniformly over u; x 2 X. Proceeding exactly as inthe proof of Theorem 5.1 and using (5.11), we obtainjh+(xk+1)j2 � (1� 2�k)jh+(xk)j2 + o1(�k); (5:12)with o1(� )=� ! 0, when � ! 0. Applying Lemma 3.1, we get limk!1 h+(xk) = 0. Theremaining part of the proof is analogous to the proof of Theorem 3.2. 2Similar results can be obtained for the subgradient projection methodxk+1 = (1� �k)xk + �kuk; (5:13)where uk = �k(xk � �kgk); (5:14)9

gk 2 @f(xk) and �k(�) denotes the orthogonal projection onto the setXk = fx 2 X : hh+(xk); h(x)i � 0g (5:15)(if the aggregation scheme (5.6) is used), or onto the setXk = fx 2 X : hh+(xk); h(xk) +rh(xk)(x� xk)i � 0g (5:16)(if linearization of smooth constraints is employed). In the identical way as Theorems 5.1and 5.3 we obtain the following results.Theorem 5.4. Assume �k 2 [0; 1]; �k � 0; �k ! 0; �k ! 0; P1k=0 �k = 1; P1k=0 �k�k =1. Then every accumulation point of the sequence generated by the method (5.13)-(5.15)is a solution of (5.1)-(5.3). If additionally, the constraint functions hj ; j = 1; : : : ;mare continuously di�erentiable, every accumulation point of the sequence generated by themethod (5.13), (5.14), (5.16) is a solution of (5.1)-(5.3).Let us suppose now that the optimal value f� in (5.1)-(5.3) is known. Then we canreformulate the problem as a system of inequalities:f(x)� f� � 0 (5:17)hj(x) � 0; j = 1; : : : ;m; (5:18)x 2 X: (5:19)We can, formally, think of it as a problem of minimizing a function identically equal to 0subject to (5.17)-(5.19). Our aggregation procedure (5.4)-(5.7) replaces our problem by asequence of simple systems having the form:max(0; f(xk)� f�)f(u) + Xj2Jk hj(xk)hj(u) � 0;u 2 X;where Jk = fj : hj(xk) > 0g. Linearization of smooth constraints is also possible here.Again, similarly to the case of linear constraints, we can consider (possibly overlapping)subgroups of constraints hl(x) � 0; l = 1; : : : ; L;and generate surrogate constraints separately for each group:hh+l (xk); hl(u)i � 0; l = 1; : : : ; L:Since the key inequality (5.8) holds for each subgroup, all results of this section remaintrue. The rule (5.9) should be replaced, of course, by�k = arg min0��1 LXl=1 jh+l ((1 � � )xk + �uk)j2:We can also keep some past aggregates in the auxiliary subproblems.10

6 The linearization procedureLet us consider the problem (5.1)-(5.3) again, but with the additional assumption thatthe functions f and hj ; j = 1; : : : ;m, are convex and continuously di�erentiable. The setX is assumed to be convex and compact. We shall extend the idea of linearization usedin (5.11) to the objective function, constructing the following procedure:xk+1 = xk + �k(uk � xk); k = 0; 1; 2; : : : ; (6:1)where uk is the solution of the subproblemminhrf(xk); u� xki (6:2)hh+(xk); h(xk) +rh(xk)(u� xk)i � 0; (6:3)u 2 X: (6:4)The above algorithm is a modi�cation of the classical Frank-Wolfe method of nonlin-ear programming, but with constraint linearization, aggregation and with non-monotonestepsize selection.Theorem 6.1. Assume that �k 2 [0; 1]; �k ! 0; P1k=0 �k = 1. Then every accumulationpoint of the sequence generated by the method (6.1)-(6.4) is a solution of (5.1)-(5.3).Proof. Proceeding exactly as in the proof of Theorem 5.3, we obtain h+(xk) ! 0, ask !1. Next, by the di�erentiability of f ,f(xk+1) � f(xk) + �khrf(xk); uk � xki+ o(�k; xk; uk); (6:5)where o(�; x; u)=� ! 0 when � ! 0, uniformly for x; u 2 X. Since every solution x� isfeasible for (6.2)-(6.4),hrf(xk); uk � xki � hrf(xk); x� � xki � f� � f(xk):Combining the last two inequalities we getf(xk+1)� f� � (1 � �k)(f(xk)� f�) + o(�k; xk; uk):By Lemma 3.1, max(0; f(xk)� f�) ! 0, as k!1. The proof is complete. 2Let us note that if X has a simple structure, e.g.X = nx 2 IRn : xminj � xj � xmaxj ; j = 1; : : : ; no ;then (6.2)-(6.4) can be solved analytically.Aggregation and linearization in subgroups, as already discussed in sections 3 and 5,is also possible here.Finally, it is worth mentioning that in a similar fashion one can derive a family ofalgorithms, using di�erent direction-�nding subproblems, such as trust-region methods,methods with regularization and averaging (see, e.g., [17]), etc.11

7 Rate of convergence and complexityLet us now analyze the speed of convergence of the methods with constraint aggregationin the case of harmonic stepsizes:�k = �k + 1 ; k = 0; 1; 2; : : : ; (7:1)where 0 < � � 1. We start from the following technical lemma.Lemma 7.1. Let the sequence f�kg be de�ned by (7.1).(i) If the sequence f�kg satis�es for k = 0; 1; 2; : : : the inequality0 � �k+1 � (1 � 2�k)�k + C� 2k (7:2)and �0 � C=2, then for all k � 0 and all p 2 [2�(1 � �); 2�) such that p � 1 onehas �k � M(p)(k + 1)p ; (7:3)where M(p) = C�2=(2� � p).(ii) If the sequence f�kg satis�es for k = 0; 1; 2; : : : the inequality0 � �k+1 � (1 � �k)�k; (7:4)then for all k � 0 �k � �0(k + 1)� : (7:5)Proof. For k = 0 (7.3) is true, because p � 2�(1 � �) implies �0 � C=2 � M(p).Assuming that (7.3) holds for k, we shall prove it for k + 1. From (7.2) we get�k+1 � �1� 2�k + 1� M(p)(k + 1)p + C�2(k + 1)2= �1� pk + 1� M(p)(k + 1)p + C�2(k + 1)2 � (2� � p)M(p)(k + 1)p+1� �1� pk + 1� M(p)(k + 1)p ; (7.6)where in the last inequality we used the de�nition of M(p). Let us now note that theconcavity of the logarithm impliesln�1 � pk + 1�+ p ln�1 + 1k + 1� � 0;so 1� pk + 1 � k + 1k + 2!p : (7:7)12

Using this inequality in (7.6) we conclude that�k+1 � M(p)(k + 2)p ;which completes the induction step for (7.3).To prove (7.5) (which is trivial for k = 0), we use (7.4) and (7.7) with p = � to get�k+1 � �1 � �k + 1� �0(k + 1)� � �0(k + 2)� :By induction (7.5) holds for all k � 0. 2We can now derive rate of convergence estimates for our basic procedure.Theorem 7.2. Consider the problem (5.1)-(5.3) and the method (5.4)-(5.5) with the step-sizes (7.1). Then for all k � 0 and all p 2 [2�(1 � �); 2�) such that p � 1 the sequencefxkg satis�es the inequalities jh+(xk)j2 � Mh(p)(k + 1)p ; (7:8)and f(xk)� f� � Mf(k + 1)� ; (7:9)where Mh(p) = 2Kh�2=(2� � p), Kh = maxx2X jh(x)j2, and Mf = max(f(x0)� f�; 0).Proof. The result can be obtained from inequalities (5.8) and (3.7) by applying Lemma7.1 with �k = jh+(xk)j2 and �k = max(0; f(xk)� f�). 2It is worth noting that the rate of convergence estimates (7.8) and (7.9) are invariant withrespect to the scaling of the variables and of the constraint and objective functions.A similar result can be obtained for the method with constraint linearization.Theorem 7.3. Consider problem (5.1)-(5.3) and the method (5.4), (5.5), (5.7), (5.11)with the stepsizes (7.1). Assume additionally that the Jacobian rh(�) is Lipschitz contin-uous on X. Then the estimates (7.8) and (7.9) hold with Mh(p) = (Kh+LD2)�2=(2��p),where Kh = maxx2X jh(x)j2, L is the Lipschitz constant of rh(�) and D is the diameterof X.Proof. It is su�cient to observe that in (5.12) one has o1(�k) = (Kh +LD2)� 2k . Applica-tion of Lemma 7.1 with �k = jh+(xk)j2 yields the required result. 2It is also clear that the estimate (7.8) holds for the constraint violation in the subgradientprojection method and in the linearization method. In the latter case, however, we canagain estimate the speed of convergence of objective values.13

Theorem 7.4. Consider problem (5.1)-(5.3) and the method (6.1)-(6.4) with the step-sizes (7.1). Assume additionally that the Jacobian rh(�) and the gradient rf(�) are Lips-chitz continuous on X. Then the estimates (7.8) hold with Mh(p) = (Kh +LhD2)�2=(2��p), where Kh = maxx2X jh(x)j2, Lh is the Lipschitz constant of rh(�) and D is the diam-eter of X. Moreover, we have for every p 2 (0; �) the estimatef(xk)� f� � Mf (p)(k + 1)p ; (7:10)where Mf (p) = max(f(x0)�f�; LfD2=(��p)) and Lf is the Lipschitz constant of rf(�).Proof. The estimate of the constraint violation can be proved as in Theorem 7.3. Toprove (7.10) we observe that in the inequality (6.5) one haso(�; u; x) � LfD2� 2:Application of Lemma 7.1 to this inequality (with 2� replaced by �) yields the requiredresult. 2Equally strong results can be obtained for the method with stepsize selection via direc-tional minimization (5.9).Theorem 7.5. Consider the algorithm (5.4)-(5.7) and assume that f(x0) � f� and thestepsizes �k are de�ned by (5.9). Then the following estimate holdsjh+(xk)j2 � 2Khk + 1 ; k = 0; 1; 2; : : : (7:11)Proof. For k = 0 our assertion is true. Assuming that it holds for k we shall prove itfor k + 1. If jh+(xk)j2 � 2Kh=(k + 2), from (5.10) we get jh+(xk+1)j2 � jh+(xk)j2 �2Kh=(k + 2). It remains to consider the case jh+(xk)j2 2 [2Kh=(k + 2); 2Kh=(k + 1)]. Butthen, by (5.10), one hasjh+(xk+1)j2 � �1 � 1k + 2� 2Khk + 1 = 2Khk + 2 ;as required. 2Let us now pass to the complexity analysis of the basic procedure (3.1)-(3.4) in the caseof the linear programming problem minhc; xi (7:12)Ax = b; (7:13)xmin � x � xmax: (7:14)If stepsizes (7.1) are used, we assume for simplicity that � = 1, so hc; xki � f� for allk � 1. Suppose that we need to �nd an approximate solution such thatkAxk � bk2 � �;14

where � > 0 is some selected accuracy. Then it follows from (7.8) with p = 1 that it issu�cient to carry out k(�) = 2KA� � 1iterations, where KA is the upper bound on jAx � bj2 subject to (7.14). By Theorem7.5, the same estimate is true when stepsizes are generated by minimizing the constraintviolation, as in (3.8).Each iteration consists in solving the linear programminhc; uihAT sk; ui � hsk; bixmin � u � xmax;where sk = Axk � b. Forming the surrogate constraint costs m(2n + 1) multiplications.The subproblem requires at most 2n multiplications and divisions to solve. Consequently,the e�ort to �nd a solution with accuracy � can be bounded by a polynomial function ofthe problem's dimensions m and n:E(m;n; �) � 2KA(2mn + 2n + m)=�:For sparse problems we get a better bound by noting that the surrogate constraint isformed in 2� +m multiplications, where � is the number of nonzeros of A. Of course, thequality of these bounds depends very much on the constant KA.8 The dual viewpointSo far we did not assume any constraint quali�cation conditions; the algorithms discussedin the previous sections converge without that (if a solution exists). Let us now look closerat the properties of the basic procedure (3.1)-(3.4) for the problem (2.4)-(2.6) under thefollowing assumption.Constraint Quali�cation Condition. The set X is polyhedral, or at some feasiblepoint ~x one has ri (X n f~xg) \ fd : Ad = 0g 6= ;:Let us consider the LagrangianL(x; �) = f(x) + h�;Ax� bi (8:1)and the dual function D(�) = minx2X L(x; �):Under the Constraint Quali�cation Condition, there exists �� such thatD(��) = max�2IRmD(�) = f�: (8:2)15

This result can be obtained as follows. At �rst, we note that at the optimal solution x�one has g 2 NC(x�), where g 2 @f(x�) and NC(x�) is the normal cone to the set de�nedby (2.5)-(2.6) (see [15]). Then we use convexity and Constraint Quali�cation Conditionto get NC(x�) = NX(x�) + fAT� : � 2 IRmg, so g + AT�� 2 NX(x�) for some ��. Thelatter is equivalent to x� being the minimizer of L(�; ��) in X, which in turn yields (8.2).The Constraint Quali�cation Condition also guarantees (in the same way) the exis-tence of an optimal Lagrange multiplier �k corresponding to the surrogate constraint (3.3)at each iteration of the method. Let us de�nesk = Axk � b (8:3)and �k = �ksk: (8:4)It follows that uk minimizes the Lagrangian for (3.2)-(3.4), which turns out to be theLagrangian (8.1) at �k:uk 2 arg minu2X hf(u) + �khAxk � b;Au� bii = arg minu2X L(u; �k):Therefore, our basic algorithm in the dual space can be interpreted as follows:�k = arg maxnD(�) : � = �sk; � � 0o ; (8:5)wk = Auk � b 2 @D(�k); (8:6)sk+1 = (1� �k)sk + �kwk: (8:7)We also have f(uk) = minu2X L(u; �k) = D(�k): (8:8)We can use this result to prove an interesting property of the sequence f�kg generatedby (8.5)-(8.7).Theorem 8.1. Let the Constraint Quali�cation Condition be satis�ed. Then for the basicprocedure (3.1)-(3.4) with �k 2 [0; 1], limk!1 �k = 0 and P1k=0 �k = 1 we havelim supk!1D(�k) = f� = max�2IRnD(�)and limk!1 sk = 0:Proof. By the convexity of f ,f(xk+1) � f(xk)� �k(f(xk) � f(uk)):Suppose that limsup f(uk) = v < f�. By Theorem 3.2, limf(xk) = f�, so for large k thelast inequality yields f(xk+1) � f(xk)� �k(f� � v)=2:Since P �k = 1, we obtain f(xk) ! �1, a contradiction. Therefore we must have v = f�and the required result follows from (8.8) and(8.2). 216

A stronger result can be obtained for the sequence of averages.Theorem 8.2. Let the Constraint Quali�cation Condition be satis�ed. Consider the basicprocedure (3.1)-(3.4) with �k 2 [0; 1], limk!1 �k = 0 and P1k=0 �k = 1 and the sequencef��kg de�ned as follows: ��k+1 = (1� �k)��k + �k�k; k = 0; 1; : : : ;where the sequence f�kg is generated by (8.5)-(8.7). Thenlimk!1D(��k) = f� = max�2IRnD(�):Proof. From the concavity of the dual functionD(��k+1) � (1 � �k)D(��k) + �kD(�k):The convexity of f yields f(xk+1) � (1� �k)f(xk) + �kf(uk):Subtracting the last two inequalities and noting that D(�k) = f(uk) we obtainf(xk+1)�D(��k+1) � (1 � �k) �f(xk)�D(��k)� :By lemma 3.1, limk!1max�0; f(xk)�D(��k)� = 0:Since f(xk) ! f� by theorem 3.2, the required result follows. 2It is obvious that analogous results hold for the case of convex inequalities discussed insection 5, in both versions of aggregation: the nonlinear one (5.6) and the linearized one(5.11). We focused here on linear constraints because of the application to decomposableproblems of the form: min IXi=1 fi(xi) (8:9)IXi=1Aixi = b (8:10)xi 2 Xi; i = 1; : : : ; I: (8:11)We assume that the functions fi : IRni 7! IR are convex, the sets Xi � IRni are convexand compact, Ai are matrices of dimension m� ni, b 2 IRm, i = 1; : : : ; I.Our approach converts (8.9)-(8.11) into a sequence of decomposable subproblems withonly one linking constraint: min IXi=1 fi(ui) (8:12)17

IXi=1hsk; Aiuii � hsk; bi; (8:13)ui 2 Xi; i = 1; : : : ; I; (8:14)where sk = PIi=1Aixk� b: These subproblems can be easily solved by duality and decom-position: max��0 dk(�)with dk(�) = ��hsk; bi+ IXi=1 dki (�);dki (�) = minui2Xi hfi(ui) + �hsk; Aiuiii :In this way we avoid complicated coordination procedures for updating multipliers asso-ciated with the linking constraints (8.10).For non-separable functions f appearing as objectives in place of (8.9), we may usethe projection method, whose subproblems consist in minimizing a separable objectivePIi=1 jui��xki j2, where �xk = xk��kgk; gk 2 @f(xk), subject to the constraints (8.13)-(8.14).Again, these subproblems are easily decomposable.When the objective f is smooth, the subproblems of the linearization method havedecomposable objectives of the formPIi=1hrxif(xk); ui�xki i, and constraints (8.13)-(8.14).In many special cases all these subproblems can be solved analytically.9 Illustrative exampleTo illustrate the behavior of the constraint aggregation technique, let us consider the dualof the transportation problem: max24 NXi=1 siwi � NXj=1 djvj35 (9:1)wi � vj � aij; i = 1; : : : ; N; j = 1; : : : ; N; (9:2)where wi, i = 1; : : : ; N are unknown potentials of sources, vj, j = 1; : : : ; N are unknownpotentials of destination nodes, aij denotes the transportation cost from source i to des-tination j, si and dj are the amounts available at source i and required at destinationj, respectively (PNi=1 si = PNj=1 dj). The particular instance of this problem, known asLemar�echal's problem TR48 has N = 48 (note that n = 2N), and leads (after eliminat-ing variables vj) to a di�cult nonsmooth optimization problem with a piecewise linearobjective [11]. 18

We shall compare on this small, but interesting, problem two methods.Full aggregationThis is the basic method as described in section 3. It forms at each step one surrogateconstraint NXi=1 NXj=1 �kij(wi � vj) � NXi=1 NXj=1 �kijaij;with aggregation coe�cients de�ned as normalized residuals:�kij = h+ij(wk; vk)PNi=1PNj=1 h+ij(wk; vk) ;where h+ij(wk; vk) = max(0; wki � vkj � aij):Together with that, we leave in the auxiliary subproblem all active constraints from theprevious iteration and we always use a box constraint on the potentials:0 � wi �M; i = 1; : : : ; N;0 � vj �M; j = 1; : : : ; N;with a large M (we used M = 3000). This is necessary to guarantee the existence ofsolutions of the auxiliary subproblems problems (3.2)-(3.4).Totally, we have up to 2N + 1 surrogate constraints: one new and no more than 2Nactive cuts from the previous iteration.Partial aggregationIn this method we aggregate the constraints in groups corresponding to individual nodes.For every source node i we construct a surrogate constraintwi � NXj=1 kijvj � NXj=1 kijaij; (9:3)where the aggregation coe�cients kij, j = 1; : : : ; N are normalized residuals: kij = h+ij(wk; vk)PNj=1 h+ij(wk; vk) :Note that (9.3) uses an average ow cost from node i and an average potential at desti-nation nodes seen from i. Similarly, for each destination node we construct a surrogateconstraint NXi=1 �kijwi � vj � NXi=1 �kijaij;19

with �kij = h+ij(wk; vk)PNi=1 h+ij(wk; vk) :Again, we use an average ow cost to j, and an average source potential seen from j.Additionally, we keep those surrogate constraints that were active at the previous step,so totally we may have up to 4N aggregates at each step: 2N new ones and at most 2Nactive cuts from the previous iteration.To solve the subproblems we used the CPLEX 2.1 callable library [20]. In the update(3.1) three stepsize rules were employed.1. Harmonic stepsizes �k = 1=(k + 1).2. Optimal stepsizes de�ned by (3.10).3. Heuristic rule that used �k = 1, if it decreased the Euclidean norm of the constraintresiduals; otherwise �k = 1=(k + 1).The results are illustrated in Figures 9.1 and 9.2. Note that (9.1)-(9.2) is a maximiza-tion problem, and therefore the objective value in Figure 9.1 decreases as a function ofthe iteration number.We see that the basic method with one surrogate constraint is slow in every case.This is mainly due to the very large size of the bounding box. Subproblem solutions werefrequently far away from the true feasible set and, consequently, very short stepsizes hadto be used. Even with active cuts remaining in the subproblem, the directions generatedwere not good enough to speed up convergence considerably.Aggregation in subgroups is much better in our case, especially when the subgroupsare logically related (a similar experiment with randomly chosen subgroups failed). Theresults for this version are not sensitive with respect to the size of the bounding box.As it could be expected, the optimizing stepsize rule performed signi�cantly betterthan the harmonic rule. The partial aggregation method with the optimizing stepsizerule found the exact solution of the problem. Surprisingly, our heuristic rule with onlytwo possible stepsize values: 1 and 1=(k + 1), found the exact solution relatively quickly,too. In fact, the �nal residual norm for the two best methods was of order 10�12 (theplots in Figure 9.2 are cut at level 10�6 to save space).Obviously, our example can be solved by other techniques as well. The bundle methodapplied to the piecewise linear reformulation of the problem requires from 180 to 579 iter-ations, depending on parameter setting [8], while the primal and the dual simplex methodfrom the CPLEX library [20] solve the problem in 138 and 157 iterations, respectively.Of course, it is not our objective at this stage of research to compare di�erent methods;we just wanted to illustrate the properties of the aggregation principle on a simple andeasy to understand example known from the literature.We believe that the aggregation rule discussed in this paper may have a potential ofreducing the number of constraints in very large convex optimization problems. Clearly,20

Iteration

Ob

jec

tiv

e V

alu

e

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

0 5 10 15 20 25 30 35 40 45 50

One Aggregate,

Harmonic Steps

One Aggregate,

Optimal Steps

n Aggregates,

Harmonic Steps

n Aggregates,

Heuristic Steps

n Aggregates,

Optimal Steps

Figure 9.1: The objective value f(wk; vk) as a function of the iteration number k.21

Iteration

Re

sid

ua

l N

orm

0.000001

0.0001

0.01

1

100

10000

1000000

0 10 20 30 40 50

One Aggregate,

Harmonic Steps

One Aggregate,

Optimal Steps

n Aggregates,

Harmonic Steps

n Aggregates,

Heuristic Steps

n Aggregates,

Optimal Steps

Figure 9.2: The norm of the residual vector �PNi=1PNj=1(h+ij(wk; vk))2�1=2 as a function ofthe iteration number k.22

more experience needs to be gained to fully assess its capabilities in some special apppli-cation areas, like stochastic programming and optimal control. We hope to analyse thesetopics in the near future.New intriguing questions arise in connection with the possibilities of applying ouraggregation idea to integer programming problems, where earlier concepts of surrogateconstraints proved particularly useful (see, e.g., [3, 4]).AcknowledgementThe authors express their gratitude to Mr. Rafa l R�o_zycki for his help in the imple-mentation of the method. Thanks are also o�ered to two anonymous referees for veryconstructive comments.

23

References[1] Yu.M. Ermoliev, "Methods for solving nonlinear extremal problems", Kibernetika 4(1966)1-17 (in Russian).[2] Yu.M. Ermoliev, Methods of Stochastic Programming, Nauka, Moscow, 1976 (in Russian).[3] F. Glover, "Surrogate constraint duality in mathematical programming", Operations Re-search 23(1975) 434-451. 741-749.[4] F. Glover and D. Klingman, "Layering strategies for creating exploitable structure in linearand integer programs", Mathematical Programming 40(1988) 165-181.[5] R. Hettich and K.O. Kortanek, "Semi-in�nite programming: theory, methods, and appli-cations", SIAM Review 35(1993) 380-429.[6] J.-B. Hiriart-Urruty and C. Lemar�echal, Convex Analysis and Minimization Algorithms,Springer-Verlag, Berlin, 1993.[7] K.C. Kiwiel, Methods of Descent for Nondi�erentiable Optimization, Springer-Verlag,Berlin, 1985.[8] K.C. Kiwiel, "Proximity control in bundle methods for convex nondi�erentiable minimiza-tion", Mathematical Programming 46(1990) 105-122.[9] N.N. Krasovskii and A.I. Subbotin, Game-Theoretical Control Problems, Springer-Verlag,Berlin, 1988.[10] A.V. Kryazhimskii, "Convex optimization via feedbacks", working paper WP-94-109,IIASA, Laxenburg, 1994.[11] C. Lemar�echal and R. Mi�in, eds., Nonsmooth Optimization, Pergamon Press, Oxford,1978.[12] V.S. Mikhalevich, A.M. Gupal and V.I. Norkin, Methods for Non-Convex Optimization,Nauka, Moscow, 1987 (in Russian).[13] E.A. Nurminski, Numerical Methods for Convex Optimization, Nauka, Moscow, 1991 (inRussian).[14] B.T. Polyak, "A general method for solving extremal problems", Doklady Akad. Nauk SSSR174 (1967) 33-36 (in Russian).[15] R.T. Rockafellar, "Lagrange multipliers and duality", SIAM Review, 35(1993) 183-238.[16] D.F. Rogers, R.D. Plante, R.T. Wong and J.R. Evans, "Aggregation and disaggregationtechniques and methodology in optimization", Operations Research 39 (1991) 553-582.[17] A. Ruszczy�nski, "A linearization method for nonsmooth stochastic programming problems",Mathematics of Operations Research 12 (1987) 32-49.[18] N.Z. Shor, "On the structure of algorithms for numerical solution of optimal planning anddesign problems", PhD Dissertation, Kiev, 1964 (in Russian).[19] N.Z. Shor,Minimization Methods for Non-Di�erentiable Functions, Springer-Verlag, Berlin,1985.[20] Using the CPLEX Callable Library and CPLEX Mixed Integer Library, CPLEX Optimiza-tion, Incline Village 1993. 24

constraint aggregation principle in convex optimization

Documents