ieeetransactionsonsignalprocessing,vol.64,no.11,june1,2016 ...€¦ ·...

16
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016 2949 A Parallel Decomposition Method for Nonconvex Stochastic Multi-Agent Optimization Problems Yang Yang, Member, IEEE, Gesualdo Scutari, Senior Member, IEEE, Daniel P. Palomar, Fellow, IEEE, and Marius Pesavento, Member, IEEE Abstract—This paper considers the problem of minimizing the expected value of a (possibly nonconvex) cost function param- eterized by a random (vector) variable, when the expectation cannot be computed accurately (e.g., because the statistics of the random variables are unknown and/or the computational complexity is prohibitive). Classical stochastic gradient methods for solving this problem may suffer from slow convergence. In this paper, we propose a stochastic parallel Successive Convex Approximation-based (best-response) algorithm for general non- convex stochastic sum-utility optimization problems, which arise naturally in the design of multi-agent networks. The proposed novel decomposition approach enables all users to update their optimization variables in parallel by solving a sequence of strongly convex subproblems, one for each user. Almost sure convergence to stationary points is proved. We then customize the algorithmic framework to solve the stochastic sum rate maximization problem over single-input-single-output (SISO) frequency-selective in- terference channels, multiple-input-multiple-output (MIMO) interference channels, and MIMO multiple-access channels. Numerical results corroborate that the proposed algorithms can converge faster than state-of-the-art stochastic gradient schemes while achieving the same (or better) sum-rates. Index Terms—Distributed algorithms, multi-agent systems, sto- chastic optimization, successive convex approximation. I. INTRODUCTION W IRELESS networks are composed of multiple users that may have different objectives and generate interference when no orthogonal multiplexing scheme is imposed to regulate Manuscript received July 27, 2015; revised November 27, 2015; accepted November 27, 2015. Date of publication February 18, 2016; date of current ver- sion April 19, 2016. The associate editor coordinating the review of this man- uscript and approving it for publication was Prof. Rui Zhang. The work of Y. Yang and M. Pesavento is supported by the Seventh Framework Programme for Research of the European Commission under grant number ADEL-619647. The work of G. Scutari is supported by the USA National Science Foundation under Grants CIF 1564044 and CAREER Award No. 1555850, and ONR Grant No. N00014-16-1-2244. The work of Palomar is supported by the Hong Kong RGC 16207814 research grant. Part of this work has been presented at the Fourteenth IEEE Workshop on Signal Processing Advances in Wireless Communications (SPAWC), June 2013 [1]. Y. Yang is with Intel Deutschland GmbH, 85579 Neubiberg, Germany (e-mail: [email protected]). G. Scutari is with the Department of Industrial Engineering and Cyber Center (Discovery Park), Purdue University, West Lafayette, IN 47907-5400 USA (e-mail: [email protected]). D. P. Palomar is with the Department of Electronic and Computer Engi- neering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: [email protected]). M. Pesavento is with Communication Systems Group, Darmstadt University of Technology, Darmstadt 64283, Germany (e-mail: [email protected] stadt.de). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2016.2531627 the transmissions; examples are peer-to-peer networks, cogni- tive radio systems, and ad-hoc networks. A common design of such multi-user systems is to optimize the (weighted) sum of users’ objective functions. This formulation however requires the knowledge of the system parameters, such as the users’ channel states. In practice this information is either difficult to acquire (e.g., when the parameters are rapidly changing) or im- perfect due to estimation and signaling errors. In such scenarios, it is convenient to focus on the optimization of the long-term performance of the network, measured in terms of the expected value of the sum-utility function, parametrized by the random system parameters. In this paper, we consider the frequently en- countered difficult case that (the expected value of) the social function is nonconvex and the expectation cannot be computed (either numerically or in closed-form). Such a system design naturally falls into the class of stochastic optimization problems [2], [3]. Gradient methods for unconstrained stochastic nonconvex optimization problems have been studied in [4]–[6], where al- most sure convergence to stationary points has been established, under some technical conditions; see, e.g., [5]. The extension of these methods to constrained optimization problems is not straightforward; in fact, the descent-based convergence anal- ysis developed for unconstrained gradient methods no longer applies to their projected counterpart (due to the presence of the projection operator). Convergence of stochastic gradient projection methods has been proved only for convex objective functions [4], [7], [8]. To cope with nonconvexity, gradient averaging seems to be an essential step to resemble convergence; indeed, stochastic conditional gradient methods for nonconvex constrained prob- lems hinge on this idea [9]–[12]: at each iteration the new update of the variables is based on the average of the current and past gradient samples. Under some technical conditions, the average sample gradient eventually resembles the nominal (but unavail- able) gradient of the (stochastic) objective function [9], [13]; convergence analysis can then be built on results from deter- ministic nonlinear programming. Numerical experiments for large classes of problems show that plain gradient-like methods usually converge slowly. Some acceleration techniques have been proposed in the literature [8], [14], but only for strongly convex objective functions. Here we are interested in nonconvex (constrained) stochastic problems. Moreover, (proximal, accelerated) stochastic gra- dient-based schemes use only the first order information of the objective function (or its realizations); recently it was shown [15]–[17] that for deterministic nonconvex optimization problems exploiting the structure of the function by replacing 1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 15-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016 2949

A Parallel Decomposition Method for NonconvexStochastic Multi-Agent Optimization Problems

Yang Yang, Member, IEEE, Gesualdo Scutari, Senior Member, IEEE, Daniel P. Palomar, Fellow, IEEE, andMarius Pesavento, Member, IEEE

Abstract—This paper considers the problem of minimizing theexpected value of a (possibly nonconvex) cost function param-eterized by a random (vector) variable, when the expectationcannot be computed accurately (e.g., because the statistics ofthe random variables are unknown and/or the computationalcomplexity is prohibitive). Classical stochastic gradient methodsfor solving this problem may suffer from slow convergence. Inthis paper, we propose a stochastic parallel Successive ConvexApproximation-based (best-response) algorithm for general non-convex stochastic sum-utility optimization problems, which arisenaturally in the design of multi-agent networks. The proposednovel decomposition approach enables all users to update theiroptimization variables in parallel by solving a sequence of stronglyconvex subproblems, one for each user. Almost sure convergenceto stationary points is proved. We then customize the algorithmicframework to solve the stochastic sum rate maximization problemover single-input-single-output (SISO) frequency-selective in-terference channels, multiple-input-multiple-output (MIMO)interference channels, and MIMO multiple-access channels.Numerical results corroborate that the proposed algorithms canconverge faster than state-of-the-art stochastic gradient schemeswhile achieving the same (or better) sum-rates.Index Terms—Distributed algorithms, multi-agent systems, sto-

chastic optimization, successive convex approximation.

I. INTRODUCTION

W IRELESS networks are composed of multiple users thatmay have different objectives and generate interference

when no orthogonal multiplexing scheme is imposed to regulate

Manuscript received July 27, 2015; revised November 27, 2015; acceptedNovember 27, 2015. Date of publication February 18, 2016; date of current ver-sion April 19, 2016. The associate editor coordinating the review of this man-uscript and approving it for publication was Prof. Rui Zhang. The work of Y.Yang and M. Pesavento is supported by the Seventh Framework Programme forResearch of the European Commission under grant number ADEL-619647. Thework of G. Scutari is supported by the USA National Science Foundation underGrants CIF 1564044 and CAREER Award No. 1555850, and ONR Grant No.N00014-16-1-2244. The work of Palomar is supported by the Hong Kong RGC16207814 research grant. Part of this work has been presented at the FourteenthIEEE Workshop on Signal Processing Advances in Wireless Communications(SPAWC), June 2013 [1].Y. Yang is with Intel Deutschland GmbH, 85579 Neubiberg, Germany

(e-mail: [email protected]).G. Scutari is with the Department of Industrial Engineering and Cyber

Center (Discovery Park), Purdue University, West Lafayette, IN 47907-5400USA (e-mail: [email protected]).D. P. Palomar is with the Department of Electronic and Computer Engi-

neering, Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong (e-mail: [email protected]).M. Pesavento is with Communication Systems Group, Darmstadt University

of Technology, Darmstadt 64283, Germany (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2016.2531627

the transmissions; examples are peer-to-peer networks, cogni-tive radio systems, and ad-hoc networks. A common design ofsuch multi-user systems is to optimize the (weighted) sum ofusers’ objective functions. This formulation however requiresthe knowledge of the system parameters, such as the users’channel states. In practice this information is either difficult toacquire (e.g., when the parameters are rapidly changing) or im-perfect due to estimation and signaling errors. In such scenarios,it is convenient to focus on the optimization of the long-termperformance of the network, measured in terms of the expectedvalue of the sum-utility function, parametrized by the randomsystem parameters. In this paper, we consider the frequently en-countered difficult case that (the expected value of) the socialfunction is nonconvex and the expectation cannot be computed(either numerically or in closed-form). Such a system designnaturally falls into the class of stochastic optimization problems[2], [3].Gradient methods for unconstrained stochastic nonconvex

optimization problems have been studied in [4]–[6], where al-most sure convergence to stationary points has been established,under some technical conditions; see, e.g., [5]. The extensionof these methods to constrained optimization problems is notstraightforward; in fact, the descent-based convergence anal-ysis developed for unconstrained gradient methods no longerapplies to their projected counterpart (due to the presence ofthe projection operator). Convergence of stochastic gradientprojection methods has been proved only for convex objectivefunctions [4], [7], [8].To cope with nonconvexity, gradient averaging seems to be

an essential step to resemble convergence; indeed, stochasticconditional gradient methods for nonconvex constrained prob-lems hinge on this idea [9]–[12]: at each iteration the new updateof the variables is based on the average of the current and pastgradient samples. Under some technical conditions, the averagesample gradient eventually resembles the nominal (but unavail-able) gradient of the (stochastic) objective function [9], [13];convergence analysis can then be built on results from deter-ministic nonlinear programming.Numerical experiments for large classes of problems show

that plain gradient-like methods usually converge slowly. Someacceleration techniques have been proposed in the literature[8], [14], but only for strongly convex objective functions.Here we are interested in nonconvex (constrained) stochasticproblems. Moreover, (proximal, accelerated) stochastic gra-dient-based schemes use only the first order information ofthe objective function (or its realizations); recently it wasshown [15]–[17] that for deterministic nonconvex optimizationproblems exploiting the structure of the function by replacing

1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2950 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

its linearization with a “better” approximant can enhance em-pirical convergence speed. In this paper we aim at bringing thisidea into the context of stochastic optimization problems.Our main contribution is to develop a new broad algorithmic

framework for the computation of stationary solutions of a wideclass of nonconvex stochastic optimization problems, encom-passing many multi-agent system designs of practical interest.The essential idea underlying the proposed approach is to de-compose the original nonconvex stochastic problem into a se-quence of (simpler) deterministic subproblems. In this case, theobjective function is replaced by suitable chosen sample convexapproximations; the subproblems can be then solved in a par-allel and distributed fashion across the users. Other key featuresof the proposed framework are: i) it is very flexible in the choiceof the approximant of the nonconvex objective function, whichneed not necessarily be its first order approximation, as in clas-sical (proximal) gradient schemes; ii) it encompasses a gamutof algorithms that differ in cost per iteration, communicationoverhead, and convergence speed, while all converging underthe same conditions; and iii) it can be successfully used to ro-bustify the algorithms proposed in [15] for deterministic opti-mization problems, when only inexact estimates of the systemparameters are available, which makes them applicable to morerealistic scenarios. As illustrative examples, we customize theproposed algorithms to some resource allocation problems inwireless communications, namely: the sum-rate maximizationproblems over MIMO Interference Channels (ICs) andMultipleAccess Channels (MACs). The resulting algorithms outperformexisting (gradient-based)methods both theoretically and numer-ically.The proposed decomposition technique hinges on successive

convex approximation (SCA) methods, and it is a nontrivialgeneralization to stochastic (nonconvex) optimization problemsof the solution method proposed in [15] for deterministic opti-mization problems. We remark that [15] is not applicable to sto-chastic problems wherein the expected value of the objectivefunction cannot be computed analytically, which is the case forthe classes of problems studied in this paper. In fact, as shownalso numerically (cf. Section IV.D), when applied to samplefunctions of stochastic optimization problems, the scheme in[15] may either not converge or converge to limit points thatare not even stationary solutions of the stochastic optimizationproblem. Finally, since the scheme proposed in this paper is sub-stantially different from that in [15], a further contribution ofthis work is establishing a new type of convergence analysis(see Appendix A) that conciliates random and SCA strategies,which is also of interest per se and could bring further develop-ments.An SCA framework for stochastic optimization problems

has also been proposed in a recent, independent submission[18]; however the proposed method differs from [18] in manyfeatures. Firstly, the iterative algorithm proposed in [18] isbased on a majorization minimization approach, requiring thusthat the convex approximation be a tight global upper boundof the (sample) objective function. This requirement, whichis fundamental for the convergence of the schemes in [18], isno longer needed in the proposed algorithm. This represents aturning point in the design of distributed stochastic SCA-basedmethods, enlarging substantially the class of (large scale)stochastic nonconvex problems solvable using the proposed

framework. Secondly, even when the aforementioned upperbound constraint can be met, it is not always guaranteed thatthe resulting convex (sample) subproblems are decomposableacross the users, implying that a centralized implementationmight be required in [18]; the proposed schemes instead natu-rally lead to a parallel and distributed implementation. Thirdly,the proposed methods converge under weaker conditions thanthose in [18]. Fourthly, numerical results on several test prob-lems show that the proposed scheme outperforms [18], seeSection IV.Finally, within the classes of approximation methods for

stochastic optimization problems, it is worth mentioning theSample Average Approach (SAA) [18]–[21]: the “true” (sto-chastic) objective function is approximated by an ensembleaverage. Then the resulting deterministic optimization problemis solved by an appropriate numerical procedure. When theoriginal objective function is nonconvex, the resulting SSAproblem is nonconvex too, which makes the computation of itsglobal optimal solution at each step a difficult, if not impossible,task. Therefore SSA-based methods are generally used to solvestochastic convex optimization problems.The rest of the paper is organized as follows. Section II for-

mulates the problem along with some motivating applications.The novel stochastic decomposition framework is introducedin Section III; customizations of the framework to some rep-resentative applications are discussed in Section IV. Finally,Section VI draws some conclusions.

II. PROBLEM FORMULATION

We consider the design of a multi-agent system composed ofusers; each user has his own strategy vector to optimize,

which belongs to the convex feasible set . The vari-ables of the remaining users are denoted by ,and the joint strategy set of all users is the Cartesian product set

.The stochastic social optimization problem is formulated as:

(1)

where , with being the number of functions;each cost function depends on thejoint strategy vector and a random vector , defined on theprobability space , with being the samplespace, being the -algebra generated by subsets of , andbeing a probability measure defined on , which need not beknown. Note that the optimization variables can be complex-valued; in such a case, all the gradients of real-valued functionsare intended to be conjugate gradients [22], [23].Assumptions: We make the following assumptions:a) Each is compact and convex;b) Each is continuously differentiable on , for any

given , and the gradient is Lipschitz continuous with con-stant . Furthermore, the gradient of is Lips-chitz continuous with constant .

These assumptions are quite standard and are satisfied for a largeclass of problems. Note that the existence of a solution to (1) is

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2951

guaranteed by Assumption (a). Since is not assumed to bejointly convex in , (1) is generally nonconvex. Some instancesof (1) satisfying the above assumptions are briefly listed next.Example #1: Consider the maximization of the ergodic sum-

rate over frequency-selective ICs:

(2)

where with being the transmit power ofuser on subchannel (subcarrier) is the number of parallelsubchannels, is the total power budget, is the channelcoefficient from transmitter to receiver on subchannel , and

is the variance of the thermal noise over subchannel atthe receiver . The expectation is taken over channel coefficients

.Example #2: The maximization of the ergodic sum-rate over

MIMO ICs also falls into the class of problems (1):

(3)

where is the covari-ance matrix of the thermal noise (assumed to be full rank)plus the multi-user interference, is the total power budget,and the expectation in (3) is taken over the channels

.Example #3: Another application of interest is the maximiza-

tion of the ergodic sum-rate over MIMO MACs:

(4)

This is a special case of (1) where the utility function is concavein , and the expectation in (4)is taken over the channels .Example #4: The algorithmic framework that will be intro-

duced shortly can be successfully used also to robustify dis-tributed iterative algorithms solving deterministic (nonconvex)social problems, but in the presence of inexact estimates of thesystem parameters. More specifically, consider for example thefollowing sum-cost minimization multi-agent problem:

(5)

where is uniformly convex in . An efficientdistributed algorithm converging to stationary solutions of (5)has been recently proposed in [15]: at each iteration , given the

current iterate , every agent minimizes (w.r.t. ) thefollowing convexified version of the social function:

where stands for , and. The evaluation of the above function re-

quires the exact knowledge of for all . In prac-tice, however, only a noisy estimate of is available[24]–[26]. In such cases, convergence of pricing-based algo-rithms [15], [27]–[29] is no longer guaranteed. We will showin Section IV.C that the proposed framework can be readilyapplied, for example, to robustify (and make convergent), e.g.,pricing-based schemes, such as [15], [27]–[29].Since the class of problems (1) is in general nonconvex

(possibly NP hard [30]), the focus of this paper is to designdistributed solution methods for computing stationary solutions(possibly local minima) of (1). The major goal is to deviseparallel (nonlinear) best-response schemes that converge evenwhen the expected value in (1) cannot be computed accuratelyand only sample values of are available.

III. A NOVEL PARALLEL STOCHASTIC DECOMPOSITION

The social problem (1) faces two main issues: i) the non-convexity of the objective functions; and ii) the impossibilityto estimate accurately the expected value. To deal with thesedifficulties, we propose a decomposition scheme that consistsin solving a sequence of parallel strongly convex subproblems(one for each user), where the objective function of user is ob-tained from by replacing the expected value with a suit-ably chosen incremental sample estimate of it and linearizingthe nonconvex part. More formally, at iteration , a randomvector is realized,1 and user solves the following problem:given and , let

(6a)

with the surrogate function defined as

(6b)

where the pricing vector is given by

(6c)

and is an accumulation vector updated recursively accordingto

(6d)

with being a sequence to be properly chosen. Here are realizations of random vectors defined

1With slight abuse of notation, throughout the paper, we use the same symbolto denote both the random vector and its realizations.

2952 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

Algorithm 1: Stochastic parallel decomposition algorithm

Data: ; set .

: If satisfies a suitable termination criterion: STOP.

: For all , compute (cf. (6)).

: The random vector is realized; updateaccording to

: For all , update according to (6d).

: , and go to .

on , at iterations respectively. The othersymbols in (6) are defined as follows:• In (6b): is any subset of

is the set ofindices of functions that are convex in ;

• In (6c): denotes the complement of , i.e., ;thus, it contains (at least) the indices of functions

that are nonconvex in , given and ;• In (6c)–(6d): is the gradient ofw.r.t. (the complex conjugate of ). Note that, since

is real-valued,.

Given is updated according to

(7)

where . Note that the iterate is a function of thepast history of the algorithm up to iteration (we omit thisdependence for notational simplicity):

Since are random vectors, and arerandom vectors as well.The subproblems (6a) have an interesting interpretation: each

user minimizes a sample convex approximation of the orig-inal nonconvex stochastic function. The first term in (6b) pre-serves the convex component (or part of it, if ) of thesample social function. The second term in (6b)—the vector

—comes from the linearization of (at least) the non-convex part. The vector in the third term represents the incre-mental estimate of (which is not available), as onecan readily check by substituting (6c) into (6d):

(8)

Roughly speaking, the goal of this third term is to estimateon-the-fly the unknown by its samples collected overthe iterations; based on (8), such an estimate is expected to be-come more and more accurate as increases, provided that thesequence is properly chosen (this statement is made rigorousshortly in Theorem 1). The last quadratic term in (6b) is theproximal regularization whose numerical benefits are well-un-derstood [31].

Given (6), we define the “best-response” mapping as: given,

(9)

Note that is well-defined for any given because theobjective function in (6) is strongly convex with constant :

(10)

The proposed decomposition scheme is formally described inAlgorithm 1, and its convergence properties are stated in The-orem 1, under the following standard boundedness assumptionson the instantaneous gradient errors [24], [32].Assumption (c): The instantaneous gradient is unbiased with

bounded variance, that is, the following holds almost surely:

This assumption is readily satisfied if the random variablesare bounded and identically distributed.

Theorem 1: Given problem (1) under Assumptions (a)–(c),suppose that in (6b) and the step-sizes andare chosen so that

(11a)

(11b)

(11c)

(11d)

Then, every limit point of the sequence generated by Al-gorithm 1 (at least one of such point exists) is a stationary pointof (1) almost surely.

Proof: See Appendix A.On Assumption (c): The boundedness condition is in terms

of the conditional expectation of the (random) gradient error.Compared with [18], Assumption (c) is weaker because in [18]it is required that every realization of the (random) gradient errormust be bounded.On Condition (11d): The condition has the following inter-

pretation: all increasing subsequences of mustgrow slower than . We will discuss later in Section IV howthis assumption is satisfied for specific applications. Note thatif is uniformly bounded for any (which is in-deed the case if is a bounded random vector), then (11d) istrivially satisfied.On Algorithm 1: To the best of our knowledge, Algorithm

1 is the first parallel best-response (e.g., nongradient-like)

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2953

scheme for nonconvex stochastic sum-utility problems inthe form (1): all the users update in parallel their strategies(possibly with a memory) solving a sequence of decoupled(strongly) convex subproblems (cf. (6)). It performs empiricallybetter than classical stochastic gradient-based schemes at noextra cost of signaling, because the convexity of the objectivefunction, if any, is better exploited. Numerical experimentson specific applications confirm this intuition; see Section IV.Moreover, by choosing different instances of the set in (6b),one obtains convex subproblems that may exhibit a differenttrade-off between cost per iteration and convergence speed. Fi-nally, it is guaranteed to converge under very weak assumptions(e.g., weaker than those in [18]) while offering some flexibilityin the choice of the free parameters (cf. Theorem 1).Diminishing Stepsize Rules: Convergence is guaranteed if a

diminishing stepsize rule satisfying (11) is chosen. An instanceof (11) is, e.g., the following:

(12)

Roughly speaking, (11) says that the stepsizes and , whilediminishing (with decreasing faster than ), need not go tozero too fast. This kind of stepsize rules are of the same spirit ofthose used to guarantee convergence of gradient methods witherror; see [33] for more details.Implementation Issues: In order to compute the best-re-

sponse, each user needs to know andthe pricing vector . The signaling required to acquirethis information is generally problem-dependent. If the problemunder consideration does not have any specific structure, themost natural message-passing strategy is to communicatedirectly and . However, in many specific appli-cations significantly reduced signaling may be required; seeSection IV for some examples. Note that the signaling is ofthe same spirit as that of pricing-based algorithms proposed inthe literature for the maximization of deterministic sum-utilityfunctions [15], [29]; no extra communication is required toupdate : once the new pricing vector is available,the recursive update (6d) for the “incremental” gradient isbased on a local accumulation register keeping track of the lastiterate . Note also that, thanks to the simultaneous natureof the proposed scheme, the overall communication overhead isexpected to be less than that required to implement sequentialschemes, such the deterministic schemes in [29].

A. Some Special Cases

We customize next the proposed general algorithmic frame-work to specific instances of problem (1) arising naturally inmany applications.1) Stochastic Proximal Conditional Gradient Methods:

Quite interestingly, the proposed decomposition techniqueresembles classical stochastic conditional gradient schemes [4]when one chooses in (6b) , for all and , resulting inthe following surrogate function:

(13)

with updated according to (8). Note that traditional stochasticconditional gradient methods [9] do not have the proximal reg-ularization term in (13). However, it is worth mentioning that,for some of the applications introduced in Section II, it is justthe presence of the proximal term that allows one to computethe best-response resulting from the minimization of(13) in closed-form; see Section IV.B.2) Stochastic Best-Response Algorithm for Single (Convex)

Functions: Suppose that the social function in (1) is a singlefunction , withconvex in each (but not necessarily jointly), for anygiven . This optimization problem is a special case of thegeneral formulation (1), with and .Since is componentwise convex, a naturalchoice for the surrogate functions is settingfor all , resulting in the following

(14)

where is updated according to. Convergence conditions are still given by The-

orem 1. It is worth mentioning that the same choice comes outnaturally when is uniformly jointly convex;in such a case the proposed algorithm converges (in the senseof Theorem 1) to the global optimum of . An interestingapplication of this algorithm is the maximization of the ergodicsum-rate over MIMO MACs in (4), resulting in the first con-vergent simultaneous stochastic MIMO Iterative Waterfillingalgorithm in the literature; see Section IV.C.3) Stochastic Pricing Algorithms: Suppose that and

each (implying that is uniformly convexon ). By taking each for all , the surrogate functionin (6b) reduces to

(15)

where and. This is the generalization

of the deterministic pricing algorithms [15], [29] to stochasticoptimization problems. Examples of this class of problems arethe ergodic sum-rate maximization problem over SISO andMIMO IC formulated in (2)–(3); see Sections IV.A and IV.B.4) Stochastic DC Programming: A stochastic DC program-

ming problem is formulated as

(16)

where both and are uniformly convex func-tions on for any given . A natural choice of the surrogatefunctions for (16) is linearizing the concave part of the samplesum-utility function, resulting in the following

2954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

where and

Comparing the surrogate functions (14)–(16) with (13),one can appreciate the potential advantage of the proposedalgorithm over classical gradient-based methods: the proposedschemes preserves the (partial) convexity of the original samplefunction while gradient-based methods use only first orderapproximations. The proposed algorithmic framework is thusof the best-response type and empirically it yields faster con-vergence than gradient-based methods. The improvement inthe practical convergence speed will be illustrated numericallyin the next section.

IV. APPLICATIONS

In this section, we customize the proposed algorithmic frame-work to some of the applications introduced in Section II, andcompare the resulting algorithms with both classical stochasticgradient algorithms and state-of-the-art schemes proposed forthe specific problems under considerations. Numerical resultsclearly show that the proposed algorithms compare favorablyon state-of-the-art schemes.

A. Sum-Rate Maximization Over Frequency-Selective ICs

Consider the sum-rate maximization problem over fre-quency-selective ICs, as introduced in (2). Since the instanta-neous rate of each user ,

is uniformly strongly concave in , a natural choicefor the surrogate function is the one in (15) wherein

is kept unchanged whileis linearized. This leads to the following best-response func-tions

(17a)

where with

The variable is updated according to. Note that

in (17a) can be computed in closed-form[15]:

(18)

where

and is the Lagrange multiplier such that, and it can be found efficiently

using a standard bisection method.The overall stochastic pricing-based algorithm is then

given by Algorithm 1 with best-response mapping definedin (18); convergence is guaranteed under conditions i)–iv)in Theorem 1. Note that the theorem is trivially satisfiedusing stepsizes rules as required in i)–iii) [e.g., (12)]; theonly condition that needs further consideration is conditioniv). If we can assumewithout loss of generality (w.l.o.g.) that the sequence of theLipschitz constant is increasing mono-tonically at a rate no slower than (we can always limitthe discussion to such a subsequence). For any , de-fine and assume w.l.o.g. that

. Note that the Lipschitz constant is upperbounded by the maximum eigenvalue of the augmented Hes-sian of [34], and the maximum eigenvalue increasingmonotonically means that the channel coefficient is becominglarger and larger (this can be verified by explicitly calculatingthe augmented Hessian of ; details are omitted due topage limit). Since

, we caninfer that the magnitude of the channel coefficient increasingmonotonically is an event of probability 0. Therefore, condition(11d) is satisfied.Numerical Results: We simulated a SISO frequency selective

IC under the following setting: the number of users is either fiveor twenty; equal power budget and white Gaussiannoise variance are assumed for all users; the SNRof each user is set to 10 dB; the instantaneousparallel subchannels are generated accordingto , where (generated by MATLAB command

) is fixed while is generated at each using ,with being the noise level.We considered in Fig. 1 the following algorithms: i) the pro-

posed stochastic best-response pricing algorithm (withfor all , and

for ). At each iteration, the users’best-responses have a closed-form solution, see (18); ii) the sto-chastic conditional gradient method [9] (with

, and for ). In eachiteration, a linear problem must be solved; iii) and the stochasticgradient projection method, proposed in [26] (with and

for ). At each iteration, theusers’ updates have a closed-form solution.Note that the stepsizes are tuned such that all algorithms can

achieve their best empirical convergence speed.

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2955

Fig. 1. Sum-rate versus iteration in frequency-selective ICs. (a) Ergodic sum-rate versus iterations, (b) achievable sum-rate versus iterations.

In Fig. 1, for all the algorithms, we plot two merit functionsversus the iteration index, namely: i) the ergodic sum-rate,defined as (with the expectedvalue estimated by the sample mean of 1000 independentrealizations); and ii) the “achievable” sum-rate, defined as

which represents thesum-rate that is actually achieved in practice (it is the time av-erage of the instantaneous (random) sum-rate). The experimentshows that for “small” systems (e.g., five active users), all algo-rithms perform quite well; the proposed scheme is just slightlyfaster. However, when the number of users increases (e.g.,from 5 to 20), all other (gradient-like) algorithms suffer fromslow convergence. Quite interestingly, the proposed schemedemonstrates also good scalability: the convergence speed isnot notably affected by the number of users, which makes itapplicable to more realistic scenarios. The faster convergenceof proposed stochastic best-response pricing algorithm comesfrom a better exploitation of partial convexity in the problemthan what more classical gradient algorithms do, which vali-dates the main idea of this paper.

B. Sum-Rate Maximization Over MIMO ICsIn this example we customize Algorithm 1 to solve the sum-

rate maximization problem over MIMO ICs (3). Defining

and following a similar approach as in the SISO case, the best-response of each user becomes (cf. (15)):

(19a)

where is given by

(19b)

withand

. Then is updated by (6d), which becomes

(19c)

We can then apply Algorithm 1 based on the best-responsewhose convergence is guar-

anteed if the stepsizes are chosen according to Theorem 1.In contrast to the SISO case, the best-response in (19a) does

not have a closed-form solution. A standard option to computeis using general-purpose solvers for strongly convex

optimization problems. By exploiting the structure of problem(19), we propose next an efficient iterative algorithm convergingto , wherein the subproblems solved at each stephave a closed-form solution.Second-Order Dual Method for Problem (19a): To begin

with, for notational simplicity, we rewrite (19a) in the followinggeneral form:

(20)where and is defined in (3). Let

be the eigenvalue/eigenvector decom-position of , where is unitary and is diagonalwith the diagonal entries arranged in decreasing order. It can beshown that (20) is equivalent to the following problem:

(21)

where , and . Wenow partition in two blocks, its positive definite andzero parts ( is partitioned accordingly):

2956 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

where , and and have the same dimensions.Problem (21) can be then rewritten as:

(22)

Note that, since , by definition must belong to aswell. Using this observation and introducing the slack variable

, (22) is equivalent to

(23)

In the following we solve (23) via dual decomposition (notethat the duality gap is zero). Denoting by the matrix of multi-pliers associated to the linear constraints , the (partial)Lagrangian function of (23) is:

The dual problem is then

with

(24)

(25)

Problem (24) is quadratic and has a closed-form solution (seeLemma 2 below). Similarly, if , (25) can be solved inclosed-form, up to a Lagrange multiplier which can be foundefficiently by bisection; see, e.g., ([29], Table I). In our setting,however, in (25) is not necessarily negative definite. Never-theless, the next lemma provides a closed-form expression of

[and ].Lemma 2: Given (24) and (25) in the setting above, the fol-

lowing hold:i) in (24) is given by

(26)

where denotes the projection of onto the coneof positive semidefinite matrices, and is the multipliersuch that , which can befound by bisection;

ii) in (25) is unique and is given by

(27)

where is the generalized eigenvalue decomposi-tion of , and is the multiplier suchthat can be foundby bisection over , with and

.Proof: See Appendix B.

Since is unique, is differentiable, withconjugate gradient [22]

One can then solve the dual problem using standard (proximal)gradient-based methods; see, e.g., [34]. As a matter of fact,is twice continuously differentiable, whose augmented Hessianmatrix [22] is given by ([34], Section 4.2.4):

with

and . Since , it followsthat and the following second-order Newton’smethod can be used to update the dual variable :

The convergence speed of the Newton’s methods is typicallyfast, and, in particular, superlinear convergence rate can beachieved when is close to ([34], Proposition 1.4.1).As a final remark on efficient solution methods computing

, note that one can also apply the proximal condi-tional gradient method as introduced in (13), which is based ona fully linearization of the social function plus a proximal regu-larization term:

(28)

where is the Lagrange multiplier that can be found efficientlyby the bisection method. Note that (28) differs from more tradi-tional conditional stochastic gradient methods [9] by the pres-ence of the proximal regularization, thanks to which one cansolve (28) in closed-form (cf. Lemma 2).The above examples, (19) and (28), clearly show the flexi-

bility of the proposed scheme: choosing different instances ofthe set leads to convex subproblems exhibiting a differenttrade-off between cost per iteration and practical convergencespeed. Roughly speaking, when the number of iterations mat-ters, one can opt for the approximation problem (19). On theother hand, when the cost per iteration is the priority, one caninstead choose the approximation problem (28).Practical Implementations: The proposed algorithm is fairly

distributed: once the pricing matrix is given, to computethe best-response, each user only needs to locally estimatethe covariance matrix of the interference plus noise. Note thatboth the computation of and the update of canbe implemented locally by each user. The estimation of thepricing matrix requires however some signaling among

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2957

Fig. 2. Sum-rate versus iteration in a 50-user MIMO IC. (a) Ergodic sum-rate versus iterations, (b) achievable sum-rate versus iterations.

nearby receivers. Interestingly, the pricing expression andthus the resulting signaling overhead necessary to compute itcoincide with [29] (where a sequential algorithm is proposedfor the deterministic maximization of the sum-rate over MIMOICs) and the stochastic gradient projection method in [26].We remark that the signaling to compute (19b) is lower thanin [18], wherein signaling exchange is required twice (one inthe computation of and another in that of ; see [18] formore details) in a single iteration to transmit among users theauxiliary variables which are of same dimensions as .Numerical Results: We considered the same scenario as in

the SISO case (cf. Section IV.A) with the following differences:i) there are 50 users; ii) the channels are matrices generated ac-cording to , where is given while isrealization dependent and generated by , with noiselevel ; and iii) the number of transmit and receive an-tennas is four. We simulate the following algorithms:• The proposed stochastic best-response pricing algorithm(19) (with for all ) under two stepsizes rules,namely: Stepsize 1 (empirically optimal):and for ; and Stepsize 2:

and for . For bothstepsize rules we set . The best-responseis computed using the second-order dual method, whoseconvergence has been observed in a few iterations;

• The proposed stochastic proximal gradient method (28)with and same stepsize as the stochastic best-re-sponse pricing algorithm. The users’ best-responses havea a closed-form expression;

• The stochastic conditional gradient method [9] (withand and

for ). In each iteration, a linear problem must besolved;

• The stochastic weighted minimum mean-square-error(SWMMSE) method [18]. The convex subproblems to besolved at each iteration have a closed-form solution.

Similarly to the SISO ICs case, we consider both ergodicsum-rate and achievable sum-rate. In Fig. 2 we plot both objec-tive functions versus the iteration index. It is clear from the fig-ures that the proposed best-response pricing and proximal gra-

dient algorithms outperform current schemes in terms of bothconvergence speed and achievable (ergodic or instantaneous)sum-rate. Note also that the best-response pricing algorithm isvery scalable compared with the other algorithms. Finally, it isinteresting to note that the proposed stochastic proximal gra-dient algorithm outperforms the conditional stochastic gradientmethod in terms of both convergence speed and cost per itera-tion. This is mainly due to the presence of the proximal regular-ization term in (19a).Note that in order to achieve a satisfactory convergence

speed, some tuning of the free parameters in the stepsizerules is typically required for all algorithms. Comparing theconvergence behavior under two different sets of stepsizerules, we see from Fig. 2(a) that, as expected, the proposedbest-response pricing and proximal gradient algorithms underthe faster decreasing Stepsize 2 converge slower than they dounder Stepsize 1, but the difference is relatively small and theproposed algorithms still converge to a larger sum-rate in asmaller number of iterations than current schemes do. Hencethis offers some extra tolerance in the stepsizes and makes theproposed algorithms quite applicable in practice.

C. Sum-Rate Maximization Over MIMO MACsIn this example we consider the sum-rate maximization

problem over MIMOMACs, as introduced in (4). This problemhas been studied in [36] using standard convex optimizationtechniques, under the assumption that the statistics of CSI areavailable and the expected value of the sum-rate function in(4) can be computed analytically. When this assumption doesnot hold, we can turn to the proposed algorithm with propercustomization: Define

A natural choice for the best-response of each user in eachiteration of Algorithm 1 is (cf. (14)):

(29)

2958 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

(b)(a)

Fig. 3. Sum-rate versus iteration in MIMO MAC. (a) Ergodic sum-rate versus iterations, (b) achievable sum-rate versus iterations.

and is updated as

while .Note that since the instantaneous sum-rate function

is jointly concave infor any , the ergodic sum-rate function is concave in ’s,and thus Algorithm 1 will converge (in the sense of Theorem1) to the global optimal solution of (4). To the best of ourknowledge, this is the first example of stochastic approxima-tion algorithms based on best-response dynamics rather thangradient responses.Numerical Results: We compare the proposed best-response

method (29) (whose solution is computed using the second-order dual method in Section IV.B) with the stochastic con-ditional gradient method [9], and the stochastic gradient pro-jection method [8]. System parameters (including the stepsizerules) are set as for the MIMO IC example in Section IV.B.In Fig. 3 we plot both the ergodic sum-rate and the achievablesum-rate versus the iteration index. This figure clearly showsthat Algorithm 1 outperforms the conditional gradient methodand the gradient projection method in terms of convergencespeed, and the performance gap is increasing as the numberof users increases. This is because the proposed algorithm is abest-response type scheme, which thus explores the concavityof each user’s rate function better than what gradient methodsdo. Note also that the proposed method exhibits good scalabilityproperties.

D. Distributed Deterministic Algorithms With ErrorsThe developed framework can also be used to robustify

some algorithms proposed for the deterministic counterpartof the multi-agent optimization problem (1), when only noisyestimates of the users’ objective functions are available. As aspecific example, we show next how to robustify the determin-istic best-response-based pricing algorithm proposed in [15].Consider the deterministic optimization problem introducedin (5). The main iterate of the best-response algorithm [15] isgiven by (7) but with each defined as

(30)

where . In many applications (see,e.g., [24]–[26]), however, only a noisy estimate of isavailable, denoted by . A heuristic is then to replacein (30) the exact with its noisy estimate . Thelimitation of this approach, albeit natural, is that convergenceof the resulting scheme is no longer guaranteed.If is unbiased, i.e., [24], [25],

capitalizing on the proposed framework, we can readily dealwith estimation errors while guaranteeing convergence. In par-ticular, it is sufficient to modify (30) as follows:

(31)

where is updated according to .Algorithm 1 based on the best-response (31) is then guaranteedto converge to a stationary solution of (5), in the sense specifiedby Theorem 1.As a case study, we consider next the maximization of the de-

terministic sum-rate over MIMO ICs in the presence of pricingestimation errors:

(32)Then (31) becomes:

(33)

where is a noisy estimate of given by (19b)2 andis updated according to . Given

2 is always negative definite by definition [29], but may notbe so. However, it is reasonable to assume to be Hermitian.

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2959

Fig. 4. Maximization of deterministic sum-rate over MIMO IC under noisyparameter estimation: sum-rate versus iteration.

, the main iterate of the algorithm becomes. Almost sure convergence to a sta-

tionary point of the deterministic optimization problem (32) isguaranteed by Theorem 1. Note that if the channel matrices

are full column-rank, one can also set in (33) all ,and compute (33) in closed-form (cf. Lemma 2).Numerical Results: We consider the maximization of the

deterministic sum-rate (32) over a 5-user MIMO IC. Theother system parameters (including the stepsize rules) areset as in the numerical example in Section IV.B. The noisyestimate of the nominal price matrix [defined in (19b)]is , where is firstly generated as inSection IV.B and then only its Hermitian part is kept; the noiselevel is set to 0.05. We compare the following algorithms: i)the proposed robust pricing method—Algorithm 1 based on thebest-response defined in (33); and ii) the plain pricing methodas proposed in [15] (cf. (30)). Note that the variable update inboth algorithms has a closed-form solution. We also include asa benchmark the sum-rate achieved by the plain pricing method(30) when there is no estimation noise (i.e., perfect isavailable). In Fig. 4 we plot the deterministic sum-rate in (32)versus the iteration index . As expected, Fig. 4 shows that theplain pricing method [15] is not robust to pricing estimationerrors, whereas the proposed robustification preforms well.For instance, the rate achievable by the proposed method isabout 50% larger than that of [15], and is observed to reach thebenchmark value (achieved by the plain pricing method whenthere is no estimation noise). This is due to the fact that theproposed robustification filters out the estimation noise. Notethat the limit point generated by the proposed scheme (33) is astationary solution of the deterministic problem (32).

V. A MORE GENERAL SCA FRAMEWORK

The key idea behind the choice of the surrogate functionin (6) is to convexify the nonconvex part of

the sample sum-utility function via partial linearization of. It is not difficult to show that one can gen-

eralize this idea and replace the surrogate in (6)

with a more general function. For example, one can use inAlgorithm 1 the following sample best-response function

(34)

where (cf. (6)), andis any surrogate function satisfying the following

technical conditions:(A1) is uniformly strongly convex and con-tinuously differentiable on for all given and ;(A2) is Lipschitz continuous on ;(A3) .

All the convergence results presented so far are still valid (cf.Theorem 1). To the best of our knowledge, this is the first SCAframework for nonconvex stochastic optimization problems; itoffers a lot of flexibility to tailor the surrogate function to indi-vidual problems, while guaranteeing convergence, whichmakesit appealing for a wide range of applications.

VI. CONCLUSIONIn this paper, we have proposed a novel best-response-based

solution method for general stochastic nonconvex multi-agentoptimization problems and analyzed its convergence properties.The proposed novel decomposition enables all users to updatetheir optimization variables in parallel by solving a sequence ofstrongly convex subproblems; which makes the algorithm veryappealing for the distributed implementation in several practicalsystems. We have then customized the general framework tosolve special classes of problems and applications, including thestochastic maximization of the sum-rate over frequency-selec-tive ICs, MIMO ICs and MACs. Extensive experiments haveprovided a solid evidence of the superiority in terms of bothachievable sum-rate and practical convergence of the proposedschemes with respect to state-of-the-art stochastic-based algo-rithms.

APPENDIX

A. Proof of Theorem 1We first introduce the following two preliminary results.Lemma 3: Given problem (1) under Assumptions (a)–(c),

suppose that the stepsizes and are chosen accordingto (11). Let be the sequence generated by Algorithm 1.Then, the following holds

Proof: This lemma is a consequence of ([10], Lemma 1).To see this, we just need to verify that all the technical condi-tions therein are satisfied by the problem at hand. Specifically,Condition (a) of ([10], Lemma 1) is satisfied because ’s areclosed and bounded in view of Assumption (a). Condition (b) of([10], Lemma 1) is exactly Assumption (c). Conditions (c)–(d)come from the stepsize rules i)–ii) in (11) of Theorem 1. Condi-tion (e) of ([10], Lemma 1) comes from the Lipschitz propertyof from Assumption (b) and stepsize rule iii) in (11) of The-orem 1.

2960 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

Lemma 4: Given problem (1) under Assumptions (a)–(c),suppose that the stepsizes and are chosen accordingto (11). Let be the sequence generated by Algorithm 1.Then, there exists a constant such that

and w.p.1.Proof: We assume w.l.o.g. that ; for notational

simplicity, we define , for and . Itfollows from the first-order optimality condition that [22]

(35a)

(35b)

Setting in (35a) and in(35b), and adding the two inequalities, we have

(36)

The first term in (36) can be lower bounded as follows:

(37a)

where in (37a) we used (8). Invoking the Lipschitz continuityof , we can get a lower bound for (37a):

(37b)

(37c)

(37d)

where (37c) comes from the Lipschitz continuity of, with and, and we used the boundedness of the con-

straint set ( for some and all) and the Lipschitz continuity of in (37d).

The second term in (36) can be bounded as:

(38)

where the inequality follows from the definition of and the(uniformly) convexity of the functions .Combining the inequalities (36), (37d) and (38), we have

which leads to the desired (asymptotic) Lipschitz property:

with and

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2961

In view of Lemma 3 and (11d), it is easy to check thatw.p.1.

Proof of Theorem 1: Invoking the first-order optimalityconditions of (6), we have

which together with the convexity ofleads to

(39)

It follows from the descent lemma on that

(40)

where in the last inequality we used (39). Let us show bycontradiction that w.p.1. Suppose

with a positive probability.Then we can find a realization such that at the same time

for all and ;we focus next on such a realization. Using ,the inequality (40) is equivalent to

(41)

Since , there exists a suffi-ciently large such that

(42)

Therefore, it follows from (41) and (42) that

(43)

which, in view of , contradicts the bounded-ness of . Therefore it must bew.p.1.We prove now that w.p.1.

Assume with some positiveprobability. We focus next on a realization along with

, and ,where is defined in Lemma 4. It follows from

andthat there exists a such that (with

) for infinitely many and also forinfinitely many . Therefore, one can always find an infiniteset of indexes, say , having the following properties: for any

, there exists an integer such that

(44)

Given the above bounds, the following holds: for all ,

(45)

implying that

(46)

Proceeding as in (45), we also have: for all

which leads to

(47)

where the second inequality follows from (44). It follows from(47) that there exists a such that for sufficiently large

,

(48)

Here after we assume w.l.o.g. that (48) holds for all (infact one can always restrict to a proper subsequence).We show now that (46) is in contradiction with the conver-

gence of . Invoking (40), we have: for all ,

(49)

2962 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

and for

(50)

where the last inequality follows from (44). Adding (49) and(50) over and, for sufficiently large(so that and

), we have

(51)

where (a) follows from; (b) is due to (48); and in (c) we

used . Since converges, itmust be , which contradicts (46).Therefore, it must be w.p.1.Finally, let us prove that every limit point of the sequence

is a stationary solution of (1). Let be the limit point ofthe convergent subsequence . Taking the limit of (35)over the index set , we have

(52)

where the last equality follows from: i)[cf. Lemma 3]; ii)

; and iii) the following

(53)

where (53) follows from the Lipschitz continuity of ,the fact , and (11d).Adding (52) over , we get the desired first-order

optimality condition: for all .Therefore is a stationary point of (1).

B. Proof of Lemma 2We prove only (27). Since (25) is a convex optimization

problem and has a nonempty interior, strong duality holdsfor (25) [37]. The dual function of (25) is

(54)

where . Denote by theoptimal solution of the maximization problem in (54), for anygiven feasible . It is easy to see that if ,so is feasible if and only if , i.e.,

and is ([29], Prop. 1)

where is the generalized eigenvalue decomposi-tion of . Invoking ([38], Corollary 28.1.1), theuniqueness of comes from the uniqueness of thatwas proved in [39].Now we prove that . First, note that .

Based on the eigenvalue decomposition , thefollowing inequalities hold:

where . In other words, is upperbounded by the optimal value of the following problem:

(55)

When , it is not difficult to verify that the optimalvariable of (55) is , and thus . We show bydiscussing two complementary cases: and .If . Since and

the primal value is also 0, there is no duality gap. From thedefinition of saddle point ([37], Section 5.4), is a dualoptimal variable.If . Assume . Thenis the optimal variable in (25) and the optimal value of (25) is

0, but this would lead to a non-zero duality gap and thus contra-dict the optimality of . Therefore .

REFERENCES[1] Y. Yang, G. Scutari, and D. P. Palomar, “Parallel stochastic decompo-

sition algorithms for multi-agent systems,” in Proc. IEEE 14th Work-shop Signal Process. Adv. Wireless Commun. (SPAWC), Jun. 2013, pp.180–184.

[2] H. Robbins and S. Monro, “A stochastic approximation method,” Ann.Math. Statist., vol. 22, no. 3, pp. 400–407, Sep. 1951.

[3] H. J. Kushner and G. Yin, Stochastic Approximation and RecursiveAlgorithms and Applications, 2nd ed. NewYork, NY,USA: Springer-Verlag, 2003, vol. 35.

[4] B. T. Poljak, Introduction to Optimization. NewYork, NY, USA: Op-timization Software, 1987.

[5] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradientmethods with errors,” SIAM J. Optim., vol. 10, no. 3, pp. 627–642,2000.

YANG et al.: A PARALLEL DECOMPOSITION METHOD FOR NONCONVEX STOCHASTIC MULTI-AGENT OPTIMIZATION PROBLEMS 2963

[6] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributedasynchronous deterministic and stochastic gradient optimizationalgorithms,” IEEE Trans. Autom. Control, vol. 31, no. 9, pp. 803–812,Sep. 1986.

[7] Y. Ermoliev, “On the method of generalized stochastic gradients andquasi-Fejer sequences,” Cybern., vol. 5, no. 2, pp. 208–220, 1972.

[8] F. Yousefian, A. Nedić, and U. V. Shanbhag, “On stochastic gradientand subgradient methods with adaptive steplength sequences,” Auto-matica, vol. 48, no. 1, pp. 56–67, Jan. 2012.

[9] Y. Ermoliev and P. I. Verchenko, “A linearization method in limitingextremal problems,” Cybernetics, vol. 12, no. 2, pp. 240–245, 1977.

[10] A. Ruszczyński, “Feasible direction methods for stochastic program-ming problems,” Math. Programm., vol. 19, no. 1, pp. 220–229, Dec.1980.

[11] A. Ruszczyński, “A merit function approach to the subgradient methodwith averaging,” Optim. Methods Softw., vol. 23, no. 1, pp. 161–172,Feb. 2008.

[12] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochasticapproximation approach to stochastic programming,” SIAM J. Optim.,vol. 19, no. 4, pp. 1574–1609, Jan. 2009.

[13] A. M. Gupal and L. G. Bazhenov, “Stochastic analog of the conjugant-gradient method,” Cybernetics, vol. 8, no. 1, pp. 138–140, 1972.

[14] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approx-imation by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp.838–855, July 1992.

[15] G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “De-composition by partial linearization: Parallel optimization of multi-agent systems,” IEEE Transactions on Signal Processing, vol. 62, no.3, pp. 641–656, Feb. 2014.

[16] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algo-rithms for nonconvex big data optimization,” IEEE Trans. SignalProcess., vol. 63, no. 7, pp. 1874–1889, Nov. 2015.

[17] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, “Hybridrandom/deterministic parallel algorithms for convex and nonconvexbig data optimization,” IEEE Trans. Signal Process., vol. 63, no. 15,pp. 3914–3929, Aug. 2015.

[18] M. Razaviyayn, M. Sanjabi, and Z.-Q. Luo, “A stochastic successiveminimization method for nonsmooth nonconvex optimization with ap-plications to transceiver design in wireless communication networks,”Jun. 2013 [Online]. Available: http://arxiv.org/abs/1307.4457

[19] S. M. Robinson, “Analysis of sample-path optimization,”Math. Oper.Res., vol. 21, no. 3, pp. 513–528, Aug. 1996.

[20] J. Linderoth, A. Shapiro, and S. Wright, “The empirical behavior ofsampling methods for stochastic programming,” Ann. Oper. Res., vol.142, no. 1, pp. 215–241, Feb. 2006.

[21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrixfactorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp.19–60, 2010.

[22] G. Scutari, F. Facchinei, J.-S. Pang, and D. P. Palomar, “Real and com-plex monotone communication games,” IEEE Trans. Inf. Theory, vol.60, no. 7, pp. 4197–4231, Jul. 2014.

[23] A. Hjørungnes, Complex-Valued Matrix Derivatives With Applicationsin Signal Processing and Communications. Cambridge, U.K.: Cam-bridge Univ. Press, 2011.

[24] J. Zhang, D. Zheng, and M. Chiang, “The impact of stochastic noisyfeedback on distributed network utility maximization,” IEEE Trans.Inf. Theory, vol. 54, no. 2, pp. 645–665, Feb. 2008.

[25] M. Hong and A. Garcia, “Averaged iterative water-filling algorithm:Robustness and convergence,” IEEE Trans. Signal Process., vol. 59,no. 5, pp. 2448–2454, May 2011.

[26] P. Di Lorenzo, S. Barbarossa, and M. Omilipo, “Distributed sum-ratemaximization over finite rate coordination links affected by randomfailures,” IEEE Trans. Signal Process., vol. 61, no. 3, pp. 648–660,Feb. 2013.

[27] J. Huang, R. A. Berry, andM. L. Honig, “Distributed interference com-pensation for wireless networks,” IEEE J. Sel. Areas Commun., vol. 24,no. 5, pp. 1074–1084, 2006.

[28] C. Shi, R. A. Berry, and M. L. Honig, “Monotonic convergence ofdistributed interference pricing in wireless networks,” in Proc. IEEEInt. Symp. Inf. Theory, Jun. 2009, pp. 1619–1623, IEEE.

[29] S.-J. Kim andG. B. Giannakis, “Optimal resource allocation forMIMOAd Hoc cognitive radio networks,” IEEE Trans. Inf. Theory, vol. 57,no. 5, pp. 3117–3131, May 2011.

[30] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Com-plexity and duality,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 1,pp. 57–73, Feb. 2008.

[31] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computa-tion: Numerical Methods. Englwood Cliffs, NJ, USA: Prentice-Hall,1989.

[32] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Incremental stochasticsubgradient algorithms for convex optimization,” SIAM J. Optim., vol.20, no. 2, pp. 691–717, Jan. 2009.

[33] K. Srivastava and A. Nedić, “Distributed asynchronous constrainedstochastic optimization,” IEEE J. Sel. Topics Signal Process., vol. 5,no. 4, pp. 772–790, Aug. 2011.

[34] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA:Athena Scientific, 1999.

[35] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar, Convex Analysis andOptimization. Belmont, MA, USA: Athena Scientific, 2003.

[36] R. Zhang, M. Mohseni, and J. Cioffi, “Power region for fading mul-tiple-access channel with multiple antennas,” in Proc. IEEE Int. Symp.Inf. Theory, 2006, no. 1, pp. 2709–2713.

[37] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[38] R. T. Rockafellar, Convex Analysis. Princeton, NJ, USA: PrincetonUniv. Press, 1970.

[39] W. Yu, W. Rhee, S. Boyd, and J. Cioffi, “Iterative water-filling forGaussian vector multiple-access channels,” IEEE Trans. Inf. Theory,vol. 50, no. 1, pp. 145–152, Jan. 2004.

Yang Yang (S’09–M’13) received the B.S. degreein School of Information Science and Engineering,Southeast University, Nanjing, China, in 2009, andthe Ph.D. degree in Department of Electronic andComputer Engineering, The Hong Kong Universityof Science and Technology in 2013. From November2013 to November 2015 he had been a postdoctoralresearch associate at the Communication SystemsGroup, Darmstadt University of Technology, Darm-stadt, Germany. He joined Intel Deutschland GmbHas a research scientist in December 2015.

His research interests are in distributed solution methods in convex optimiza-tion, nonlinear programming, and game theory, with applications in communi-cation networks, signal processing, and financial engineering.

Gesualdo Scutari (S’05–M’06–SM’11) received theElectrical Engineering and Ph.D. degrees (both withHons.) from the University of Rome “La Sapienza,”Rome, Italy, in 2001 and 2005, respectively. He is anAssociate Professor with the Department of Indus-trial Engineering, Purdue University, West Lafayette,IN, USA, and he is the Scientific Director for the areaof Big-Data Analytics at the Cyber Center (DiscoveryPark) at Purdue University. He had previously heldseveral research appointments, namely, at the Univer-sity of California at Berkeley, Berkeley, CA, USA;

Hong Kong University of Science and Technology, Hong Kong; Universityof Rome, “La Sapienza,” Rome, Italy; University of Illinois at Urbana-Cham-paign, Urbana, IL, USA. His research interests include theoretical and algo-rithmic issues related to big data optimization, equilibrium programming, andtheir applications to signal processing, medical imaging, machine learning, andnetworking. Dr. Scutari is an Associate Editor of the IEEE TRANSACTIONS ONSIGNAL PROCESSING and he served as an Associate Editor of the IEEE SIGNALPROCESSING LETTERS. He served on the IEEE Signal Processing Society Tech-nical Committee on Signal Processing for Communications (SPCOM). He wasthe recipient of the 2006 Best Student Paper Award at the International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) 2006, the 2013NSF Faculty Early Career Development (CAREER) Award, the 2013UBYoungInvestigator Award, the 2015 AnnaMaria Molteni Award for Mathematics andPhysics from the Italian Scientists and Scholars in North America Foundation(ISSNAF), and the 2015 IEEE Signal Processing Society Young Author BestPaper Award.

2964 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 11, JUNE 1, 2016

Daniel P. Palomar (S’99–M’03–SM’08–F’12) re-ceived the Electrical Engineering and Ph.D. degreesfrom the Technical University of Catalonia (UPC),Barcelona, Spain, in 1998 and 2003, respectively.He is a Professor in the Department of Electronic

and Computer Engineering at the Hong Kong Uni-versity of Science and Technology (HKUST), HongKong, which he joined in 2006. Since 2013 he is aFellow of the Institute for Advance Study (IAS) atHKUST. He had previously held several researchappointments, namely, at King’s College London

(KCL), London, U.K.; Stanford University, Stanford, CA; Telecommunica-tions Technological Center of Catalonia (CTTC), Barcelona, Spain; RoyalInstitute of Technology (KTH), Stockholm, Sweden; University of Rome “LaSapienza”, Rome, Italy; and Princeton University, Princeton, NJ. His currentresearch interests include applications of convex optimization theory, gametheory, and variational inequality theory to financial systems, big data systems,and communication systems.Dr. Palomar is an IEEE Fellow, a recipient of a 2004/2006 Fulbright Research

Fellowship, the 2004 and 2015 (co-author) Young Author Best Paper Awardsby the IEEE Signal Processing Society, the 2002/2003 best Ph.D. prize in In-formation Technologies and Communications by the Technical University ofCatalonia (UPC), the 2002/2003 Rosina Ribalta first prize for the Best DoctoralThesis in Information Technologies and Communications by the Epson Foun-dation, and the 2004 prize for the Best Doctoral Thesis in Advanced MobileCommunications by the Vodafone Foundation.He is a Guest Editor of the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL

PROCESSING 2016 Special Issue on Financial Signal Processing and MachineLearning for Electronic Trading and has been Associate Editor of IEEETRANSACTIONS ON INFORMATION THEORY and of IEEE TRANSACTIONSON SIGNAL PROCESSING, a Guest Editor of the IEEE SIGNAL PROCESSINGMAGAZINE 2010 Special Issue on Convex Optimization for Signal Processing,the IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 2008 Special

Issue on Game Theory in Communication Systems, and the IEEE JOURNAL ONSELECTED AREAS IN COMMUNICATIONS 2007 Special Issue on Optimization ofMIMO Transceivers for Realistic Communication Networks.

Marius Pesavento (M’00) received the Dipl.-Ing.and M.Eng. degrees from Ruhr-Universitãd’tBochum, Germany, and McMaster University,Hamilton, ON, Canada, in 1999 and 2000, re-spectively, and the Dr.-Ing. degree in electricalengineering from Ruhr-Universitãd’t Bochum in2005. Between 2005 and 2007, he was a ResearchEngineer at FAG Industrial Services GmbH, Aachen,Germany. From 2007 to 2009, he was the Di-rector of the Signal Processing Section at mimoOnGmbH, Duisburg, Germany. In 2010, he became

an Assistant Professor for Robust Signal Processing and a Full Professor forCommunication Systems in 2013 at the Department of Electrical Engineeringand Information Technology, Darmstadt University of Technology, Darmstadt,Germany. His research interests are in the area of robust signal processing andadaptive beamforming, high-resolution sensor array processing, multiantennaand multiuser communication systems, distributed, sparse and mixed-integeroptimization techniques for signal processing and communications, statisticalsignal processing, spectral analysis, parameter estimation. Dr. Pesavento wasa recipient of the 2003 ITG/VDE Best Paper Award, the 2005 Young AuthorBest Paper Award of the IEEE TRANSACTIONS ON SIGNAL PROCESSING, andthe 2010 Best Paper Award of the CROWNCOM conference. He is a memberof the Editorial board of the EURASIP Signal Processing Journal, an AssociateEditor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING. He currently isserving the second term as a member of the Sensor Array and Multichannel(SAM) Technical Committee of the IEEE Signal Processing Society (SPS).