kkt conditions, first-order and second-order optimization

59
KKT Conditions, First-Order and Second-Order Optimization, and Distributed Optimization: Tutorial and Survey Benyamin Ghojogh BGHOJOGH@UWATERLOO. CA Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Ali Ghodsi ALI . GHODSI @UWATERLOO. CA Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science, Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada Fakhri Karray KARRAY@UWATERLOO. CA Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley MCROWLEY@UWATERLOO. CA Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract This is a tutorial and survey paper on Karush- Kuhn-Tucker (KKT) conditions, first-order and second-order numerical optimization, and dis- tributed optimization. After a brief review of history of optimization, we start with some pre- liminaries on properties of sets, norms, func- tions, and concepts of optimization. Then, we introduce the optimization problem, standard op- timization problems (including linear program- ming, quadratic programming, and semidefinite programming), and convex problems. We also introduce some techniques such as eliminating inequality, equality, and set constraints, adding slack variables, and epigraph form. We intro- duce Lagrangian function, dual variables, KKT conditions (including primal feasibility, dual fea- sibility, weak and strong duality, complementary slackness, and stationarity condition), and solv- ing optimization by method of Lagrange multi- pliers. Then, we cover first-order optimization including gradient descent, line-search, conver- gence of gradient methods, momentum, steepest descent, and backpropagation. Other first-order methods are explained, such as accelerated gra- dient method, stochastic gradient descent, mini- batch gradient descent, stochastic average gradi- ent, stochastic variance reduced gradient, Ada- Grad, RMSProp, and Adam optimizer, proximal methods (including proximal mapping, proximal point algorithm, and proximal gradient method), and constrained gradient methods (including pro- jected gradient method, projection onto convex sets, and Frank-Wolfe method). We also cover non-smooth and 1 optimization methods includ- ing lasso regularization, convex conjugate, Hu- ber function, soft-thresholding, coordinate de- scent, and subgradient methods. Then, we ex- plain second-order methods including Newton’s method for unconstrained, equality constrained, and inequality constrained problems. We explain the interior-point method, barrier methods, Wolfe conditions for line-search, fast solving system of equations (including decomposition methods and conjugate gradient), and quasi-Newton’s method (including BFGS, LBFGS, Broyden, DFP, and SR1 methods). The sequential convex program- ming for non-convex optimization is also in- troduced. Finally, we explain distributed opti- mization including alternating optimization, dual decomposition methods, augmented Lagrangian, and alternating direction method of multipliers (ADMM). We also introduce some techniques for using ADMM for many constraints and vari- ables. arXiv:2110.01858v1 [math.OC] 5 Oct 2021

Upload: khangminh22

Post on 24-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

KKT Conditions, First-Order and Second-Order Optimization, andDistributed Optimization: Tutorial and Survey

Benyamin Ghojogh [email protected]

Department of Electrical and Computer Engineering,Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Ali Ghodsi [email protected]

Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada

Fakhri Karray [email protected]

Department of Electrical and Computer Engineering,Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada

Mark Crowley [email protected]

Department of Electrical and Computer Engineering,Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

AbstractThis is a tutorial and survey paper on Karush-Kuhn-Tucker (KKT) conditions, first-order andsecond-order numerical optimization, and dis-tributed optimization. After a brief review ofhistory of optimization, we start with some pre-liminaries on properties of sets, norms, func-tions, and concepts of optimization. Then, weintroduce the optimization problem, standard op-timization problems (including linear program-ming, quadratic programming, and semidefiniteprogramming), and convex problems. We alsointroduce some techniques such as eliminatinginequality, equality, and set constraints, addingslack variables, and epigraph form. We intro-duce Lagrangian function, dual variables, KKTconditions (including primal feasibility, dual fea-sibility, weak and strong duality, complementaryslackness, and stationarity condition), and solv-ing optimization by method of Lagrange multi-pliers. Then, we cover first-order optimizationincluding gradient descent, line-search, conver-gence of gradient methods, momentum, steepestdescent, and backpropagation. Other first-ordermethods are explained, such as accelerated gra-dient method, stochastic gradient descent, mini-

batch gradient descent, stochastic average gradi-ent, stochastic variance reduced gradient, Ada-Grad, RMSProp, and Adam optimizer, proximalmethods (including proximal mapping, proximalpoint algorithm, and proximal gradient method),and constrained gradient methods (including pro-jected gradient method, projection onto convexsets, and Frank-Wolfe method). We also covernon-smooth and `1 optimization methods includ-ing lasso regularization, convex conjugate, Hu-ber function, soft-thresholding, coordinate de-scent, and subgradient methods. Then, we ex-plain second-order methods including Newton’smethod for unconstrained, equality constrained,and inequality constrained problems. We explainthe interior-point method, barrier methods, Wolfeconditions for line-search, fast solving system ofequations (including decomposition methods andconjugate gradient), and quasi-Newton’s method(including BFGS, LBFGS, Broyden, DFP, andSR1 methods). The sequential convex program-ming for non-convex optimization is also in-troduced. Finally, we explain distributed opti-mization including alternating optimization, dualdecomposition methods, augmented Lagrangian,and alternating direction method of multipliers(ADMM). We also introduce some techniquesfor using ADMM for many constraints and vari-ables.

arX

iv:2

110.

0185

8v1

[m

ath.

OC

] 5

Oct

202

1

2

Contents

1 Introduction 5

2 Notations and Preliminaries 72.1 Preliminaries on Sets and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Preliminaries on Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Preliminaries on Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Preliminaries on Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Optimization Problems 133.1 Standard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 General Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Convex Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.4 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.5 Quadratically Constrained Quadratic Programming (QCQP) . . . . . . . . . . . . . . . . . . . . 143.1.6 Second-Order Cone Programming (SOCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.7 Semidefinite Programming (SDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.8 Optimization Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Eliminating Constraints and Equivalent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Eliminating Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Eliminating Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Adding Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.4 Eliminating Set Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.5 Adding Slack Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.6 Epigraph Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Karush-Kuhn-Tucker (KKT) Conditions 154.1 The Lagrangian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Lagrangian and Dual Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.2 Sign of Terms in Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.3 Interpretation of Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.4 Lagrange Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Primal Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Dual Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 The Dual Problem, Weak and Strong Duality, and Slater’s Condition . . . . . . . . . . . . . . . . . . . . 174.5 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Stationarity Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.7 KKT Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8 Solving Optimization by Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 First-Order Optimization: Gradient Methods 205.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1.1 Step of Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1.2 Line-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.3 Backtracking Line-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.4 Convergence Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3

5.1.5 Convergence Analysis for Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.6 Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.7 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.8 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Accelerated Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Stochastic Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.2 Mini-batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4 Stochastic Average Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4.1 Stochastic Average Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4.2 Stochastic Variance Reduced Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4.3 Adapting Learning Rate with AdaGrad, RMSProp, and Adam . . . . . . . . . . . . . . . . . . . 26

5.5 Proximal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.5.1 Proximal Mapping and Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.5.2 Proximal Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5.3 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.6 Gradient Methods for Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.6.1 Projected Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.6.2 Projection Onto Convex Sets (POCS) and Averaged Projections . . . . . . . . . . . . . . . . . . 305.6.3 Frank-Wolfe Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Non-smooth and L1 Norm Optimization Methods 306.1 Lasso Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2 Convex Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Convex Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2.2 Huber Function: Smoothing L1 Norm by Convex Conjugate . . . . . . . . . . . . . . . . . . . . 31

6.3 Soft-thresholding and Proximal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.4.1 Coordinate Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4.2 L1 Norm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.5 Subgradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5.1 Subgradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5.2 Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.5.3 Stochastic Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.5.4 Projected Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Second-Order Optimization: Newton’s Method 347.1 Newton’s Method from the Newton-Raphson Root Finding Method . . . . . . . . . . . . . . . . . . . . 347.2 Newton’s Method for Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.3 Newton’s Method for Equality Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 357.4 Interior-Point and Barrier Methods: Newton’s Method for Inequality Constrained Optimization . . . . . . 367.5 Wolfe Conditions and Line-Search in Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 367.6 Fast Solving System of Equations in Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.6.1 Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.6.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.6.3 Nonlinear Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.7 Quasi-Newton’s Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4

7.7.1 Hessian Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.7.2 Quasi-Newton’s Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8 Non-convex Optimization by Sequential Convex Programming 418.1 Convex Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.1.1 Convex Approximation by Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 418.1.2 Convex Approximation by Particle Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418.1.3 Convex Approximation by Quasi-linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.2 Trust Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.2.1 Formulation of Trust Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.2.2 Updating Trust Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9 Distributed Optimization 429.1 Alternating Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429.2 Dual Ascent and Dual Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439.3 Augmented Lagrangian Method (Method of Multipliers) . . . . . . . . . . . . . . . . . . . . . . . . . . 449.4 Alternating Direction Method of Multipliers (ADMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9.4.1 ADMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459.4.2 Simplifying Equations in ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9.5 ADMM Algorithm for General Optimization Problems and Any Number of Variables . . . . . . . . . . . 459.5.1 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459.5.2 Making Optimization Problem Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

10 Additional Notes 4810.1 Cutting-Plane Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.2 Ellipsoid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.3 Minimax and Maximin Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.4 Riemannian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.5 Metaheuristic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

11 Conclusion 49

A Proofs for Section 2 49A.1 Proof for Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49A.2 Proof for Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.3 Proof for Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.4 Proof for Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.5 Proof for Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B Proofs for Section 5 51B.1 Proof for Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51B.2 Proof for Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51B.3 Proof for Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

C Proofs for Section 7 52C.1 Proof for Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5

1. Introduction– KKT conditions and numerical optimization: Nu-merical optimization has application in various fields ofscience. Many of the optimization methods can be ex-plained in terms of the Karush-Kuhn-Tucker (KKT) condi-tions (Kjeldsen, 2000), proposed in (Karush, 1939; Kuhn& Tucker, 1951). The KKT conditions discuss the pri-mal and dual problems with primal and dual variables, re-spectively, where the dual minimum is a lower-bound onthe primal optimum. In an unconstrained problem, if set-ting the gradient of a cost function to zero gives a closed-form solution, the optimization is done; however, if wedo not have a closed-form solution, we should use nu-merical optimization which finds the solution iterativelyand gradually. Besides, if the optimization is constrained,constrained numerical optimization should be used. Thenumerical optimization methods can be divided into first-order and second-order methods.

– History of first-order optimization: The first-ordermethods are based on gradient while the second-ordermethods make use of Hessian or approximation of Hes-sian as well as gradient. The most well-known first-order method is gradient descent, first suggested byCauchy in 1874 (Lemarechal, 2012) and Hadamard in 1908(Hadamard, 1908), whose convergence was later analyzedin (Curry, 1944). Backpropagation, for training neuralnetworks, was proposed in (Rumelhart et al., 1986) andit is gradient descent used with chain rule. It is foundout in 1980’s that gradient descent is not optimal in con-vergence rate. Therefore, Accelerated Gradient Method(AGM) was proposed by Nesterov (Nesterov, 1983; 1988;2005) which had an optimal convergence rate in gradientmethods. Stochastic methods were also proposed for largevolume optimization when we have a dataset of points.They randomly sample points or batches of points for use ingradient methods. Stochastic Gradient Descent (SGD), firstproposed in (Robbins & Monro, 1951), was first used formachine learning in (Bottou et al., 1998). Stochastic Av-erage Gradient (SAG) (Roux et al., 2012), Stochastic Vari-ance Reduced Gradient (SVRG) (Johnson & Zhang, 2013)are two other example methods in this category. Sometechniques, such as AdaGrad (Duchi et al., 2011), RootMean Square Propagation (RMSProp) (Tieleman & Hin-ton, 2012), Adaptive Moment Estimation (Adam) (Kingma& Ba, 2014), have also been proposed for adaptive learningrate in stochastic optimization.

– History of proximal methods: Another family of op-timization methods are the proximal methods (Parikh &Boyd, 2014) which are based on the Moreau-Yosida reg-ularization (Moreau, 1965; Yosida, 1965). Some prox-imal methods are the proximal point algorithm (Rock-

afellar, 1976) and the proximal gradient method (Nes-terov, 2013). The proximal mapping can also be usedfor constrained gradient methods such as projected gra-dient method (Iusem, 2003). Another effective first-order method for constrained problems is the Frank-Wolfemethod (Frank & Wolfe, 1956).

– History of non-smooth optimization: Optimization ofnon-smooth functions is also very important especially be-cause of use of `1 norm for sparsity in many applica-tions. Some techniques for `1 norm optimization are `1norm approximation by Huber function (Huber, 1992),soft-thresholding which is the proximal mapping of `1norm, and coordinate descent (Wright, 2015) which can beused for `1 norm optimization (Wu & Lange, 2008). Sub-gradient methods, including stochastic subgradient method(Shor, 1998) and projected subgradient method (Alberet al., 1998), can also be used for non-smooth optimization.

– History of second-order optimization: Second-ordermethods use Hessian, or inverse of Hessian, or their ap-proximations. The family of second-order methods canbe named the Newton’s methods which are based on theNewton-Raphson method (Stoer & Bulirsch, 2013). Con-strained second-order methods can be solved using theinterior-point method, first proposed in (Dikin, 1967). Theinterior-point method is also called the barrier methods(Boyd & Vandenberghe, 2004; Nesterov, 2018) and Se-quential Unconstrained Minimization Technique (SUMT)(Fiacco & McCormick, 1967). Interior-point method is avery powerful method and is often the main method of solv-ing optimization problems in optimization toolboxes suchas CVX (Grant et al., 2009).The second-order methods are usually faster than the first-order methods because of using the Hessian information.However, computation of Hessian or approximation ofHessian in second-order methods is time-consuming anddifficult. This might be the reason for why most machinelearning algorithms, such as backpropagation for neuralnetworks (Rumelhart et al., 1986), use first-order methods.Although, note that some few machine learning algorithms,such as logistic regression and Sammon mapping (Sam-mon, 1969), use second-order optimization.The update of solution in either first-order or second-ordermethods can be stated as a system of linear equations. Forlarge-scale optimization, the Newton’s method becomesslow and intractable. Therefore, decomposition methods(Golub & Van Loan, 2013), conjugate gradient method(Hestenes & Stiefel, 1952), and nonlinear conjugate gra-dient method (Fletcher & Reeves, 1964; Polak & Ribiere,1969; Hestenes & Stiefel, 1952; Dai & Yuan, 1999) can beused for approximation of solution to the system of equa-tions. Truncated Newton’s methods (Nash, 2000), used

6

for large scale optimization, usually use conjugate gradi-ent. Another approach for approximating Newton’s methodfor large-scale data is the quasi-Newton’s method (No-cedal & Wright, 2006, Chapter 6) which approximates theHessian or inverse Hessian matrix. The well-known algo-rithms for quasi-Newton’s method are Broyden-Fletcher-Goldfarb-Shanno (BFGS) (Fletcher, 1987; Dennis Jr &Schnabel, 1996), limited-memory BFGS (LBFGS) (No-cedal, 1980; Liu & Nocedal, 1989), Davidon-Fletcher-Powell (DFP) (Davidon, 1991; Fletcher, 1987), Broydenmethod (Broyden, 1965), and Symmetric Rank-one (SR1)(Conn et al., 1991).

– History of line-search: Both first-order and second-order optimization methods have a step size parameter tomove toward their descent direction. This step size canbe calculated at every iteration using line-search methods.Well-known line-search methods are the backtracking orArmijo line-search (Armijo, 1966) and the Wolfe condi-tions (Wolfe, 1969).

– Standard problems: The terms “programming” and“program” are sometimes used to mean “optimization”and ”optimization problem”, respectively, in the litera-ture. Convex optimization or convex programming startedto develop since 1940’s (Tikhomirov, 1996). There existsome standard forms for convex problems which are linearprogramming, quadratic programming, quadratically con-strained quadratic programming, second-order cone pro-gramming, and Semidefinite Programming (SDP). An im-portant method for solving linear programs was the simplexmethod proposed in 1947 (Dantzig, 1983). SDP is also im-portant because the standard convex problems can be statedas special cases of SDP and then may be solved using theinterior-point method.

– History of non-convex optimization: There also ex-ist methods for non-convex optimization. These methodsare either local or global methods. The local methodsare faster but find a local solution depending on the ini-tial solution. The global methods, however, find the globalsolution but are slower. Examples for local and globalnon-convex methods are Sequential Convex Programming(SCP) (Dinh & Diehl, 2010) and branch and bound (Land& Doig, 1960), respectively. SCP uses trust region (Connet al., 2000) and it solves a sequence of convex approxima-tions of the problem. It is related to Sequential QuadraticProgramming (SQP) (Boggs & Tolle, 1995) which is usedfor constrained nonlinear optimization. The branch andbound methods use a binary tree structure for optimizingon a non-convex cost function.

– History of distributed optimization: Distributed op-

timization has two benefits. First, it makes the problemable to run in parallel on several servers. Secondly, it canbe used to solve problems with multiple optimization vari-ables. Especially, for the second reason, it has been widelyused in machine learning and signal processing. Two mostwell-known distributed optimization approaches are alter-nating optimization (Jain & Kar, 2017; Li et al., 2019)and Alternating Direction Method of Multipliers (ADMM)(Gabay & Mercier, 1976; Glowinski & Marrocco, 1976;Boyd et al., 2011). Alternating optimization alternatesbetween optimizing over variables one-by-one, iteratively.ADMM is based on dual decomposition (Dantzig & Wolfe,1960; Benders, 1962; Everett III, 1963) and augmentedLagrangian (Hestenes, 1969; Powell, 1969). ADMM hasalso been generalized for multiple variables and constraints(Giesen & Laue, 2016; 2019).

– History of iteratively decreasing the feasible set:Cutting-plane methods remove a part of feasible point atevery iteration where the removed part does not contain theminimizer. The feasible set gets smaller and smaller untilit converges to the solution. The most well-known cutting-plane method is the Analytic Center Cutting-Plane Method(ACCPM) (Goffin & Vial, 1993; Nesterov, 1995; Atkinson& Vaidya, 1995). Ellipsoid method (Shor, 1977; Yudin &Nemirovski, 1976; 1977a;b) has a similar idea but it re-moves half of an ellipsoid around the current solution atevery iteration. The ellipsoid method was initially appliedto liner programming (Khachiyan, 1979).

– History of other optimization approaches: There ex-ist some other approaches for optimization. In this paper,for brevity, we do not explain the theory of these other ap-proaches and we merely focus on the classical optimiza-tion. Riemannian optimization (Absil et al., 2009; Boumal,2020) is the extension of Euclidean optimization to thecases where the optimization variable lies on a possiblycurvy Riemannian manifold (Hosseini & Sra, 2020b; Huet al., 2020) such as the symmetric positive definite (Sra& Hosseini, 2015), quotient (Lee, 2013), Grassmann (Ben-dokat et al., 2020), and Stiefel (Edelman et al., 1998) man-ifolds.Metaheuristic optimization (Talbi, 2009), in the field ofsoft computing, is a a family of methods finding the op-timum of a cost function using efficient, and not brute-force, search. They use both local and global searches forexploitation and exploration of the cost function, respec-tively. They can be used in highly non-convex optimiza-tion with many constraints, where classical optimization isa little difficult and slow to perform. These methods con-tain nature-inspired optimization (Yang, 2010), evolution-ary computing (Simon, 2013), and particle-based optimiza-tion. Two fundamental metaheuristic methods are genetic

7

algorithm (Holland et al., 1992) and particle swarm opti-mization (Kennedy & Eberhart, 1995).

– Important books on optimization: Some importantbooks on optimization are Boyd’s book (Boyd & Vanden-berghe, 2004), Nocedal’s book (Nocedal & Wright, 2006),Nesterov’s books (Nesterov, 1998; 2003; 2018) (The book(Nesterov, 2003) is a good book on first-order methods),Beck’s book (Beck, 2017), and some other books (Den-nis Jr & Schnabel, 1996; Avriel, 2003; Chong & Zak, 2004;Bubeck, 2014; Jain & Kar, 2017), etc.In this paper, we introduce and explain these optimizationmethods and approaches.

Required Background for the ReaderThis paper assumes that the reader has general knowledgeof calculus and linear algebra.

2. Notations and Preliminaries2.1. Preliminaries on Sets and NormsDefinition 1 (Interior and boundary of set). Consider a setD in a metric space Rd. The point x ∈ D is an interiorpoint of the set if:

∃ε > 0 such that y | ‖y − x‖2 ≤ ε ⊆ D.

The interior of the set, denoted by int(D), is the set con-taining all the interior points of the set. The closure of theset is defined as cl(D) := Rd \ int(Rd \ D). The bound-ary of set is defined as bd(D) := cl(D) \ int(D). An open(resp. closed) set does not (resp. does) contain its bound-ary. The closure of set can be defined as the smallest closedset containing the set. In other words, the closure of set isthe union of interior and boundary of the set.

Definition 2 (Convex set and convex hull). A set D is aconvex set if it completely contains the line segment be-tween any two points in the set D:

∀x,y ∈ D, 0 ≤ t ≤ 1 =⇒ tx+ (1− t)y ∈ D.

The convex hull of a (not necessarily convex) set D is thesmallest convex set containing the set D. If a set is convex,it is equal to its convex hull.

Definition 3 (Minimum, maximum, infimum, and supre-mum). A minimum and maximum of a function f : Rd →R, f : x 7→ f(x), with domain D, are defined as:

minxf(x) ≤ f(y), ∀y ∈ D,

maxx

f(x) ≥ f(y), ∀y ∈ D,

respectively. The minimum and maximum of a functionbelong to the range of function. Infimum and supremum

Figure 1. Minimum, maximum, infimum, and supremum of ex-ample functions.

Figure 2. Examples for stationary points such as local and globalextreme points, strict and non-strict extreme points, and saddlepoint.

are the lower-bound and upper-bound of function, respec-tively:

infxf(x) := maxz ∈ R | z ≤ f(x),∀x ∈ D,

supxf(x) := minz ∈ R | z ≥ f(x),∀x ∈ D.

Depending on the function, the infimum and supremum ofa function may or may not belong to the range of function.Fig. 1 shows some examples for minimum, maximum, in-fimum, and supremum. The minimum and maximum of afunction are also the infimum and supremum of function,respectively, but the converse is not necessarily true. If theminimum and maximum of function are minimum and max-imum in the entire domain of function, they are the globalminimum and global maximum, respectively. See Fig. 2 forexamples of global minimum and maximum.

Lemma 1 (Inner product). Consider two vectors x =[x1, . . . , xd]

> ∈ Rd and y = [y1, . . . , yd]> ∈ Rd. Their

inner product, also called dot product, is:

〈x,y〉 = x>y =

d∑i=1

xi yi.

We also have inner product between matrices X,Y ∈Rd1×d2 . Let Xij denote the (i, j)-th element of matrix X .

8

The inner product ofX and Y is:

〈X,Y 〉 = tr(X>Y ) =

d1∑i=1

d2∑j=1

Xi,j Y i,j ,

where tr(.) denotes the trace of matrix.Definition 4 (Norm). A function ‖ · ‖ : Rd → R, ‖ · ‖ :x 7→ ‖x‖ is a norm if it satisfies:

1. ‖x‖ ≥ 0,∀x

2. ‖ax‖ = |a| ‖x‖,∀x and all scalars a

3. ‖x‖ = 0 if and only if x = 0

4. Triangle inequality: ‖x+ y‖ ≤ ‖x‖+ ‖y‖.

Definition 5 (Important norms). Some important norms fora vector x = [x1, . . . , xd]

> are as follows. The `p norm is:

‖x‖p :=(|x1|p + · · ·+ |xd|p

)1/p,

where p ≥ 1 and |.| denotes the absolute value. Two well-known `p norms are `1 norm and `2 norm (also called theEuclidean norm) with p = 1 and p = 2, respectively. The`∞ norm, also called the infinity norm, the maximum norm,or the Chebyshev norm, is:

‖x‖∞ := max|x1|+ · · ·+ |xd|.

For the matrixX ∈ Rd1×d2 , the `p norm is:

‖X‖p := supy 6=0

‖Xy‖p‖y‖p

.

A special case for this is the `2 norm, also called the spec-tral norm or the Euclidean norm. The spectral norm isrelated to the largest singular value of matrix:

‖X‖2 = supy 6=0

‖Xy‖2‖y‖2

=

√λmax(X

>X) = σmax(X),

where λmax(X>X) and σmax(X) denote the largest eigen-

value of X>X and the largest singular value of X ,respectively. Other specaial cases are the maximum-absolute-column-sum norm (p = 1) and the maximum-absolute-row-sum norm (p =∞):

‖X‖1 = supy 6=0

‖Xy‖1‖y‖1

= max1≤j≤d2

d1∑i=1

|Xi,j |,

‖X‖∞ = supy 6=0

‖Xy‖∞‖y‖∞

= max1≤i≤d1

d2∑j=1

|Xi,j |.

The formulation of the Frobenius norm for a matrix is sim-ilar to the formulation of `2 norm for a vector:

‖X‖F :=

√√√√ d1∑i=1

d2∑j=1

X2i,j ,

Figure 3. The unit balls, in R2, for (a) `1 norm, (b) `2 norm, and(c) `∞ norm.

whereXij denotes the (i, j)-th element ofX .The `2,1 norm of matrixX is:

‖X‖2,1 :=

d1∑i=1

√√√√ d2∑j=1

X2i,j .

The Schatten `p norm of matrixX is:

‖X‖p :=

(min(d1,d2)∑i=1

(σi(X)

)p)1/p

,

where σi(X) denotes the i-th singular value of X . A spe-cial case of the Schatten norm, with p = 1, is called thenuclear norm or the trace norm (Fan, 1951):

‖X‖∗ :=

min(d1,d2)∑i=1

σi(X) = tr(√X>X

),

which is summation of the singular values of matrix. Notethat similar to use of `1 norm of vector for sparsity, thenuclear norm is also used to impose sparsity on matrix.

Lemma 2. We have:

‖x‖22 = x>x = 〈x,x〉,‖X‖2F = tr(X>X) = 〈X,X〉,

which are convex and in quadratic forms.

Definition 6 (Unit ball). The unit ball for a norm ‖ · ‖ is:

B := x ∈ Rd | ‖x‖ ≤ 1.

The unit balls for some of the norms are shown in Fig. 3.

Definition 7 (Dual norm). Let ‖.‖ be a norm on Rd. Itsdual norm is:

‖x‖∗ := supx>y | ‖y‖ ≤ 1. (1)

Note that the notation ‖·‖∗ should not be confused with thethe nuclear norm despite of similarity of notations.

9

Lemma 3 (Holder’s (Holder, 1889) and Cauchy-Schwarzinequalities (Steele, 2004)). Let p, q ∈ [1,∞] and:

1

p+

1

q= 1. (2)

These p and q are called the Holder conjugates of eachother. According to the Holder’s inequality, for functionsf(.) and g(.), we have ‖fg‖1 ≤ ‖f‖p‖g‖q . A corollary ofthe Holder’s inequality is that Eq. (2) holds if the norms‖.‖p and ‖.‖q are dual of each other. Holder’s inequalitystates that:

|x>y| ≤ ‖x‖p‖x‖q,

where p and q satisfy Eq. (2). A special case of the Holder’sinequality is the Cauchy-Schwarz inequality, stated as|x>y| ≤ ‖x‖2‖x‖2.

According to Eq. (2), we have:

‖ · ‖p =⇒ ‖ · ‖∗ = ‖ · ‖p/(p−1), ∀p ∈ [1,∞]. (3)

For example, the dual norm of ‖.‖2 is ‖.‖2 again and thedual norm of ‖.‖1 is ‖.‖∞.

Definition 8 (Cone and dual cone). A setK ⊆ Rd is a coneif:

1. it contains the origin, i.e., 0 ∈ K,

2. K is a convex set,

3. for each x ∈ K and λ ≥ 0, we have λx ∈ K.

The dual cone of a cone K is:

K∗ := y |y>x ≥ 0,∀x ∈ K.

An example cone and its dual are depicted in Fig. 4-a.

Definition 9 (Proper cone (Boyd & Vandenberghe, 2004)).A convex cone K ⊆ Rd is a proper cone if:

1. K is closed, i.e., it contains its boundary,

2. K is solid, i.e., its interior is non-empty,

3. K is pointed, i.e., it contains no line. In other words,it is not a two-sided cone around the origin.

Definition 10 (Generalized inequality (Boyd & Vanden-berghe, 2004)). A generalized inequality, defined by aproper cone K, is:

x K y ⇐⇒ x− y ∈ K.

This means x K y ⇐⇒ x − y ∈ int(K). Note thatx K y can also be stated as x−y K 0. An example fora generalized inequality is shown in Fig. 4-b.

Figure 4. (a) A coneK and its dual coneK∗. Note that the bordersof dual cone are perpendicular to the borders of the cone as shownin this figure. (b) An example for the generalized inequality x Ky. As it is shown, the vector (x− y) belongs to the cone K.

Definition 11 (Important examples for generalized in-equality). The generalized inequality defined by the non-negative orthant, K = Rd+, is the default inequality forvectors x = [x1, . . . , xd]

>, y = [y1, . . . , yd]>:

x y ⇐⇒ x Rd+ y.

It means component-wise inequality:

x y ⇐⇒ xi ≥ yi, ∀i ∈ 1, . . . , d.

The generalized inequality defined by the positive definitecone, K = Sd+, is the default inequality for symmetric ma-tricesX,Y ∈ Sd:

X Y ⇐⇒ X Sd+ Y .

It means (X−Y ) is positive semi-definite. Note that if theinequality is strict, i.e. X Y , it means that (X − Y ) ispositive definite. In conclusion, x 0 means all elementsof vector x are non-negative andX 0 means the matrixX is positive semi-definite.

2.2. Preliminaries on FunctionsDefinition 12 (Fixed point). A fixed point of a function f(.)is a point x which is mapped to itself by the function, i.e.,f(x) = x.

10

Figure 5. Two definitions for the convex function: (a) Eq. (4)meaning that the function value for the point αx + (1 − α)yis less than or equal to the hyper-line αf(x) + (1− α)f(y), and(b) Eq. (5) meaning that the function value for x, i.e. f(x), fallsabove the hyper-line f(y) +∇f(y)>(x− y), ∀x,y ∈ D.

Definition 13 (Convex function). A function f(.) with do-main D is convex if:

f(αx+ (1− α)y

)≤ αf(x) + (1− α)f(y), (4)

∀x,y ∈ D, where α ∈ [0, 1]. Eq. (4) is depicted in Fig.5-a.Moreover, if the function f(.) is differentiable, it is convexif:

f(x) ≥ f(y) +∇f(y)>(x− y), (5)

∀x,y ∈ D. Eq. (5) is depicted in Fig. 5-b.Moreover, if the function f(.) is twice differentiable, itis convex if its second-order derivative is positive semi-definite:

∇2f(x) 0, (6)

∀x ∈ D.

Each of the Eqs. (4), (5), and (6) is a definition for theconvex function. Note that if ≥ is changed to ≤ in Eqs. (4)and (5) or if is changed to in Eq. (6), the function isconcave.

Definition 14 (Strongly convex function). A differentialfunction f(.) with domain D is µ-strongly convex if:

f(x) ≥ f(y) +∇f(y)>(x− y) +µ

2‖x− y‖22, (7)

∀x,y ∈ D and µ > 0.Moreover, if the function f(.) is twice differentiable, it isµ-strongly convex if its second-order derivative is positivesemi-definite:

y>∇2f(x)y ≥ µ‖y‖22, (8)

∀x,y ∈ D and µ > 0. A strongly convex function has aunique minimizer. See Fig. 6 for difference of convex andstrongly convex functions.

Figure 6. Comparison of strongly convex and convex functions.The strongly convex function has only one strict minimizer whilethe convex function can have multiple minimizers with equalfunction values.

Definition 15 (Holder and Lipschitz smoothness). A func-tion f(.) with domain D belongs to a Holder spaceH(α,L), with smoothness parameter α and the radius Lfor ball (as the space can be seen as a ball), if:

|f(x)− f(y)| ≤ L ‖x− y‖α2 , ∀x,y ∈ D. (9)

The Holder space relates to local smoothness. A function inthis space is called Holder smooth (or Holder continuous).A function is Lipschitz smooth (or Lipschitz continuous) ifit is Holder smooth with α = 1:

|f(x)− f(y)| ≤ L ‖x− y‖2, ∀x,y ∈ D. (10)

The parameter L is called the Lipschitz constant. A func-tion with Lipschitz smoothness (with Lipschitz constant L)is called L-smooth.

Holder and Lipschitz smoothness are used in many con-vergence and correctness proofs for optimization (e.g., see(Liu et al., 2021)).The following lemma, which is based on the fundamentaltheorem of calculus, is widely used in proofs of optimiza-tion methods.

Lemma 4 (Fundamental theorem of calculus for multivari-ate functions). Consider a differentiable function f(.) withdomain D. For any x,y ∈ D, we have:

f(y) = f(x) +∇f(x)>(y − x)

+

∫ 1

0

(∇f(x+ t(y − x)

)−∇f(x)

)>(y − x)dt

= f(x) +∇f(x)>(y − x) + o(y − x),(11)

where o(.) is the small-o complexity.

Lemma 5 (Corollary of the fundamental theorem of calcu-lus). Consider a differentiable function f(.), with domainD, whose gradient is L-smooth:

|∇f(x)−∇f(y)| ≤ L ‖x− y‖2, ∀x,y ∈ D. (12)

For any x,y ∈ D, we have:

f(y) ≤ f(x) +∇f(x)>(y − x) +L

2‖y − x‖22. (13)

11

Proof. Proof is available in Appendix A.1.

The following lemma is useful for proofs of convergenceof first-order methods.

Lemma 6. Consider a convex and differentiable functionf(.), with domain D, whose gradient is L-smooth (see Eq.(12)). We have:

f(y)− f(x) ≤∇f(y)>(y − x)

− 1

2L‖∇f(y)−∇f(x)‖22, (14)

(∇f(y)−∇f(x)

)>(y − x) ≥ 1

L‖∇f(y)−∇f(x)‖22.

(15)

Proof. Proof is available in Appendix A.2.

2.3. Preliminaries on OptimizationDefinition 16 (Local and global minimizers). A point x ∈D is a local minimizer of function f(.) if and only if:

∃ ε > 0 : ∀y ∈ D, ‖y − x‖2 ≤ ε =⇒ f(x) ≤ f(y),(16)

meaning that in an ε-neighborhood of x, the value of func-tion is minimum at x. A point x ∈ D is a global minimizerof function f(.) if and only if:

f(x) ≤ f(y), ∀y ∈ D. (17)

See Fig. 2 for examples of local minimizer and maximizer.

Definition 17 (Strict minimizers). In Eqs. (16) and (17), ifwe have f(x) < f(y) rather than f(x) ≤ f(y), the min-imizer is a strict local and global minimizer, respectively.See Fig. 2 for examples of strict/non-strict minimizer andmaximizer.

Lemma 7 (Minimizer in convex function). In a convexfunction, any local minimizer is a global minimizer.

Proof. Proof is available in Appendix A.3.

Corollary 1. In a convex function, there exists only onelocal minimizer which is the global minimizer. As an imag-ination, a convex function is like a multi-dimensional bowlwith only one minimizer.

Lemma 8 (Gradient of a convex function at the minimizerpoint). When the function f(.) is convex and differentiable,a point x∗ is a minimizer if and only if∇f(x∗) = 0.

Proof. Proof is available in Appendix A.4.

Definition 18 (Stationary, extremum, and saddle points).In a general (not-necessarily-convex) function f(.), a pointx∗ is a stationary if and only if ∇f(x∗) = 0. By passing

through a saddle point, the sign of the second derivativeflips to the opposite sign. Minimizer and maximizer points(locally or globally) minimize and maximize the function,respectively. A saddle point is neither minimizer nor maxi-mizer, although the gradient at a saddle point is zero. Bothminimizer and maximizer are also called the extremumpoints. As Fig. 2 shows, some of stationary point can beeither a minimizer, a maximizer, or a saddle point of func-tion.

Lemma 9 (First-order optimality condition (Nesterov,2018, Theorem 1.2.1)). If x∗ is a local minimizer for adifferentiable function f(.), then:

∇f(x∗) = 0. (18)

Note that if f(.) is convex, this equation is a necessary andsufficient condition for a minimizer.

Proof. Proof is available in Appendix A.5.

Note that if setting the derivative to zero, i.e. Eq. (18),gives a closed-form solution for x∗, the optimization isdone. Otherwise, one should start with some random ini-tialized solution and iteratively update it using the gradient.First-order or second-order methods can be used for itera-tive optimization (see Sections 5 and 7).

Definition 19 (Arguments of minimization and maximiza-tion). In the domain of function, the point which mini-mizes (resp. maximizes) the function f(.) is the argu-ment for the minimization (resp. maximization) of function.The minimizer and maximizer of function are denoted byarg minx f(x) and arg maxx f(x), respectively.

Remark 1. We can convert convert maximization to mini-mization and vice versa:

maximizex

f(x) = −minimizex

(−f(x)

),

minimizex

f(x) = −maximizex

(−f(x)

).

(19)

We can have similar conversions for the arguments of maxi-mization and minimization but we the sign of optimal valueof function is not important in argument, we do not havethe negative sign before maximization and minimization:

arg maxx

f(x) = arg minx

(−f(x)

),

arg minxf(x) = arg max

x

(−f(x)

).

(20)

Definition 20 (Convergence rates). If any sequenceε0, ε1, . . . , εk, εk+1, . . . converges, its convergence ratehas one of the following cases:

limk→∞

εk+1

εk=

0 superlinear rate,∈ (0, 1) linear rate,1 sublinear rate.

(21)

12

2.4. Preliminaries on DerivativeRemark 2 (Dimensionality of derivative). Consider afunction f : Rd1 → Rd2 , f : x 7→ f(x). Derivativeof function f(x) ∈ Rd2 with respect to (w.r.t.) x ∈ Rd1has dimensionality (d1 × d2). This is because tweakingevery element of x ∈ Rd1 can change every element off(x) ∈ Rd2 . The (i, j)-th element of the (d1 × d2)-dimensional derivative states the amount of change in thej-th element of f(x) resulted by changing the i-th elementof x.Note that one can use a transpose of the derivative as thederivative. This is okay as long as the dimensionality ofother terms in equations of optimization coincide (i.e., theyare all transposed). In that case, the dimensionality ofderivative is (d2×d1) where the (i, j)-th element of deriva-tive states the amount of change in the i-th element of f(x)resulted by changing the j-th element of x.Some examples of derivatives are as follows.

• If the function is f : R → R, f : x 7→ f(x),the derivative (∂f(x)/∂x) ∈ R is a scalar becausechanging the scalar x can change the scalar f(x).

• If the function is f : Rd → R, f : x 7→ f(x),the derivative (∂f(x)/∂x) ∈ Rd is a vector becausechanging every element of the vector x can change thescalar f(x).

• If the function is f : Rd1×d2 → R, f : X 7→ f(X),the derivative (∂f(X)/∂X) ∈ Rd1×d2 is a matrixbecause changing every element of the matrix X canchange the scalar f(X).

• If the function is f : Rd1 → Rd2 , f : x 7→ f(x), thederivative (∂f(x)/∂x) ∈ Rd1×d2 is a matrix becausechanging every element of the vector x can changeevery element of the vector f(x).

• If the function is f : Rd1×d2 → Rd3 , f : X 7→ f(X),the derivative (∂f(X)/∂X) is a (d1 × d2 × d3)-dimensional tensor because changing every elementof the matrix X can change every element of the vec-tor f(X).

• If the function is f : Rd1×d2 → Rd3×d4 , f : X 7→f(X), the derivative (∂f(X)/∂X) is a (d1 × d2 ×d3 × d4)-dimensional tensor because changing everyelement of the matrix X can change every element ofthe matrix f(X).

In other words, the derivative of a scalar w.r.t. a scalaris a scalar. The derivative of a scalar w.r.t. a vector is avector. The derivative of a scalar w.r.t. a matrix is a matrix.The derivative of a vector w.r.t. a vector is a matrix. Thederivative of a vector w.r.t. a matrix is a rank-3 tensor. Thederivative of a matrix w.r.t. a matrix is a rank-4 tensor.Definition 21 (Gradient, Jacobian, and Hessian). Considera function f : Rd → R, f : x 7→ f(x). In optimizing the

function f , the derivative of function w.r.t. its variable x iscalled the gradient, denoted by:

∇f(x) :=∂f(x)

∂x∈ Rd.

The second derivative of function w.r.t. to its derivative iscalled the Hessian matrix, denoted by

B = ∇2f(x) :=∂2f(x)

∂x2∈ Rd×d.

The Hessian matrix is symmetric. If the function is convex,its Hessian matrix is positive semi-definite.If the function is multi-dimensional, i.e., f : Rd1 → Rd2 ,f : x 7→ f(x), the gradient becomes a matrix:

J :=[ ∂f∂x1

, . . . ,∂f

∂xd1

]>=

∂f1∂x1

. . .∂fd2∂xd1

.... . .

...∂f1∂xd1

. . .∂fd2∂xd1

∈ Rd1×d2 ,

where x = [x1, . . . , xd1 ]> and f(x) = [f1, . . . , fd2 ]>.This matrix derivative is called the Jacobian matrix.Corollary 2 (Technique for calculating derivative). Ac-cording to the size of derivative, we can easily calculatethe derivatives. For finding the correct derivative for mul-tiplications of matrices (or vectors), one can temporarilyassume some dimensionality for every matrix and find thecorrect of matrices in the derivative. Let X ∈ Ra×b, Anexample for calculating derivative is:

Ra×b 3 ∂

∂X

(tr(AXB)

)= A>B> = (BA)>. (22)

This is calculated as explained in the following. We as-sume A ∈ Rc×a and B ∈ Rb×c so that we can have thematrix multiplicationAXB and its size isAXB ∈ Rc×cbecause the argument of trace should be a square matrix.The derivative ∂(tr(AXB))/∂X has size Ra×b becausetr(AXB) is a scalar and X is (a × b)-dimensional. Weknow that the derivative should be a kind of multiplicationofA andB because tr(AXB) is linear w.r.t. X . Now, weshould find their order in multiplication. Based on the as-sumed sizes ofA andB, we see thatA>B> is the desiredsize and these matrices can be multiplied to each other.Hence, this is the correct derivative.Lemma 10 (Derivative of matrix w.r.t. matrix). As ex-plained in Remark 2, the derivative of a matrix w.r.t. an-other matrix is a tensor. Working with tensors is difficult;hence, we can use Kronecker product for representing ten-sor as matrix. This is the Magnus-Neudecker convention(Magnus & Neudecker, 1985) in which all matrices arevectorized. For example, if X ∈ Ra×b, A ∈ Rc×a, andB ∈ Rb×d, we have:

R(cd)×(ab) 3 ∂

∂X(AXB) = B> ⊗A, (23)

where ⊗ denotes the Kronecker product.

13

Remark 3 (Chain rule in matrix derivatives). When hav-ing composite functions (i.e., function of function), we usechain rule for derivative. When we have derivative of ma-trix w.r.t. matrix, this chain rule can get difficult but wecan do it by checking compatibility of dimensions in matrixmultiplications. We should use Lemma 10 and vectoriza-tion technique in which the matrix is vectorized. Let vec(.)denote vectorization of a Ra×b matrix to a Rab vector. Also,let vec−1

a×b(.) be de-vectorization of a Rab vector to a Ra×bmatrix.For the purpose of tutorial, here we calculate derivative bychain rule as an example:

f(S) = tr(ASB), S = CMD, M =M

‖M‖2F,

where A ∈ Rc×a, S ∈ Ra×b, B ∈ Rb×c, C ∈ Ra×d,M ∈ Rd×d,D ∈ Rd×b, andM ∈ Rd×d. We have:

Ra×b 3 ∂f(S)

∂S

(22)= (BA)>.

Rab×d2

3 ∂S

∂M

(23)= D> ⊗C,

Rd2×d2 3 ∂M

∂M

(a)=

1

‖M‖4F

(‖M‖2F Id2 − 2M ⊗M

)=

1

‖M‖2F

(Id2 −

2

‖M‖2FM ⊗M

),

where (a) is because of the formula for the derivative offraction and Id2 is a (d2×d2)-dimensional identity matrix.finally, by chain rule, we have:

Rd×d 3 ∂f

M= vec−1

d×d

((∂M∂M

)>( ∂S∂M

)>vec(∂f(S)

∂S

)).

Note that the chain rule in matrix derivatives usually isstated right to left in matrix multiplications while transposeis used for matrices in multiplication.

More formulas for matrix derivatives can be found in thematrix cookbook (Petersen & Pedersen, 2012) and similarresources. Here, we discussed only real derivatives. Whenworking with complex data (with imaginary part), we needcomplex derivative. The reader can refer to (Hjorungnes& Gesbert, 2007) and (Chong, 2021, Chapter 7, ComplexDerivatives) for techniques in complex derivatives.

3. Optimization Problems3.1. Standard ProblemsHere, we review the standard forms for convex optimiza-tion and we explain why these forms are important. Notethat the term “programming” refers to solving optimizationproblems.

3.1.1. GENERAL OPTIMIZATION PROBLEM

Consider the function f : Rd → R, f : x 7→ f(x). Let thedomain of function be D where x ∈ D,x ∈ Rd.Consider the following unconstrained minimization of acost function f(.):

minimizex

f(x), (24)

wherex is called the optimization variable and the functionf(.) is called the objective function or the cost function.This is an unconstrained problem where the optimizationvariable x needs only be in the domain of function, i.e.,x ∈ D, while minimizing the function f(.).The optimization problem can be constrained where the op-timization variable x should satisfy some equality and/orinequality constraints, in addition to being in the domain offunction, while minimizing the function f(.). Consider aconstrained optimization problem where we want to mini-mize the function f(x) while satisfyingm1 inequality con-straints and m2 equality constraint:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1,hi(x) = 0, i ∈ 1, . . . ,m2,

(25)

where f(x) is the objective function, every yi(x) ≤ 0 is aninequality constraint, and every hi(x) = 0 is an equalityconstraint. Note that if some of the inequality constraintsare not in the form yi(x) ≤ 0, we can restate them as:

yi(x) ≥ 0 =⇒ −yi(x) ≤ 0,

yi(x) ≤ c =⇒ yi(x)− c ≤ 0.

Therefore, all inequality constraints can be written in theform yi(x) ≤ 0. Furthermore, according to Eq. (19), ifthe optimization problem (25) is a maximization problemrather than minimization, we can convert it to maximiza-tion by multiplying its objective function to −1:

maximizex

f(x)

subject to constraints≡

minimizex

− f(x)

subject to constraints(26)

Definition 22 (Feasible point). The point x for the opti-mization problem (25) is feasible if:

x ∈ D, and

yi(x) ≤ 0, ∀i ∈ 1, . . . ,m1, and

hi(x) = 0, ∀i ∈ 1, . . . ,m2.(27)

The constrained optimization problem can also be statedas:

minimizex

f(x)

subject to x ∈ S,(28)

where S is the feasible set of constraints.

14

3.1.2. CONVEX OPTIMIZATION PROBLEM

A convex optimization problem is of the form:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1,Ax = b,

(29)

where the functions f(.) and yi(.),∀i are all convex func-tions and the equality constraints are affine functions. Thefeasible set of a convex problem is a convex set.

3.1.3. LINEAR PROGRAMMING

A linear programming problem is of the form:

minimizex

c>x+ d

subject to Gx h,Ax = b,

(30)

where the objective function and equality constraints areaffine functions. The feasible set of a linear program-ming problem is a a polyhedron set while the cost is pla-nar (affine). A survey on linear programming methods isavailable in the book (Dantzig, 1963). One of the well-known methods for solving linear programming is the sim-plex method, initially appeared in 1947 (Dantzig, 1983).Simplex method moves between the vertices of a simplex,until convergence, for minimizing the objective function. Itis efficient and its proposal was a breakthrough in the fieldof optimization.

3.1.4. QUADRATIC PROGRAMMING

A quadratic programming problem is of the form:

minimizex

(1/2)x>Px+ q>x+ r

subject to Gx h,Ax = b,

(31)

where P 0 (which is the second derivative of objectivefunction) is a symmetric positive definite matrix, the ob-jective function is quadratic, and equality constraints areaffine functions. The feasible set of a quadratic program-ming problem is a a polyhedron set while the cost is curvy(quadratic).

3.1.5. QUADRATICALLY CONSTRAINED QUADRATICPROGRAMMING (QCQP)

A QCQP problem is of the form:

minimizex

(1/2)x>Px+ q>x+ r

subject to

(1/2)x>M ix+ s>i x+ zi ≤ 0, i ∈ 1, . . . ,m1,Ax = b,

(32)

where P ,M i 0,∀i, the objective function and the in-equality constraints are quadratic, and equality constraintsare affine functions. The feasible set of a QCQP problemis intersection of m1 ellipsoids and an affine set, while thecost is curvy (quadratic).

3.1.6. SECOND-ORDER CONE PROGRAMMING (SOCP)A SOCP problem is of the form:

minimizex

f>x

subject to ‖Aix+ bi‖2 ≤ c>i x+ di, i ∈ 1, . . . ,m1,Fx = g,

(33)where the inequality constraints are norm of an affine func-tion being less than an affine function. The constraint‖Aix + bi‖2 − c>i x − di ≤ 0 is called the second-ordercone whose shape is like an ice-cream cone.

3.1.7. SEMIDEFINITE PROGRAMMING (SDP)A SDP problem is of the form:

minimizeX

tr(CX)

subject to X 0,

tr(DiX) ≤ ei, i ∈ 1, . . . ,m1,tr(AiX) = bi, i ∈ 1, . . . ,m2,

(34)

where the optimization variable X belongs to the positivesemidefinite cone Sd+, tr(.) denotes the trace of matrix,C,Di,Ai ∈ Sd,∀i, and Sd denotes the cone of (d × d)symmetric matrices. The trace terms may be written insummation forms. Note that tr(C>X) is the inner productof two matricesC andX and if the matrixC is symmetric,this inner product is equal to tr(CX).Another form for SDP is:

minimizex

c>x

subject to( d∑i=1

xiF i

)+G 0,

Ax = b,

(35)

where x = [x1, . . . , xd]>,G,F i ∈ Sd,∀i, andA, b, and c

are constant matrices/vectors.

3.1.8. OPTIMIZATION TOOLBOXES

All the standard optimization forms can be restated as SDPbecause their constraints can be written as belonging tosome cones (see Definitions 10 and 11); hence, they arespecial cases of SDP. The interior-point method, or thebarrier method, introduced in Section 7.4, can be usedfor solving various optimization problems including SDP(Nesterov & Nemirovskii, 1994; Boyd & Vandenberghe,2004). Optimization toolboxes such as CVX (Grant et al.,

15

2009) often use interior-point method (see Section 7.4) forsolving optimization problems such as SDP. Note that theinterior-point method is iterative and solving SDP usuallyis time consuming especially for large matrices. If the op-timization problem is a convex optimization problem (e.g.SDP is a convex problem), it has only one local optimawhich is the global optima (see Corollary 1).

3.2. Eliminating Constraints and Equivalent ProblemsHere, we review some of the useful techniques in convert-ing optimization problems to their equivalent forms.

3.2.1. ELIMINATING INEQUALITY CONSTRAINTS

As was discussed in Section 7.4, we can eliminate theinequality constraints by embedding the inequality con-straints into the objective function using the indicator orbarrier functions.

3.2.2. ELIMINATING EQUALITY CONSTRAINTS

Consider the optimization problem (55). We can eliminatethe equality constraints, Ax = b, as explained in the fol-lowing. Let A ∈ Rm2×d, m2 < d, N (A) := x ∈Rd |Ax = 0 denote the null-space of matrixA. We have:

∀z ∈ N (A),∃u ∈ Rd−m2 ,C ∈ Rm2×(d−m2) :

Col(C) = N (A), z = Cu,

where Col(.) is the column-space or range of matrix.Therefore, we can say:

∀z ∈ N (A) : A(x− z) = Ax−Az = Ax− 0 = Ax

=⇒ A(x− z) = Ax = b

=⇒ x = A†b+ z = A†b+Cu, (36)

where A† := A>(AA>)−1 is the pseudo-inverse of ma-trix A. Putting Eq. (36) in problem (55) changes the op-timization variable and eliminates the equality constraint:

minimizeu

f(Cu+A†b)

subject to yi(Cu+A†b) ≤ 0, i ∈ 1, . . . ,m1.(37)

Ifu∗ is the solution to this problem, the solution to problem(55) is x∗ = Cu∗ +A†b.

3.2.3. ADDING EQUALITY CONSTRAINTS

Conversely, we can convert the problem:

minimizexi

m1i=0

f(Ax0 + b0)

subject to yi(Axi + bi) ≤ 0, i ∈ 1, . . . ,m1,(38)

to:

minimizeui,xi

m1i=0

f(u0)

subject to yi(ui) ≤ 0, i ∈ 1, . . . ,m1,ui = Axi + bi, i ∈ 0, 1, . . . ,m1,

(39)

by change of variables.

3.2.4. ELIMINATING SET CONSTRAINTS

As was discussed in Section 5.6.1, we can convert problem(28) to problem (151) by using the indicator function. Thatproblem can be solved iteratively where at every iteration,the solution is updated (by first- or second-order methods)without the set constraint and then the updated solution ofiteration is projected onto the set. This procedure is re-peated until convergence.

3.2.5. ADDING SLACK VARIABLES

Consider the following problem with inequality con-straints:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1.(40)

Using the so-called slack variables, denoted by ξi ∈R+m1

i=1, we can convert this problem to the followingproblem:

minimizex,ξi

m1i=1

f(x)

subject to yi(x) + ξi = 0, i ∈ 1, . . . ,m1,ξi ≥ 0, i ∈ 1, . . . ,m1.

(41)

The slack variables should be non-negative because the in-equality constraints are less than or equal to zero.

3.2.6. EPIGRAPH FORM

We can convert the optimization problem (25) to its epi-graph form:

minimizex,t

t

subject to f(x)− t ≤ 0,

yi(x) ≤ 0, i ∈ 1, . . . ,m1,hi(x) = 0, i ∈ 1, . . . ,m2,

(42)

because we can minimize an upper-bound t on the objec-tive function rather than minimizing the objective function.Likewise, for a maximization problem, we can maximize alower-bound of the objective function rather than maximiz-ing the objective function. The upper-/lower-bound doesnot necessarily need to be t; it can be any upper-/lower-bound function for the objective function. This is a goodtechnique because sometimes optimizing an upper-/lower-bound function is simpler than the objective function itself.

4. Karush-Kuhn-Tucker (KKT) ConditionsMany of the optimization algorithms are reduced to andcan be explained by the Karush-Kuhn-Tucker (KKT) con-ditions. Therefore, KKT conditions are fundamental re-quirements for optimization. In this section, we explainthese conditions.

16

4.1. The Lagrangian Function4.1.1. LAGRANGIAN AND DUAL VARIABLES

Definition 23 (Lagrangian and dual variables). The La-grangian function for the optimization problem (25) is L :Rd × Rm1 × Rm2 → R, with domain D × Rm1 × Rm2 ,defined as:

L(x,λ,ν) := f(x) +

m1∑i=1

λiyi(x) +

m2∑i=1

νihi(x)

= f(x) + λ>y(x) + ν>h(x),

(43)

where λim1i=1 and νim2

i=1 are the Lagrange multipliers,also called the dual variables, corresponding to inequal-ity and equality constraints, respectively. Note that λ :=[λ1, . . . , λm1 ]> ∈ Rm1 , ν := [ν1, . . . , νm2 ]> ∈ Rm2 ,y(x) := [y1(x), . . . , ym1(x)]> ∈ Rm1 , and h(x) :=[h1(x), . . . , hm2

(x)]> ∈ Rm2 . Eq. (43) is also called theLagrange relaxation of the optimization problem (25).

4.1.2. SIGN OF TERMS IN LAGRANGIAN

In some papers, the plus sign behind∑m2

i=1 νihi(x) is re-placed with the negative sign. As hi(x) is for equality con-straint, its sign is not important in the Lagrangian function.However, the sign of the term

∑m1

i=1 λiyi(x) is importantbecause the sign of inequality constraint is important. Wewill discuss the sign of λim1

i=1 later. Moreover, accordingto Eq. (26), if the problem (50) is a maximization prob-lem rather than minimization, the Lagrangian function isL(x,λ,ν) = −f(x) +

∑m1

i=1 λiyi(x) +∑m2

i=1 νihi(x) in-stead of Eq. (43).

4.1.3. INTERPRETATION OF LAGRANGIAN

We can interpret Lagrangian using penalty. As Eq. (25)states, we want to minimize the objective function f(x).We create a cost function consisting of the objective func-tion. The optimization problem has constraints so its con-straints should also be satisfied while minimizing the ob-jective function. Therefore, we penalize the cost functionif the constraints are not satisfied. For this, we can add theconstraints to the objective function as the regularization(or penalty) terms and we minimize the regularized cost.The dual variables λ and ν can be seen as the regulariza-tion parameters which weight the penalties compared to theobjective function f(x). This regularized cost function isthe Lagrangian function or the Lagrangian relaxation of theproblem (25). Minimization of the regularized cost func-tion minimizes the function f(x) while trying to satisfy theconstraints.

4.1.4. LAGRANGE DUAL FUNCTION

Definition 24 (Lagrange dual function). The Lagrangedual function (also called the dual function) g : Rm1 ×

Rm2 → R is defined as:

g(λ,ν) := infx∈DL(x,λ,ν)

= infx∈D

(f(x) +

m1∑i=1

λiyi(x) +

m2∑i=1

νihi(x)).

(44)

Note that the dual function g is a concave function. Wewill see later, in Section 4.4, that we maximize this concavefunction in a so-called dual problem.

4.2. Primal FeasibilityDefinition 25 (The optimal point and the optimum). Thesolution of optimization problem (25) is the optimal pointdenoted by x∗. The minimum function from this solution,i.e., f∗ := f(x∗), is called the optimum function of prob-lem (25).

The optimal point x∗ is one of the feasible points whichminimizes function f(.) with constraints in problem (25).Hence, the optimal point is a feasible point and accordingto Eq. (27), we have:

yi(x∗) ≤ 0, ∀i ∈ 1, . . . ,m1, (45)

hi(x∗) = 0, ∀i ∈ 1, . . . ,m2. (46)

These are called the primal feasibility.The optimal point x∗ minimizes the Lagrangian functionbecause Lagrangian is the relaxation of optimization prob-lem to an unconstrained problem (see Section 4.1.3). Onthe other hand, according to Eq. (44), the dual function isthe minimum of Lagrangian w.r.t. x. Hence, we can writethe dual function as:

g(λ,ν)(44)= inf

x∈DL(x,λ,ν) = L(x∗,λ,ν). (47)

4.3. Dual FeasibilityLemma 11 (Dual function as a lower bound). If λ 0, then the dual function is a lower bound for f∗, i.e.,g(λ,ν) ≤ f∗.

Proof. Let λ 0 which means λi ≥ 0,∀i. Consider afeasible x for problem (25). According to Eq. (27), wehave:

L(x,λ,ν)(43)= f(x) +

m1∑i=1

λi︸︷︷︸≥0

yi(x)︸ ︷︷ ︸≤0

+

m2∑i=1

νi hi(x)︸ ︷︷ ︸=0

≤ f(x). (48)

Therefore, we have:

f(x)(48)

≥ L(x,λ,ν) ≥ infx∈DL(x,λ,ν)

(44)= g(λ,ν).

17

Hence, the dual function is a lower bound for the functionof all feasible points. As the optimal point x∗ is a feasiblepoint, the dual function is a lower bound for f∗. Q.E.D.

Corollary 3 (Nonnegativity of dual variables for inequalityconstraints). From Lemma 11, we conclude that for havingthe dual function as a lower bound for the optimum func-tion, the dual variable λim1

i=1 for inequality constraints(less than or equal to zero) should be non-negative, i.e.:

λ 0 or λi ≥ 0, ∀i ∈ 1, . . . ,m1. (49)

Note that if the inequality constraints are greater thanor equal to zero, we should have λi ≤ 0, ∀i becauseyi(x) ≥ 0 =⇒ −yi(x) ≤ 0. In this paper, we as-sume that the inequality constraints are less than or equalto zero. If some of the inequality constraints are greaterthan or equal to zero, we convert them to less than or equalto zero by multiplying them to −1.

The inequalities in Eq. (49) are called the dual feasibility.

4.4. The Dual Problem, Weak and Strong Duality, andSlater’s Condition

According to Eq. (11), the dual function is a lower boundfor the optimum function, i.e., g(λ,ν) ≤ f∗. We want tofind the best lower bound so we maximize g(λ,ν) w.r.t. thedual variables λ,ν. Moreover, Eq. (49) says that the dualvariables for inequalities must be nonnegative. Hence, wehave the following optimization:

maximizeλ,ν

g(λ,ν)

subject to λ 0.(50)

The problem (50) is called the Lagrange dual optimizationproblem for problem (25). The problem (25) is also re-ferred to as the primal optimization problem. The variableof problem (25), i.e. x, is called the primal variable whilethe variables of problem (50), i.e. λ and ν, are called thedual variables. Let the solutions of the dual problem bedenoted by λ∗ and ν∗. We denote g∗ := g(λ∗,ν∗) =supλ,ν g.

Definition 26 (Weak and strong duality). For all convexand nonconvex problems, the optimum dual problem is alower bound for the optimum function:

g∗ ≤ f∗ i.e., g(λ∗,ν∗) ≤ f(x∗). (51)

This is called the weak duality. For some optimizationproblems, we have strong duality which is when the opti-mum dual problem is equal to the optimum function:

g∗ = f∗ i.e., g(λ∗,ν∗) = f(x∗). (52)

The strong duality usually holds for convex optimizationproblems.

Figure 7. Illustration of weak duality and strong duality.

Figure 8. Progress of iterative optimization: (a) gradual mini-mization of the primal function and maximization of dual func-tion and (b) the primal optimal and dual optimal reach each otherand become equal if strong duality holds.

Corollary 4. Eqs. (51) and (52) show that the optimumdual function, g∗, always provides a lower-bound for theoptimum primal function, f∗.

The primal optimization problem, i.e. Eq. (25), is mini-mization so its cost function is like a bowl as illustrated inFig. 7. The dual optimization problem, i.e. Eq. (50), ismaximization so its cost function is like a reversed bowlas shown in Fig. 7. The domains for primal and dual prob-lems are the domain of primal variable x and the domain ofdual variables λ and ν, respectively. As the figure shows,the optimal x∗ is corresponded to the optimal λ∗ and ν∗.As shown in the figure, there is a possible nonnegative gapbetween the two bowls. In the best case, this gap is zero. Ifthe gap is zero, we have strong duality; otherwise, a weakduality exists.If optimization is iterative, the solution is updated itera-

18

tively until convergence. First-order and second-order nu-merical optimization, which we will introduce later, are it-erative. In optimization, the series of primal optimal anddual optimal converge to the optimal solution and the dualoptimal, respectively. The function values converge to thelocal minimum and the dual function values converge to theoptimal (maximum) dual function. Let the superscript (k)denotes the value of variable at iteration k. We have:

x(0),x(1),x(2), . . . → x∗,

ν(0),ν(1),ν(2), . . . → ν∗,

λ(0),λ(1),λ(2), . . . → λ∗,

f(x(0)) ≥ f(x(1)) ≥ f(x(2)) ≥ · · · ≥ f(x∗),

g(λ(0),ν(0)) ≤ g(λ(1),ν(1)) ≤ · · · ≤ g(λ∗,ν∗).

(53)

Hence, the value of function goes down but the value ofdual function goes up. As Fig. 8 depicts, they reach eachother if strong duality holds; otherwise, there will be a gapbetween them after convergence. Note that if the optimiza-tion problem is a convex problem, the eventually found so-lution is the global solution; otherwise, the solution is local.

Corollary 5. As every iteration of a numerical optimiza-tion must satisfy either the weak or strong duality, the op-timum dual function at every iteration always provides alower-bound for the optimum primal function at that itera-tion:

g(λ(k),ν(k)) ≤ f(x(k)), ∀k. (54)

Lemma 12 (Slater’s condition (Slater, 1950)). For a con-vex optimization problem in the form:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1,Ax = b,

(55)

we have strong duality if it is strictly feasible, i.e.:

∃x ∈ int(D) : yi(x) < 0, ∀i ∈ 1, . . . ,m1,Ax = b.

(56)

In other words, for at least one point in the interior of do-main (not on the boundary of domain), all the inequalityconstraints hold strictly. This is called the Slater’s condi-tion.

4.5. Complementary SlacknessAssume that the problem has strong duality, the primal op-timal is x∗ and dual optimal variables are λ∗ and ν∗. We

have:

f(x∗)(52)= g(λ∗,ν∗)

(44)= inf

x∈D

(f(x) +

m1∑i=1

λ∗i yi(x) +

m2∑i=1

ν∗i hi(x))

(a)= f(x∗) +

m1∑i=1

λ∗i yi(x∗) +

m2∑i=1

ν∗i hi(x∗)

(b)= f(x∗) +

m1∑i=1

λ∗i yi(x∗)

(c)

≤ f(x∗), (57)

where (a) is because x∗ is the primal optimal solution forproblem (25) and it minimizes the Lagrangian, (b) is be-cause x∗ is a feasible point and satisfies hi(x∗) = 0 in Eq.(27), and (c) is because λ∗i ≥ 0 according to Eq. (49) andthe feasible x∗ satisfies yi(x∗) ≤ 0 in Eq. (27) so we have:

λ∗i yi(x∗) ≤ 0, ∀i ∈ 1, . . . ,m1. (58)

From Eq. (57), we have:

f(x∗) = f(x∗) +

m1∑i=1

λ∗i yi(x∗) ≤ f(x∗)

=⇒m1∑i=1

λ∗i yi(x∗) = 0

(58)=⇒ λ∗i yi(x

∗) = 0,∀i.

Therefore, the multiplication of every optimal dual variableλ∗i with yi(.) of optimal primal solution x∗ must be zero.This is called the complementary slackness:

λ∗i yi(x∗) = 0, ∀i ∈ 1, . . . ,m1. (59)

These conditions can be restated as:

λ∗i > 0 =⇒ yi(x∗) = 0, (60)

yi(x∗) < 0 =⇒ λ∗i = 0, (61)

which means that, for an inequality constraint, if the dualoptimal is nonzero, its inequality function of the primal op-timal must be zero. If the inequality function of the primaloptimal is nonzero, its dual optimal must be zero.

4.6. Stationarity ConditionAs was explained before, the Lagrangian function can beinterpreted as a regularized cost function to be minimized.Hence, the constrained optimization problem (25) is con-verted to minimization of the Lagrangian function, Eq.(43), which is an unconstrained optimization problem:

minimizex

L(x,λ,ν). (62)

Note that this problem is the dual function according to Eq.(44). As this is an unconstrained problem, its optimization

19

is easy. We can find its minimum by setting its derivativew.r.t. x, denoted by ∇xL, to zero:

∇xL(x,λ,ν) = 0(43)=⇒

∇xf(x) +

m1∑i=1

λi∇xyi(x) +

m2∑i=1

νi∇xhi(x) = 0.(63)

This equation is called the stationarity condition becausethis shows that the gradient of Lagrangian w.r.t. x shouldvanish to zero (n.b. a stationary point of a function is a pointwhere the derivative of function is zero). This derivativeholds for all dual variables and not just for the optimal dualvariables. We can claim that the gradient of Lagrangianw.r.t. x should vanish to zero because the dual function,defined in Eq. (44), should exist.

4.7. KKT ConditionsWe derived the primal feasibility, dual feasibility, comple-mentary slackness, and stationarity condition. These fourconditions are called the Karush-Kuhn-Tucker (KKT) con-ditions (Karush, 1939; Kuhn & Tucker, 1951). The pri-mal optimal variable x∗ and the dual optimal variablesλ∗ = [λ∗1, . . . , λ

∗m1

]>, ν∗ = [ν∗1 , . . . , ν∗m2

]> must satisfythe KKT conditions. We summarize the KKT conditions inthe following:

1. Stationarity condition:

∇xL(x,λ,ν) =∇xf(x) +

m1∑i=1

λi∇xyi(x)

+

m2∑i=1

νi∇xhi(x) = 0.

(64)

2. Primal feasibility:

yi(x∗) ≤ 0, ∀i ∈ 1, . . . ,m1, (65)

hi(x∗) = 0, ∀i ∈ 1, . . . ,m2. (66)

3. Dual feasibility:

λ 0 or λi ≥ 0, ∀i ∈ 1, . . . ,m1. (67)

4. Complementary slackness:

λ∗i yi(x∗) = 0, ∀i ∈ 1, . . . ,m1. (68)

As listed above, KKT conditions impose constraints on theoptimal dual variables of inequality constraints because thesign of inequalities are important.Recall the dual problem (50). The constraint in this prob-lem is already satisfied by the dual feasibility in the KKTconditions. Hence, we can ignore the constraint of the dual

problem (as it is automatically satisfied by dual feasibility):

maximizeλ,ν

g(λ,ν), (69)

which should give us λ∗, ν∗, and g∗ = g(λ∗,ν∗). This isan unconstrained optimization problem and for solving it,we should set the derivative of g(λ,ν) w.r.t. λ and ν tozero:

∇λg(λ,ν) = 0(47)=⇒ ∇λL(x∗,λ,ν) = 0. (70)

∇νg(λ,ν) = 0(47)=⇒ ∇νL(x∗,λ,ν) = 0. (71)

Note that setting the derivatives of Lagrangian w.r.t. dualvariables always gives back the corresponding constraintsin the primal optimization problem. Eqs. (64), (70), and(71) state that the primal and dual residuals must be zero.Finally, Eqs. (44) and (69) can be summarized into thefollowing max-min optimization problem:

supλ,ν

g(λ,ν)(44)= sup

λ,νinfxL(x,λ,ν) = L(x∗,λ∗,ν∗).

(72)

The reason for the name KKT is as follows (Kjeldsen,2000). In 1952, Kuhn and Tucker published an importantpaper proposing the conditions (Kuhn & Tucker, 1951).However, later it was found out that there is a master’s theseby Karush, in 1939, at the University of Chicago, Illinois(Karush, 1939). That thesis had also proposed the con-ditions; however, researchers including Kuhn and Tuckerwere not aware of that thesis. Therefore, these conditionswere named after all three of them.

4.8. Solving Optimization by Method of LagrangeMultipliers

We can solve the optimization problem (25) using dual-ity and KKT conditions. This technique is also called themethod of Lagrange multipliers. For this, we should do thefollowing steps:

1. We write the Lagrangian as Eq. (43).

2. We consider the dual function defined in Eq. (44) andwe solve it:

x† := arg minxL(x,λ,ν). (73)

It is an unconstrained problem and according to Eqs.(44) and (64), we solve this problem by taking thederivative of Lagrangian w.r.t. x and setting it to zero,i.e., ∇xL(x,λ,ν)

set= 0. This gives us the dual func-

tion, according to Eq. (43):

g(λ,ν) = L(x†,λ,ν). (74)

20

3. We consider the dual problem, defined in Eq. (50)which is simplified to Eq. (69) because of Eq. (67).This gives us the optimal dual variables λ∗ and ν∗:

λ∗,ν∗ := arg maxλ,ν

g(λ,ν). (75)

It is an unconstrained problem and according to Eqs.(70) and (71), we solve this problem by taking thederivative of dual function w.r.t. λ and ν and settingthem to zero, i.e., ∇λg(λ,ν)

set= 0 and ∇νg(λ,ν)

set=

0. The optimum dual value is obtained as:

g∗ = maxλ,ν

g(λ,ν) = g(λ∗,ν∗). (76)

4. We put the optimal dual variables λ∗ and ν∗ in Eq.(64) to find the optimal primal variable:

x∗ := arg minxL(x,λ∗,ν∗). (77)

It is an unconstrained problem and we solve this prob-lem by taking the derivative of Lagrangian at opti-mal dual variables w.r.t. x and setting it to zero, i.e.,∇xL(x,λ∗,ν∗)

set= 0. The optimum primal value is

obtained as:

f∗ = minxL(x,λ∗,ν∗) = L(x∗,λ∗,ν∗). (78)

5. First-Order Optimization: GradientMethods

5.1. Gradient DescentGradient descent is one of the fundamental first-ordermethods. It was first suggested by Cauchy in 1874(Lemarechal, 2012) and Hadamard in 1908 (Hadamard,1908) and its convergence was later analyzed in (Curry,1944). In the following, we introduce this method.

5.1.1. STEP OF UPDATE

Consider the unconstrained optimization problem (24).Here, we denote x∗ := arg minx f(x) and f∗ :=minx f(x) = f(x∗). In numerical optimization for un-constrained optimization, we start with a random feasibleinitial point and iteratively update it by step ∆x:

x(k+1) := x(k) + ∆x, (79)

until we converge to (or get sufficiently close to) the desiredoptimal point x∗. Note that the step ∆x is also denoted byp in the literature, i.e., p := ∆x. Let the function f(.)be differentiable and its gradient is L-smooth. If we setx = x(k) and y = x(k+1) = x(k) + ∆x in Eq. (13), wehave:

f(x(k) + ∆x) ≤ f(x(k)) +∇f(x(k))>∆x+L

2‖∆x‖22

=⇒ f(x(k) + ∆x)− f(x(k))

≤ ∇f(x(k))>∆x+L

2‖∆x‖22. (80)

Until reaching the minimum, we want to decrease the costfunction f(.) in every iteration; hence, we desire:

f(x(k) + ∆x)− f(x(k)) < 0. (81)

According to Eq. (80), one way to achieve Eq. (81) is:

∇f(x(k))>∆x+L

2‖∆x‖22 < 0.

Hence, we should minimize ∇f(x(k))>∆x + L2 ‖∆x‖

22

w.r.t. ∆x:

minimize∆x

∇f(x(k))>∆x+L

2‖∆x‖22. (82)

This function is convex w.r.t. ∆x and we can optimize itby setting its derivative to zero:

∂∆x(∇f(x(k))>∆x+

L

2‖∆x‖22) = ∇f(x(k)) + L∆x

set= 0 =⇒ ∆x = − 1

L∇f(x(k)). (83)

Using Eq. (83) in Eq. (80) gives:

f(x(k) + ∆x)− f(x(k)) ≤ − 1

2L‖∇f(x(k))‖22 ≤ 0,

which satisfies Eq. (81). Eq. (83) means that it is bet-ter to move toward a scale of minus gradient for updatingthe solution. This inspires the name of algorithm which isgradient descent.The problem is that often we either do not know the Lip-schitz constant L or it is hard to compute. Hence, ratherthan Eq. (83), we use:

∆x = −η∇f(x(k)), i.e., x(k+1) := x(k) − η∇f(x(k)),(84)

where η > 0 is the step size, also called the learning ratein data science literature. Note that if the optimizationproblem is maximization rather than minimization, the stepshould be ∆x = η∇f(x(k)) rather than Eq. (84). In thatcase, the name of method is gradient ascent.Using Eq. (84) in Eq. (80) gives:

f(x(k) + ∆x)− f(x(k))

≤ −η‖∇f(x(k))‖22 +L

2η2‖∇f(x(k))‖22

(85)

= η(L

2η − 1)‖∇f(x(k))‖22

If x(k) is not a stationary point, we have ‖∇f(x(k))‖22 > 0.Noticing η > 0, for satisfying Eq. (81), we must set:

L

2η − 1 < 0 =⇒ η <

2

L. (86)

21

On the other hand, we can minimize Eq. (85) by setting itsderivative w.r.t. η to zero:

∂η(−η‖∇f(x(k))‖22 +

L

2η2‖∇f(x(k))‖22)

= −‖∇f(x(k))‖22 + Lη‖∇f(x(k))‖22

= (−1 + Lη)‖∇f(x(k))‖22set= 0 =⇒ η =

1

L.

If we set:

η <1

L, (87)

then Eq. (85) becomes:

f(x(k) + ∆x)− f(x(k))

≤ − 1

L‖∇f(x(k))‖22 +

1

2L‖∇f(x(k))‖22

= − 1

2L‖∇f(x(k))‖22 < 0

=⇒ f(x(k+1)) ≤ f(x(k))− 1

2L‖∇f(x(k))‖22. (88)

Eq. (87) means that there should be an upper-bound, de-pendent on the Lipschitz constant, on the step size. Hence,L is still required. Eq. (88) shows that every iteration ofgradient descent decreases the cost function:

f(x(k+1)) ≤ f(x(k)), (89)

and the amount of this decrease depends on the norm ofgradient at that iteration. In conclusion, the series of solu-tions converges to the optimal solution while the functionvalue decreases iteratively until the local minimum:

x(0),x(1),x(2), . . . → x∗,

f(x(0)) ≥ f(x(1)) ≥ f(x(2)) ≥ · · · ≥ f(x∗).

If the optimization problem is a convex problem, the solu-tion is the global solution; otherwise, the solution is local.

5.1.2. LINE-SEARCH

As was shown in Section 5.1.1, the step size of gradientdescent requires knowledge of the Lipschitz constant forthe smoothness of gradient. Hence, we can find the suitablestep size η by a search which is named the line-search. Inline-search of every optimization iteration, we start withη = 1 and halve it, η ← η/2, if it does not satisfy Eq. (81)with step ∆x = −η∇f(x(k)):

f(x(k) − η∇f(x(k))) < f(x(k)). (90)

This halving step size is repeated until this equation is satis-fied, i.e., until we have a decrease in the objective function.Note that this decrease will happen when the step size be-comes small enough to satisfy Eq. (87). The algorithm ofgradient descent with line-search is shown in Algorithm 1.As this algorithm shows, line-search has its own internaliterations inside every iteration of gradient descent.

1 Initialize x(0)

2 for iteration k = 0, 1, . . . do3 Initialize η := 14 for iteration τ = 1, 2, . . . do5 Check Eq. (90) or (92)6 if not satisfied then7 η ← 1

2 × η8 else9 x(k+1) := x(k) − η∇f(x(k))

10 break the loop

11 Check the convergence criterion12 if converged then13 return x(k+1)

Algorithm 1: Gradient descent with line search

Lemma 13 (Time complexity of line-search). In the worst-case, line-search takes (logL/ log 2) iterations until Eq.(90) is satisfied.

Proof. Proof is available in Appendix B.1.

5.1.3. BACKTRACKING LINE-SEARCH

A more sophisticated line-search method is the Armijo line-search (Armijo, 1966), also called the backtracking line-search. Rather than Eq. (90), it checks if the cost functionis sufficiently decreased:

f(x(k) + p) ≤ f(x(k)) + cp>f(x(k)), (91)

where c ∈ (0.0.5] is the parameter of Armijo line-searchand p = ∆x is the search direction for update. The valueof c should be small, e.g., c = 10−4 (Nocedal & Wright,2006). This condition is called the Armijo condition orthe Armijo-Goldstein condition. In gradient descent, thesearch direction is p = ∆x = −η∇f(x(k)) according toEq. (84). Hence, for gradient descent, it checks:

f(x(k) − η∇f(x(k))) ≤ f(x(k))− η γ‖∇f(x(k))‖22.(92)

The algorithm of gradient descent with Armijo line-searchis shown in Algorithm 1. Note that we can have moresophisticated line-search with Wolfe conditions (Wolfe,1969). This will be introduced in Section 7.5.

5.1.4. CONVERGENCE CRITERION

For all numerical optimization methods including gradientdescent, there exist several methods for convergence crite-rion to stop updating the solution and terminate optimiza-tion. Some of them are:

• Small norm of gradient: ‖∇f(x(k+1))‖2 ≤ ε where εis a small positive number. The reason for this crite-

22

rion is the first-order optimality condition (see Lemma9).

• Small change of cost function: |f(x(k+1)) −f(x(k))| ≤ ε.

• Small change of gradient of function: |∇f(x(k+1))−∇f(x(k))| ≤ ε.

• Reaching maximum desired number of iterations, de-noted by maxk: k + 1 < maxk.

5.1.5. CONVERGENCE ANALYSIS FOR GRADIENTDESCENT

We showed in Eq. (88) that the cost function value is de-creased by gradient descent iterations. The following theo-rem provides the convergence rate of gradient descent.

Theorem 1 (Convergence rate and iteration complexity ofgradient descent). Consider a differentiable function f(.),with domainD, whose gradient is L-smooth (see Eq. (12)).Starting from the initial point x(0), after t iterations of gra-dient descent, we have:

min0≤k≤t

‖∇f(x(k))‖22 ≤2L(f(x(0))− f∗)

t+ 1, (93)

where f∗ is the minimum of cost function. In other words,after t iterations, we have:

∃x(k) : ‖∇f(x(k))‖22 = O(1

t), (94)

which means the squared norm of gradient has sublinearconvergence (see Definition 20) in gradient descent. More-over, after:

t ≥ 2L(f(x(0))− f∗)ε

− 1, (95)

iterations, gradient descent is guaranteed to satisfy‖∇f(x(k))‖22 ≤ ε. Hence, the iteration complexity of gra-dient descent is O(1/ε).

Proof. Proof is available in Appendix B.2.

The above theorem provides the convergence rate of gradi-ent descent for a general function. If the function is convex,we can simplify this convergence rate further, as stated inthe following.

Theorem 2 (Convergence rate of gradient descent for con-vex functions). Consider a convex and differentiable func-tion f(.), with domain D, whose gradient is L-smooth (seeEq. (12)). Starting from the initial point x(0), after t itera-tions of gradient descent, we have:

f(x(t+1))− f∗ ≤ 2L‖x(0) − x∗‖22t+ 1

, (96)

where f∗ is the minimum of cost function and x∗ is theminimizer. In other words, after t iterations, we have:

f(x(t))− f∗ = O(1

t), (97)

which means the distance of convex function value to itsoptimum has sublinear convergence (see Definition 20) ingradient descent. The iteration complexity is the same asEq. (95).

Proof. Proof is available in Appendix B.3.

Theorem 3 (Convergence rate of gradient descent forstrongly convex functions). Consider a µ-strongly convexand differentiable function f(.), with domain D, whosegradient is L-smooth (see Eq. (12)). Starting from the ini-tial point x(0), after t iterations, the convergence rate anditeration complexity of gradient descent are:

f(x(t))− f∗ ≤ (1− µ

L)t(f(x(0))− f∗

)=⇒ f(x(t))− f∗ = O

((1− µ

L)t), (98)

t = O(log1

ε), (99)

respectively, where f∗ is the minimum of cost function. Itmeans that gradient descent has linear convergence rate(see Definition 20) for strongly convex functions.

Note that some convergence proofs and analyses for gradi-ent descent can be found in (Gower, 2018).

5.1.6. GRADIENT DESCENT WITH MOMENTUM

Gradient descent and other first-order methods can havea momentum term. Momentum, proposed in (Rumelhartet al., 1986), makes the change of solution ∆x a little sim-ilar to the previous change of solution. Hence, the changeadds a history of previous change to Eq. (84):

(∆x)(k) := α(∆x)(k−1) − η(k)∇f(x(k)), (100)

where α > 0 is the momentum parameter which weightsthe importance of history compared to the descent direc-tion. We use this (∆x)(k) in Eq. (79) for updating thesolution. Because of faithfulness to the track of previousupdates, momentum reduces the amount of oscillation ofupdates in gradient descent optimization.

5.1.7. STEEPEST DESCENT

Steepest descent is similar to gradient descent but there isa difference between them. In steepest descent, we movetoward the negative gradient as much as possible to reachthe smallest function value which can be achieved at ev-ery iteration. Hence, the step size at iteration k of steepestdescent is calculated as (Chong & Zak, 2004):

η(k) := arg minη

f(x(k) − η∇f(x(k))

), (101)

23

Figure 9. Neurons in three layers of a neural network.

and then, the solution is updated using Eq. (84) as in gra-dient descent.Another interpretation of steepest descent is as follows,according to (Boyd & Vandenberghe, 2004, Chapter 9.4).The first-order Taylor expansion of function is f(x+ v) ≈f(x) +∇f(x)>v. Hence, the step size in the normalizedsteepest descent, at iteration k, is obtained as:

∆x = arg minv∇f(x(k))>v | ‖v‖2 ≤ 1, (102)

which is used in Eq. (79) for updating the solution.

5.1.8. BACKPROPAGATION

Backpropagation (Rumelhart et al., 1986) is the most well-known optimization method used in neural networks. Itis actually gradient descent with chain rule in derivativesbecause of having layers of parameters. Consider Fig. 9which shows three neurons in three layers of a network. Letxji denote the weight connecting neuron i to neuron j. Letai and zi be the output of neuron i before and after applyingits activation function σi(.) : R→ R, respectively. In otherwords, zi := σi(ai).According to neural network, we have ai =

∑` xi`z`

which sums over the neurons in layer `. By chain rule,the gradient of error e w.r.t. to the weight between neurons` and i is:

∂e

∂xi`=

∂e

∂ai× ∂ai∂xi`

(a)= δi × z`, (103)

where (a) is because ai =∑` xi`z` and we define δi :=

∂e/∂ai. If layer i is the last layer, δi can be computed byderivative of error (loss function) w.r.t. the output. How-ever, if i is one of the hidden layers, δi is computed by chainrule as:

δi =∂e

∂ai=∑j

( ∂e∂aj× ∂aj∂ai

)=∑j

(δj ×

∂aj∂ai

).

(104)

The term ∂aj/∂ai is calculated by chain rule as:

∂aj∂ai

=∂aj∂zi× ∂zi∂ai

(a)= xji σ

′(ai), (105)

where (a) is because aj =∑i xjizi and zi = σ(ai) and

σ′(.) denotes the derivative of activation function. PuttingEq. (105) in Eq. (104) gives:

δi = σ′(ai)∑j

(δj xji).

Putting this equation in Eq. (103) gives:

∂e

∂xi`= z` σ

′(ai)∑j

(δj xji). (106)

Backpropagation uses the gradient in Eq. (106) for updat-ing the weight xi`,∀i, ` by gradient descent:

x(k+1)i` := x

(k)i` − η

(k) ∂e

∂xi`.

This tunes the weights from last layer to the first layer forevery iteration of optimization.

5.2. Accelerated Gradient MethodIt was shown in the literature that gradient descent is notoptimal in convergence rate and can be improved. It wasat that time that Nesterov proposed Accelerated GradientMethod (AGM) (Nesterov, 1983) to make the convergencerate of gradient descent optimal (Nesterov, 2003, Chapter2.2). AGM is also called the Nesterov’s accelerated gra-dient method or Fast Gradient Method (FGM). A series ofNesterov’s papers improved AGM (Nesterov, 1983; 1988;2005; 2013).Consider a sequence γ(k) which satisfies:

k∏i=0

(1− γ(i)) ≥ (γ(k))2, ∀k ≥ 0, γ(k) ∈ [0, 1]. (107)

An example sequence, satisfying this condition, is γ(0) =γ(1) = γ(2) = γ(3) = 0, γ(k) = 2/k,∀k ≥ 4. The AGMupdates the solution iteratively as (Nesterov, 1983):

x(k+1) := y(k) − η(k)∇f(y(k)), (108)

y(k+1) := (1− γ(k))x(k+1) + γ(k)x(k), (109)

until convergence.Theorem 4 (Convergence rate of AGM for convex func-tions (Nesterov, 1983, under Eq. 7), (Bubeck, 2014, Theo-rem 3.19)). Consider a convex and differentiable functionf(.), with domain D, whose gradient is L-smooth (see Eq.(12)). Starting from the initial point x(0), after t iterationsof AGM, we have:

f(x(t+1))− f∗ ≤ 2L‖x(0) − x∗‖22(t+ 1)2

= O(1

t2), (110)

24

where f∗ is the minimum of cost function and x∗ is theminimizer. It means the distance of convex function valueto its optimum has sublinear convergence (see Definition20) in AGM.

Comparing Eqs. (96) and (110) shows that AGM convergesmuch faster than gradient descent. A book chapter on AGMis (Bubeck, 2014, Section 3.7). Various versions of AGMhave been unified in (Tseng, 2008). Moreover, the connec-tion of AGM with ordinary differential equations has beeninvestigated in (Su et al., 2016).

5.3. Stochastic Gradient Methods5.3.1. STOCHASTIC GRADIENT DESCENT

Assume we have a dataset of n data points, ai ∈ Rdni=1

and their labels li ∈ Rni=1. Let the cost function f(.) bedecomposed into summation of n terms fi(x)ni=1. Somewell-known examples for the cost function terms are:

• Least squares error: fi(x) = 0.5(a>i x− li)2,

• Absolute error: fi(x) = a>i x− li,

• Hinge loss (for li ∈ −1, 1): fi(x) = max(0, 1 −lia>i x).

• Logistic loss (for li ∈ −1, 1): log( 11+exp(−lia>i x)

).

The optimization problem (24) becomes:

minimizex

1

n

n∑i=1

fi(x). (111)

In this case, the full gradient is the average gradient, i.e:

∇f(x) =1

n

n∑i=1

∇fi(x), (112)

so Eq. (83) becomes ∆x = −(1/(Ln))∑ni=1∇fi(x(k)).

This is what gradient descent uses in Eq. (79) for updat-ing the solution at every iteration. However, calculationof this full gradient is time-consuming and inefficient forlarge values of n, especially as it needs to be recalculatedat every iteration. Stochastic Gradient Descent (SGD), alsocalled stochastic gradient method, approximates gradientdescent stochastically and samples (i.e. bootstraps) oneof the points at every iteration for updating the solution.Hence, it uses:

x(k+1) := x(k) − η(k)∇fi(x(k)), (113)

ragther than Eq. (84). The idea of stochastic approximationwas first proposed in (Robbins & Monro, 1951). It was firstused for machine learning in (Bottou et al., 1998).As Eq. (113) states, SGD often uses an adaptive step sizewhich changes in every iteration. The step size can be de-creasing because in initial iterations, where we are far away

from the optimal solution, the step size can be large; how-ever, it should be small in the last iterations which is sup-posed to be close to the optimal solution. Some well-knownadaptations for the step size are:

η(k) :=1

k, η(k) :=

1√k, η(k) := η. (114)

Theorem 5 (Convergence rates for SGD). Consider afunction f(x) =

∑ni=1 fi(x) and which is bounded below

and each fi is differentiable. Let the domain of functionf(.) be D and its gradient be L-smooth (see Eq. (12)).Assume E[‖∇fi(xk)‖22 |xk] ≤ β2 where β is a constant.Depending on the step size, the convergence rate of SGDis:

O(1

log t) if η(k) =

1

k, (115)

O(log t√t

) if η(k) =1√k, (116)

O(1

t+ η) if η(k) = η, (117)

where t denotes the iteration index. If the functions fi’s areµ-strongly convex, then the convergence rate of SGD is:

O(1

t) if η(k) =

1

µk, (118)

O((1− µ

L)t + η

)if η(k) = η. (119)

Eqs. (117) and (119) shows that with a fixed step size η,SGD converges sublinearly for a non-convex function andlinearly for a strongly convex function (see Definition 20)in the initial iterations. However, in the late iterations, itstagnates to a neighborhood around the optimal point andnever reaches it. Hence, SGD has less accuracy than gradi-ent descent. The advantage of SGD over gradient descentis that its every iteration is much faster than every iterationof gradient descent because of less computations for gra-dient. This faster pacing of every iteration shows off morewhen n is huge. In summary, SGD has fast convergence tolow accurate optimal point.It is noteworthy that the full gradient is not available inSGD to use for checking convergence, as discussed in Sec-tion 5.1.4. One can use other criteria in that section ormerely check the norm of gradient for the sampled point.Moreover, note that SGD can be used with the line-searchmethods, too. SGD can also use a momentum term (seeSection 5.1.6).

5.3.2. MINI-BATCH STOCHASTIC GRADIENT DESCENT

Gradient descent uses the entire n data points and SGDuses one randomly sampled point at every iteration. Forlarge datasets, gradient descent is very slow and intractablein every iteration while SGD will need a significant num-ber of iterations to roughly cover all data. Besides, SGD

25

has low accuracy in convergence to the optimal point. Wecan have a middle case where we use a batch of b randomlysampled points at every iteration. This method is named themini-batch SGD or the hybrid deterministic-stochastic gra-dient method. This batch-wise approach is wise for largedatasets because of the mentioned problems gradient de-scent and SGD face in big data optimization (Bottou et al.,2018).Usually, before start of optimization, the n data points arerandomly divided into bn/bc batches of size b. This isequivalent to simple random sampling for sampling pointsinto batches without replacement. We denote the datasetby D (where |D| = n) and the i-th batch by Bi (where|Bi| = b). The batches are disjoint:

bn/bc⋃i=1

Bi = D, (120)

Bi ∩ Bj = ∅, ∀i, j ∈ 1, . . . , bn/bc, i 6= j. (121)

Another less-used approach for making batches is to sam-ple points for a batch during optimization. This is equiva-lent to bootstrapping for sampling points into batches withreplacement. In this case, the batches are not disjoint any-more and Eqs. (120) and (121) do not hold.

Definition 27 (Epoch). In mini-batch SGD, when all bn/bcbatches of data are used for optimization once, an epoch iscompleted. After completion of an epoch, the next epoch isstarted and epochs are repeated until convergence of opti-mization.

In mini-batch SGD, if the k-th iteration of optimization isusing the k′-th batch, the update of solution is done as:

x(k+1) := x(k) − η(k) 1

b

∑i∈Bk′

∇fi(x(k)). (122)

The scale factor 1/b is sometimes dropped for simplicity.Mini-batch SGD is used significantly in machine learning,especially in neural networks (Bottou et al., 1998; Good-fellow et al., 2016). Because of dividing data into batches,mini-batch SGD can be solved on parallel servers as a dis-tributed optimization method.

Theorem 6 (Convergence rates for mini-batch SGD). Con-sider a function f(x) =

∑ni=1 fi(x) which is bounded be-

low and each fi is differentiable. Let the domain of functionf(.) be D and its gradient be L-smooth (see Eq. (12)) andassume η(k) = η is fixed. The batch-wise gradient is anapproximation to the full gradient with some error et forthe t-th iteration:

1

b

∑i∈Bt′

∇fi(x(t)) = ∇f(x(t)) + et. (123)

The convergence rate of mini-batch SGD for non-convexand convex functions are:

O(1

t+ ‖et‖22

), (124)

where t denotes the iteration index. If the functions fi’s areµ-strongly convex, then the convergence rate of mini-batchSGD is:

O((1− µ

L)t + ‖et‖22

). (125)

According to Eq. (123), the expected error of mini-batchSGD at the k-th iteration is:

E[‖et‖22] = E[∥∥∇f(x(t))− 1

b

∑i∈Bt′

∇fi(x(t))∥∥2

2

],

(126)

which is variance of estimation. If we sample the batcheswithout replacement (i.e., sampling batches by simple ran-dom sampling before start of optimization) or with replace-ment (i.e., bootstrapping during optimization), the expectederror is (Ghojogh et al., 2020, Proposition 3):

E[‖et‖22] = (1− b

n)σ2

b, (127)

E[‖et‖22] =σ2

b, (128)

respectively, where σ2 is the variance of whole dataset. Ac-cording to Eqs. (127) and (128), the accuracy of SGD bysampling without and with replacement increases by b→ nand b→∞, respectively. However, this increase makes ev-ery iteration slower so there is trade-off between accuracyand speed. Also, comparing Eqs. (124) and (125) with Eqs.(94) and (98), while noticing Eqs. (127) and (128), showsthat the convergence rate of mini-batch gets closer to thatof gradient descent if the batch size increases.

5.4. Stochastic Average Gradient Methods5.4.1. STOCHASTIC AVERAGE GRADIENT

SGD is faster than gradient descent but its problem is itslower accuracy compared to gradient descent. StochasticAverage Gradient (SAG) (Roux et al., 2012) keeps a trade-off between accuracy and speed. Consider the optimizationproblem (111). Let∇fi(x(k)) be the gradient of fi(.), eval-uated in point x(k), at iteration k. According to Eqs. (84)and (112), gradient descent updates the solution as:

x(k+1) := x(k) − η(k)

n

n∑i=1

∇fi(x(k)).

SAG randomly samples one of the points and updates itsgradient among the gradient terms. If the sampled point is

26

the j-th one, we have:

x(k+1) := x(k) − η(k)

n

(∇fj(x(k))−∇fj(x(k−1))

+

n∑i=1

∇fi(x(k−1))).

(129)In other words, we subtract the j-th gradient from the sum-mation of all n gradients in previous iteration (k − 1) by∑ni=1∇fi(x(k−1))−∇fj(x(k−1)); then, we add back the

new j-th gradient in this iteration by adding∇fj(x(k)).

Theorem 7 (Convergence rates for SAG (Roux et al., 2012,Proposition 1)). Consider a function f(x) =

∑ni=1 fi(x)

which is bounded below and each fi is differentiable. Letthe domain of function f(.) be D and its gradient be L-smooth (see Eq. (12)). The convergence rate of SAG isO(1/t) where t denotes the iteration index.

Comparing the convergence rates of SAG, gradient de-scent, and SGD shows that SAG has the same rate orderas gradient descent; although, it usually needs some moreiterations to converge. Practical experiments have shownthat SAG requires many parameter fine-tuning to performperfectly. Some other variants of SAG are optimization ofa finite sum of smooth convex functions (Schmidt et al.,2017) and its second-order version named Stochastic Aver-age Newton (SAN) (Chen et al., 2021).

5.4.2. STOCHASTIC VARIANCE REDUCED GRADIENT

Another effective first-order method is the Stochastic Vari-ance Reduced Gradient (SVRG) (Johnson & Zhang, 2013)which updates the solution according to Algorithm 2. Asthis algorithm shows, the update of solution is similar toSAG (see Eq. (129)) but for every iteration, it updates thesolution for m times. SVRG is an efficient method andits convergence rate is similar to that of SAG. It is shown in(Johnson & Zhang, 2013) that both SAG and SVRG reducethe variance of solution to optimization. Recently, SVRGhas been used for semidefinite programming optimization(Zeng et al., 2021).

5.4.3. ADAPTING LEARNING RATE WITH ADAGRAD,RMSPROP, AND ADAM

Consider the optimization problem (111). We can adapt thelearning rate in stochastic gradient methods. In the follow-ing, we introduce the three most well-known methods foradapting the learning rate, which are AdaGrad, RMSProp,and Adam.

– Adaptive Gradient (AdaGrad): AdaGrad method(Duchi et al., 2011) updates the solution iteratively as:

x(k+1) := x(k) − η(k)G−1∇fi(x(k)), (130)

where G is a (d × d) diagonal matrix whose (j, j)-th ele-

1 Initialize x(0)

2 for iteration k = 1, 2, . . . do3 x := x(k−1)

4 ∇f(x)(112):= 1

n

∑ni=1∇fi(x)

5 x(0) := x6 for iteration τ = 0, 1, . . . ,m− 1 do7 Randomly sample j from 1, . . . , n.8 x(τ+1) := x(τ) − η(τ)

(∇fj(x(τ))−

∇fj(x) +∇f(x)).

9 x(k) := x(m)

Algorithm 2: The SVRG algorithm

ment is:

G(j, j) :=

√√√√ε+

k∑τ=0

(∇jfiτ (x(τ))

)2, (131)

where ε ≥ 0 is for stability (making G full rank), iτ is therandomly sampled point (from 1, . . . , n) at iteration τ ,and∇jfiτ (.) is the partial derivative of fiτ (.) w.r.t. its j-thelement (n.b. fiτ (.) is d-dimensional). Putting Eq. (131)in Eq. (130) can simplify AdaGrad to:

x(k+1)j := x

(k)j −

η(k)√ε+

∑kτ=0

(∇jfiτ (x(τ))

)2∇fj(x(k)j ).

(132)

AdaGrad keeps a history of the sampled points and it takesderivative for them to use. During the iterations so far, if adimension has changed significantly, it dampens the learn-ing rate for that dimension (see the inverse in Eq. (130));hence, it gives more weight for changing the dimensionswhich have not changed noticeably.

– Root Mean Square Propagation (RMSProp): RM-SProp was first proposed in (Tieleman & Hinton, 2012)which is unpublished. It is an improved version of Rprop(resilient backpropagation) (Riedmiller & Braun, 1992)which uses the sign of gradient in optimization. Inspiredby momentum in Eq. (100), it updates a scalar variable vas (Hinton et al., 2012):

v(k+1) := γv(k) + (1− γ)‖∇fi(x(k))‖22, (133)

where γ ∈ [0, 1] is the forgetting factor (e.g. γ = 0.9).Then, it uses this v to weight the learning rate:

x(k+1) := x(k) − η(k)

√ε+ v(k+1)

∇fj(x(k)j ), (134)

27

where ε ≥ 0 is for stability to not have division by zero.Comparing Eqs. (132) and (134) shows that RMSProp hasa similar form to AdaGrad.

– Adaptive Moment Estimation (Adam): Adam opti-mizer (Kingma & Ba, 2014) improves over RMSProp byadding a momentum term (see Section 5.1.6). It updatesthe scalar v and the vectorm ∈ Rd as:

m(k+1) := γ1m(k) + (1− γ1)∇fi(x(k)), (135)

v(k+1) := γ2v(k) + (1− γ2)‖∇fi(x(k))‖22, (136)

where γ1, γ2 ∈ [0, 1]. It normalizes these variables as:

m(k+1)

:=1

1− γk1m(k+1), v(k+1) :=

1

1− γk2v(k+1).

Then, it updates the solution as:

x(k+1) := x(k) − η(k)

√ε+ v(k+1)

m(k+1)

, (137)

which is stochastic gradient descent with momentum whileusing RMSProp. Convergences of RMSProp and Adammethods have been discussed in (Zou et al., 2019). TheAdam optimizer is one of the mostly used optimizers inneural networks.

5.5. Proximal Methods5.5.1. PROXIMAL MAPPING AND PROJECTION

Definition 28 (Proximal mapping/operator (Parikh &Boyd, 2014)). The proximal mapping or proximal operatorof a convex function g(.) is:

proxg(x) := arg minu

(g(u) +

1

2‖u− x‖22

). (138)

In case the function g(.) is scaled by a scalar λ (e.g., thisoften holds in Eq. (148) where λ can scale g(.) as the regu-larization parameter), the proximal mapping is defined as:

proxλg(x) := arg minu

(g(u) +

1

2λ‖u− x‖22

). (139)

The proximal mapping is related to the Moreau-Yosida reg-ularization defined below.

Definition 29 (Moreau-Yosida regularization or Moreauenvelope (Moreau, 1965; Yosida, 1965)). The Moreau-Yosida regularization or the Moreau envelope of functiong(.) is:

Mλg(x) := infu

(g(u) +

1

2‖u− x‖22

). (140)

This Moreau-Yosida regularized function has the same min-imizer as the function g(.) (Lemarechal & Sagastizabal,1997).

Figure 10. Projection of point x onto a set S.

Lemma 14 (Moreau decomposition (Moreau, 1962)). Wealways have the following decomposition, named theMoreau decomposition:

x = proxg(x) + proxg∗(x), (141)

x = proxλg(x) + λ prox 1λ g∗(x

λ), (142)

where g(.) is a function in a space and g∗(.) is its corre-sponding function in the dual space (e.g., if g(.) is a norm,g∗(.) is its dual norm or if g(.) is projection onto a cone,g∗(.) is projection onto the dual cone).Lemma 15 (Projection onto set). Consider an indicatorfunction I(.) which is zero if its condition is satisfied and isinfinite otherwise. The proximal mapping of the indicatorfunction to a convex set S, i.e. I(x ∈ S), is projection ofthe point x onto the set S. Hence, projection of x onto setS, denoted by ΠS(x), is defined as:

ΠS(x) := proxI(.∈S)(x) = arg minu∈S

(1

2‖u− x‖22

).

(143)

As Fig. 10 shows, this projection simply means projectingthe point x onto the closest point of set from the point x;hence, the vector connecting the points x and ΠS(x) isorthogonal to the set S.

Proof.

proxI(.∈S)(x) = arg minu

(I(x ∈ S) +

1

2‖u− x‖22

)(a)= arg min

u∈S

(1

2‖u− x‖22

),

where (a) is because I(x ∈ S) becomes infinity if x 6∈ S.Q.E.D.

Corollary 6 (Moreau decomposition for norm). If the func-tion is a scaled norm, g(.) = λ‖.‖, we have from re-arranging Eq. (142) that:

proxλ‖.‖(x) = x− λΠB(x

λ), (144)

where B is the unit ball of dual norm (see Definition 6).

28

Derivation of proximal operator for various g(.) functionsare available in (Beck, 2017, Chapter 6). Here, we reviewthe proximal mapping of some mostly used functions. Ifg(x) = 0, proximal mapping becomes an identity map-ping:

proxλ0(x)(139)= arg min

u

( 1

2λ‖u− x‖22

)= x.

Lemma 16 (Proximal mapping of `2 norm (Beck, 2017,Example 6.19)). The proximal mapping of the `2 norm is:

proxλ‖.‖2(x) =(

1− λ

max(‖x‖2, λ)

)x

=

(1− λ

‖x‖2

)x if ‖x‖2 ≥ λ

0 if ‖x‖2 < λ.

(145)

Proof. Let g(.) = ‖.‖2 and B be the unit `2 ball (see Defi-nition 6) because `2 is the dual norm of `2 according to Eq.(3). We have:

ΠB(x) =

x/‖x‖2 if ‖x‖2 ≥ 1x if ‖x‖2 < 1.

=⇒ proxλ‖.‖2(x)(144)= x− λΠB(

x

λ)

=

(1− λ

‖x‖2

)x if ‖x‖2 ≥ λ

x− λ(x/λ) = 0 if ‖x‖2 < λ.

Q.E.D.

Lemma 17 (Proximal mapping of `1 norm (Beck, 2017,Example 6.8)). Let xj denote the j-th element of x =[x1, . . . , xd]

> ∈ Rd and let [ proxλ‖.‖1(x)]j denote the j-thelement of the d-dimensional proxλ‖.‖1(x) mapping. Thej-th element of proximal mapping of the `1 norm is:

[ proxλ‖.‖1(x)]j = max(0, |xj | − λ) sign(xj)

= sλ(xj) :=

xj − λ if xj ≥ λ0 if |xj | < λxj + λ if xj ≤ −λ,

(146)

for all j ∈ 1, . . . , d. Eq. (146) is called the soft-thresholding function, denoted here by sλ(.). It is depictedin Fig. 11.

Proof. Let g(.) = ‖.‖1 and B be the unit `∞ ball (see Def-inition 6) because `∞ is the dual norm of `1 according toEq. (3). The j-th element of projection is:

[ΠB(x)]j =

1 if xj ≥ 1xj if |xj | < 1−1 if xj ≤ −1

=⇒ [proxλ‖.‖1(x)]j(144)= xj − λ [ΠB(

x

λ)]j

=

xj − λ if xj ≥ λ0 if |xj | < λxj + λ if xj ≤ −λ.

Q.E.D.

Figure 11. Soft-thresholding function.

5.5.2. PROXIMAL POINT ALGORITHM

The term g(u) + 1/(2λ)‖u− x‖22 in Eq. (139) is stronglyconvex; hence, the proximal point, proxλg(x), is unique.

Lemma 18. The point x∗ minimizes the function f(.) ifand only if x∗ = proxλf (x∗). In other words, the optimalpoint x∗ is the fixed point of the proxλf (.) operator (seeDefinition 12).

Consider the optimization problem (24). Proximal point al-gorithm, also called proximal minimization, was proposedin (Rockafellar, 1976). It finds the optimal point of thisproblem by iteratively updating the solution as:

x(k+1) := proxλf (x(k))

(139)= arg min

u

(f(u) +

1

2λ‖u− x(k)‖22

), (147)

until convergence, where λ can be seen as a parameter re-lated to the step size. In other words, proximal gradientmethod applies gradient descent on the Moreau envelopeMλf (x) (recall Eq. (140)) rather than on the function f(.)itself.

5.5.3. PROXIMAL GRADIENT METHOD

– Composite Problems: Consider the following optimiza-tion problem:

minimizex

f(x) + g(x), (148)

where f(x) is a smooth function and g(x) is a convex func-tion which is not smooth necessarily. According to the fol-lowing definition, this is a composite optimization problem.

Definition 30 (Composite objective function (Nesterov,2013)). In optimization, if a function is stated as a sum-mation of two terms, f(x) + g(x), it is called a compositefunction and its optimization is a composite optimizationproblem.

Composite problems are widely used in machine learningand regularized problems because f(x) can be the costfunction to be minimized while g(x) is the penalty or reg-ularization term (Ghojogh & Crowley, 2019).

29

– Proximal Gradient Method for Composite Optimiza-tion:For solving problem (148), we can approximate the func-tion f(.) by its quadratic approximation around point x be-cause it is smooth (differentiable):

f(u) ≈ f(x) +∇f(x)>(u− x) +1

2η‖u− x‖22,

where we have replaced ∇2f(x) with scaled identity ma-trix, (1/η)I . Hence, the solution of problem (148) can beapproximated as:

x = arg minu

(f(u) + g(u)

)≈ arg min

u

(f(x) +∇f(x)>(u− x) +

1

2η‖u− x‖22

+ g(u))

= arg minu

( 1

2η‖u− (x− η∇f(x))‖22 + g(u)

).

(149)

The first term in Eq. (149) keeps the solution close to thesolution of gradient descent for minimizing the functionf(.) (see Eq. (84)) and the second term in Eq. (149) makesthe function g(.) small.Proximal gradient method, also called proximal gradientdescent, uses Eq. (149) for solving the composite problem(148). It was first proposed in (Nesterov, 2013) and also in(Beck & Teboulle, 2009) for g = ‖.‖1. It finds the optimalpoint by iteratively updating the solution as:

x(k+1) (149):=

arg minu

( 1

2η(k)‖u− (x(k) − η(k)∇f(x(k)))‖22 + g(u)

)(139)= proxη(k)g

(x(k) − η(k)∇f(x(k))

), (150)

until convergence, where η(k) is the step size which can befixed or found by line-search. In Eq. (148), the functiong(.) can be a regularization term such as `2 or `1 norm. Inthese cases, we use Lemmas 16 and 17 for calculating Eq.(150). The convergence rates of proximal gradient methodis discussed in (Schmidt et al., 2011). A distributed ver-sion of this method is also proposed in (Chen & Ozdaglar,2012).

5.6. Gradient Methods for Constrained Problems5.6.1. PROJECTED GRADIENT METHOD

Projected gradient method (Iusem, 2003), also called gra-dient projection method and projected gradient descent,considers g(x) to be the indicator function I(x ∈ S) inproblem (148). In other words, the optimization problemis a constrained problem as problem (28), which can be re-stated to:

minimizex

f(x) + I(x ∈ S), (151)

because the indicator function becomes infinity if its con-dition is not satisfied. According to Eq. (150), the solutionis updated as:

x(k+1) (150):= proxη(k)I(.∈S)

(x(k) − η(k)∇f(x(k))

)(143)= ΠS

(x(k) − η(k)∇f(x(k))

). (152)

In other words, projected gradient method performs a stepof gradient descent and then projects the solution onto theset of constraint. This procedure is repeated until conver-gence of solution.

Lemma 19 (Projection onto the cone of orthogonal ma-trices (Parikh & Boyd, 2014, Section 6.7.2)). A functiong : Rd1×d2 → R is orthogonally invariant if g(UXV >) =g(X), for all U ∈ Rd1×d1 , X ∈ Rd1×d2 , and V ∈Rd2×d2 where U and V are orthogonal matrices.Let g be a convex and orthogonally invariant function, andit works on the singular values of a matrix variable X ∈Rd1×d2 , i.e., g = g σ where the function σ(X) gives thevector of singular values ofX . In this case, we have:

proxλ,g(X) = U diag(

proxλ,g(σ(X)

))V >, (153)

where diag(.) makes a diagonal matrix with its input as thediagonal, and U ∈ Rd1×d1 and V ∈ Rd2×d2 are the ma-trices of left and right singular vectors ofX , respectively.Consider the constraint for projection onto the cone of or-thogonal matrices, i.e., X>X = I . In this constraint, thefunction g deals with the singular values of X . The rea-son is that, from the Singular Value Decomposition (SVD)of X , we have: X

SVD= UΣV > =⇒ X>X =

UΣV >V ΣU>(a)= UΣ2U>

set= I =⇒ UΣ2U>U =

U(b)

=⇒ UΣ2 = U =⇒ Σ = I , where (a) and (b)are because U and V are orthogonal matrices. Therefore,the constraint X>X = I (i.e., projecting onto the cone oforthogonal matrices) can be modeled by Eq. (153) whichis simplified to setting all singular values ofX to one:

proxλ,g(X) = ΠO = UIV >, (154)

where I ∈ Rd1×d2 is a rectangular identity matrix and Odenotes the cone of orthogonal matrices. If the constraintis scaled orthogonality, i.e. X>X = λI with λ as thescale, the projection is setting all singular values to λ byU(λI)V > = λUIV >.

Although most often projected gradient method is used forEq. (152), there are few other variants of projected gradientmethods such as (Drummond & Iusem, 2004):

y(k) := ΠS(x(k) − η(k)∇f(x(k))

), (155)

x(k+1) := x(k) + γ(k)(y(k) − x(k)), (156)

30

where η(k) and γ(k) are positive step sizes at iteration k. Inthis alternating approach, we find an additional variable yby gradient descent followed by projection and then updatex to get close to the found y while staying close to theprevious solution by line-search.

5.6.2. PROJECTION ONTO CONVEX SETS (POCS) ANDAVERAGED PROJECTIONS

Assume we want to project a point onto the intersection ofc closed convex sets, i.e.,

⋂cj=1 Sj . We can model this by

an optimization problem with a fake objective function:

minimizex

x ∈ Rd

subject to x ∈ S1, . . . ,x ∈ Sc.(157)

Projection Onto Convex Sets (POCS) solves this problem,similar to projected gradient method, by projecting onto thesets one-by-one (Bauschke & Borwein, 1996):

x(k+1) := ΠS1(ΠS2(. . .ΠSc(x(k)) . . . )), (158)

and repeating it until convergence. Another similar methodfor solving problem (157) is the averaged projectionswhich updates the solution as:

x(k+1) :=1

c

(ΠS1

(x(k)) + · · ·+ ΠSc(x(k))). (159)

5.6.3. FRANK-WOLFE METHOD

Frank-Wolfe method, also called conditional gradientmethod and reduced gradient algorithm, was first proposedin (Frank & Wolfe, 1956) and it can be used for solving theconstrained problem (28) using gradient of objective func-tion (Levitin & Polyak, 1966). It updates the solution as:

y(k) := arg miny∈S∇f(x(k))>y, (160)

x(k+1) := (1− γ(k))x(k) + γ(k)y(k), (161)

until convergence, where γ(k) is the step size at iteration k.Eq. (160) finds the direction to move toward at the iterationand Eq. (161) updates the solution while staying close tothe previous solution by line-search. A stochastic versionof Frank-Wolfe method is proposed in (Reddi et al., 2016).

6. Non-smooth and L1 Norm OptimizationMethods

6.1. Lasso RegularizationThe `1 norm can be used for sparsity (Ghojogh & Crow-ley, 2019). We explain the reason in the following. Spar-sity is very useful and effective because of betting on spar-sity principal (Friedman et al., 2001) and the Occam’s razor(Domingos, 1999). If x = [x1, . . . , xd]

>, for having spar-sity, we should use subset selection for the regularizationof a cost function Ω0(x):

minimizex

Ω(x) := Ω0(x) + λ ||x||0, (162)

where:

||x||0 :=

d∑j=1

I(xj 6= 0) =

0 if xj = 0,1 if xj 6= 0,

(163)

is the “`0” norm, which is not a norm (so we use “.” forit) because it does not satisfy the norm properties (Boyd &Vandenberghe, 2004). The “`0” norm counts the number ofnon-zero elements so when we penalize it, it means that wewant to have sparser solutions with many zero entries. Ac-cording to (Donoho, 2006), the convex relaxation of “`0”norm (subset selection) is `1 norm. Therefore, we write theregularized optimization as:

minimizex

Ω(x) := Ω0(x) + λ ||x||1. (164)

Note that the `1 regularization is also referred to as lasso(least absolute shrinkage and selection operator) regular-ization (Tibshirani, 1996; Hastie et al., 2019). Differentmethods exist for solving optimization having `1 norm,such as its approximation by Huber function (Huber, 1992),proximal algorithm and soft thresholding (Parikh & Boyd,2014), coordinate descent (Wright, 2015; Wu & Lange,2008), and subgradients. In the following, we explain thesemethods.

6.2. Convex Conjugate6.2.1. CONVEX CONJUGATE

Consider Fig. 12 showing a line which supports the func-tion f meaning that it is tangent to the function and thefunction upper-bounds it. In other words, if the line goesabove where it is, it will intersect the function in more thana point. Now let the support line be multi-dimensional tobe a support hyperplane. For having this tangent supporthyperplane with slope y ∈ Rd and intercept β ∈ R, weshould have:

y>x+ β = f(x) =⇒ β = f(x)− y>x.

We want the smallest intercept for the support hyperplane:

β∗ = minx∈Rd

(f(x)− y>x

) (19)= −max

x∈Rd

(y>x− f(x)

).

We define f∗(y) := −β∗ to have the convex conjugatedefined as below.

Definition 31 (Convex conjugate of function). The conju-gate gradient of function f(.) is defined as:

f∗(y) := supx∈Rd

(y>x− f(x)

). (165)

The convex conjugate of a function is always convex, evenif the function itself is not convex, because it is point-wisemaximum of affine functions.

31

Figure 12. Supporting line (or hyper-plane) to the function.

Lemma 20 (Conjugate of convex conjugate). the conju-gate of convex conjugate of a function is:

f∗∗(x) = supy∈dom(f∗)

(x>y − f∗(y)

). (166)

It is always a lower-bound for the function, i.e., f∗∗(x) ≤f(x). If the function f(.) is convex, we have f∗∗(x) =f(x); hence, for a convex function, we have:

f(x) = supy∈dom(f∗)

(x>y − f∗(y)

). (167)

Lemma 21 (Gradient in terms of convex conjugate). Forany function f(.), we have:

∇f(x) = arg maxy∈dom(f∗)

(x>y − f∗(y)

). (168)

6.2.2. HUBER FUNCTION: SMOOTHING L1 NORM BYCONVEX CONJUGATE

Lemma 22 (The convex conjugate of `1 norm). The convexconjugate of f(.) = ‖.‖1 is:

f∗(y) =

0 if ‖y‖∞ ≤ 1∞ Otherwise. (169)

Proof. We can write `1 norm as:

f(x) = ‖x‖1 = max‖z‖∞≤1

x>z.

Using this in Eq. (165) results in Eq. (169). Q.E.D.

According to Eq. (168), we have ∇f(x) =arg max‖y‖∞≤1 x

>y. For x = 0, we have ∇f(x) =arg max‖y‖∞≤1 0 which has many solutions. Therefore,at x = 0, the function ‖.‖1 norm is not differentiable andnot smooth because the gradient at that point is not unique.We can smooth the `1 norm at x = 0 using convex conju-gate. Let x = [x1, . . . , xd]

>. As we have f(x) = ‖x‖1 =∑dj=1 |xj |, we can use the convex conjugate for every di-

mension f(xj) = |xj |:

f∗(yj) =

0 if |yj | ≤ 1∞ Otherwise. (170)

According to Eq. (167), we have:

|xj | = supy∈R

(xjyj − f∗(yj)

) (170)= max|yj |≤1

xjyj .

This is not unique for xj = 0. Hence, we add a µ-stronglyconvex function to the above equation to make the solu-tion unique at xj = 0 also. This added term is named theproximity function defined below.

Definition 32 (Proximity function (Banaschewski &Maranda, 1961)). A proximity function p(y) for a closedconvex set S ∈ dom(p) is a function which is continuousand strongly convex. We can change Eq. (167) to:

f(x) ≈ fµ(x) := supy∈dom(f∗)

(x>y − f∗(y)− µ p(y)

),

(171)

where µ > 0.

Using Eq. (171), we can have:

|xj | ≈ supy∈R

(xjyj − f∗(yj)−

µ

2y2j

)(170)= max|yj |≤1

(xjyj −µ

2y2j ) =

x2j

2µ if |xj | ≤ µ|xj | − µ

2 if |xj | > µ.

This approximation to `1 norm, which is differentiable ev-erywhere, including at xj = 0, is named the Huber functiondefined below. Note that the Huber function is the Moreauenvelope of absolute value (see Definition 29).

Definition 33 (Huber and pseudo-Huber functions (Huber,1992)). The Huber function and pseudo-Huber functionsare:

hµ(x) =

x2

2µ if |x| ≤ µ|x| − µ

2 if |x| > µ,(172)

hµ(x) =

√(xµ

)2

+ 1− 1, (173)

respectively, where µ > 0. The derivative of these functionsis easily calculated. For example, the derivative of Huberfunction is:

∇hµ(x) =

xµ if |x| ≤ µsign(x) if |x| > µ.

The Huber and pseudo-Huber functions are shown for dif-ferent µ values in Fig. 13. As this figure shows, in con-trast to `1 norm or absolute value, these two functions aresmooth so they approximate the `1 norm smoothly. Thisfigure also shows that the Huber function is always upper-bounded by absolute value (`1 norm); however, this doesnot hold for pseudo-Huber function. We can also see that

32

Figure 13. (a) Comparison of `1 and `2 norms in R1, (b) comparison of `1 norm (i.e., absolute value in R1) and the Huber function, and(c) comparison of `1 norm (i.e., absolute value in R1) and the pseudo-Huber function.

the approximation of Huber function is better than the ap-proximation of pseudo-Huber function; however, its cal-culation is harder than pseudo-Huber function because itis a piece-wise function (compare Eqs. (172) and (173)).Moreover, the figure shows a smaller positive value µ givebetter approximations, although it makes calculation of theHuber and pseudo-Huber functions harder.

6.3. Soft-thresholding and Proximal MethodsProximal mapping was introduced in Section 5.5. We canuse proximal mapping of non-smooth functions to solvenon-smooth optimization by proximal methods introducedin Section 5.5. For example, we can solve an optimizationproblem containing `1 norm in its objective function usingEq. (146). That equation, named soft-thresholding, is theproximal mapping of `1 norm. Then, we can use any of theproximal methods such as proximal point method and prox-imal gradient method. For solving the regularized prob-lem (164), which is optimizing a composite function, wecan use the proximal gradient method introduced in Sec-tion 5.5.

6.4. Coordinate Descent6.4.1. COORDINATE METHOD

Assume x = [x1, . . . , xd]>. For solving Eq. (24), coor-

dinate method (Wright, 2015) updates the dimensions (co-ordinates) of solution one-by-one and not all dimensionstogether at once:

x(k+1)1 := arg min

x1

f(x1, x(k)2 , x

(k)3 , . . . , x

(k)d ),

x(k+1)2 := arg min

x2

f(x(k+1)1 , x2, x

(k)2 , . . . , x

(k)d ),

...

x(k+1)d := arg min

xdf(x

(k+1)1 , x

(k+1)2 , x

(k+1)3 , . . . , xd),

(174)

until convergence of all dimensions of solution. Note thatthe update of every dimension uses the latest update of pre-viously updated dimensions. The order of updates for thedimensions does not matter. The idea of coordinate descentalgorithm is similar to the idea of Gibbs sampling (Geman& Geman, 1984; Ghojogh et al., 2020) where we work onthe dimensions of the variable one by one.If we use a step of gradient descent (i.e. Eq. (84)) forevery of the above updates, the method is named coordi-nate descent. If we use proximal gradient method (i.e., Eq.(150)) for every update in coordinate method, the methodis named the proximal coordinate descent. Note that wecan group some of the dimensions (features) together andalternate between updating the blocks (groups) of features.That method is named block coordinate descent. The con-vergence analysis of coordinate descent and block coordi-nate descent methods can be found in (Luo & Tseng, 1992;1993) and (Tseng, 2001), respectively. They show that ifthe function f(.) is continuous, proper, and closed, the co-ordinate descent method converges to a stationary point.There exist some other faster variants of coordinate de-scent named accelerated coordinate descent (Lee & Sid-ford, 2013; Fercoq & Richtarik, 2015).similar to SGD, the full gradient is not available in coordi-nate descent to use for checking convergence, as discussedin Section 5.1.4. One can use other criteria in that section.Moreover, note that SGD can be used with the line-searchmethods, too. Although coordinate descent methods arevery simple and shown to work properly for `1 norm op-timization (Wu & Lange, 2008), they have not sufficientlyattracted the attention of researchers in the field of opti-mization (Wright, 2015).

6.4.2. L1 NORM OPTIMIZATION

Coordinate descent method can be used for `1 norm (lasso)optimization (Wu & Lange, 2008) because every coordi-nate of the `1 norm is an absolute value (‖x‖1 =

∑dj=1 |xj |

33

for x = [x1, . . . , xd]>) and the derivative of absolute value

a simple sign function (note that we have subgradient forabsolute value at zero, which will be introduced in Section6.5.1). One of the well-known `1 optimization methodsis the lasso regression (Tibshirani, 1996; Friedman et al.,2001; Hastie et al., 2019):

minimizeβ

1

2‖y −Xβ‖22 + λ‖β‖1, (175)

where y ∈ Rn are the labels, X = [x1, . . . ,xd] ∈ Rn×dare the observations, β = [β1, . . . , βd]

> ∈ Rd are the re-gression coefficients, and λ is the regularization parameter.The lasso regression is sparse which is effective because ofthe reasons explained in Section 6.1.Let c denote the objective function in Eq. (175). The objec-tive function can be simplified as 0.5(y>y − 2β>X>y +β>X>Xβ) + λ‖β‖1. We can write the j-th element ofthis objective, denoted by cj , as:

cj =1

2(y>y − 2x>j yβj

+ βjx>j xjβj + βjx

>j X−jβ−j) + λ|βj |,

where X−j := [x1, . . . ,xj−1,xj+1, . . . ,xd] and β−j :=

[β1, . . . , βj−1, βj+1, . . . , βd]>. For coordinate descent, we

need gradient of objective function w.r.t. every coordinate.The derivatives of other coordinates of objective w.r.t. βjare zero so we need cj for derivative w.r.t. βj . Takingderivative of cj w.r.t. βj and setting it to zero gives:

∂c

∂βj=∂cj∂βj

= x>j xjβj + x>j (X−jβ−j − y) + λ sign(βj)set= 0

=⇒ βj = s λ

‖xj‖22

(x>j (y −X−iβ−i)x>j xj

)

=

x>j (y−X−iβ−i)

‖xj‖22− λ‖xj‖22

ifx>j (y−X−iβ−i)

x>j xj≥ λ‖xj‖22

0 if |x>j (y−X−iβ−i)

x>j xj|< λ‖xj‖22

x>j (y−X−iβ−i)‖xj‖22

+ λ‖xj‖22

ifx>j (y−X−iβ−i)

x>j xj≤− λ

‖xj‖22,

which is a soft-thresholding function (see Eq. (146)).Therefore, coordinate descent for `1 optimization finds thesoft-thresholding solution, the same as the proximal map-ping. We can use this soft-thresholding in coordinate de-scent where we use βj’s in Eq. (174) rather than xj’s.

6.5. Subgradient Methods6.5.1. SUBGRADIENT

We know that the convex conjugate f∗(y) is always con-vex. If the convex conjugate f∗(y) is strongly convex,then we have only one gradient according to Eq. (168).However, if the convex conjugate is only convex and not

Figure 14. Subgradients: (a) the two extreme subgradients at thenon-smooth point and (b) some examples of the subgradients inthe subdifferential at the non-smooth point.

strongly convex, Eq. (168) might have several solutions sothe gradient may not be unique. For the points in which thefunction does not have a unique gradient, we can have a setof subgradients, defined below.Definition 34 (Subgradient). Consider a convex functionf(.) with domainD. The vector g ∈ Rd is a subgradient off(.) at x ∈ D if it satisfies:

f(y) ≥ f(x) + g>(y − x), ∀y ∈ D. (176)

As Fig. 14 shows, if the function is not smooth at a point, ithas multiple subgradients at that point. This is while thereis only one subgradient (which is the gradient) for a pointat which the function is smooth.Definition 35 (subdifferential). The subdifferential of aconvex function f(.), with domain D, at a point x ∈ Dis the set of all subgradients at that point:

∂f(x) := g | g>(y − x)(176)

≤ f(y)− f(x), ∀y ∈ D.(177)

The subdifferential is a closed convex set. Every subgradi-ent is a member of the subdifferential, i.e., g ∈ ∂f(x). Anexample subdifferential is shown in Fig. 14.

An example of subgradient is the subdifferential of absolutevalue, f(.) = |.|:

∂f(x) =

1 if x > 0∈ [−1, 1] if x = 0−1 if x < 0.

(178)

The subgradient of absolute value is equal to the gradientfor x < 0 and x > 0 while there exists a set of subgradientsat x = 0 because absolute value is not smooth at that point.We can also compute the subgradient of `1 norm becausewe have f(x) = ‖x‖1 =

∑di=1 |xi| =

∑di=1 fi(xi) for

x = [x1, . . . , xd]>. We take Eq. (178) as the subdifferen-

tial of the i-th dimension, denoted by ∂fi(xi). Hence, forf(x) = ‖x‖1, we have ∂f(x) = ∂f1(x1)× · · · × ∂fd(xd)where × denotes the Cartesian product of sets.We can have the first-order optimality condition using sub-gradients by generalizing Lemma 9 as follows.

34

Lemma 23 (First-order optimality condition with subgra-dient). If x∗ is a local minimizer for a function f(.), then:

0 ∈ ∂f(x∗). (179)

Note that if f(.) is convex, this equation is a necessary andsufficient condition for a minimizer.

Proof. According to Eq. (176), we have f(y) ≥ f(x∗) +g>(y − x∗),∀y. If we have g = 0 ∈ ∂f(x∗), we havef(y) ≥ f(x∗) + 0>(y − x∗) = f(x∗) which means thatx∗ is a minimizer. Q.E.D.

The following lemma can be useful for calculation of sub-differential of functions.

Lemma 24. Some useful properties for calculation of sub-differential of functions:

• For a smooth function or at points where the functionis smooth, subdifferential has only one member whichis the gradient: ∂f(x) = ∇f(x).

• Linear combination: If f(x) =∑ni=1 aifi(x) with

ai ≥ 0, then ∂f(x) =∑ni=1 ai∂fi(x).

• Affine transformation: If f(x) = f0(Ax + b), then∂f(x) = A>∂f0(Ax+ b).

• Point-wise maximum: Suppose f(x) =maxf1(x), . . . , fn(x) where fi’s are differen-tiable. Let I(x) := i|fi = f(x) states whichfunction has the maximum value for the point x.At the any point other than the intersection pointof functions (which is smooth), The subgradientis g = ∇fi(x) for i ∈ I(x). At the intersectionpoint of two functions (which is not smooth), e.g.fi(x) = fi+1(x), we have:

∂f(x) = g | t∇fi(x) + (1− t)∇fi+1(x),∀t ∈ [0, 1].

6.5.2. SUBGRADIENT METHOD

Subgradient method, first proposed in (Shor, 2012), is usedfor solving the unconstrained optimization problem (24)where the function f(.) is not smooth, i.e., not differen-tiable, everywhere in its domain. It iteratively updates thesolution as:

x(k+1) := x(k) − η(k)g(k), (180)

where g(k) is any subgradient of function f(.) in point x atiteration k, i.e. g(k) ∈ ∂f(x(k)), and η(k) is the step sizeat iteration k. Comparing this update with Eq. (84) showsthat gradient descent is a special case of the subgradientmethod because for a smooth function, gradient is the onlymember of the subdifferential set (see Lemma 24); hence,the only subgradient is the gradient.

6.5.3. STOCHASTIC SUBGRADIENT METHOD

Consider the optimization problem (111) where at least oneof the fi(.) functions is not smooth. Stochastic subgradientmethod (Shor, 1998) randomly samples one of the points toupdate the solution in every iteration:

x(k+1) := x(k) − η(k)g(k)i , (181)

where g(k)i ∈ ∂fi(x

(k)). Comparing this with Eq. (113)shows that stochastic gradient descent is a special case ofstochastic gradient descent because for a smooth function,gradient is the only member of the subdifferential set (seeLemma 24). Note that there is another method namedstochastic subgradient method which uses a noisy unbiasedsubgradient for robustness to noisy data (Boyd & Mutap-cic, 2008). Here, our focus was on random sampling of thepoint and not the noise.We can have mini-batch stochastic subgradient methodwhich is a generalization of mini-batch SGD for non-smooth functions. In this case, Eq. (122) is changed to:

x(k+1) := x(k) − η(k) 1

b

∑i∈Bk′

g(k)i . (182)

Note that, if the function is not smooth, we can also usesubgradient instead of gradient in other stochastic methodssuch as SAG and SVRG, which were introduced before.For this, we need to use g(k)

i rather than∇f(x(k)) in thesemethods.

6.5.4. PROJECTED SUBGRADIENT METHOD

Consider the optimization problem (28). If the functionf(.) is not smooth, we can use he projected subgradientmethod (Alber et al., 1998) which generalizes the projectedgradient method introduced in Section 5.6.1. Similar to Eq.(152), it iteratively updates the solution as:

x(k+1) = ΠS(x(k) − η(k)g(k)

), (183)

until convergence of the solution.

7. Second-Order Optimization: Newton’sMethod

7.1. Newton’s Method from the Newton-Raphson RootFinding Method

We can find the root of a function f : x 7→ f(x) by solv-ing the equation f(x)

set= 0. The root of function can be

found iteratively where we get closer to the root over it-erations. One of the iterative root-finding methods is theNewton-Raphson method (Stoer & Bulirsch, 2013). In ev-ery iteration, it finds the next solution as:

x(k+1) := x(k) − f(x(k))

∇f(x(k)), (184)

35

where ∇f(x(k)) is the derivative of function w.r.t. x. Ac-cording to Eq. (18), in unconstrained optimization, we canfind the extremum (minimum or maximum) of the functionby setting its derivative to zero, i.e., ∇f(x)

set= 0. Recall

that Eq. (184) was used for solving f(x)set= 0. Therefore,

for solving Eq. (18), we can replace f(x) with ∇f(x) inEq. (184):

x(k+1) := x(k) − η(k) ∇f(x(k))

∇2f(x(k)), (185)

where∇2f(x(k)) is the second derivative of function w.r.t.x and we have included a step size at iteration k denotedby η(k) > 0. This step size can be either fixed or adaptive.If x is multivariate, i.e. x ∈ Rd, Eq. (185) is written as:

x(k+1) := x(k) − η(k)(∇2f(x(k))

)−1∇f(x(k)), (186)

where ∇f(x(k)) ∈ Rd is the gradient of function w.r.t. xand ∇2f(x(k)) ∈ Rd×d is the Hessian matrix w.r.t. x.Because of the second derivative or the Hessian, this opti-mization method is a second-order method. The name ofthis method is the Newton’s method.

7.2. Newton’s Method for Unconstrained OptimizationConsider the following optimization problem:

minimizex

f(x). (187)

where f(.) is a convex function. Iterative optimization canbe first-order or second-order. Iterative optimization up-dates solution iteratively as in Eq. (79). The update contin-ues until ∆x becomes very small which is the convergenceof optimization. In the first-order optimization, the step ofupdating is ∆x := −∇f(x). Near the optimal point x∗,gradient is very small so the second-order Taylor series ex-pansion of function becomes:

f(x) ≈ f(x∗) +∇f(x∗)>︸ ︷︷ ︸≈ 0

(x− x∗)

+1

2(x− x∗)>∇2f(x∗)(x− x∗)

≈ f(x∗) +1

2(x− x∗)>∇2f(x∗)(x− x∗).

(188)

This shows that the function is almost quadratic near theoptimal point. Following this intuition, Newton’s methoduses Hessian∇2f(x) in its updating step:

∆x := −∇2f(x)−1∇f(x). (189)

In the literature, this equation is sometimes restated to:

∇2f(x) ∆x := −∇f(x). (190)

7.3. Newton’s Method for Equality ConstrainedOptimization

The optimization problem may have equality constraints:

minimizex

f(x)

subject to Ax = b.(191)

After a step of update by p = ∆x, this optimization be-comes:

minimizex

f(x+ p)

subject to A(x+ p) = b.(192)

The Lagrangian of this optimization problem is:

L = f(x+ p) + ν>(A(x+ p)− b),

where ν is the dual variable. The second-order Taylor se-ries expansion of function f(x+ p) is:

f(x+ p) ≈ f(x) +∇f(x)>p+1

2p>∇2f(x∗)p.

(193)

Substituting this into the Lagrangian gives:

L = f(x) +∇f(x)>p+1

2p>∇2f(x∗)p

+ ν>(A(x+ p)− b).

According to Eqs. (64) and (71) in KKT conditions, theprimal and dual residuals must be zero:

∇xL = ∇f(x) +∇2f(x)>p+ p>∇3f(x∗)︸ ︷︷ ︸≈ 0

p

+A>νset= 0 =⇒ ∇2f(x)>p+A>ν = −∇f(x),

(194)

∇νL = A(x+ p)− b (a)= Ap

set= 0, (195)

where we have∇3f(x∗) ≈ 0 because the gradient of func-tion at the optimal point vanishes according to Eq. (18)and (a) is because of the constraint Ax − b = 0 in prob-lem (191). Eqs. (194) and (195) can be written as a systemof equations:[

∇2f(x)> A>

A 0

] [pν

]=

[−∇f(x)

0

]. (196)

Solving this system of equations gives the desired step p(i.e., ∆x) for updating the solution at the iteration.

– Starting with Non-feasible Initial Point: Newton’smethod can even start with a non-feasible point which doesnot satisfy all the constraints. If the initial point for opti-mization is not a feasible point, i.e.,Ax−b 6= 0, Eq. (195)becomes:

∇νL = A(x+ p)− b set= 0 =⇒ Ap = −(Ax− b).

(197)

36

Hence, for the first iteration, we solve the following systemrather than Eq. (196):[

∇2f(x)> A>

A 0

] [pν

]= −

[∇f(x)Ax− b

], (198)

and we use Eq. (198) for the rest of iterations because thenext points will be in the feasibility set (because we forcethe solutions to satisfyAx = b).

7.4. Interior-Point and Barrier Methods: Newton’sMethod for Inequality Constrained Optimization

The optimization problem may have inequality constraints:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1,Ax = b.

(199)

We can solve constrained optimization problems usingBarrier methods, also known as interior-point methods(Nesterov & Nemirovskii, 1994; Potra & Wright, 2000;Boyd & Vandenberghe, 2004; Wright, 2005). Interior-pointmethods were first proposed by (Dikin, 1967). The interior-point method is also referred to as the Unconstrained Min-imization Technique (UMT) or Sequential UMT (SUMT)(Fiacco & McCormick, 1967) because it converts the prob-lem to an unconstrained problem and solves it iteratively.The barrier methods or the interior-point methods, convertinequality constrained problems to equality constrained orunconstrained problems. Ideally, we can do this conver-sion using the indicator function I(.) which is zero if itsinput condition is satisfied and is infinity otherwise (n.b.the indicator function in optimization literature is not likethe indicator in data science which is one if its input con-dition is satisfied and is zero otherwise). The problem isconverted to:

minimizex

f(x) +

m1∑i=1

I(yi(x) ≤ 0)

subject to Ax = b.

(200)

The indicator function is not differentiable because it is notsmooth:

I(yi(x) ≤ 0) :=

0 if yi(x) ≤ 0∞ if yi(x) > 0.

(201)

Hence, we can approximate it with differentiable functionscalled the barrier functions (Boyd & Vandenberghe, 2004;Nesterov, 2018). A barrier function is logarithm, named thelogarithmic barrier or log barrier in short. It approximatesthe indicator function by:

I(yi(x) ≤ 0) ≈ −1

tlog(−yi(x)), (202)

where t > 0 (usually a large number such as t = 106) andthe approximation becomes more accurate by t → ∞. Itchanges the problem to:

minimizex

f(x)− 1

t

m1∑i=1

log(−yi(x))

subject to Ax = b.

(203)

This optimization problem is an equality constrained op-timization problem which we already explained how tosolve. Note that there exist many approximations for thebarrier. One of mostly used methods is the logarithmic bar-rier.The iterative solutions of the interior-point method satisfyEq. (53) and follow Fig. 8. If the optimization problemis a convex problem, the solution of interior-point methodis the global solution; otherwise, the solution is local. Theinterior-point and barrier methods are used in many opti-mization toolboxes such as CVX (Grant et al., 2009).

– Accuracy of the log barrier method: In the followingtheorem, we discuss the accuracy of the log barrier method.

Theorem 8 (On the sub-optimality of log-barrier method).Let the optimum of problems (199) and (203) be denoted byf∗ and f∗r , respectively. We have:

f∗ − m1

t≤ f∗r ≤ f∗, (204)

meaning that the optimum of problem (203) is no more thanm1/t from the optimum of problem (199).

Proof. Proof is available in Appendix C.1.

Theorem 8 indicates that by t→∞, the log-barrier methodis more accurate; i.e., the solution of problem (203) is moreaccurately close to the solution of problem (199). This isexpected because the approximation in Eq. (202) gets moreaccurate by increasing t. Note that by increasing t, opti-mization gets more accurate but harder to solve and slowerto converge.

7.5. Wolfe Conditions and Line-Search in Newton’sMethod

In Sections 5.1.2 and 5.1.3, we introduced line-search forgradient descent. We have line-search for second-orderoptimization, too. Line-search for second-order optimiza-tion checks two conditions. These conditions are calledthe Wolfe conditions (Wolfe, 1969). For finding the suit-able step size η(k), at iteration k of optimization, the Wolfeconditions are checked. Here, we do not include the stepsize η in p = ∆x and the step at iteration k is p(k) =−∇2f(x(k))−1∇f(x(k)) according to Eq. (189). The

37

Wolfe conditions are:

f(x(k) + η(k)p(k)) ≤ f(x(k)) + c1η(k)p(k)>f(x(k)),

(205)

− p(k)>∇f(x(k) + η(k)p(k)) ≤ −c2p(k)>f(x(k)),(206)

where 0 < c1 < c2 < 1 are the parameters of Wolfe con-ditions. It is recommended in (Nocedal & Wright, 2006)to have c1 = 10−4 and c2 = 0.9. The first condition isthe Armijo condition (Armijo, 1966) which ensures the stepsize η(k) decreases the function value sufficiently. The sec-ond condition is the curvature condition which ensures thestep size η(k) decreases the function slope sufficiently. Inquasi-Newton’s method (introduced later in Section 7.7),the curvature condition makes sure the approximation ofHessian matrix remains positive definite. The Armijo andcurvature conditions give an upper-bound and lower-boundon the step size, respectively. There also exists a strongcurvature condition:

|p(k)>∇f(x(k) + η(k)p(k))| ≤ c2|p(k)>f(x(k))|, (207)

which can be used instead of the curvature condition. Notethat the Wolfe conditions can also be used for line-searchin first-order methods.

7.6. Fast Solving System of Equations in Newton’sMethod

In unconstrained Newton’s method, the update of solutionwhich is Eq. (190) is in the form of a system of linearequations. In constrained Newton’s method (and hence, inthe interior-point method), the update of solution which isEq. (196) is also in the form of a system of linear equations.Therefore, every iteration of all optimization methods isreduced to a system of equations such as:

Mz = q, (208)

where we need to calculate z. Therefore, the optimizationtoolboxes, such as CVX (Grant et al., 2009), solve a sys-tem of equations to find the solution at every iteration. Ifthe dimensionality of data or the number of constraints islarge, the number of equations and the size of matrices inthe system of equations increase. Solving the large systemof equations is very time-consuming. Hence, some meth-ods have been developed to accelerate solving the systemof equations. Here, we review some of these methods.

7.6.1. DECOMPOSITION METHODS

We can use various matrix decomposition/factorizationmethods (Golub & Van Loan, 2013) for decomposing thecoefficient matrix M and solving the system of equations(208). We review some of them here.

– LU decomposition: We use LU decomposition to de-composeM = PLU . We have:

Mz = P L Uz︸︷︷︸=w2︸ ︷︷ ︸

=w1

= q,

where we define w2 := Uz and w1 := Lw2. Hence, wecan solve the system of equations as:

1. LU decomposition: M = PLU

2. permutation: Solve Pw1 = q to find w1

3. forward subtraction: Solve Lw2 = w1 to find w2

4. backward subtraction: Solve Uz = w2 to find z

– Cholesky decomposition: In most cases of optimization,the coefficient matrix is positive definite. for example, inEq. (190), the coefficient matrix is the Hessian which ispositive definite. Therefore, we can use Cholesky decom-position to decomposeM = LL>. We have:

Mz = LL>z︸︷︷︸=w1

= q,

where we define w1 := L>z. Hence, we can solve thesystem of equations as:

1. Cholesky decomposition: M = LL>

2. forward subtraction: Solve Lw1 = q to find w1

3. backward subtraction: Solve L>z = w1 to find z

– Schur complement: Assume the system of equationscan be divided into blocks:

Mz =

[M11 M12

M21 M22

] [z1

z2

]=

[q1

q2

].

From this, we have:

M11z1 +M12z2 = q1 =⇒ z1 = M−111 (q1 −M12z2).

M21z1 +M22z2 = q2

=⇒ M21(M−111 (q1 −M12z2)) +M22z2 = q2

=⇒ (M22 −M21M−111 M12)z2 = q2 −M21M

−111 q1.

The term (M22 −M21M−111 M12) is the Schur comple-

ment (Zhang, 2006) of block matrix M11 in matrix M .We assume that the block matrixM11 is not singular so itsinverse exists. We use the Schur complement to solve thesystem of equations as:

38

1. CalculateM−111 M12 andM−1

11 q

2. Calculate M := M22 −M21(M−111 M12) and q :=

q2 −M21(M−111 q1)

3. Solve Mz2 = q (as derived above) to find z2

4. Solve M11z1 = q1 −M12z2 (as derived above) tofind z1

7.6.2. CONJUGATE GRADIENT METHOD

The conjugate gradient method, proposed in (Hestenes &Stiefel, 1952), iteratively solves Eq. (208) faster than theregular solution. Its pacing shows off better when the ma-trices are very large. A good book on conjugate gradientis (Kelley, 1995, Chapter 2). It is noteworthy that trun-cated Newton’s methods (Nash, 2000), which approximatethe Hessian for large-scale optimization, usually use conju-gate gradient as their approximation method for calculatingthe search direction.Definition 36 (Conjugate vectors). Two non-zero vectorsx and y are conjugate if x>My = 0 whereM 0.Definition 37 (Krylov subspace (Krylov, 1931)). Theorder-r Krylov subspace, denoted by Kr, is spanned by thefollowing bases:

Kr(M , q) = spanq,Mq,M2q, . . . ,M r−1q. (209)

The solution to Eq. (208) should satisfy z = M−1q.Therefore, the solution to Eq. (208) lies in the Krylovsubspace. The conjugate gradient method approximatesthis solution lying in the Krylov subspace. Every iterationof conjugate gradient can be seen as projection onto theKrylov subspace.According to Eq. (18), the solution to Eq. (208) minimizesthe function:

f(z) =1

2z>Mz − z>q,

because Eq. (208) is the gradient of this function, i.e.,∇f(z) = Mz − q. conjugate gradient iteratively solvesEq. (208) as:

z(k+1) := z(k) + η(k)p(k). (210)

It starts by moving toward minus gradient as gradient de-scent does. Then, it uses the conjugate of gradient. This isthe reason for the name of this method.At iteration k, the residual (error) for fulfilling Eq. (208)is:

r(k) := q −Mz(k) = −∇f(z(k)). (211)

We also have:

r(k+1) − r(k) (211)= q −Mz(k+1) − q +Mz(k)

(210)= M(−z(k) − η(k)p(k) + z(k)) = −η(k)Mp(k).

1 Initialize: z(0)

2 r(0) := −∇f(z(0)) = q −Mz(0), p(0) := r(0)

3 for iteration k = 0, 1, . . . do4 η(k) = p(k)>r(k)

p(k)>Mp(k)

5 z(k+1) := z(k) + η(k)p(k)

6 r(k+1) := r(k) − η(k)Mp(k)

7 if ‖r(k+1)‖2 is small then8 Break the loop.

9 β(k+1) := r(k+1)>r(k+1)

r(k)>r(k)=‖r(k+1)‖22‖r(k)‖22

10 p(k+1) := r(k+1) + β(k+1)p(k)

11 Return z(k+1)

Algorithm 3: The conjugate gradient method tosolve Eq. (208).

Initially, the direction is this residual, p(0) = r(0) as ingradient descent (see Eq. (84)). If we take derivative of

f(z(k+1))(210)= f(z(k) + η(k)p(k)) w.r.t. η(k), we have:

∂η(k)f(z(k) + η(k)p(k))

set= 0

=⇒ η(k) =p(k)>(q −Mx(k))

p(k)>Mp(k)

(211)=

p(k)>r(k)

p(k)>Mp(k).

The conjugate gradient method is shown in Algorithm 3.As this algorithm shows, the direction of update p is foundby a linear combination of the residual (which was initial-ized by negative gradient as in gradient descent) and theprevious direction. The weight of previous direction in thislinear combination is β which gets smaller if the residual ofthis step is much smaller than the residual of previous iter-ation. This formula for β = (r(k+1)>r(k+1))/(r(k)>r(k))is also used in the Fletcher-Reeves nonlinear conjugate gra-dient method (Fletcher & Reeves, 1964), introduced in thenext section. The conjugate gradient method returns a z asan approximation to the solution to Eq. (208).

7.6.3. NONLINEAR CONJUGATE GRADIENT METHOD

As we saw, conjugate gradient is for solving linear equa-tions. Nonlinear Conjugate Gradient (NCG) generalizesconjugate gradient to nonlinear functions. Recall that con-jugate gradient solves Eq. (208):

Mz = q =⇒ M>Mz = M>q

=⇒ 2M>(Mz − q) = 0.

The goal of NCG is to find the minimum of a quadraticfunction:

f(z) = ‖Mz − q‖22, (212)

using its gradient. The gradient of this function is∇f(z) =2M>(Mz−q) which was found above. The NCG method

39

1 Initialize: z(0)

2 r(0) := −∇f(z(0)), p(0) := r(0)

3 for iteration k = 0, 1, . . . do4 r(k+1) := −∇f(z(k+1))

5 Compute β(k+1) by one of Eqs. (213)6 p(k+1) := r(k+1) + β(k+1)p(k)

7 η(k+1) := arg minη f(z(k+1) + η p(k+1))

8 z(k+1) := z(k) + η(k+1)p(k+1)

9 Return z(k+1)

Algorithm 4: The NCG method to find the mini-mum of Eq. (212).

is shown in Algorithm 4 which is very similar to Algorithm3. It uses steepest descent for updating the solution (see Eq.(3)). The direction for update is found by a linear combi-nation of the residual, initialized by negative gradient asin gradient descent, and the direction of previous iteration.Several formulas exist for the weight β for the previous di-rection in the linear combination:

β(k+1)1 :=

r(k+1)>r(k+1)

r(k)>r(k)=‖r(k+1)‖22‖r(k)‖22

,

β(k+1)2 :=

r(k+1)>(r(k+1) − r(k))

r(k)>r(k),

β(k+1)3 := −r

(k+1)>(r(k+1) − r(k))

p(k)>(r(k+1) − r(k)),

β(k+1)4 := − r(k)>r(k)

p(k)>(r(k+1) − r(k)).

(213)

The β1, β2, β3, and β4 are the formulas for Fletcher-Reeves(Fletcher & Reeves, 1964), Polak-Ribiere (Polak & Ri-biere, 1969), Hestenes-Stiefel (Hestenes & Stiefel, 1952),and Dai-Yuan (Dai & Yuan, 1999) methods, respectively.In all these formulas, β gets smaller if the residual of nextiteration gets much smaller than the previous residual. TheNCG method returns a z as an approximation to the mini-mizer of the nonlinear function in Eq. (212).

7.7. Quasi-Newton’s Methods7.7.1. HESSIAN APPROXIMATION

Similar to what we did in Eq. (188), we can approximatethe function at the updated solution by its second-orderTaylor series:

f(x(k+1)) = f(x(k) + p(k))

≈ f(x(k)) +∇f(x(k))>p(k) +1

2p(k)>B(k)p(k).

where p = ∆x is the step size and B(k) = ∇2f(x(k)) isthe Hessian matrix at iteration k. Taking derivative from

this equation w.r.t. p gives:

∇f(x(k) + p(k)) ≈ ∇f(x(k)) +B(k)p(k). (214)

This equation is called the secant equation. Setting thisderivative to zero, for optimization, gives:

B(k)p(k) = −∇f(x(k)) =⇒ p(k) = −H(k)∇f(x(k)),(215)

where H(k) := (B(k))−1 is the inverse of Hessian ma-trix. This equation is the previously found Eqs. (189) and(190). Note that although the letter H seems to be used forHessian, literature usesH to denote the (approximation of)inverse of Hessian. Considering the step size, we can writethe step s(k) as:

Rd 3 s(k) := ∆x = x(k+1) − x(k)

= −η(k)H(k)∇f(x(k)),(216)

where the step size is found by the Wolfe conditions in line-search (see Section 7.5).Computation of the Hessian matrix or its inverse is usu-ally expensive in Newton’s method. One way to approxi-mate the Newton’s solution at every iteration is the conju-gate gradient method, which was introduced before. An-other approach for approximating the solution is the quasi-Newton’s methods which approximate the (inverse) Hes-sian matrix. The quasi-Newton’s methods approximateHessian based on the step p(k), the difference of gradientsbetween iterations:

Rd 3 y(k) := ∇f(x(k+1))−∇f(x(k)), (217)

and the previous approximated Hessian B(k) or its inverseH(k).Some papers approximate the Hessian by a diagonal matrix(Lee & Verleysen, 2007, Appendix C.1), (Andrei, 2019). Inthis situation, the inverse of approximated Hessian is sim-ply the inverse of its diagonal elements. Some methods usea dense approximation for Hessian matrix. Examples forthese are BFGS, DFP, Broyden, and SR1 methods, whichwill be introduced in the following. Some other methods,such as LBFGS introduced later, approximate the Hessianmatrix only by a scalar.

7.7.2. QUASI-NEWTON’S ALGORITHMS

The most well-known algorithm for quasi-Newton’smethod is Broyden-Fletcher-Goldfarb-Shanno (BFGS)(Fletcher, 1987; Dennis Jr & Schnabel, 1996). Limited-memory BFGS (LBFGS) (Nocedal, 1980; Liu & Nocedal,1989) is a simplified version of BFGS which utilizesless memory. Some other quasi-Newton’s methods areDavidon-Fletcher-Powell (DFP) (Davidon, 1991; Fletcher,1987), Broyden method (Broyden, 1965), and Symmetric

40

Rank-one (SR1) (Conn et al., 1991). In the following, wereview the approximations of Hessian matrix and its inverseby different methods. More explanation on these methodscan be found in (Nocedal & Wright, 2006, Chapter 6).We define:

R 3 ρ(k) :=1

y(k)>s(k), (218)

Rd×d 3 V (k) := I − ρ(k)y(k)s(k)> = I − y(k)s(k)>

y(k)>s(k),

(219)

where I is the identity matrix.– BFGS: The approximations in BFGS method are:

B(k+1) := B(k) + ρ(k)y(k)y(k)>

− B(k)s(k)s(k)>B(k)>

s(k)>B(k)s(k),

H(k+1) := V (k)>H(k)V (k) + ρ(k)s(k)s(k)>.

(220)

– DFP: The approximations in DFP method are:

B(k+1) := V (k)B(k)V (k)> + ρ(k)y(k)y(k)>,

H(k+1) := H(k) + ρ(k)s(k)s(k)>

− H(k)y(k)y(k)>H(k)>

y(k)>H(k)y(k).

(221)

– Broyden: The approximations in Broyden method are:

B(k+1) := B(k) +(y(k) −B(k)s(k)) s(k)>

s(k)>s(k),

H(k+1) := H(k) +(s(k) −H(k)y(k))s(k)>H(k)

s(k)>H(k)y(k).

(222)– SR1: The approximations in SR1 method are:

B(k+1) := B(k) +(y(k) −B(k)s(k))(y(k) −B(k)s(k))>

(y(k) −B(k)s(k))>s(k),

H(k+1) := H(k) +(s(k) −H(k)y(k))(s(k) −H(k)y(k))>

(s(k) −H(k)y(k))>y(k).

(223)Comparing Eqs. (220) and (221) shows that the BFGS andDFP methods are dual of each other. Experiments haveshown that BFGS often outperforms DFP (Avriel, 2003).

– LBFGS: The above methods, including BFGS, approxi-mate the inverse Hessian matrix by a dense (d× d) matrix.For large d, storing this matrix is very memory-consuming.Hence, LBFGS (Nocedal, 1980; Liu & Nocedal, 1989) wasproposed which uses much less memory than BFGS. InLBFGS, the inverse of Hessian is a scalar multiplied to

1 Initialize the solution x(0)

2 H(0) := 1‖∇f(x(0))‖2

I

3 for k = 0, 1, . . . (until convergence) do4 p(k) ← GetDirection(−∇f(x(k)), k, 1)

5 η(k) ← Line-search with Wolfe conditions6 x(k+1) := x(k) − η(k)p(k)

7 s(k) := x(k+1) − x(k) = η(k)p(k)

8 y(k) := ∇f(x(k+1))−∇f(x(k))

9 γ(k+1) := s(k)>y(k)

y(k)>y(k)

10 H(k+1) := γ(k+1)I

11 Store y(k), s(k), andH(k+1)

12 return x(k+1)

13

14 // recursive function:15 Function GetDirection(p, k, n recursion)16 if k > 0 then17 // do up to m recursions:18 if n recursion > m then19 return p

20 ρ(k−1) := 1y(k−1)>s(k−1)

21 p := p− ρ(k−1)(s(k−1)>p)y(k−1)

22 p := GetDirection(p, k− 1, n recursion + 1)

23 return p− ρ(k−1)(y(k−1)>p)s(k−1) +

ρ(k−1)(s(k−1)>s(k−1))p

24 else25 returnH(0)p

Algorithm 5: The LBFGS algorithm

identity matrix, i.e., H(0) := γ(k)I; therefore, it approxi-mates the (d × d) matrix with a scalar. It uses a memoryof m previous variables and recursively calculates the up-dating direction of solution. In other words, it has recursiveunfoldings which approximate the descent directions in op-timization. The number of recursions is a small integer, forexample m = 10; hence, not much memory is used. Bym times recursion on Eq. (220), LBFGS approximates theinverse Hessian as (Liu & Nocedal, 1989, Algorithm 2.1):

H(k+1) := (V (k)> . . .V (k−m)>)H(0)(V (k−m) . . .V (k))

+ ρ(k−m)(V (k)> . . .V (k−m+1)>)s(k−m)s(k−m)>

(V (k−m+1) . . .V (k))

+ ρ(k−m+1)(V (k)> . . .V (k−m+2)>)s(k−m+1)s(k−m+1)>

(V (k−m+2) . . .V (k)) + · · ·+ ρ(k)s(k)s(k)>.

The LBFGS algorithm can be implemented as shown in Al-gorithm 5 (Hosseini & Sra, 2020a) which is based on (No-cedal & Wright, 2006, Chapter 6). As this algorithm shows,every iteration of optimization calls a recursive function for

41

up to m recursions and uses the stored previous m memoryto calculate the direction p for updating the solution. AsEq. (215) states, the initial updating direction is the New-ton’s method direction, which is p = −H(0)∇f(x0).

8. Non-convex Optimization by SequentialConvex Programming

Consider the optimization problem (25) where the func-tions f(.), yi(.), and hi(.) are not necessarily convex. Theexplained methods can also work for non-convex problemsbut they do not guarantee to find the global optimum. Theycan find local minimizers which depend on the randominitial solution. For example, the optimization landscapeof neural network is highly nonlinear but backpropagation(see Section 5.1.8) works very well for it. The reason forthis is explained in this way: every layer of neural networkpulls data to the feature space such as in kernels (Gho-jogh et al., 2021). In the high-dimensional feature space,all local minimizers are almost global minimizers becausethe local minimum values are almost equal in that space(Feizi et al., 2017). Also see (Soltanolkotabi et al., 2018;Allen-Zhu et al., 2019a;b) to understand why backpropa-gation optimization works well even in highly non-convexoptimization. Note that another approach for highly non-convex optimization is metaheuristic optimization whichwill be briefly introduced in Section 10.5.As was explained, the already introduced first-order andsecond-order optimization methods can work fairly well fornon-convex problems by finding local minimizers depend-ing on the initial solution. However, there exist some spe-cific methods for non-convex programming, divided intotwo categories. The local optimization methods are fasterbut do not guarantee to find the global minimizer. Theglobal optimization methods find the global minimizer butare usually slow to find the answer (Duchi et al., 2018). Se-quential Convex Programming (SCP) (Dinh & Diehl, 2010)is an example for local optimization methods. It is basedon a sequence of convex approximations of the non-convexproblem. It is closely related to Sequential Quadratic Pro-gramming (SQP) (Boggs & Tolle, 1995) which is usedfor constrained nonlinear optimization. Branch and boundmethod, first proposed in (Land & Doig, 1960), is an ex-ample for the global optimization methods. It divides theoptimization landscape, i.e. the feasible set, into local partsby a binary tree and solves optimization in every part. Itchecks whether the solution of a part is the global solutionor not. In the following, we explain SCP which is a fasterbut local method.

8.1. Convex ApproximationSCP iteratively solves a convex problem where, at every it-eration, it approximates the non-convex problem (25) witha convex problem, based on the current solution, and re-

stricts the variable to be in a so-called trust region (Connet al., 2000). The trust region makes sure that the vari-able stays in a locally convex region of the optimizationproblem. At the iteration k of SCP, we solve the followingconvex problem:

minimizex

f(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m1,

hi(x) = 0, i ∈ 1, . . . ,m2,x ∈ T (k),

(224)

where f(.), yi(.), and hi(.), are convex approximations offunctions f(.), yi(.), and hi(.), and T (k) is the trust regionat iteration k. This approximated convex problem is alsosolved iteratively itself using one of the previously intro-duced methods such as the interior-point method. Thereexist several approaches for convex approximation of thefunctions. In the following, we introduce some of theseapproaches.

8.1.1. CONVEX APPROXIMATION BY TAYLOR SERIESEXPANSION

The non-convex functions f(.), yi(.), and hi(.) can be ap-proximated by affine functions (i.e., first-order Taylor se-ries expansion) to become convex. For example, the func-tion f(.) is approximated as:

f(x) = f(x(k)) +∇f(x(k))>(x− x(k)). (225)

The functions can also be approximated by quadratic func-tions (i.e., second-order Taylor series expansion) to becomeconvex. For example, the function f(.) is approximated as:

f(x) = f(x(k)) +∇f(x(k))>(x− x(k))

+1

2(x− x(k))>P (x− x(k)), (226)

where P = ΠSd+(∇2f(x(k))) is projection of Hessian ontothe symmetric positive semi-definite cone. This projectionis performed by setting the negative eigenvalues of Hessianto zero. The same approaches can be used for approxima-tion of functions yi(.) and hi(.) using first- or second-orderTaylor expansion.

8.1.2. CONVEX APPROXIMATION BY PARTICLEMETHOD

We can approximate the functions f(.), yi(.), and hi(.) inthe domain of trust region using regression. This approachis named the particle method (Duchi et al., 2018). Letxi ∈ T (k)mi=1 be m points which lie in the trust region.We can use least-squares quadratic regression to make the

42

functions convex in the trust region:

minimizea∈R,b∈Rd,P∈Sd++

m∑i=1

(1

2(xi − x(k))>P (xi − x(k))

+ b>(xi − x(k)) + a− f(xi))2

subject to P 0.(227)

Then, the function f(.) is replaced by its convex approxi-mation f(x) = (1/2)(xi−x(k))>P (xi−x(k))+b>(xi−x(k)) + a. The same approach can be used for approxima-tion of functions yi(.) and hi(.).

8.1.3. CONVEX APPROXIMATION BYQUASI-LINEARIZATION

Another approach for convex approximation of functionsf(.), yi(.), and hi(.) is quasi-linearization. We should statethe function f(.) in the form f(x) = A(x)x + c(x). Forexample, we can use the second-order Taylor series expan-sion to do this:

f(x) ≈ 1

2x>Px+ b>x+ a = (

1

2Px+ b)>x+ a,

so we useA(x) := ((1/2)Px+ b)> and c(x) := a whichdepend on the Taylor expansion of f(x). Hence, the con-vex approximation of function f(.) can be:

f(x) = A(x(k))x+ c(x(k)) = (1

2Px(k) + b)>x+ a.

(228)

The same approach can be used for approximation of func-tions yi(.) and hi(.).

8.2. Trust Region8.2.1. FORMULATION OF TRUST REGION

The trust region can be a box around the point at that itera-tion:

T (k) := x | |xj − x(k)j | ≤ ρi,∀j ∈ 1, . . . , d. (229)

where xj and x(k)j are the j-th element of x and x(k), re-

spectively, and ρi is the bound of box for the j-th dimen-sion. Another option for trust region is an ellipse aroundthe point to have a quadratic trust region:

T (k) := x | (x− x(k))>P−1(x− x(k)) ≤ ρ, (230)

where P ∈ Sd++ (is symmetric positive definite) and ρ > 0is the radius of ellipse.

8.2.2. UPDATING TRUST REGION

The trust region gets updated in every iteration of SCP. Inthe following, we explain how the trust region can be up-

dated. First, we embed the constraints in the objective func-tion of problem (25):

minimizex

φ(x) := f(x) + λ( m1∑i=1

(max(yi(x), 0)

)2+

m1∑i=1

|hi(x)|2),

(231)where λ > 0 is the regularization parameter. This is calledthe exact penalty method (Di Pillo, 1994) because it pe-nalizes violation from the constraints. For large enoughregularization parameter (which gives importance to viola-tion of constraints), the solution of problem (231) is exactlyequal to the solution of problem (25). That is the reason forthe term “exact” in the name “exact penalty method”. Sim-ilar to Eq. (231), we define:

φ(x) := f(x) + λ( m1∑i=1

(max(yi(x), 0)

)2+

m1∑i=1

|hi(x)|2), (232)

for the problem (224). At the iteration k of SCP, let x(k)

be the solution of the convex approximated problem (224)using any method such as the interior-point method. Wecalculate the predicted and exact decreases which are δ :=

φ(x(k)) − φ(x) and δ := φ(x(k)) − φ(x), respectively.Two cases may happen:

• We have progress in optimization if αδ ≤ δ where0 < α < 1 (e.g., α = 0.1). In this case, we acceptthe approximate solution, i.e. x(k+1) := x, and weincrease the size of trust region, for the next iterationof SCP, by ρ(k+1) := βρ(k) where β ≥ 1 (e.g., β =1.1).

• We do not have progress in optimization if αδ > δ.In this case, we reject the approximate solution, i.e.x(k+1) := x(k), and we decrease the size of trust re-gion, for the next iteration of SCP, by ρ(k+1) := γρ(k)

where 0 < γ < 1 (e.g., γ = 0.5).

In summary, the trust region is expanded if we find a goodsolution; otherwise, it is made smaller.

9. Distributed Optimization9.1. Alternating OptimizationWhen we have several optimization variables, we can alter-nate between optimizing over each of these variables. Thistechnique is called alternating optimization in the literature(Li et al., 2019) (also see (Jain & Kar, 2017, Chapter 4)).Consider the following multivariate optimization problem:

minimizeximi=1

f(x1, . . . ,xm), (233)

43

where the objective function depends on m variables. Al-ternating optimization alternates between updating everyvariable while assuming other variables are constant, set totheir last updated value. After random feasible initializa-tion, it updates solutions as (Li et al., 2019):

x(k+1)1 := arg min

x1

f(x1,x(k)2 , . . . ,x

(k)m−1,x

(k)m ),

x(k+1)2 := arg min

x2

f(x(k+1)1 ,x2, . . . ,x

(k)m−1,x

(k)m ),

...

x(k+1)m := arg min

xmf(x

(k+1)1 ,x

(k+1)2 , . . . ,x

(k+1)m−1 ,xm),

until convergence. Any optimization methods, includingfirst-order and second-order methods, can be used for eachof the optimization lines above. In most cases, alternatingoptimization is robust to changing the order of updates ofvariables.

Remark 4. If the function f(x1, . . . ,xm) is decomposablein terms of variables, i.e., if we have f(x1, . . . ,xm) =∑mi=1 fi(xi), the alternating optimization can be simpli-

fied to:

x(k+1)1 := arg min

x1

f1(x1),

x(k+1)2 := arg min

x2

f2(x2),

...

x(k+1)m := arg min

xmfm(xm),

because other terms become constant in optimization. Theabove updates mean that if the function is completely de-composable in terms of variables, the updates of variablesare independent and can be done independently. Hence,in that case, alternating optimization is reduced to m in-dependent optimization problems, each of which can besolved by any optimization method such as the first-orderand second-order methods.

Proximal alternating optimization uses proximal operator,Eq. (139), for minimization to keep the updated solutionclose to the solution of previous iteration (Li et al., 2019):

x(k+1)1 := arg min

x1

(f(x1,x

(k)2 , . . . ,x

(k)m−1,x

(k)m )

+1

2λ‖x1 − x(k)

1 ‖22),

x(k+1)2 := arg min

x2

(f(x

(k+1)1 ,x2, . . . ,x

(k)m−1,x

(k)m )

+1

2λ‖x2 − x(k)

2 ‖22),

...

x(k+1)m := arg min

xm

(f(x

(k+1)1 ,x

(k+1)2 , . . . ,x

(k+1)m−1 ,xm)

+1

2λ‖xm − x(k)

m ‖22).

The alternating optimization methods can also be used forconstrained problems:

minimizeximi=1

f(x1, . . . ,xm)

subject to xi ∈ Si, ∀i ∈ 1, . . . ,m.(234)

In this case, every line of the optimization is a constrainedproblem:

x(k+1)1 := arg min

x1

(f(x1,x

(k)2 , . . . ,x

(k)m−1,x

(k)m ),

s.t. x1 ∈ S1

),

x(k+1)2 := arg min

x2

(f(x

(k+1)1 ,x2, . . . ,x

(k)m−1,x

(k)m ),

s.t. x2 ∈ S2

),

...

x(k+1)m := arg min

xm

(f(x

(k+1)1 ,x

(k+1)2 , . . . ,x

(k+1)m−1 ,xm),

s.t. xm ∈ Sm).

Any constrained optimization methods can be used for eachof the optimization lines above. Some examples are pro-jected gradient method, proximal methods, interior-pointmethods, etc.Finally, it is noteworthy that practical experiments haveshown there is usually no need to use a complete optimiza-tion until convergence for every step in the alternating op-timization, either unconstrained or constrained. Often, asingle step of updating, such as a step of gradient descentor projected gradient method, is enough for the whole al-gorithm to work.

9.2. Dual Ascent and Dual Decomposition MethodsConsider the following problem:

minimizex

f(x)

subject to Ax = b.(235)

We follow the method of multipliers as discussed in Section4.8. The Lagrangian is:

L(x,ν) = f(x) + ν>(Ax− b).

The dual function is:

g(ν) = infxL(x,ν). (236)

44

The dual problem maximizes g(ν):

ν∗ = arg maxν

g(ν), (237)

so the optimal primal variable is:

x∗ = arg minxL(x,ν∗). (238)

For solving Eq. (237), we should take the derivative of thedual function w.r.t. the dual variable:

∇νg(ν)(236)= ∇ν(inf

xL(x,ν))

(238)= ∇ν(f(x∗) + ν>(Ax∗ − b)) = Ax∗ − b.

The dual problem is a maximization problem so we can usegradient ascent (see Section 5.1) for iteratively updating thedual variable with this gradient. We can alternate betweenupdating the optimal primal and dual variables:

x(k+1) := arg minxL(x,ν(k)), (239)

ν(k+1) := ν(k) + η(k)(Ax(k+1) − b), (240)

where k is the iteration index and η(k) is the step size (alsocalled the learning rate) at iteration k. Eq. (239) can beperformed by any optimization method. We compute thegradient of L(x,ν(k)) w.r.t. x. If setting this gradient tozero does not give x in closed form, we can use gradientdescent (see Section 5.1) to perform Eq. (239). Some pa-pers approximate Eq. (239) by one step or few steps of gra-dient descent rather than a complete gradient descent untilconvergence. If using one step, we can write Eq. (239) as:

x(k+1) := x(k) − γ∇xL(x,ν(k)), (241)

where γ > 0 is the step size. It has been shown empiricallythat even one step of gradient descent for Eq. (239) worksproperly for the whole alternating algorithm.We continue the iterations until convergence of the pri-mal and dual variables to stable values. When we getcloser to convergence, we will have (Axk+1 − b) → 0so that we will not have update of dual variable accordingto Eq. (240). This means that after convergence, we have(Axk+1−b) ≈ 0 so that the constraint in Eq. (235) is get-ting satisfied. In other words, the update of dual variable inEq. (240) is taking care of satisfying the constraint. Thismethod is known as the dual ascent method because it usesgradient ascent for updating the dual variable.If the objective function can be distributed and decomposedon b blocks xibi=1, i.e.:

f(x) = f1(x1) + · · ·+ f1(xb),

we can have b Lagrangian functions where the total La-grangian is the summation of these functions:

Li(xi,ν) = f(xi) + ν>(Axi − b),

L(xi,ν) =

b∑i=1

(f(xi) + ν>(Axi − b)

).

We can divide the Eq. (239) into b updates, each for one ofthe blocks.

x(k+1)i := arg min

xiL(x,ν(k)), ∀i ∈ 1, . . . , b, (242)

ν(k+1) := ν(k) + η(k)(Ax(k+1) − b). (243)

This is called dual decomposition developed by decompo-sition techniques such as the Dantzig-Wolfe decomposition(Dantzig & Wolfe, 1960), Bender’s decomposition (Ben-ders, 1962), and Lagrangian decomposition (Everett III,1963). The dual decomposition methods can divide a prob-lem into sub-problems and solve them in parallel. Hence,its can be used for big data but they are usually slow toconverge.

9.3. Augmented Lagrangian Method (Method ofMultipliers)

Assume we regularize the objective function in Eq. (235)by a penalty on not satisfying the constraint:

minimizex

f(x) +ρ

2‖Ax− b‖22

subject to Ax = b,(244)

where ρ > 0 is the regularization parameter.

Definition 38 (Augmented Lagrangian (Hestenes, 1969;Powell, 1969)). The Lagrangian for problem (244) is:

Lρ(x,ν) := f(x) + ν>(Ax− b) +ρ

2‖Ax− b‖22.

(245)

This Lagrangian is called the augmented Lagrangian forproblem (235).

We can use this augmented Lagrangian in Eqs. (239) and(240):

x(k+1) := arg minxLρ(x,ν(k)), (246)

ν(k+1) := ν(k) + ρ(Ax(k+1) − b), (247)

where we use ρ for the step size of updating the dual vari-able. This method is called the augmented Lagrangianmethod or the method of multipliers (Hestenes, 1969; Pow-ell, 1969; Bertsekas, 1982).

45

9.4. Alternating Direction Method of Multipliers(ADMM)

ADMM (Gabay & Mercier, 1976; Glowinski & Marrocco,1976; Boyd et al., 2011) has been used in many recentmachine learning and signal processing papers. The use-fulness and goal for using ADMM (and other distributedmethods) are two-fold: (1) it makes the problem distributedand parallelizable on several servers and (2) it makes it pos-sible to solve an optimization problem with multiple vari-ables.

9.4.1. ADMM ALGORITHM

Consider the following problem:

minimizex1,x2

f1(x1) + f2(x2)

subject to Ax1 +Bx2 = c,(248)

which is an optimization over two variables x1 and x2. Theaugmented Lagrangian for this problem is:

Lρ(x1,x2,ν) = f1(x1) + f2(x2)

+ ν>(Ax1 +Bx2 − c) +ρ

2‖Ax1 +Bx2 − c‖22.

(249)We can alternate between updating the primal variables x1

and x2 and the dual variable ν until convergence of thesevariables:

x(k+1)1 := arg min

x1

Lρ(x1,x(k)2 ,ν(k)), (250)

x(k+1)2 := arg min

x2

Lρ(x(k+1)1 ,x2,ν

(k)), (251)

ν(k+1) := ν(k) + ρ(Ax(k+1)1 +Bx

(k+1)2 − c). (252)

Note that the order of updating primal and dual variablesis important and the dual variable should be updated af-ter the primal variables but the order of updating primalvariables is not important. This method is called the Alter-nating Direction Method of Multipliers (ADMM) (Gabay& Mercier, 1976; Glowinski & Marrocco, 1976). A goodsurvey on ADMM is (Boyd et al., 2011).As was explained before, Eqs. (250) and (251) can be per-formed by any optimization method such as calculating thegradient of augmented Lagrangian w.r.t. x1 andx2, respec-tively, and using a few (or even one) iterations of gradientdescent for each of these equations.

9.4.2. SIMPLIFYING EQUATIONS IN ADMMThe last term in the augmented Lagrangian, Eq. (249), canbe restated as:

ν>(Ax1 +Bx2 − c) +ρ

2‖Ax1 +Bx2 − c‖22

= ν>(Ax1 +Bx2 − c) +ρ

2‖Ax1 +Bx2 − c‖22

+1

2ρ‖ν‖22 −

1

2ρ‖ν‖22

2

(‖Ax1 +Bx2 − c‖22 +

1

ρ2‖ν‖22

+2

ρν>(Ax1 +Bx2 − c)

)− 1

2ρ‖ν‖22

(a)=

ρ

2

∥∥Ax1 +Bx2 − c+1

ρν∥∥2

2− 1

2ρ‖ν‖22

(b)=ρ

2

∥∥Ax1 +Bx2 − c+ u∥∥2

2− 1

2ρ‖ν‖22.

where (a) is because of the square of summation of twoterms and (b) is because we define u := (1/ρ)ν. The lastterm −(1/(2ρ))‖ν‖22 is constant w.r.t. to the primal vari-ables x1 and x2 so we can drop that term fro Lagrangianwhen updating the primal variables. Hence, the Lagrangiancan be restated as:

Lρ(x1,x2,u) = f1(x1) + f2(x2)

2

∥∥Ax1 +Bx2 − c+ u∥∥2

2+ constant.

(253)

For updating x1 and x2, the terms f2(x2) and f(x1) areconstant, respectively, and can be dropped (because herearg min is important and not the minimum value). Hence,Eqs. (250), (251), and (252) can be restated as:

x(k+1)1 := arg min

x1

(f1(x1)

2

∥∥Ax1 +Bx(k)2 − c+ u(k)

∥∥2

2

), (254)

x(k+1)2 := arg min

x2

(f2(x2)

2

∥∥Ax(k+1)1 +Bx2 − c+ u(k)

∥∥2

2

),

(255)

u(k+1) := u(k) + ρ(Ax(k+1)1 +Bx

(k+1)2 − c). (256)

Again, Eqs. (254) and (255) can be performed by oneor few steps of gradient descent or any other optimizationmethod. The convergence of ADMM for non-convex andnon-smooth functions has been analyzed in (Wang et al.,2019).

9.5. ADMM Algorithm for General OptimizationProblems and Any Number of Variables

9.5.1. DISTRIBUTED OPTIMIZATION

ADMM can be extended to several equality and inequal-ity constraints for several optimization variables (Giesen &Laue, 2016; 2019). Consider the following optimizationproblem with m optimization variables and an equality andinequality constraint for every variable:

minimizeximi=1

m∑i=1

fi(xi)

subject to yi(xi) ≤ 0, i ∈ 1, . . . ,m,hi(xi) = 0, i ∈ 1, . . . ,m.

(257)

46

We can convert every inequality constraint to equality con-straints by this technique (Giesen & Laue, 2016; 2019):

yi(xi) ≤ 0 ≡ y′i(xi) :=(max(0, yi(xi))

)2= 0.

Hence, the problem becomes:

minimizeximi=1

m∑i=1

fi(xi)

subject to y′i(xi) = 0, i ∈ 1, . . . ,m,hi(xi) = 0, i ∈ 1, . . . ,m.

Having dual variables λ = [λ1, . . . , λm]> and ν =[ν1, . . . , νm]> and regularization parameter ρ > 0, the aug-mented Lagrangian for this problem is:

Lρ(ximi=1,ν′,ν) =

m∑i=1

fi(xi)

+

m∑i=1

λiy′i(xi) +

m∑i=1

νihi(xi)

2

m∑i=1

(y′i(xi))2 +

ρ

2

m∑i=1

(hi(xi))2

=

m∑i=1

fi(xi) + λ>y′(x) + ν>h(x)

2‖y′(x)‖22 +

ρ

2‖h(x)‖22,

(258)

where Rm 3 y′(x) := [y′1(x1), . . . , y′m(xm)]> and Rm 3h(x) := [h1(x1), . . . , hm(xm)]>. Updating the primaland dual variables are performed as (Giesen & Laue, 2016;2019)::

x(k+1)i := arg min

xiLρ(xi, λ(k)

i , ν(k)i ), ∀i ∈ 1, . . . ,m,

λ(k+1) := λ(k) + ρy′(x(k+1)),

ν(k+1) := ν(k) + ρh(x(k+1)).

Note that as the Lagrangian is completely decomposable bythe i indices, the optimization for every i-th primal or dualvariable does not depend on other indices; in other words,the terms of other indices become constant for every index.The last terms in the augmented Lagrangian, Eq. (258), can

be restated as:

λ>y′(x) + ν>h(x) +ρ

2‖y′(x)‖22 +

ρ

2‖h(x)‖22

= λ>y′(x) +ρ

2‖y′(x)‖22 +

1

2ρ‖λ‖22 −

1

2ρ‖λ‖22

+ ν>h(x) +ρ

2‖h(x)‖22 +

1

2ρ‖ν‖22 −

1

2ρ‖ν‖22

2

(‖y′(x)‖22 +

1

ρ2‖λ‖22 +

2

ρλ>y′(x)

)− 1

2ρ‖λ‖22

2

(‖h(x)‖22 +

1

ρ2‖ν‖22 +

2

ρν>h(x)

)− 1

2ρ‖ν‖22

2

∥∥y′(x) +1

ρλ∥∥2

2− 1

2ρ‖λ‖22

2

∥∥h(x) +1

ρν∥∥2

2− 1

2ρ‖ν‖22

(a)=

ρ

2

∥∥y′(x) + uλ∥∥2

2+ρ

2

∥∥h(x) + uν∥∥2

2− constant,

where (a) is because we define uλ := (1/ρ)λ and uν :=(1/ρ)ν. Hence, the Lagrangian can be restated as:

Lρ(ximi=1,uλ,uν) =

m∑i=1

fi(xi)+

2

∥∥y′(x) + uλ∥∥2

2+ρ

2

∥∥h(x) + uν∥∥2

2+ constant

=

m∑i=1

fi(xi) +ρ

2

m∑i=1

[(y′i(xi) + uλ,i)

2

+ (hi(xi) + uν,i)2]

+ constant,

where uλ,i = (1/ρ)λi and uν,i = (1/ρ)νi are the i-th el-ements of uλ and uν , respectively. Hence, updating vari-ables can be restated as:

x(k+1)i := arg min

xi

(fi(xi) +

ρ

2

[(y′i(xi) + u

(k)λ,i )

2

+ (hi(xi) + u(k)ν,i )

2]),∀i ∈ 1, . . . ,m, (259)

u(k+1)λ,i := u

(k)λ,i + ρ y′i(x

(k+1)i ), ∀i ∈ 1, . . . ,m (260)

u(k+1)ν,i := u

(k)ν,i + ρ hi(x

(k+1)i ), ∀i ∈ 1, . . . ,m. (261)

– Use of ADMM for Distributed Optimization: ADMMis one of the most well-known algorithms for distributedoptimization. If the problem can be divided into severaldisjoint blocks (i.e., several primal variables), we can solvethe optimization for each primal variable on a separate coreor server (see Eq. (259) for every i). Hence, in every it-eration of ADMM, the update of primal variables can beperformed in parallel by distributed servers. At the end ofeach iteration, the updated primal variables are gatheredin a central server so that the update of dual variable(s) isperformed (see Eqs. (260) and (261)). Then, the updated

47

dual variable(s) is sent to the distributed servers so they up-date their primal variables. This procedure is repeated untilconvergence of primal and dual variables. In this sense,ADMM is performed similar to the approach of federatedlearning (Konecny et al., 2015; Li et al., 2020).

9.5.2. MAKING OPTIMIZATION PROBLEMDISTRIBUTED

We can convert a non-distributed optimization problem to adistributed optimization problem to solve it using ADMM.Many recent machine learning and signal processing papersare using this technique.

– Univariate optimization problem: Consider a regularnon-distributed problem with one optimization variable x:

minimizex

m∑i=1

fi(x)

subject to yi(x) ≤ 0, i ∈ 1, . . . ,m,hi(x) = 0, i ∈ 1, . . . ,m.

(262)

This problem can be stated as:

minimizeximi=1

m∑i=1

fi(xi)

subject to yi(xi) ≤ 0, i ∈ 1, . . . ,m,hi(xi) = 0, i ∈ 1, . . . ,m,xi = z, i ∈ 1, . . . ,m,

(263)

where we introduce m variables ximi=1 and use the trickxi = z,∀i to make them equal to one variable. Eq. (263)is similar to Eq. (257) except that it has 2m equality con-straints rather than m equality constraints. Hence, we canuse ADMM updates similar to Eqs. (259), (260), and (261)but with slight change because of the additional m con-straints. We introduce m new dual variables for constraintsxi = z,∀i and update those dual variables as well as othervariables. The augmented Lagrangian also has some addi-tional terms for the new constraints. We do not write downthe Lagrangian and ADMM updates because of its similar-ity to the previous equations. This is a good technique tomake a problem distributed, use ADMM for solving it, andsolving it in parallel servers.

– Multivariate optimization problem: Consider a regularnon-distributed problem with multiple optimization vari-ables ximi=1:

minimizexmi=1

m∑i=1

fi(xi)

subject to xi ∈ Si, i ∈ 1, . . . ,m,(264)

where xi ∈ Si can be any constraint such as belonging toa set Si, an equality constraint, or an inequality constraint.

We can embed the constraint in the objective function usingan indicator function:

minimizexmi=1

m∑i=1

(fi(xi) + φi(xi)

),

where φi(xi) := I(xi ∈ Si) is zero if xi ∈ Si and isinfinity otherwise. This problem can be stated as:

minimizeximi=1

m∑i=1

(fi(xi) + φi(zi)

)subject to xi = zi, i ∈ 1, . . . ,m,

(265)

where we introduce a variable zi for every xi, use the in-troduced variable for the second term in the objective func-tion, and we equate them in the constraint.As the constraints xi − zi = 0,∀i are equality constraints,we can use Eqs. (254), (255), and (256) as ADMM updatesfor this problem:

x(k+1)i := arg min

xi

(fi(xi) +

ρ

2

∥∥xi − z(k)i + u

(k)i

∥∥2

2

),

∀i ∈ 1, . . . ,m, (266)

z(k+1)i := arg min

zi

(φi(zi) +

ρ

2

∥∥x(k+1)i − zi + u

(k)i

∥∥2

2

),

∀i ∈ 1, . . . ,m, (267)

u(k+1)i := u

(k)i + ρ(x

(k+1)i + z

(k+1)i ), ∀i ∈ 1, . . . ,m.

Comparing Eqs. (266) and (267) with Eq. (139) showsthat these ADMM updates can be written as proximal map-pings:

x(k+1)i := prox 1

ρ fi(z

(k)i − u

(k)i ), ∀i ∈ 1, . . . ,m,

z(k+1)i := prox 1

ρφi(x

(k+1)i + u

(k)i ), ∀i ∈ 1, . . . ,m,

(268)

u(k+1)i := u

(k)i + ρ(x

(k+1)i + z

(k+1)i ), ∀i ∈ 1, . . . ,m,

if we notice that ‖x(k+1)i − zi +u

(k)i ‖22 = ‖zi−x(k+1)

i −u

(k)i ‖22. Note that in many papers, such as (Otero et al.,

2018), we only havem = 1. In that case, we only have twoprimal variables x and z.According to Lemma 15, as the function φi(.) is an indica-tor function, Eq. (268) can be implemented by projectiononto the set Si:

z(k+1)i := ΠSi(x

(k+1)i + u

(k)i ), ∀i ∈ 1, . . . ,m.

As an example, assume the variables are all matrices so wehaveXi,Zi, andU i. if the set Si is the cone of orthogonalmatrices, the constraintXi ∈ Si would beX>i Xi = I . Inthis case, the update of matrix variable Zi would be doneby setting the singular values of (x

(k+1)i +u

(k)i ) to one (see

Lemma 19).

48

10. Additional NotesThere exist some other optimization methods which wehave not covered in this paper, for brevity. Here, we re-view some of them.

10.1. Cutting-Plane MethodsCutting plane methods, also called the localization meth-ods, are a family of methods which start with a large feasi-ble set containing the solution to be found. Then, iterativelythey reduce the feasible set by cutting off some piece of it(Boyd & Vandenberghe, 2007). For example, a cutting-plane method starts with a polyhedron feasible set. It findsa plane at every iteration which divides the feasible setsinto two disjoint parts one of which contains the ultimatesolution. It gets rid of the part without solution and re-duces the volume of the polyhedron. This is repeated untilthe polyhedron feasible set becomes very small and con-verges to the solution. This is somewhat a generalization ofthe bisection method, also called the binary search method,which was used for root-finding (Burden & Faires, 1963)but later it was used for convex optimization. The bisectionmethod halves the feasible set and removes the part withoutthe solution, at every iteration (see Algorithm 6). Some ofthe important cutting-plane methods are center of gravitymethod, Maximum Volume Ellipsoid (MVE) cutting-planemethod, Chebyshev center cutting-plane method, and An-alytic Center Cutting-Plane Method (ACCPM) (Goffin &Vial, 1993; Nesterov, 1995; Atkinson & Vaidya, 1995).Similar to subgradient methods, cutting-plane methods canbe used for optimizing non-smooth functions.

10.2. Ellipsoid MethodEllipsoid method was developed by several people (Wolfe,1980; Rebennack, 2009). It was proposed in (Shor, 1977;Yudin & Nemirovski, 1976; 1977a;b) and it was ini-tially applied to liner programming in a famous paper(Khachiyan, 1979). It is similar to cutting-plane methodsin cutting some part of feasible set iteratively. At every it-eration, it finds an ellipsoid centered at the current solution:

E(x(k),P ) := z | (z − x(k))>P−1(z − x(k)) ≤ 1,

where P ∈ Sd++ (is symmetric positive definite). It re-moves half of the ellipsoid which does not contain the so-lution. Again, another ellipsoid is found at the updated so-lution. This is repeated until the ellipsoid of iteration isvery small and converges to the solution.

10.3. Minimax and Maximin ProblemsConsider a function of two variables, f(x1,x2), and thefollowing optimization problem:

minimizex1

(maximize

x2

f(x1,x2)). (269)

1 Input: l and u2 for iteration k = 0, 1, . . . do3 x(k+1) := l+u

24 if ∇f(x) < 0 then5 l := x6 else7 u := x

8 Check the convergence criterion9 if converged then

10 return x(k+1)

Algorithm 6: The bisection algorithm

In this problem, we want to minimize the function w.r.t.one of the variables and maximize it w.r.t the other variable.This optimization problem is called the minimax problem.We can change the order of this problem to have the so-called maximin problem:

maximizex1

(minimize

x2

f(x1,x2)). (270)

Note that under certain conditions, the minimax and max-imin problems are equivalent if the variables of maximiza-tion (or minimization) stay the same. In other words, undersome conditions, we have (Du & Pardalos, 2013):

minimizex1

(maximizex2

f(x1,x2))

= maximizex2

(minimizex1

f(x1,x2)).

In the minimax and maximin problems, the two variableshave conflicting or contrastive desires; one of them wantsto maximize the function while the other wants to minimizeit. Hence, they are widely used in the field of game theoryas important strategies (Aumann & Maschler, 1972).

10.4. Riemannian OptimizationIn this paper, we covered optimization methods in the Eu-clidean space. The Euclidean optimization methods can beslightly revised to have optimization on (possibly curvy)Riemannian manifolds. Riemannian optimization (Absilet al., 2009; Boumal, 2020) optimizes a cost function whilethe variable lies on a Riemannian manifoldM. The opti-mization variable in the Riemannian optimization is usuallymatrix rather than vector; hence, Riemannian optimizationis also called optimization on matrix manifolds. The Rie-mannian optimization problem is formulated as:

minimizeX

f(X)

subject to X ∈M.(271)

Most of the Euclidean first-order and second-order opti-mization methods have their Riemannian optimization vari-ants by some changes in the formulation of methods. In the

49

Riemannian optimization methods, the solution lies on themanifold. At every iteration, the descent direction is cal-culated in the tangent space on the manifold at the currentsolution. Then, the updated solution in the tangent space isretracted (i.e., projected) onto the curvy manifold (Hosseini& Sra, 2020b; Hu et al., 2020). This procedure is repeateduntil convergence to the final solution. Some well-knownRiemannian manifolds which are used for optimization areSymmetric Positive Definite (SPD) (Sra & Hosseini, 2015),quotient (Lee, 2013), Grassmann (Bendokat et al., 2020),and Stiefel (Edelman et al., 1998) manifolds.

10.5. Metaheuristic OptimizationIn this paper, we covered classical optimization methods.There are some other optimization methods, called meta-heuristic optimization (Talbi, 2009), which are a family ofmethods finding the optimum of a cost function using effi-cient, and not brute-force, search. They fall in the field ofsoft computing and can be used in highly non-convex opti-mization with many constraints, where classical optimiza-tion is a little difficult and slow to perform. Some very well-known categories of the Metaheuristic optimization meth-ods are nature-inspired optimization (Yang, 2010), evolu-tionary computing (Simon, 2013), and particle-based opti-mization. The general shared idea of metaheuristic meth-ods is as follows. We perform local search by some parti-cles in various parts of the feasible set. Wherever a smallercost was found, we tend to do more local search in that area;although, we should also keep exploring other areas be-cause that better area might be just a local optimum. Hence,both local and global searches are used for exploitation andexploration of the cost function, respectively.There exist many metaheuristic methods. Two popularand fundamental ones are Genetic Algorithm (GA) (Hol-land et al., 1992) and Particle Swarm Optimization (PSO)(Kennedy & Eberhart, 1995). GA, which is an evolutionarymethod inspired by natural selection, uses chromosomes asparticles where the particles tend to cross-over (or marry)with better particles in terms of smaller cost function. Thisresults in a better next generation of particles. Mutationsare also done for global search and exploration. Iterationsof making new generations results in very good particlesfinding the optimum.PSO is a nature-inspired method whose particles can beseen as a herd fishes or birds. For better understanding,assume particles are humans digging ground in a vast areato find treasure. They find various things which differ interms of value. When someone finds a roughly good thing,people tend to move toward that person bu also tend tosearch more than before, around themselves at the sametime. Hence, they move in a combined direction tending alittle more toward the better found object. The fact that theystill dig around themselves is that the found object may notbe the real treasure (i.e., may be a local optimum) so they

also search around for the sake of exploration.

11. ConclusionThis paper was a tutorial and survey paper on KKT condi-tions, numerical optimization (both first-order and second-order methods), and distributed optimization. We coveredvarious optimization algorithms in this paper. Reading itcan be useful for different people in different fields of sci-ence and engineering. We did not assume much on thebackground of reader and explained methods in detail.

AcknowledgementThe authors hugely thank Prof. Stephen Boyd for his greatcourses Convex Optimization 1 and 2 of Stanford Univer-sity available on YouTube (The course Convex Optimiza-tion 1 mostly focuses on second-order and interior-pointmethods and the course Convex Optimization 2 focuses onmore advanced non-convex optimization, non-smooth opti-mization, ellipsoid method, distributed optimization, prox-imal algorithms, and some first-order methods). They alsothank Prof. Kimon Fountoulakis, Prof. James Geelen, Prof.Oleg Michailovich, Prof. Massoud Babaie-Zadeh, Prof.Lieven Vandenberghe, Prof. Mark Schmidt, Prof. RyanTibshirani, Prof. Reshad Hosseini, Prof. Saeed Sharifian,and some other professors whose lectures partly coveredsome materials and proofs mentioned in this tutorial paper.The great books of Prof. Yurii Nesterov (Nesterov, 1998;2003) were also influential on some of proofs.

A. Proofs for Section 2A.1. Proof for Lemma 5

f(y)(11)= f(x) +∇f(x)>(y − x)

+

∫ 1

0

(∇f(x+ t(y − x)

)−∇f(x)

)>(y − x)dt

(a)

≤ f(x) +∇f(x)>(y − x)

+

∫ 1

0

‖∇f(x+ t(y − x)

)−∇f(x)‖2‖y − x‖2dt

(b)

≤ f(x) +∇f(x)>(y − x) +

∫ 1

0

Lt‖y − x‖22dt

= f(x) +∇f(x)>(y − x) + L‖y − x‖22∫ 1

0

tdt

= f(x) +∇f(x)>(y − x) +L

2‖y − x‖22,

where (a) is because of the Cauchy-Schwarz inequality and(b) is because, according to Eq. (12), we have ‖∇f(x +t(y−x))−∇f(x)‖2 = L‖x+ t(y−x)−x‖2 = Lt‖y−x‖2. Q.E.D.

50

A.2. Proof for Lemma 6– Proof for the first equation: As f(.) is convex, accord-ing to Eq. (5),

f(z)− f(y) ≥ ∇f(y)>(z − y)

=⇒ f(y)− f(z) ≤ ∇f(y)>(y − z). (272)

f(y)− f(x) =(f(y)− f(z)

)+(f(z)− f(x)

).

(273)

Also, according to Eq. (13), we have:

f(z)− f(x) ≤ ∇f(x)>(z − x) +L

2‖z − x‖22. (274)

Using Eqs. (272) and (274) in Eq. (273) gives:

f(y)− f(x) ≤∇f(y)>(y − z) +∇f(x)>(z − x)

+L

2‖z − x‖22.

For this, we can minimize this upper-bound (the right-handside) by setting its derivative w.r.t. z to zero. It gives:

z = x− 1

L

(∇f(x)−∇f(y)

).

Putting this in the upper-bound gives:

f(y)− f(x) ≤ ∇f(y)>(y − x)

+1

L∇f(y)>(∇f(x)−∇f(y))

− 1

L∇f(x)>(∇f(x)−∇f(y))

+L

2

1

L2‖∇f(x)−∇f(y)‖22

(a)= ∇f(y)>(y − x)− 1

L‖∇f(x)−∇f(y)‖22

+1

2L‖∇f(x)−∇f(y)‖22

= ∇f(y)>(y − x)− 1

2L‖∇f(x)−∇f(y)‖22,

were (a) is because ‖∇f(x) − ∇f(y)‖22 = (∇f(x) −∇f(y))>(∇f(x)−∇f(y)).– Proof for the second equation: If we exchange thepoints x and y in Eq. (14), we have:

f(x)− f(y) ≤∇f(x)>(x− y)

− 1

2L‖∇f(y)−∇f(x)‖22. (275)

Adding Eqs. (14) and (275) gives:

0 ≤ (∇f(y)−∇f(x))>(y − x)− 1

L‖∇f(y)−∇f(x)‖22.

Q.E.D.

A.3. Proof for Lemma 7Consider any y ∈ D. We define z := αy + (1− α)x. Wechoose a small enough α to have ‖z − x‖2 ≤ ε. Hence:

ε ≥ ‖z − x‖2 = ‖αy + (1− α)x− x‖2 = ‖αy − αx‖2

= α‖y − x‖2 =⇒ α ≤ ε

‖y − x‖2.

As α ∈ [0, 1], we should have 0 ≤ αmin(ε/‖y − x‖2, 1).As x is a local minimizer, according to Eq. (16), we have∃ ε > 0 : ∀z ∈ D, ‖z − x‖2 ≤ ε =⇒ f(x) ≤ f(z).As the function is convex, according to Eq. (4), we havef(z) = f

(αy + (1 − α)x

)≤ αf(y) + (1 − α)f(x)

(n.b. we have exchanged the variables x and y in Eq. (4)).Hence, overall, we have:

f(x) ≤ f(z) ≤ αf(y) + (1− α)f(x)

=⇒ f(x)− (1− α)f(x) ≤ αf(y)

=⇒ αf(x) ≤ αf(y) =⇒ f(x) ≤ f(y),∀y ∈ D.

So, x is the global minimizer. Q.E.D.

A.4. Proof for Lemma 8– Proof for side (x∗ is minimizer =⇒ ∇f(x∗) = 0):According to the definition of directional derivative, wehave:

∇f(x)>(y − x) = limt→0

f(x+ t(y − x))− f(x)

t.

For a minimizer x∗, we have f(x∗) ≤ f(x∗ + t(y − x∗))because it minimizes f(.) for a neighborhood around it(n.b. x∗ + t(y − x∗) is a neighborhood of x∗ because ttends to zero). Hence, we have:

0 ≤ limt→0

f(x∗ + t(y − x∗))− f(x∗)

t

= ∇f(x∗)>(y − x∗).

As y can be any point in the domainD, we can choose it tobe y = x∗ − ∇f(x∗) so we have ∇f(x∗) = x∗ − y andtherefore:

0 ≤ ∇f(x∗)>(y − x∗) = −∇f(x∗)>∇f(x∗)

= −‖∇f(x∗)‖22 ≤ 0 =⇒ ∇f(x∗) = 0.

– Proof for side (∇f(x∗) = 0 =⇒ x∗ is minimizer):As the function f(.) is convex, according to Eq. (5), wehave:

f(y) ≥ f(x∗) +∇f(x∗)>(y − x∗),

∀y ∈ D. As we have ∇f(x∗) = 0, we can say f(y) ≥f(x∗),∀y. So, x∗ is the global minimizer.

51

A.5. Proof for Lemma 9As x∗ is a local minimizer, according to Eq. (16), we have∃ ε > 0 : ∀y ∈ D, ‖y − x∗‖2 ≤ ε =⇒ f(x∗) ≤ f(y).Also, by Eq. (11), we have f(y) = f(x∗)+∇f(x∗)>(y−x∗) + o(y − x∗). From these two, we have:

f(x∗) ≤ f(x∗) +∇f(x∗)>(y − x∗) + o(y − x∗)=⇒ ∇f(x∗)>(y − x∗) ≥ 0.

As y can be any point in the domainD, we can choose it tobe y = x∗ − ∇f(x∗) so we have ∇f(x∗) = x∗ − y andtherefore∇f(x∗)>(y−x∗) = −‖∇f(x∗)‖22 ≥ 0. Hence,∇f(x∗) = 0. Q.E.D.

B. Proofs for Section 5B.1. Proof for Lemma 13Because we halve the step size every time, after τ internaliterations of line-search, we have:

η(τ) := (1

2)τη(τ) = (

1

2)τ ,

where η(τ) = 1 is the initial step size. According to Eq.(87), we have:

η(τ) = (1

2)τ <

1

L

(a)=⇒ log 1

2(1

2)τ > log 1

2

1

L

=⇒ τ > log 12

1

L=

log 1L

log 12

= −log 1

L

log 2

= − 1

log 2(log 1− logL) =

logL

log 2.

Q.E.D.

B.2. Proof for Theorem 1We re-arrange Eq. (88):

‖∇f(x(k))‖22 ≤ 2L(f(x(k))− f(x(k+1))

),∀k,

=⇒t∑

k=0

‖∇f(x(k))‖22 ≤ 2L

t∑k=0

(f(x(k))− f(x(k+1))

).

(276)

The right-hand side of Eq. (276) is a telescopic summation:

t∑k=0

(f(x(k))− f(x(k+1))

)= f(x(0))− f(x(1))

+ f(x(1))− f(x(2)) + · · · − f(x(t)) + f(x(t))

− f(x(t+1)) = f(x(0))− f(x(t+1)).

The left-hand side of Eq. (276) is larger than summation ofits smallest term for (t+ 1) times:

(t+ 1) min0≤k≤t

‖∇f(x(k))‖22 ≤t∑

k=0

‖∇f(x(k))‖22.

Overall, Eq. (276) becomes:

(t+ 1) min0≤k≤t

‖∇f(x(k))‖22 ≤ 2L(f(x(0))− f(x(t+1))

).

(277)

As f∗ is the minimum of function, we have:

f(x(t+1)) ≥ f∗

=⇒ 2L(f(x(0))− f(x(t+1))

)≤ 2L

(f(x(0))− f∗)

).

Hence, Eq. (277) becomes:

(t+ 1) min0≤k≤t

‖∇f(x(k))‖22 ≤ 2L(f(x(0))− f∗)

),

which gives Eq. (93). The right-hand side of Eq. (93) isof the order O(1/t), resulting in Eq. (94). Moreover, forconvergence, we desire:

min0≤k≤t

‖∇f(x(k))‖22 ≤ ε(93)=⇒ 2L(f(x(0))− f∗)

t+ 1≤ ε,

which gives Eq. (95) by re-arranging for t. Q.E.D.

B.3. Proof for Theorem 2

‖x(k+1) − x∗‖22(a)=∥∥x(k) − x∗ − 1

L∇f(x(k))

∥∥2

2

=(x(k) − x∗ − 1

L∇f(x(k))

)>(x(k) − x∗ − 1

L∇f(x(k))

)= ‖x(k) − x∗‖22 −

2

L(x(k) − x∗)>∇f(x(k))

+1

L2‖∇f(x(k))‖22, (278)

where (a) is because of Eqs. (79) and (83). Moreover,according to Eq. (15), we have:(∇f(x(k))−∇f(x∗)

)>(x(k) − x∗)

≥ 1

L‖∇f(x(k))−∇f(x∗)‖22

(18)=⇒ ∇f(x(k))>(x(k) − x∗) ≥ 1

L‖∇f(x(k))‖22.

=⇒ − 2

L∇f(x(k))>(x(k) − x∗) ≤ − 2

L2‖∇f(x(k))‖22.

(279)

Using Eq. (279) in Eq. (278) gives:

‖x(k+1) − x∗‖22 ≤ ‖x(k) − x∗‖22 −2

L2‖∇f(x(k))‖22

+1

L2‖∇f(x(k))‖22 = ‖x(k) − x∗‖22 −

1

L2‖∇f(x(k))‖22.

This shows that at every iteration of gradient descent, thedistance of point to x∗ is decreasing; hence:

‖x(k) − x∗‖22 ≤ ‖x(0) − x∗‖22. (280)

52

As the function is convex, according to Eq. (5), we have:

f(x∗) ≥ f(x(k)) +∇f(x(k))>(x∗ − x(k))

=⇒ f(x(k))− f(x∗) ≤ ∇f(x(k))>(x(k) − x∗)(a)

≤ ‖∇f(x(k))‖2‖x(k) − x∗‖2(280)

≤ ‖∇f(x(k))‖2‖x(0) − x∗‖2

=⇒ −1

2L

(f(x(k))− f(x∗)

)2‖x(0) − x∗‖22

≥ −1

2L‖∇f(x(k))‖22,

(281)

where (a) is because of the Cauchy-Schwarz inequality.Also, from Eq. (88), we have:

f(x(k+1)) ≤ f(x(k))− 1

2L‖∇f(x(k))‖22

=⇒ f(x(k+1))− f(x∗)

≤ f(x(k))− f(x∗)− 1

2L‖∇f(x(k))‖22

(281)

≤ f(x(k))− f(x∗)− 1

2L

(f(x(k))− f(x∗)

)2‖x(0) − x∗‖22

.

(282)

We define δk := f(x(k))− f(x∗) and µ := 1/(2L‖x(0) −x∗‖22). According to Eq. (88), we have:

δk+1 ≤ δk =⇒ µδkδk+1

≥ µ. (283)

Eq. (282) can be restated as:

δk+1 ≤ δk − µ δ2k =⇒ µ

δkδk+1

≤ 1

δk+1− 1

δk

(283)=⇒ µ ≤ 1

δk+1− 1

δk=⇒ µ

t∑k=0

(1) ≤t∑

k=0

(1

δk+1− 1

δk).

The last term is a telescopic summation; hence:

µ(t+ 1) ≤t∑

k=0

(1

δk+1− 1

δk)

=1

δ1− 1

δ0+

1

δ2− 1

δ1+ · · ·+ 1

δt+1=

1

δt+1− 1

δ0(a)

≤ 1

δt+1=⇒ δt+1 ≤

1

µ(t+ 1)

=⇒ f(x(t+1))− f∗ ≤ 2L‖x(0) − x∗‖22t+ 1

,

where (a) is because δ0 = f(x(0)) − f(x∗) ≥ 0 becausef∗ is the minimum in the convex function. Q.E.D.

C. Proofs for Section 7C.1. Proof for Theorem 8The Lagrangian for problem (203) is:

L(x,ν) = f(x)− 1

t

m1∑i=1

log(−yi(x)) + ν>(Ax− b).

According to Eq. (73), for parameter t, x†(t) minimizesthe Lagrangian:

∇xL(x∗(t),ν) = ∇xf(x∗(t))

+

m1∑i=1

−1

t yi(x∗(t))∇xyi(x∗(t)) +A>ν

set= 0

=⇒ ∇xL(x∗(t),ν) = ∇xf(x∗(t))

+

m1∑i=1

λ∗i (t)∇xyi(x∗(t)) +A>ν = 0, (284)

where we define λ∗i (t) := −1/(t yi(x∗(t))). Let λ∗(t) :=

[λ∗1(t), . . . , λ∗m1(t)]> and let ν∗(t)) be the optimal dual

variable ν for parameter t. We take integral from Eq. (284)w.r.t. x to retrieve the Lagrangian again:

L(x,λ∗(t),ν∗(t)) = f(x∗(t))

+

m1∑i=1

λ∗i (t)∇xyi(x) + ν∗(t)>(Ax− b). (285)

The dual function is:

g(λ,ν) = infxL(x,λ∗(t),ν∗(t)).

The dual problem is:

supλ,ν

g(λ,ν) = supλ,ν

infxL(x,λ∗(t),ν∗(t))

= L(x∗(t),λ∗(t),ν∗(t))(285)= f(x∗(t))

+

m1∑i=1

λ∗i (t)∇xyi(x∗(t)) + ν∗(t)>(Ax∗(t)− b︸ ︷︷ ︸= 0

)

(a)= f(x∗(t))− 1

t

m1∑i=1

(1) = f(x∗(t))− m1

t= f∗ − m1

t,

where (a) is because we had defined λ∗i (t) =−1/(t yi(x

∗(t))) andAx∗(t)−b = 0 because the point isfeasible. According to Eq. (51) for weak duality, we havesupλ,ν g(λ,ν) ≤ f∗r where f∗r is the optimum of problem(203); hence, f∗ − (m1/t) ≤ f∗r . On the other hand, wef∗r is the optimum, we have:

f∗r(203)= f(x∗)− 1

t

m1∑i=1

log(−yi(x∗)) ≤ f(x∗) = f∗.

Hence, overall, Eq. (204) is obtained. Q.E.D.

53

ReferencesAbsil, P-A, Mahony, Robert, and Sepulchre, Rodolphe.

Optimization algorithms on matrix manifolds. PrincetonUniversity Press, 2009.

Alber, Ya I, Iusem, Alfredo N., and Solodov, Mikhail V. Onthe projected subgradient method for nonsmooth convexoptimization in a Hilbert space. Mathematical Program-ming, 81(1):23–35, 1998.

Allen-Zhu, Zeyuan, Li, Yuanzhi, and Liang, Yingyu.Learning and generalization in overparameterized neuralnetworks, going beyond two layers. Advances in neuralinformation processing systems, 2019a.

Allen-Zhu, Zeyuan, Li, Yuanzhi, and Song, Zhao.A convergence theory for deep learning via over-parameterization. In International Conference on Ma-chine Learning, pp. 242–252, 2019b.

Andrei, Neculai. A diagonal quasi-Newton updatingmethod for unconstrained optimization. Numerical Al-gorithms, 81(2):575–590, 2019.

Armijo, Larry. Minimization of functions having Lipschitzcontinuous first partial derivatives. Pacific Journal ofmathematics, 16(1):1–3, 1966.

Atkinson, David S and Vaidya, Pravin M. A cutting planealgorithm for convex programming that uses analyticcenters. Mathematical Programming, 69(1):1–43, 1995.

Aumann, Robert J and Maschler, Michael. Some thoughtson the minimax principle. Management Science, 18(5-part-2):54–63, 1972.

Avriel, Mordecai. Nonlinear programming: analysis andmethods. Courier Corporation, 2003.

Banaschewski, Bernhard and Maranda, Jean-Marie. Prox-imity functions. Mathematische Nachrichten, 23(1):1–37, 1961.

Bauschke, Heinz H and Borwein, Jonathan M. On projec-tion algorithms for solving convex feasibility problems.SIAM review, 38(3):367–426, 1996.

Beck, Amir. First-order methods in optimization. SIAM,2017.

Beck, Amir and Teboulle, Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM journal on imaging sciences, 2(1):183–202, 2009.

Benders, Jacques F. Partitioning procedures for solvingmixed-variables programming problems. Numerischemathematik, 4(1):238–252, 1962.

Bendokat, Thomas, Zimmermann, Ralf, and Absil, P-A. A Grassmann manifold handbook: Basic ge-ometry and computational aspects. arXiv preprintarXiv:2011.13699, 2020.

Bertsekas, Dimitri P. The method of multipliers for equal-ity constrained problems. Constrained optimization andLagrange multiplier methods, pp. 96–157, 1982.

Boggs, Paul T and Tolle, Jon W. Sequential quadratic pro-gramming. Acta numerica, 4:1–51, 1995.

Bottou, Leon, Curtis, Frank E, and Nocedal, Jorge. Op-timization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018.

Bottou, Leon et al. Online learning and stochastic approx-imations. On-line learning in neural networks, 17(9):142, 1998.

Boumal, Nicolas. An introduction to optimization onsmooth manifolds. Available online, 2020.

Boyd, Stephen and Mutapcic, Almir. Stochastic subgra-dient methods. Technical report, Lecture Notes forEE364b, Stanford University, 2008.

Boyd, Stephen and Vandenberghe, Lieven. Convex opti-mization. Cambridge university press, 2004.

Boyd, Stephen and Vandenberghe, Lieven. Localizationand cutting-plane methods. Technical report, StanfordEE 364b lecture notes, 2007.

Boyd, Stephen, Parikh, Neal, and Chu, Eric. Distributedoptimization and statistical learning via the alternatingdirection method of multipliers. Now Publishers Inc,2011.

Broyden, Charles G. A class of methods for solving non-linear simultaneous equations. Mathematics of compu-tation, 19(92):577–593, 1965.

Bubeck, Sebastien. Convex optimization: Algorithms andcomplexity. arXiv preprint arXiv:1405.4980, 2014.

Burden, Richard L. and Faires, J. Douglas. NumericalAnalysis. PWS Publishers, 1963.

Chen, Annie I and Ozdaglar, Asuman. A fast distributedproximal-gradient method. In 2012 50th Annual AllertonConference on Communication, Control, and Computing(Allerton), pp. 601–608. IEEE, 2012.

Chen, Jiabin, Yuan, Rui, Garrigos, Guillaume, and Gower,Robert M. SAN: Stochastic average Newton algo-rithm for minimizing finite sums. arXiv preprintarXiv:2106.10520, 2021.

54

Chong, Edwin KP and Zak, Stanislaw H. An introductionto optimization. John Wiley & Sons, 2004.

Chong, Yidong D. Complex methods for the sciences.Technical report, Nanyang Technological University,2021.

Conn, Andrew R, Gould, Nicholas IM, and Toint, Ph L.Convergence of quasi-newton matrices generated by thesymmetric rank one update. Mathematical program-ming, 50(1):177–195, 1991.

Conn, Andrew R, Gould, Nicholas IM, and Toint,Philippe L. Trust region methods. SIAM, 2000.

Curry, Haskell B. The method of steepest descent fornon-linear minimization problems. Quarterly of AppliedMathematics, 2(3):258–261, 1944.

Dai, Yu-Hong and Yuan, Yaxiang. A nonlinear conju-gate gradient method with a strong global convergenceproperty. SIAM Journal on optimization, 10(1):177–182,1999.

Dantzig, George. Linear programming and extensions.Princeton university press, 1963.

Dantzig, George B. Reminiscences about the origins oflinear programming. In Mathematical Programming TheState of the Art, pp. 78–86. Springer, 1983.

Dantzig, George B and Wolfe, Philip. Decomposition prin-ciple for linear programs. Operations research, 8(1):101–111, 1960.

Davidon, William C. Variable metric method for minimiza-tion. SIAM Journal on Optimization, 1(1):1–17, 1991.

Dennis Jr, John E and Schnabel, Robert B. Numericalmethods for unconstrained optimization and nonlinearequations. SIAM, 1996.

Di Pillo, Gianni. Exact penalty methods. In Algorithms forContinuous Optimization, pp. 209–253. Springer, 1994.

Dikin, I.I. Iterative solution of problems of linear andquadratic programming. In Doklady Akademii Nauk,volume 174, pp. 747–748. Russian Academy of Sci-ences, 1967.

Dinh, Quoc Tran and Diehl, Moritz. Local convergenceof sequential convex programming for nonconvex op-timization. In Recent Advances in Optimization andits Applications in Engineering, pp. 93–102. Springer,2010.

Domingos, Pedro. The role of Occam’s razor in knowledgediscovery. Data mining and knowledge discovery, 3(4):409–425, 1999.

Donoho, David L. For most large underdetermined systemsof linear equations the minimal `1-norm solution is alsothe sparsest solution. Communications on Pure and Ap-plied Mathematics: A Journal Issued by the Courant In-stitute of Mathematical Sciences, 59(6):797–829, 2006.

Drummond, LM Grana and Iusem, Alfredo N. A projectedgradient method for vector optimization problems. Com-putational Optimization and applications, 28(1):5–29,2004.

Du, Ding-Zhu and Pardalos, Panos M. Minimax and appli-cations, volume 4. Springer Science & Business Media,2013.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization. Journal of machine learning research, 12(7), 2011.

Duchi, John, Boyd, Stephen, and Mattingley, Jacob. Se-quential convex programming. Technical report, Notesfor EE364b, Stanford University, 2018.

Edelman, Alan, Arias, Tomas A, and Smith, Steven T. Thegeometry of algorithms with orthogonality constraints.SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.

Everett III, Hugh. Generalized Lagrange multiplier methodfor solving problems of optimum allocation of resources.Operations research, 11(3):399–417, 1963.

Fan, Ky. Maximum properties and inequalities for theeigenvalues of completely continuous operators. Pro-ceedings of the National Academy of Sciences of theUnited States of America, 37(11):760, 1951.

Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, and Tse, David.Porcupine neural networks:(almost) all local optima areglobal. arXiv preprint arXiv:1710.02196, 2017.

Fercoq, Olivier and Richtarik, Peter. Accelerated, paral-lel, and proximal coordinate descent. SIAM Journal onOptimization, 25(4):1997–2023, 2015.

Fiacco, Anthony V and McCormick, Garth P. The se-quential unconstrained minimization technique (SUMT)without parameters. Operations Research, 15(5):820–827, 1967.

Fletcher, Reeves and Reeves, Colin M. Function minimiza-tion by conjugate gradients. The computer journal, 7(2):149–154, 1964.

Fletcher, Roger. Practical methods of optimization. JohnWiley & Sons, 1987.

55

Frank, Marguerite and Wolfe, Philip. An algorithm forquadratic programming. Naval research logistics quar-terly, 3(1-2):95–110, 1956.

Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert, et al.The elements of statistical learning, volume 1. Springerseries in statistics New York, 2001.

Gabay, Daniel and Mercier, Bertrand. A dual algorithm forthe solution of nonlinear variational problems via finiteelement approximation. Computers & mathematics withapplications, 2(1):17–40, 1976.

Geman, Stuart and Geman, Donald. Stochastic relaxation,Gibbs distributions, and the Bayesian restoration of im-ages. IEEE Transactions on pattern analysis and ma-chine intelligence, (6):721–741, 1984.

Ghojogh, Benyamin and Crowley, Mark. The the-ory behind overfitting, cross validation, regulariza-tion, bagging, and boosting: tutorial. arXiv preprintarXiv:1905.12787, 2019.

Ghojogh, Benyamin, Nekoei, Hadi, Ghojogh, Aydin, Kar-ray, Fakhri, and Crowley, Mark. Sampling algorithms,from survey sampling to Monte Carlo methods: Tutorialand literature review. arXiv preprint arXiv:2011.00901,2020.

Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, andCrowley, Mark. Reproducing kernel Hilbert space, Mer-cer’s theorem, eigenfunctions, Nystrom method, and useof kernels in machine learning: Tutorial and survey.arXiv preprint arXiv:2106.08443, 2021.

Giesen, Joachim and Laue, Soren. Distributed convex op-timization with many convex constraints. arXiv preprintarXiv:1610.02967, 2016.

Giesen, Joachim and Laue, Soren. Combining ADMM andthe augmented Lagrangian method for efficiently han-dling many constraints. In International Joint Confer-ence on Artificial Intelligence, pp. 4525–4531, 2019.

Glowinski, R and Marrocco, A. Finite element approxi-mation and iterative methods of solution for 2-D non-linear magnetostatic problems. In Proceeding of Inter-national Conference on the Computation of Electromag-netic Fields (COMPUMAG), 1976.

Goffin, Jean-Louis and Vial, Jean-Philippe. On the com-putation of weighted analytic centers and dual ellipsoidswith the projective algorithm. Mathematical Program-ming, 60(1):81–92, 1993.

Golub, Gene H and Van Loan, Charles F. Matrix computa-tions, volume 3. JHU press, 2013.

Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron.Deep learning. MIT press, 2016.

Gower, Robert M. Convergence theorems for gradient de-scent. Lecture notes for Statistical Optimization, 2018.

Grant, Michael, Boyd, Stephen, and Ye, Yinyu. CVX: Mat-lab software for disciplined convex programming, 2009.

Hadamard, Jacques. Memoire sur le probleme d’analyse re-latif a l’equilibre des plaques elastiques encastrees, vol-ume 33. Imprimerie nationale, 1908.

Hastie, Trevor, Tibshirani, Robert, and Wainwright, Mar-tin. Statistical learning with sparsity: the lasso and gen-eralizations. Chapman and Hall/CRC, 2019.

Hestenes, Magnus R. Multiplier and gradient methods.Journal of optimization theory and applications, 4(5):303–320, 1969.

Hestenes, Magnus Rudolph and Stiefel, Eduard. Methodsof conjugate gradients for solving linear systems, vol-ume 49. NBS Washington, DC, 1952.

Hinton, Geoffrey, Srivastava, Nitish, and Swersky, Kevin.Neural networks for machine learning lecture 6aoverview of mini-batch gradient descent. Technical re-port, Department of Computer Science, University ofToronto, 2012.

Hjorungnes, Are and Gesbert, David. Complex-valued ma-trix differentiation: Techniques and key results. IEEETransactions on Signal Processing, 55(6):2740–2746,2007.

Holder, Otto. Ueber einen mittelwerthabsatz. Nachrichtenvon der Konigl. Gesellschaft der Wissenschaften undder Georg-Augusts-Universitat zu Gottingen, 1889:38–47, 1889.

Holland, John Henry et al. Adaptation in natural and artifi-cial systems: an introductory analysis with applicationsto biology, control, and artificial intelligence. MIT press,1992.

Hosseini, Reshad and Sra, Suvrit. An alternative to EM forGaussian mixture models: batch and stochastic Rieman-nian optimization. Mathematical Programming, 181(1):187–223, 2020a.

Hosseini, Reshad and Sra, Suvrit. Recent advances instochastic Riemannian optimization. Handbook of Vari-ational Methods for Nonlinear Geometric Data, pp.527–554, 2020b.

Hu, Jiang, Liu, Xin, Wen, Zai-Wen, and Yuan, Ya-Xiang.A brief introduction to manifold optimization. Journalof the Operations Research Society of China, 8(2):199–248, 2020.

56

Huber, Peter J. Robust estimation of a location parameter.In Breakthroughs in statistics, pp. 492–518. Springer,1992.

Iusem, Alfredo N. On the convergence properties ofthe projected gradient method for convex optimiza-tion. Computational & Applied Mathematics, 22:37–52,2003.

Jain, Prateek and Kar, Purushottam. Non-convex op-timization for machine learning. arXiv preprintarXiv:1712.07897, 2017.

Johnson, Rie and Zhang, Tong. Accelerating stochasticgradient descent using predictive variance reduction. Ad-vances in neural information processing systems, 26:315–323, 2013.

Karush, William. Minima of functions of several vari-ables with inequalities as side constraints. Master’s the-sis, Department of Mathematics, University of Chicago,Chicago, Illinois, 1939.

Kelley, Carl T. Iterative methods for linear and nonlinearequations. SIAM, 1995.

Kennedy, James and Eberhart, Russell. Particle swarmoptimization. In Proceedings of ICNN’95-internationalconference on neural networks, volume 4, pp. 1942–1948. IEEE, 1995.

Khachiyan, Leonid Genrikhovich. A polynomial algorithmin linear programming. In Doklady Akademii Nauk, vol-ume 244, pp. 1093–1096. Russian Academy of Sciences,1979.

Kingma, Diederik P and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

Kjeldsen, Tinne Hoff. A contextualized historical analysisof the Kuhn–Tucker theorem in nonlinear programming:the impact of world war II. Historia mathematica, 27(4):331–361, 2000.

Konecny, Jakub, McMahan, Brendan, and Ramage,Daniel. Federated optimization: Distributed op-timization beyond the datacenter. arXiv preprintarXiv:1511.03575, 2015.

Krylov, AN. On the numerical solution of equation bywhich are determined in technical problems the frequen-cies of small vibrations of material systems. News Acad.Sci. USSR, 7:491–539, 1931.

Kuhn, Harold W and Tucker, Albert W. Nonlinear pro-gramming. In Berkeley Symposium on MathematicalStatistics and Probability, pp. 481–492. Berkeley: Uni-versity of California Press, 1951.

Land, Ailsa H and Doig, Alison G. An automatic methodfor solving discrete programming problems. Economet-rica, 28(3):497–520, 1960.

Lee, John A and Verleysen, Michel. Nonlinear dimension-ality reduction. Springer Science & Business Media,2007.

Lee, John M. Quotient manifolds. In Introduction toSmooth Manifolds, pp. 540–563. Springer, 2013.

Lee, Yin Tat and Sidford, Aaron. Efficient accelerated co-ordinate descent methods and faster algorithms for solv-ing linear systems. In 2013 ieee 54th annual symposiumon foundations of computer science, pp. 147–156. IEEE,2013.

Lemarechal, Claude. Cauchy and the gradient method. DocMath Extra, 251(254):10, 2012.

Lemarechal, Claude and Sagastizabal, Claudia. Practicalaspects of the Moreau–Yosida regularization: Theoreti-cal preliminaries. SIAM journal on optimization, 7(2):367–385, 1997.

Levitin, Evgeny S and Polyak, Boris T. Constrained min-imization methods. USSR Computational mathematicsand mathematical physics, 6(5):1–50, 1966.

Li, Qiuwei, Zhu, Zhihui, and Tang, Gongguo. Alternat-ing minimizations converge to second-order optimal so-lutions. In International Conference on Machine Learn-ing, pp. 3935–3943, 2019.

Li, Tian, Sahu, Anit Kumar, Talwalkar, Ameet, and Smith,Virginia. Federated learning: Challenges, methods, andfuture directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.

Liu, Dong C and Nocedal, Jorge. On the limited memoryBFGS method for large scale optimization. Mathemati-cal programming, 45(1):503–528, 1989.

Liu, Yusha, Wang, Yining, and Singh, Aarti. Smooth banditoptimization: Generalization to Holder space. In Inter-national Conference on Artificial Intelligence and Statis-tics, pp. 2206–2214, 2021.

Luo, Zhi-Quan and Tseng, Paul. On the convergence ofthe coordinate descent method for convex differentiableminimization. Journal of Optimization Theory and Ap-plications, 72(1):7–35, 1992.

Luo, Zhi-Quan and Tseng, Paul. Error bounds and con-vergence analysis of feasible descent methods: a generalapproach. Annals of Operations Research, 46(1):157–178, 1993.

57

Magnus, Jan R and Neudecker, Heinz. Matrix differentialcalculus with applications to simple, hadamard, and kro-necker products. Journal of Mathematical Psychology,29(4):474–492, 1985.

Moreau, Jean Jacques. Decomposition orthogonale d’unespace Hilbertien selon deux cones mutuellement po-laires. Comptes rendus hebdomadaires des seances del’Academie des sciences, 255:238–240, 1962.

Moreau, Jean Jacques. Proximite et dualite dans un es-pace hilbertien. Bulletin de la Societe mathematique deFrance, 93:273–299, 1965.

Nash, Stephen G. A survey of truncated-Newton methods.Journal of computational and applied mathematics, 124(1-2):45–59, 2000.

Nesterov, Yurii. A method for solving the convex program-ming problem with convergence rate O(1/kˆ2). In Dokl.Akad. Nauk SSSR, volume 269, pp. 543–547, 1983.

Nesterov, Yurii. On an approach to the construction of op-timal methods of minimization of smooth convex func-tions. Ekonomika i Mateaticheskie Metody, 24(3):509–517, 1988.

Nesterov, Yurii. Cutting plane algorithms from analyticcenters: efficiency estimates. Mathematical Program-ming, 69(1):149–176, 1995.

Nesterov, Yurii. Introductory lectures on convex program-ming volume I: Basic course. Lecture notes, 3(4):5,1998.

Nesterov, Yurii. Introductory lectures on convex optimiza-tion: A basic course, volume 87. Springer Science &Business Media, 2003.

Nesterov, Yurii. Smooth minimization of non-smooth func-tions. Mathematical programming, 103(1):127–152,2005.

Nesterov, Yurii. Gradient methods for minimizing com-posite functions. Mathematical Programming, 140(1):125–161, 2013.

Nesterov, Yurii. Lectures on convex optimization, volume137. Springer, 2018.

Nesterov, Yurii and Nemirovskii, Arkadii. Interior-pointpolynomial algorithms in convex programming. SIAM,1994.

Nocedal, Jorge. Updating quasi-Newton matrices with lim-ited storage. Mathematics of computation, 35(151):773–782, 1980.

Nocedal, Jorge and Wright, Stephen. Numerical optimiza-tion. Springer Science & Business Media, 2 edition,2006.

Otero, Daniel, La Torre, Davide, Michailovich, Oleg V, andVrscay, Edward R. Alternate direction method of mul-tipliers for unconstrained structural similarity-based op-timization. In International Conference Image Analysisand Recognition, pp. 20–29. Springer, 2018.

Parikh, Neal and Boyd, Stephen. Proximal algorithms.Foundations and Trends in optimization, 1(3):127–239,2014.

Petersen, Kaare Brandt and Pedersen, Michael Syskind.The matrix cookbook. Technical University of Denmark,15, 2012.

Polak, Elijah and Ribiere, Gerard. Note sur la conver-gence de methodes de directions conjuguees. ESAIM:Mathematical Modelling and Numerical Analysis-Modelisation Mathematique et Analyse Numerique, 3(R1):35–43, 1969.

Potra, Florian A and Wright, Stephen J. Interior-pointmethods. Journal of computational and applied math-ematics, 124(1-2):281–302, 2000.

Powell, Michael JD. A method for nonlinear constraintsin minimization problems. Optimization, pp. 283–298,1969.

Rebennack, Steffen. Ellipsoid method. Encyclopedia ofOptimization, pp. 890–899, 2009.

Reddi, Sashank J, Sra, Suvrit, Poczos, Barnabas, andSmola, Alex. Stochastic Frank-Wolfe methods for non-convex optimization. In 2016 54th Annual AllertonConference on Communication, Control, and Comput-ing (Allerton), pp. 1244–1251. IEEE, 2016.

Riedmiller, Martin and Braun, Heinrich. Rprop-a fast adap-tive learning algorithm. In Proceedings of the Interna-tional Symposium on Computer and Information ScienceVII, 1992.

Robbins, Herbert and Monro, Sutton. A stochastic approx-imation method. The annals of mathematical statistics,pp. 400–407, 1951.

Rockafellar, R Tyrrell. Monotone operators and the proxi-mal point algorithm. SIAM journal on control and opti-mization, 14(5):877–898, 1976.

Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. Astochastic gradient method with an exponential conver-gence rate for finite training sets. In Advances in NeuralInformation Processing Systems, volume 25, 2012.

58

Rumelhart, David E, Hinton, Geoffrey E, and Williams,Ronald J. Learning representations by back-propagatingerrors. nature, 323(6088):533–536, 1986.

Sammon, John W. A nonlinear mapping for data structureanalysis. IEEE Transactions on computers, 100(5):401–409, 1969.

Schmidt, Mark, Roux, Nicolas Le, and Bach, Francis. Con-vergence rates of inexact proximal-gradient methods forconvex optimization. arXiv preprint arXiv:1109.2415,2011.

Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Min-imizing finite sums with the stochastic average gradient.Mathematical Programming, 162(1-2):83–112, 2017.

Shor, Naum Zuselevich. Cut-off method with space exten-sion in convex programming problems. Cybernetics, 13(1):94–96, 1977.

Shor, Naum Zuselevich. Nondifferentiable optimizationand polynomial problems, volume 24. Springer Science& Business Media, 1998.

Shor, Naum Zuselevich. Minimization methods for non-differentiable functions, volume 3. Springer Science &Business Media, 2012.

Simon, Dan. Evolutionary optimization algorithms. JohnWiley & Sons, 2013.

Slater, Morton. Lagrange multipliers revisited. Technicalreport, Cowles Commission Discussion Paper: Mathe-matics 403, Yale University, 1950.

Soltanolkotabi, Mahdi, Javanmard, Adel, and Lee, Ja-son D. Theoretical insights into the optimization land-scape of over-parameterized shallow neural networks.IEEE Transactions on Information Theory, 65(2):742–769, 2018.

Sra, Suvrit and Hosseini, Reshad. Conic geometric opti-mization on the manifold of positive definite matrices.SIAM Journal on Optimization, 25(1):713–739, 2015.

Steele, J Michael. The Cauchy-Schwarz master class:an introduction to the art of mathematical inequalities.Cambridge University Press, 2004.

Stoer, Josef and Bulirsch, Roland. Introduction to numer-ical analysis, volume 12. Springer Science & BusinessMedia, 2013.

Su, Weijie, Boyd, Stephen, and Candes, Emmanuel J. Adifferential equation for modeling Nesterov’s acceler-ated gradient method: Theory and insights. The Journalof Machine Learning Research, 17(1):5312–5354, 2016.

Talbi, El-Ghazali. Metaheuristics: from design to imple-mentation, volume 74. John Wiley & Sons, 2009.

Tibshirani, Robert. Regression shrinkage and selection viathe lasso. Journal of the Royal Statistical Society: SeriesB (Methodological), 58(1):267–288, 1996.

Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude. COURSERA: Neural networks formachine learning, 4(2):26–31, 2012.

Tikhomirov, Vladimir M. The evolution of methods of con-vex optimization. The American Mathematical Monthly,103(1):65–71, 1996.

Tseng, Paul. Convergence of a block coordinate descentmethod for nondifferentiable minimization. Journal ofoptimization theory and applications, 109(3):475–494,2001.

Tseng, Paul. On accelerated proximal gradient methodsfor convex-concave optimization. Technical report, MITUniversity, 2008.

Wang, Yu, Yin, Wotao, and Zeng, Jinshan. Global conver-gence of ADMM in nonconvex nonsmooth optimization.Journal of Scientific Computing, 78(1):29–63, 2019.

Wolfe, Philip. Convergence conditions for ascent methods.SIAM review, 11(2):226–235, 1969.

Wolfe, Philip. Invited note—some references for the ellip-soid algorithm. Management Science, 26(8):747–749,1980.

Wright, Margaret. The interior-point revolution in opti-mization: history, recent developments, and lasting con-sequences. Bulletin of the American mathematical soci-ety, 42(1):39–56, 2005.

Wright, Stephen J. Coordinate descent algorithms. Mathe-matical Programming, 151(1):3–34, 2015.

Wu, Tong Tong and Lange, Kenneth. Coordinate descentalgorithms for lasso penalized regression. The Annals ofApplied Statistics, 2(1):224–244, 2008.

Yang, Xin-She. Nature-inspired metaheuristic algorithms.Luniver press, 2010.

Yosida, Kosaku. Functional analysis. Springer Berlin Hei-delberg, 1965.

Yudin, DB and Nemirovski, Arkadi S. Informational com-plexity and efficient methods for the solution of convexextremal problems. Ekon Math Metod, English transla-tion: Matekon, 13(2):22–45, 1976.

59

Yudin, DB and Nemirovski, AS. Evaluation of the informa-tional complexity of mathematical programming prob-lems. Ekon Math Metod, English translation: Matekon,13(2):3–24, 1977a.

Yudin, DB and Nemirovski, AS. Optimization methodsadapting to the “significant” dimension of the problem.Autom Telemekhanika, English translation: Automationand Remote Control, 38(4):513–524, 1977b.

Zeng, Jinshan, Zha, Yixuan, Ma, Ke, and Yao,Yuan. On stochastic variance reduced gradientmethod for semidefinite optimization. arXiv preprintarXiv:2101.00236, 2021.

Zhang, Fuzhen. The Schur complement and its applica-tions, volume 4. Springer Science & Business Media,2006.

Zou, Fangyu, Shen, Li, Jie, Zequn, Zhang, Weizhong, andLiu, Wei. A sufficient condition for convergences ofAdam and RMSProp. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recogni-tion, pp. 11127–11135, 2019.