marc teboulle school of mathematical sciences tel aviv ...teboulle/lx/lec1-to-6.pdf · first order...

First Order Optimization Methods

Lecture 1 - Introduction and Basic Results

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 1 / 22

I Deriving simple methods capable of solving very large scaleproblems

I Amenable to theoretical analysis: Convergence/Complexity

Exploit problem structures and data information: convex and nonconvex models

3 ELEMENTARY PRINCIPLES

• Approximation • Regularization • Decomposition

Outline

PGMO Lecture SeriesFoundation Mathematique Jacques Hadamard and Ecole Polytechnique

1. Introduction: Basic Schemes/Examples

2. The Proximal Map

3. Proximal Gradient Methods

4. Fast Proximal Gradient

5. Smoothing Methods

6. Lagrangian and Decomposition Methods

7. FOM Beyond Lipschitz Gradient Continuity

8. FOM beyond Convexity

Applications and examples illustrating ideas and methods

Opening Remark and CreditAbout more than 386 years ago.....In 1629, Fermat suggested the following:

I Given f , solve for x :

[f (x + d)− f (x)

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Opening Remark and CreditAbout more than 386 years ago.....In 1629, Fermat suggested the following:

I Given f , solve for x :

[f (x + d)− f (x)

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Simple Minimization Methods

Practical Side – Simplicity/Scalability

I Simple computational operations: additions - multiplications

I Explicit iterations.

I Avoid nested optimization schemes/control-correction of accumulated errors.

I Minimal storage of data

Theoretical Side – Convergence/Complexity Analysis

I Free from heuristic choices of extra parameters.

I Versatile mathematical analytic tools broadly applicable..and with no pains!

I Complexity: nearly independent on dimension.

I Performance: reasonable for medium accuracy.

Natural Candidates: Schemes based on First Order Methods

First-Order Methods

First-Order methods are iterative algorithms that only exploit informationon the objective function and its gradient (sub-gradient).

I A main drawback: Can be very slow for producing high accuracysolutions....But share many advantages:

I Requires minimal data information

I Often lead to very simple and ”cheap” iterative schemes

I Provable complexity/efficiency nearly independent of dimension

I Suitable for large-scale problems when high accuracy is not crucial. [In manylarge scale applications, the data is anyway corrupted or known only roughly.]

First-Order Methods

First Order Based Algorithms

Widely used in applications....

I Clustering Analysis: The k-means algorithm

I Neuro-computing: The backpropagation algorithm

I Statistical Estimation: The EM (Expectation-Maximization) algorithm.

I Machine Learning: SVM, Regularized regression, etc...

I Signal and Image Processing: Sparse Recovery, Denoising and DeblurringSchemes, Total Variation minimization...

I Matrix minimization Problems: Low rank, matrix completion, Networksensor localization ...

I ....and much more...

Algorithms - Short Historical Development

I Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]

I Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]

I Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]

I Stochastic Gradients [Robbins and Monro ’51]

I Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]

I Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]

I Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74, Mercier-Lions ’79,Fortin-Glowinski’76, Bertsekas’82]

I Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]

I Optimal Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

I Why not (2.71828, 3.28172) !?....!...

Linear Inverse Problems - Source for Optimization

Problem: Find x ∈ C ⊂ E which ”best” solves A(x) ≈ b, A : E→ F,where b (observable output), and A are known.

Approach via Optimization – Regularization Models• ρ(x) is a ”regularizer” (one – or sum of functions, convex or nonconvex)• d(b,A(x)) some ”proximity” measure from b to A(x)

. min{ρ(x) : A(x) = b, x ∈ C} or min{ρ(x) : d(b,A(x)) ≤ ε, x ∈ C}

. min{d(b,A(x)) : ρ(x) ≤ δ, x ∈ C} or min{d(b,A(x))+µρ(x) : x ∈ C}, µ > 0

I Choices for ρ(·), d(·, ·) depends on the application at hand.

I Nonsmooth and often Nonconvex regularizers ρ useful to describe desiredfeatures.

I Intensive research activities over the past 50 years.

I Today more with emerging new technologies and increase in computer power.

Example: Sparsity is a Common Desired Feature/StructureArises in Many Applications

I Sparse learning: feature selection, support vector machines, PCA,...

I Compressive sensing: recover a signal from few measurements ...

I Trust topology design: remove bars that are not needed...

I Image processing: denoising, deblurring,....and much more....

Example Let d(b,A(x)) := ‖b−A(x)‖2, ρ(x) := ‖x‖0.

Find x ∈ Rd which is sparsest or at least δ-sparse

min{‖x‖0 : ‖b−A(x)‖2 ≤ ε, x ∈ Rd}; min{‖b−A(x)‖2 : ‖x‖0 ≤ δ, ∈ Rd}

where ‖x‖0 denotes the number of nonzero component of x.

This can be Hard (despite the convex objective/constraint!).

ApproachesI Convex Relaxation/Approximation: Replace ‖x‖0 by a more tractable

object. The l1-norm ‖x‖1 has been well known (since 70’s) to promotesparsity. Nonconvex (concave) approximations are also relevant.

I Tackle directly the nonconvex problem “as is”?. More on this later...

Example: Sparsity is a Common Desired Feature/StructureArises in Many Applications

I Sparse learning: feature selection, support vector machines, PCA,...

I Compressive sensing: recover a signal from few measurements ...

I Trust topology design: remove bars that are not needed...

I Image processing: denoising, deblurring,....and much more....

Example Let d(b,A(x)) := ‖b−A(x)‖2, ρ(x) := ‖x‖0.

Find x ∈ Rd which is sparsest or at least δ-sparse

min{‖x‖0 : ‖b−A(x)‖2 ≤ ε, x ∈ Rd}; min{‖b−A(x)‖2 : ‖x‖0 ≤ δ, ∈ Rd}

where ‖x‖0 denotes the number of nonzero component of x.

This can be Hard (despite the convex objective/constraint!).

ApproachesI Convex Relaxation/Approximation: Replace ‖x‖0 by a more tractable

object. The l1-norm ‖x‖1 has been well known (since 70’s) to promotesparsity. Nonconvex (concave) approximations are also relevant.

I Tackle directly the nonconvex problem “as is”?. More on this later...Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 11 / 22

Example: Image ProcessingNatural images are not random! Large areas of nearly constant intensity/colorseparated by sharp edges. Pixels are noisy, and the objective is to find a“goodlooking” nearby image.

I g = λ‖L(·)‖1 - l1-regularized problem.

minx{f (x) + λ‖Lx‖1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

I g = TV (·) - Total Variation-based regularization. Grid points (i , j) withsame intensity as their neighbors [ROF 1992].

minx{f (x) + λTV (x)}

1-dim TV:TV (x) =

∑i |xi − xi+1|

2-dim TV:isotropic: TV(x) =

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

anisotropic: TV(x) =∑

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

1-dim TV:TV (x) =

∑i |xi − xi+1|

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

1-dim TV:TV (x) =

∑i |xi − xi+1|

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

An Icon...The Gradient Method – Cauchy 1847..We begin with the simplest unconstrained minimization problem of a continuouslydifferentiable function f on E (finite dimensional Euclidean space):

(U) min{f (x) : x ∈ E}.

The basic gradient method: generates a sequence {xk} via

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

with suitable step size tk > 0: fixed or via a backtracking line search.

I This is a descent method:

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

I Explicit discretization of dx(t)/dt +∇f (x(t)) = 0, x(0) = x0.

xk − xk−1

h= −∇f (xk−1), (increment h > 0).

(U) min{f (x) : x ∈ E}.

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

xk − xk−1

(U) min{f (x) : x ∈ E}.

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

xk − xk−1

(U) min{f (x) : x ∈ E}.

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

xk − xk−1

Quick Recall : Backtracking Line Search – BLS

I Backtracking: start with t, try t/2, t/4, . . . until sufficient decrease

I BLS procedure warrants sufficient decrease + rules out too short steps.

A simple inexact line search to find tk for descent methods along the ray xk + tdk :

tk ∈ argmint≥0

φ(t) ≡ f (xk + tdk), (e.g., for GM d := −∇f (x))).

1. Initialize: Choose an initial guess t > 0, and α, β ∈ (0, 1). Settk = t

2. Take tk ← βtk , (e.g., with β = 1/2) until

(∗) f (xk + tkdk) < f (xk) + αtk〈∇f (xk), dk〉

Building First Order Based Schemes: Basic Old Idea

Pick an adequate approximate model

To minimize f , given some y, approximate f with some q(·, y): f (x) ' q(x, y)

Solve “some how”, the resulting approximate model:

xk+1 = argminx

q(x, xk), k = 0, . . .

Finding good approximation models rely ondata information and structure of the given problem.

1. Linearize + regularize

2. Linearize only + use info on C : e.g., C compact.

3. Use optimality conditions e.g., to identify a fixed point iteration.

4. Any other device/approach that can produce a useful approximation

xk+1 = argminx

q(x, xk), k = 0, . . .

xk+1 = argminx

q(x, xk), k = 0, . . .

A Quadratic Approximation ModelI Simplest case: unconstrained minimization of f ∈ C 1

(U) min{f (x) : x ∈ E}.I Simplest idea: Use the quadratic model

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

Namely, use the linearized part of f at some given point y.Regularized with a quadratic proximity term that would measure the ”localerror” in the approximation.

I This leads to a (strongly) convex approximation for (U):

(Ut) min {qt(x, y) : x ∈ E} .

I Now, fixing y := xk−1 ∈ E, the unique minimizer xk solving (Utk )

xk = argmin {qtk (x, xk−1) : x ∈ E} .I Therefore, optimality condition yields exactly the gradient method:

∇qtk (xk , xk−1) = 0 ⇒ tk∇f (xk−1)+xk−xk−1 = 0 ⇒ xk = xk−1−tk∇f (xk−1).

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

(Ut) min {qt(x, y) : x ∈ E} .

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

(Ut) min {qt(x, y) : x ∈ E} .

Constrained Minimization: Gradient Projection Method

Simple algebra: qt(x, y) = f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2,

2t‖x− (y − t∇f (y))‖2 − t

2‖∇f (y)‖2 + f (y).

Allows to easily pass to the constrained (P) min {f (x) : x ∈ C} ,

I Ignoring the constant terms (in y := xk−1) leads to solve (P) via GPM:

xk = argminx∈C

2‖x− (xk−1 − tk∇f (xk−1))‖2 ≡ PC (xk−1 − tk∇f (xk−1))

PC stands for the orthogonal projection map: PC (x) := argminz∈C

‖z− x‖2.

It reveals a more general operation - The Proximal Map -....Central to FOM....

Constrained Minimization: Gradient Projection Method

Simple algebra: qt(x, y) = f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2,

2t‖x− (y − t∇f (y))‖2 − t

2‖∇f (y)‖2 + f (y).

Allows to easily pass to the constrained (P) min {f (x) : x ∈ C} ,I Ignoring the constant terms (in y := xk−1) leads to solve (P) via GPM:

xk = argminx∈C

2‖x− (xk−1 − tk∇f (xk−1))‖2 ≡ PC (xk−1 − tk∇f (xk−1))

PC stands for the orthogonal projection map: PC (x) := argminz∈C

‖z− x‖2.

It reveals a more general operation - The Proximal Map -....Central to FOM....

Nonsmooth Problem: Subgradient (Projection) Method

Let g be nonsmooth (and admits a subgradient, e.g., convex), consider theconstrained NSO:

(P) inf{g(x) : x ∈ C}; C ⊂ E closed convex

Similarly, as done for gradient projection through the quadratic approximation, byreplacing in the linearization, the gradient with a subgradient element:

Subgradient Scheme: [Shor (63), Polyak (65)]

γk−1 ∈ ∂g(xk−1), xk = PC (xk−1 − tkγk−1), (tk > 0, a stepsize)

I Note: Subgradient scheme is not a descent method.

I γ ∈ ∂g(x∗) does not imply γ = 0.

I Example in R : g(x) = |x|, with ∂g(0) = [−1, 1].

Nonsmooth Problem: Subgradient (Projection) Method

Let g be nonsmooth (and admits a subgradient, e.g., convex), consider theconstrained NSO:

(P) inf{g(x) : x ∈ C}; C ⊂ E closed convex

Similarly, as done for gradient projection through the quadratic approximation, byreplacing in the linearization, the gradient with a subgradient element:

Subgradient Scheme: [Shor (63), Polyak (65)]

γk−1 ∈ ∂g(xk−1), xk = PC (xk−1 − tkγk−1), (tk > 0, a stepsize)

I Note: Subgradient scheme is not a descent method.

I γ ∈ ∂g(x∗) does not imply γ = 0.

I Example in R : g(x) = |x|, with ∂g(0) = [−1, 1].

Convergence Rates/Iteration Complexity of AlgorithmsClassical rates of convergence for sequences {sk} converging to zero such as{‖∇F (xk)‖}, {F (xk)− Fopt}, {‖xk − x∗‖} are

Sublinear: sk ≤ c1√k,c2

Linear: sk+1 ≤ (1− c)sk , for some c ∈ (0, 1).

Fix an accuracy ε, we then ask how many iterations N are needed to get sN ≤ ε:

N = O(1/εp) (p = 2, 1, 1/2) for sublinear rates, and N = O(1

clog(1/ε)) for linear.

For example, to solve approximately a problem to a given accuracy ε > 0, i.e., tofinding an x satisfying F (x)− Fopt ≤ ε. Suppose that for all k ≥ 1 we have:

F (xk)− Fopt ≤Γ

kθ, (Γ > 0, θ > 0)

Then, it means that the iteration complexity is O(ε−1/θ).

The constant Γ usually depends on distance from initial point to an optimal solution:

R2 := max{‖x∗ − x0‖2 : x∗ ∈ X∗}.Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 19 / 22

Complexity for Basic FOM - A Quick PreviewFor minimizing a function to accuracy ε.⊕

For Smooth Nonconvex Problems: through norm of gradient

I For Gradient method, min1≤k≤N ‖∇F (xk)‖,N = O(1/ε2) (idem for GPM)⊕For Smooth Convex Problems: through function values F (xN)− F∗

I For Gradient (Projection) Method: N = O(1/ε).⊕Nonmooth Lipschitz Convex Problems: through “best” function values

I Subgradient (Projection) Method: N = O(1/ε2), FNbest := min1≤k≤N F (xk)

We will see that these schemes

I Can be analyzed as special case of a more general optimization model .

I Complexity can be improved by exploiting structure/data.

I Can be extended/combined through various approaches to generateschemes that can handle more complex situations (constraints,structures, functions class ...)

Complexity for Basic FOM - A Quick PreviewFor minimizing a function to accuracy ε.⊕

For Smooth Nonconvex Problems: through norm of gradient

I For Gradient method, min1≤k≤N ‖∇F (xk)‖,N = O(1/ε2) (idem for GPM)⊕For Smooth Convex Problems: through function values F (xN)− F∗

I For Gradient (Projection) Method: N = O(1/ε).⊕Nonmooth Lipschitz Convex Problems: through “best” function values

I Subgradient (Projection) Method: N = O(1/ε2), FNbest := min1≤k≤N F (xk)

We will see that these schemes

I Can be analyzed as special case of a more general optimization model .

I Complexity can be improved by exploiting structure/data.

I Can be extended/combined through various approaches to generateschemes that can handle more complex situations (constraints,structures, functions class ...)

Before we start..Quick Recalls on Convex Functions

I Throughout, E,F,X ... stand for a finite dimensional real vector space, heremostly an Euclidean space with inner product 〈·, ·〉 endowed with the norm‖x‖ =

√〈x, x〉.

I Let f : E→ (−∞,+∞] be proper, closed (lsc) convex function, withdom f = {x |f (x) < +∞} its effective domain.

I Proper: dom f 6= ∅ and f (x) > −∞, ∀x ∈ E.I Closed and Convex: Its epigraph is a closed convex set

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

I Extended valued functions are useful for handling constraints:

inf{h(x) : x ∈ C} ⇐⇒ inf{f (x) : x ∈ E}, f := h + δC

where δC (x) = 0 if x ∈ C and +∞ if x /∈ C is the indicator of C .

√〈x, x〉.

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

√〈x, x〉.

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

Recalls - Subdifferentiability of Convex FunctionsI g ∈ E is a subgradient of f at x if:

f (z) ≥ f (x) + 〈g, z− x〉, ∀z

I Subdifferential of f at x = Set of all subgradients:

∂f (x) = {g ∈ E | f (z) ≥ f (x) + 〈g, z− x〉, ∀z ∈ E}.

I ∂f (x) is a closed convex set (possibly empty) as an infinite intersection ofclosed half-spaces.

I If x ∈ int dom f , ∂f (x) is nonempty and bounded.

I When f is differentiable, ∂f (x) ≡ {∇f (x)} ≡ {f ′(x)}.I f ∗(y) = sup{〈x, y〉 − f (x) : x ∈ E}, its convex conjugate.

I Fenchel Theorem: For f proper convex, for any x, y ∈ E:

〈x, y〉 = f (x) + f ∗(y) ⇐⇒ y ∈ ∂f (x) ⇐⇒︸︷︷︸if f closed

x ∈ ∂f ∗(y)

Lecture 2 - The Proximal Map

Marc Teboulle

PGMO Lecture Series

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 1 / 18

Lecture 2 - The Proximal Map

The proximal map (Moreau 1965) is a key player in FOM.

Definition. Given g : E→ (−∞,∞] a closed, proper and convex function,the proximal mapping of g is defined by

proxg (x) = argminu∈E

{g(u) +

2‖u− x‖2

A Natural Extension of the Projection Operator

I If C ⊆ E is nonempty closed convex, then proxδC = PC

proxδC (x) = argminu∈E

{δC (u) +

2‖u− x‖2

}= argmin

u∈C‖u− x‖2 = PC (x).

To study properties of prox, we first recall some basic results.

Strongly Convex Functions -Recalls

I A function f : E→ (−∞,∞] is called σ-strongly convex with modulusσ > 0, if for any x, y ∈ dom(f ) and λ ∈ [0, 1]:

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y)− 1

2σλ(1− λ)‖x− y‖2.

I f : E→ (−∞,∞] is σ-strongly convex function iff f (·)− σ2 ‖ · ‖

2 is convex.

I Sum of Convex + strongly convex is strongly convex.

First Order Characterization Let f : E → (−∞,∞] and σ > 0. Thefollowing three claims are equivalent:

(i) f is σ-strongly convex.

(ii) For any x ∈ dom(∂f ), y ∈ dom(f ) and gx ∈ ∂f (x) :f (y) ≥ f (x) + 〈gx, y − x〉+ σ

2 ‖y − x‖2.

(iii) For any x, y ∈ dom(∂f ) and gx ∈ ∂f (x), gy ∈ ∂f (y) :〈gx − gy, x− y〉 ≥ σ‖x− y‖2.

Strongly Convex Functions -Recalls

I A function f : E→ (−∞,∞] is called σ-strongly convex with modulusσ > 0, if for any x, y ∈ dom(f ) and λ ∈ [0, 1]:

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y)− 1

2σλ(1− λ)‖x− y‖2.

I f : E→ (−∞,∞] is σ-strongly convex function iff f (·)− σ2 ‖ · ‖

2 is convex.

I Sum of Convex + strongly convex is strongly convex.

First Order Characterization Let f : E → (−∞,∞] and σ > 0. Thefollowing three claims are equivalent:

(i) f is σ-strongly convex.

(ii) For any x ∈ dom(∂f ), y ∈ dom(f ) and gx ∈ ∂f (x) :f (y) ≥ f (x) + 〈gx, y − x〉+ σ

2 ‖y − x‖2.

(iii) For any x, y ∈ dom(∂f ) and gx ∈ ∂f (x), gy ∈ ∂f (y) :〈gx − gy, x− y〉 ≥ σ‖x− y‖2.

Minimization of Strongly Convex Functions - Recalls

Theorem. Let f : E→ (−∞,∞] be a proper closed and σ-strongly convexfunction (σ > 0). Then

(a) f has a unique minimizer x∗.

(b) f (x)− f (x∗) ≥ σ2 ‖x− x∗‖2 for all x ∈ dom(f ).

Proof. Existence: Take x0 ∈ dom(∂f ) and g ∈ ∂f (x0). Then

f (x) ≥ f (x0) + 〈g, x− x0〉+σ

2‖x− x0‖2 for all x ∈ E

= ≥ f (x0)− 1

2σ‖g‖2 +

∥∥∥∥x−(

x0 −1

)∥∥∥∥2

for any x ∈ E.

Thus, Lev(f , f (x0)) := {x : f (x) ≤ f (x0)} ⊆ B‖·‖[x0 − 1

σg, 1σ‖g‖

Therefore, Lev(f , f (x0)) is compact + closedness of f ⇒ a minimizer exists.

Notation: B‖·‖[x0, δ] := {x : ‖x− x0‖ ≤ δ}

Minimization of Strongly Convex Functions - Recalls

Theorem. Let f : E→ (−∞,∞] be a proper closed and σ-strongly convexfunction (σ > 0). Then

(a) f has a unique minimizer x∗.

(b) f (x)− f (x∗) ≥ σ2 ‖x− x∗‖2 for all x ∈ dom(f ).

Proof. Existence: Take x0 ∈ dom(∂f ) and g ∈ ∂f (x0). Then

f (x) ≥ f (x0) + 〈g, x− x0〉+σ

2‖x− x0‖2 for all x ∈ E

= ≥ f (x0)− 1

2σ‖g‖2 +

∥∥∥∥x−(

x0 −1

)∥∥∥∥2

for any x ∈ E.

Thus, Lev(f , f (x0)) := {x : f (x) ≤ f (x0)} ⊆ B‖·‖[x0 − 1

σg, 1σ‖g‖

Therefore, Lev(f , f (x0)) is compact + closedness of f ⇒ a minimizer exists.

Notation: B‖·‖[x0, δ] := {x : ‖x− x0‖ ≤ δ}

Proof Contd.

I To show uniqueness, if x and x are minimizers of f . Then f (x) = f (x) = fopt.Then

fopt ≤ f

)≤ 1

2f (x) +

2f (x)− σ

8‖x− x‖2 = fopt −

8‖x− x‖2.

showing that x = x (proving part (a)).

I Let x∗ be the unique minimizer of f .

I By Fermat’s optimality condition 0 ∈ ∂f (x∗), and hence using thesubgradient inequality for f − σ strongly convex, we get for all x ∈ E,

f (x)− f (x∗) ≥ 〈0, x− x∗〉+σ

2‖x− x∗‖2 =

2‖x− x∗‖2

We are ready to establish the main properties of the prox map.

Prox Theorem – Uniqueness and CharacterizationProx Theorem Let g : E → (−∞,∞] be a proper closed and convexfunction. Then,

(i) (Uniqueness) u := proxg (x) is a singleton for any x ∈ E.

(ii) (Characterization) x− u ∈ ∂g(u) ⇐⇒ u = (I + ∂g)−1(x).

(iii) (Subgradient via prox) z ∈ ∂g(x) ⇐⇒ x = proxg (z + x).In particular, x is a minimizer of f iff x = proxg (x).

Proof. (i) For any x ∈ E, u := proxg (x) is by definition the minimizer of theproper closed and 1-strongly convex function:

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

(iii) z ∈ ∂g(x) ⇔ z + x ∈ ∂g(x) + x ⇔ x = proxg (z + x).Now, x minimizer of f iff 0 ∈ ∂g(x). Set z = 0, proves the claim.

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.

(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

u→ g(u) +1

2‖u− x‖2

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

u→ g(u) +1

2‖u− x‖2

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

The Proximal Inequality – Fundamental

Proximal Inequality Let g : E→ (−∞,∞] be a proper closed and convexfunction. For x ∈ E, let u = proxg (x). Then, for any y ∈ E,

g(u)− g(y) ≤ 〈x− u,u− y, 〉

{‖y − x‖2 − ‖y − u‖2 − ‖u− x‖2

Proof. Since x− u ∈ ∂g(u), using the subgradient inequality yields for any y ∈ E:

g(u)− g(y) ≤ 〈∂g(u),u− y〉= 〈x− u,u− y〉.

The second line follows by using the identity

2〈x− u,u−w〉 = ‖w − x‖2 − ‖w − u‖2 − ‖u− x‖2, ∀x,u,w ∈ E.

The Proximal Inequality – Fundamental

Proximal Inequality Let g : E→ (−∞,∞] be a proper closed and convexfunction. For x ∈ E, let u = proxg (x). Then, for any y ∈ E,

g(u)− g(y) ≤ 〈x− u,u− y, 〉

{‖y − x‖2 − ‖y − u‖2 − ‖u− x‖2

Proof. Since x− u ∈ ∂g(u), using the subgradient inequality yields for any y ∈ E:

g(u)− g(y) ≤ 〈∂g(u),u− y〉= 〈x− u,u− y〉.

The second line follows by using the identity

2〈x− u,u−w〉 = ‖w − x‖2 − ‖w − u‖2 − ‖u− x‖2, ∀x,u,w ∈ E.

Examples

I Computing proxg can be very hard...!.

I But, for many useful special cases it is easy...

I If g ≡ δC , (C ⊆ E closed and convex), then

proxg (x) = argminu{δC (u) +

2t‖u− x‖2} ≡ PC (x)

For some useful sets C easy to compute the orthogonal projection PC :

I Affine sets, Simple Polyhedral Sets (halfspace, Rn+, [l , u]n),

I l2, l1, l∞ - Balls,

I Ice Cream Cone (SOC), Semidefinite Cone Sn+,

I Simplex and Spectrahedron (Simplex in Sn).

Examples

2t‖u− x‖2} ≡ PC (x)

Examples

2t‖u− x‖2} ≡ PC (x)

Other Basic Examples of Proximal Mappings

proxth(x) = argminu∈E

{h(u) +

2t‖u− x‖2

}, t > 0.

I Constant function h ≡ c for some c ∈ R, then proxth(x) = x, the identitymap.

I Affine function g(x) = 〈a, x〉+ b, a ∈ E and b ∈ R. Thenproxth(x) = x− ta.

I Quadratic function h(x) = 12 xTAx + bTx + c , ( A ∈ Sn+,b ∈ Rn, c ∈ R).

proxth(x) = (I + A)−1(x− tb)

I Euclidean norm h(x) = ‖x‖2

proxth(x) =

{(1− t/‖x‖2)x ‖x‖2 ≥ t0 otherwise

I Euclidean distance d(x) = miny∈C ‖x− y‖2 (C closed convex)

proxtd(x) = θPC (x) + (1− θ)x, θ =

{t/d(x) d(x) ≥ t1 else

Other Basic Examples of Proximal Mappings

proxth(x) = argminu∈E

{h(u) +

2t‖u− x‖2

}, t > 0.

I Constant function h ≡ c for some c ∈ R, then proxth(x) = x, the identitymap.

I Affine function g(x) = 〈a, x〉+ b, a ∈ E and b ∈ R. Thenproxth(x) = x− ta.

I Quadratic function h(x) = 12 xTAx + bTx + c , ( A ∈ Sn+,b ∈ Rn, c ∈ R).

proxth(x) = (I + A)−1(x− tb)

I Euclidean norm h(x) = ‖x‖2

proxth(x) =

{(1− t/‖x‖2)x ‖x‖2 ≥ t0 otherwise

I Euclidean distance d(x) = miny∈C ‖x− y‖2 (C closed convex)

proxtd(x) = θPC (x) + (1− θ)x, θ =

{t/d(x) d(x) ≥ t1 else

Basic Calculus Rules for Prox

f (x) proxf (x) assumptions∑mi=1 fi (xi ) proxf1 (x1)× · · · × proxfm (xm)

g(λx + a) 1λ

[proxλ2g (a + λx)− a

]λ 6= 0, a ∈ E, gproper

λg(x/λ) λproxg/λ(x/λ) λ > 0, g proper

g(x) + c2‖x‖2 +

〈a, x〉+ γprox 1

c+1g ( x−a

c+1) a ∈ E, c >

0, γ ∈ R, gproper

g(A(x) + b) x + 1αAT (proxαg (A(x) + b)−A(x)− b) b ∈ Rm,

A : V → Rm,g closedproper convex,A ◦ AT = αI ,α > 0

Note: In general, the prox of the composition

(g ◦ A)(x) = g(A(x))

is NOT tractable!..

Examples of Prox Computations – Summary

f dom f proxf assumptions12 xT Ax + bT x + c Rn (A + I)−1(x− b) A ∈ Sn++, b ∈ Rn, c ∈ R

λ‖x‖ E[

1− λ‖x‖

x Euclidean norm, λ > 0

λ‖x‖1 Rn [|x| − λe]+ ◦ sgn(x) λ > 0

−λ∑n

j=1 log xj Rn++

√x2j

λ > 0

δC (x) E PC (x) C ⊆ EλσC (x) E x− λPC (x/λ) C closed and convexλ‖x‖ E x− λPB‖·‖∗ [0,1](x/λ) arbitrary norm

λmax{x1, x2, . . . , xn} Rn x− prox∆n(x/λ) λ > 0

λdC (x) E x + min{

λdC (x) , 1

}(PC (x)− x) C closed convex

λ2 dC (x)2 E λ

λ+1 PC (x) + 1λ+1 x C closed convex

We show the prox of the l1-norm.

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)

I g(x) =∑n

i=1 ϕ(xi ), where ϕ(t) = λ|t|.I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

I proxϕ(s) = Tλ(s), where Tλ is defined as

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

is the soft thresholding operator. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

I By the separability of the l1-norm, proxg (x) = (Tλ(xj))nj=1. We expend thedefinition of the soft thresholding operator and write

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)I g(x) =

∑ni=1 ϕ(xi ), where ϕ(t) = λ|t|.

I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

−0.5

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)I g(x) =

∑ni=1 ϕ(xi ), where ϕ(t) = λ|t|.

I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

−0.5

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 12 / 18

Prox of a composition |aTx|

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = TλI proxf (x) = x + 1

‖a‖2 (T‖a‖2 (aTx)− aTx)a.

Unfortunately in general, the prox of f (x) = g(Ax) = ‖Ax‖1

does not admit an explicit formula.

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = Tλ

I proxf (x) = x + 1‖a‖2 (T‖a‖2 (aTx)− aTx)a.

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = TλI proxf (x) = x + 1

‖a‖2 (T‖a‖2 (aTx)− aTx)a.

More Properties: Firm Nonexpansivity/Nonexpansivity

Theorem. For any x, y ∈ E(i) 〈x− y,proxh(x)− proxh(y)〉 ≥ ‖proxh(x)− proxh(y)‖2.

(ii) ‖proxh(x)− proxh(y)‖ ≤ ‖x− y‖.

Proof.

I Denote u = proxh(x), v = proxh(y).

I x− u ∈ ∂h(u), y − v ∈ ∂h(v).

I By the subgradient inequality

f (v) ≥ f (u) + 〈x− u, v − u〉,f (u) ≥ f (v) + 〈y − v,u− v〉.

I Summing the above two inequalities, we obtain 〈(x−u)− (y− v),u− v〉 ≥ 0.

I Thus,〈u− v, x− y〉 ≥ ‖u− v‖2.

I (ii) follows from Cauchy-Schwarz.

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Proof.I Let x ∈ E,u = proxf (x).

I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)

I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]

I iff x− u = proxf ∗(x).I Thus,

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).

I Thus,proxf (x) + proxf ∗(x) = u + (x− u) = x.

Example : Prox of Support FunctionsThe Support Function of a set C defined by

σC (u) := sup{〈u, v〉 : v ∈ C}is essentially “built-in” in most optimization problems. For instance, consider thegeneral NLP with m constraints described through g : Rn → Rm:

inf{f (x) : g(x) ≤ 0, x ∈ Rn} ⇐⇒ infx{f (x) + σRm

+(g(x)}.

Recall that σC = δ∗C and σ∗C = δ∗∗C = δC , (C closed convex nonempty).

Proposition (Prox of support function). Let C be a nonempty closed andconvex set, and let λ > 0. Then

proxλσC(x) = x− λPC (x/λ).

Proof. By the extended Moreau decomposition formula

proxλσC(x) = x− λproxλ−1σ∗C

(x/λ) = x− λproxλ−1δC(x/λ) = x− λPC (x/λ).

Example : Prox of Support FunctionsThe Support Function of a set C defined by

σC (u) := sup{〈u, v〉 : v ∈ C}is essentially “built-in” in most optimization problems. For instance, consider thegeneral NLP with m constraints described through g : Rn → Rm:

inf{f (x) : g(x) ≤ 0, x ∈ Rn} ⇐⇒ infx{f (x) + σRm

+(g(x)}.

Recall that σC = δ∗C and σ∗C = δ∗∗C = δC , (C closed convex nonempty).

Proposition (Prox of support function). Let C be a nonempty closed andconvex set, and let λ > 0. Then

proxλσC(x) = x− λPC (x/λ).

Proof. By the extended Moreau decomposition formula

proxλσC(x) = x− λproxλ−1σ∗C

(x/λ) = x− λproxλ−1δC(x/λ) = x− λPC (x/λ).

Examples

Thus, computing the prox of a support function of a set C reduces to compute aprojection onto C:

I proxλ‖·‖α(x) = x− λPB‖·‖α,∗ [0,1](x/λ). (‖ · ‖α - arbitrary norm)

I proxλ‖·‖∞(x) = x− λPB‖·‖1[0,1](x/λ).

I proxλmax(·)(x) = x− λP∆n(x/λ).

References

I J. J. Moreau, Proximite et dualite dans un espace Hilbertien, Bull. Soc.Math. France (1965).

I H. Bauschke and P. L. Combettes. Convex analysis and monotone operatortheory in Hilbert spaces (2011).

I P. L. Combettes and V. R. Wajs, Signal recovery by proximal forwardbackward splitting, Multiscale Model. Simul. (2005).

Lecture 3 The Proximal Gradient Method

Marc Teboulle

PGMO Lecture Series

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 1 / 22

Lecture 3 - The Proximal Gradient Method

A Basic Useful Model: Composite Minimization

Composite convex model:

(P) min{F (x) ≡ f (x) + g(x) : x ∈ E}

I f - convex, C 1 with Lipschitz gradient.

I g - convex, extended valued function, nonsmooth.

I X ∗ 6= ∅ optimal set of (P), with optimal value denotedFopt ≡ F (x∗).

This simple model is rich enough to encompass most convex optimization modelsarising in a wide array of disparate applications: image science, signal processing,machine learning...ect...

Some Prototype ExamplesI unconstrained smooth convex minimization (g ≡ 0)

min{f (x) : x ∈ E}

I constrained smooth convex minimization (g ≡ δ(·|C ),C− closed andconvex)

min{f (x) : x ∈ C}

I constrained nonsmooth convex minimization (f ≡ 0)

min{g(x) : x ∈ E}, (g is extended valued)

I Nonsmooth l1 regularized convex problems (g(x) ≡ λ‖x‖1, λ > 0)

min{f (x) + λ‖x‖1 : x ∈ E}

Smoothness - A Central Assumption in FOMDefinition. Let L ≥ 0. A function f : E → R which is differentiable issaid to have Lipschitz continuous gradient with Lipschitz constant L ≥ 0(or simply is L-smooth if

‖∇f (x)−∇f (y)‖∗ ≤ L‖x− y‖ for all x, y ∈ E.

Recall: Given a norm ‖ · ‖ (not necessarily the endowed Euclidean norm), the dual normis defined by

‖u‖∗ = maxv{〈u, v〉 : ‖v‖ ≤ 1}, u ∈ E.

I The class of L-smooth functions is denoted by C 1,1L .

I If a function is L1-smooth, then it is also L2-smooth for any L2 ≥ L1.

Examples:

I f (x) = 〈a, x〉+ b, a ∈ E, b ∈ R (0-smooth).

I f (x) = 12 xTAx + bTx + c , A ∈ Sn,b ∈ Rn and c ∈ R (‖A‖2-smooth if Rn is

endowed with the l2-norm).

I f (x) = 12d

2C (f : E→ R) (1-smooth)

A Key Inequality for Smooth f – The Descent LemmaLemma. Let f : E→ R be L-smooth. Then,

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

In particular, for f convex, we can drop the absolute value; this is usually called“Descent Lemma”.

Proof. By the fundamental theorem of calculus:

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

≤∫ 1

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

≤∫ 1

tL‖y − x‖2dt =L

2‖y − x‖2.

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

≤∫ 1

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

≤∫ 1

2‖y − x‖2.

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

≤∫ 1

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

≤∫ 1

2‖y − x‖2.

Characterizations of L-smoothness -Useful Inequalities

Theorem. Let f : E → R be a convex function, differentiable over E, andlet L > 0. Then the following claims are equivalent:

(i) f is L-smooth.

(ii) f (y) ≤ f (x) + 〈∇f (x), y − x〉+ L2‖x− y‖2 for all x, y ∈ E.

(iii) f (y) ≥ f (x) + 〈∇f (x), y − x〉+ 12L‖∇f (x)−∇f (y)‖2

∗ for all x, y ∈ E.

(iv) 〈∇f (x)−∇f (y), x− y〉 ≥ 1L‖∇f (x)−∇f (y)‖2

∗ for all x, y ∈ E.

Moreover, for f twice continuously differentiable function over Rn, we havef ∈ C 1,1

L (w.r.t. the l2-norm) iff

‖∇2f (x)‖2 ≤ L ⇐⇒ λmax(∇2f (x)) ≤ L for any x ∈ Rn.

Proof. See refs.

The Basic Idea

Instead of minimizing directly

minx∈E

f (x) + g(x)

Approximate f by a regularized linear approximation of f while keeping thenonsmooth g untouched. Thus, given xk and tk > 0, the next point is:

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

2tk‖x− xk‖2 + g(x)

xk+1 = argminx

{g(x) +

∥∥x− (xk − tk∇f (xk))∥∥2}

Proximal Gradient Iteration

xk+1 = proxtkg (xk − tk∇f (xk))

The Basic Idea

minx∈E

f (x) + g(x)

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

2tk‖x− xk‖2 + g(x)

xk+1 = argminx

{g(x) +

∥∥x− (xk − tk∇f (xk))∥∥2}

The Basic Idea

minx∈E

f (x) + g(x)

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

2tk‖x− xk‖2 + g(x)

xk+1 = argminx

{g(x) +

∥∥x− (xk − tk∇f (xk))∥∥2}

The Basic Idea

minx∈E

f (x) + g(x)

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

2tk‖x− xk‖2 + g(x)

xk+1 = argminx

{g(x) +

∥∥x− (xk − tk∇f (xk))∥∥2}

Basic Examples

I Gradient Method ( g = 0, unconstrained minimization)

xk+1 = xk − tk∇f (xk)

I Gradient Projection Method (g = δC , constrained convex minimization)

xk+1 = PC (xk − tk∇f (xk))

I Iterative Soft-Thresholding Algorithm (ISTA) (g = ‖ · ‖1):

xk+1 = Tλtk(xk − tk∇f (xk)

)where Tα(u) = [|u| − αe]� sgn (u).

Basic Examples

The Proximal Gradient Method -PGM

For simplicity, we analyze PGM with a constant stepsize rule.

Proximal Gradient Method with Constant StepsizeInput: L = Lf - A Lipschitz constant of ∇f .Step 0. Take x0 ∈ E.Step k. (k ≥ 0) Compute

xk+1 = TL(xk) := prox gL

(xk − 1

L∇f (xk)

I When the Lipschitz constant Lf is not always known or not easilycomputable, this can be resolved with an easy backtracking stepsize rule, andall the forthcoming results also hold in that case; (see refs. for details).

Main Pillar in the Analysis: The Prox-Grad InequalityTheorem. Let L ≥ Lf . Then for any x, y ∈ E

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

where `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉.

Note: Valid for nonconvex f . When f convex, `f (x, y) ≥ 0.

Proof. We use the short hand notation y+ = TL(y) = prox gL

(y − 1

L∇f (y)).

I Using the prox-Inequality proven in past lecture:

g(x)− g(ξ) ≥ 〈z− ξ, x− ξ〉, ξ := proxg (z),

it follows that 1Lg(x)− 1

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

I Therefore, rearranging we have:

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − x〉= L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

(y − 1

L∇f (y)).

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

(y − 1

L∇f (y)).

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

Proof End

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉,

and using the descent Lemma for L ≥ Lf we have:

f (x)− f (y+) ≥ f (x)− f (y)− 〈∇f (y), y+ − y〉 − L

2‖y+ − y‖2.

Adding the above, with `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉, we get

F (x)− F (y+) ≥ `f (x, y) + L〈y − y+, x− y+〉 − L

2‖y+ − y‖2.

Using the identity 2〈y − y+, x− y+〉 = ‖x− y+‖2 + ‖y − y+‖2 − ‖y − x‖2, weobtain that

F (x)− F (y+) ≥ L

2‖x− y+‖2 − L

2‖x− y‖2 + `f (x, y).

Proof End

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉,

and using the descent Lemma for L ≥ Lf we have:

f (x)− f (y+) ≥ f (x)− f (y)− 〈∇f (y), y+ − y〉 − L

2‖y+ − y‖2.

Adding the above, with `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉, we get

F (x)− F (y+) ≥ `f (x, y) + L〈y − y+, x− y+〉 − L

2‖y+ − y‖2.

Using the identity 2〈y − y+, x− y+〉 = ‖x− y+‖2 + ‖y − y+‖2 − ‖y − x‖2, weobtain that

F (x)− F (y+) ≥ L

2‖x− y+‖2 − L

2‖x− y‖2 + `f (x, y).

A Useful Side Remark: The Gradient MapIn terms of the Gradient Map GL : E→ E defined by

GL(x) := L (x− TL(x))

the PGM simply reads ”like” a gradient scheme:

xk+1 = TL(xk) ≡ xk − 1

LGL(xk).

This is a convenient device, it parallels the gradient, and can be used (e.g. with fnonconvex) along the analysis we developed next to easily derived the complexityin terms of ‖G (xk)‖ announced earlier.

I Clearly GL(·) reduces to the gradient ∇f (x) when g = 0 (hence the name!)

I GL(x∗) = 0 iff x∗ = prox 1L g

(x∗ − 1

L∇f (x∗)). Therefore, by prox property

x∗ − 1

L∇f (x∗)− x∗ ∈ 1

L∂g(x∗) ⇐⇒ 0 ∈ ∇f (x∗) + ∂g(x∗)

I That is, for nonconvex f , x∗ is a critical point for F = f + g .

Sufficient Decrease LemmaAs an immediate consequence of the prox-grad inequality, we obtain (simply sety = x) the sufficient decrease property:

Corollary. Let L ≥ Lf . Then for any x ∈ E,

F (x)− F (TL(x)) ≥ L

2‖x− TL(x)‖2.

Remark. Recall that the sufficient descent property holds for a nonconvex f(`(x, x) ≡ 0 in the prox-grad inequality.)

Above, in terms of the Gradient Map GL : E→ E defined by

GL(x) ≡ L (x− TL(x))

F (x)− F (TL(x)) ≥ 1

2L‖GL(x)‖2,

from which the rate O(1/√N) for min0≤k≤N ‖GL(xk)‖ can be easily

deduced...Try!

Sufficient Decrease LemmaAs an immediate consequence of the prox-grad inequality, we obtain (simply sety = x) the sufficient decrease property:

Corollary. Let L ≥ Lf . Then for any x ∈ E,

F (x)− F (TL(x)) ≥ L

2‖x− TL(x)‖2.

Remark. Recall that the sufficient descent property holds for a nonconvex f(`(x, x) ≡ 0 in the prox-grad inequality.)

Above, in terms of the Gradient Map GL : E→ E defined by

GL(x) ≡ L (x− TL(x))

F (x)− F (TL(x)) ≥ 1

2L‖GL(x)‖2,

from which the rate O(1/√N) for min0≤k≤N ‖GL(xk)‖ can be easily

deduced...Try!Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 13 / 22

O(1/k) Rate of Convergence of Proximal Gradient

Theorem. Suppose that f is convex. Let {xk}k≥0 be the sequence gener-ated by the proximal gradient method. Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤Lf ‖x0 − x∗‖2

I Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.

I Note: The sequence {xk} can be proven to converge to solution x∗ with astep size in (0, 2/Lf ), see refs.

O(1/k) Rate of Convergence of Proximal Gradient

Theorem. Suppose that f is convex. Let {xk}k≥0 be the sequence gener-ated by the proximal gradient method. Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤Lf ‖x0 − x∗‖2

I Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.

I Note: The sequence {xk} can be proven to converge to solution x∗ with astep size in (0, 2/Lf ), see refs.

Proof of Rate for Prox-Grad : xn+1 = TL(xn)

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Proof. Set L = Lf , x = x∗ and y = xn in the prox-grad ineq. (`f (x∗, xn) ≥ 0):

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

I Summing over n = 0, 1, . . . , k − 1 we obtain

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

n=0 F (xn+1)− kFopt ≤ Lf

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

I By the sufficient decrease Corollary one has F (xn+1) ≤ F (xn) ⇒I∑k−1

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

I Consequently adding above to ?: F (xk)− Fopt ≤ Lf ‖x∗−x0‖2

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

Iteration Complexity of PGM

Recall:An ε-optimal solution of problem (P) is a vector x satisfying

F (x)− Fopt ≤ ε.

We now ask how many iterations are required to satisfy F (xk)− Fopt ≤ ε?

For PGM we have proved: F (xk)− Fopt ≤ Lf ‖x0−x∗‖2

2k ≡ Lf R2

Definition of R: R2 := max{‖x∗ − x0‖2 : x∗ ∈ X ∗}. Therefore,

Theorem[O(1/ε) complexity of PGM]. For any k satisfying

k ≥⌈Lf R

⌉it holds that F (xk)− Fopt ≤ ε.

Iteration Complexity of PGM

Recall:An ε-optimal solution of problem (P) is a vector x satisfying

F (x)− Fopt ≤ ε.

We now ask how many iterations are required to satisfy F (xk)− Fopt ≤ ε?

For PGM we have proved: F (xk)− Fopt ≤ Lf ‖x0−x∗‖2

2k ≡ Lf R2

Definition of R: R2 := max{‖x∗ − x0‖2 : x∗ ∈ X ∗}. Therefore,

Theorem[O(1/ε) complexity of PGM]. For any k satisfying

k ≥⌈Lf R

⌉it holds that F (xk)− Fopt ≤ ε.

Special Cases

I With g ≡ 0 and g = δC in our model (P), PGM recovers the basic gradientand gradient projection methods respectively.

I Therefore, both methods have complexity O(1/ε).

I Let us look at the case when f = 0 in (P).

I In that case, we have to solve a general convex nonsmooth problem and thePGM reduces to the Proximal Minimization Algorithm -PMA describednext.

Special Cases

I With g ≡ 0 and g = δC in our model (P), PGM recovers the basic gradientand gradient projection methods respectively.

I Therefore, both methods have complexity O(1/ε).

I Let us look at the case when f = 0 in (P).

I In that case, we have to solve a general convex nonsmooth problem and thePGM reduces to the Proximal Minimization Algorithm -PMA describednext.

Proximal Minimization Algorithm-PMASet f ≡ 0, in (P), i.e., we solve the general convex nonsmooth problem

min{g(x) : x ∈ E}.

PG reduces to Proximal Minimization Algorithm (Martinet (70)):

x0 ∈ E, xk = argminx∈E

{g(x) +

2tk‖x− xk−1‖2

}, tk > 0.

This is an implicit discretization of 0 ∈ dx(t)/dt + ∂g(x(t)), x(0) = x0.

Theorem Let xk be the sequence generated by PMA, and set σk =∑k

s=1 ts .

Then, g(xk)− g(x) ≤ ‖x0 − x‖2/2σk ,∀x ∈ E.

In particular, if σk → ∞ then g(xk) ↓ g∗ = infx g(x) and if X∗ 6= ∅, thenxk converges to some point in X∗.

Consequence: Complexity of PMA is O(1/ε). ”Better” than SubgradientO(1/ε2). But, No free lunch.. requires g being prox-friendly!Very useful when combined with duality:−→ Augmented Lagrangians Methods.

Proximal Minimization Algorithm-PMASet f ≡ 0, in (P), i.e., we solve the general convex nonsmooth problem

min{g(x) : x ∈ E}.

PG reduces to Proximal Minimization Algorithm (Martinet (70)):

x0 ∈ E, xk = argminx∈E

{g(x) +

2tk‖x− xk−1‖2

}, tk > 0.

This is an implicit discretization of 0 ∈ dx(t)/dt + ∂g(x(t)), x(0) = x0.

Theorem Let xk be the sequence generated by PMA, and set σk =∑k

s=1 ts .

Then, g(xk)− g(x) ≤ ‖x0 − x‖2/2σk ,∀x ∈ E.

In particular, if σk → ∞ then g(xk) ↓ g∗ = infx g(x) and if X∗ 6= ∅, thenxk converges to some point in X∗.

Consequence: Complexity of PMA is O(1/ε). ”Better” than SubgradientO(1/ε2). But, No free lunch.. requires g being prox-friendly!Very useful when combined with duality:−→ Augmented Lagrangians Methods.

Strongly Convex Case - Linear Rate for Prox-Grad

Theorem. Suppose that f is σ-strongly convex (σ > 0). Let {xk}k≥0 bethe sequence generated by the proximal gradient method. Then for anyx∗ ∈ X and k ≥ 0,

(a) ‖xk+1 − x∗‖2 ≤(

1− σLf

)‖xk − x∗‖2.

(b) ‖xk − x∗‖2 ≤(

1− σLf

)k‖x0 − x∗‖2.

(c) F (xk+1)− Fopt ≤ Lf

(1− σ

‖x0 − x∗‖2.

ProofI Plugging L = Lf , x = x∗ and y = xk in the prox-grad inequality

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

I `f (x∗, xk) = f (x∗)− f (xk)− 〈∇f (xk), x∗ − xk〉 ≥ σ2 ‖x

k − x∗‖2 ← [f is S.C ]I F (x∗)− F (xk+1) ≥ Lf

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).I (c) follows by third inequality above written as:

F (xk+1)− Fopt ≤ Lf − σ2‖xk − x∗‖2 − Lf

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

‖x0 − x∗‖2.

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).

I (c) follows by third inequality above written as:

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

‖x0 − x∗‖2.

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).I (c) follows by third inequality above written as:

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

‖x0 − x∗‖2.

Complexity of PGM - Strongly Convex Case

A direct result of the rate analysis. We let κ := Lf

Theorem [Complexity O(κ log(1/ε) for Strongly Convex PGM] For anyk ≥ 1 satisfying

k ≥ κ log

)+ κ log

it holds that F (xk)− Fopt ≤ ε.

References

I A. Beck and M. Teboulle, Fast Gradient-Based Algorithms for ConstrainedTotal Variation Image Denoising and Deblurring Problems. IEEE Trans.Image Processing, 18, 2419–2434 (2009).

I A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications toSignal Recovery Problems. In: Convex Optimization in Signal Processing andCommunications, Edited by Y. Eldar and D. Palomar, Cambridge UniversityPress, (2010).

I P. L. Combettes and V. R. Wajs. Signal recovery by proximal forwardbackward splitting. Multiscale Model. Simul. (2005).

Lecture 4 - Fast Proximal Gradient (FISTA)

Marc Teboulle

PGMO Lecture Series

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 1 / 25

Lecture 4 – Fast Proximal Gradient Method (FISTA)

Previous explicit methods are simple but might be too slow.

I For Subgradient Methods: rate of O(1/√k).

I For Gradient methods/Prox-Gradient: rate of O(1/k)

Can we do better to solve the composite nonsmooth problem (P)?

(P) min{F (x) := f (x) + g(x) : x ∈ E}.

• Same Underlying Assumptions:

(A) g : E→ (−∞,∞] is proper closed and convex.

(B) f : E→ R is Lf -smooth and convex.

(C) The optimal set of (P) X ∗ 6= ∅ with optimal value denoted by Fopt.

Can we devise a method with:♠ the same computational effort/simplicty as Prox-Grad .♠ a Faster global rate of convergence.

Lecture 4 – Fast Proximal Gradient Method (FISTA)

Previous explicit methods are simple but might be too slow.

I For Subgradient Methods: rate of O(1/√k).

I For Gradient methods/Prox-Gradient: rate of O(1/k)

Can we do better to solve the composite nonsmooth problem (P)?

(P) min{F (x) := f (x) + g(x) : x ∈ E}.

• Same Underlying Assumptions:

(A) g : E→ (−∞,∞] is proper closed and convex.

(B) f : E→ R is Lf -smooth and convex.

(C) The optimal set of (P) X ∗ 6= ∅ with optimal value denoted by Fopt.

Can we devise a method with:♠ the same computational effort/simplicty as Prox-Grad .♠ a Faster global rate of convergence.

Yes we Can...

B Answer: Yes, through an “equally simple” scheme.

• The Idea: instead of making a step of the form

xk+1 = prox 1Lk

(xk − 1

Lk∇f (xk)

)we will consider a step of the same form, at an auxiliary point yk :

xk+1 = prox 1Lk

(yk − 1

Lk∇f (yk)

)where yk is a special linear combination of xk , xk−1.

Yes we Can...

B Answer: Yes, through an “equally simple” scheme.

• The Idea: instead of making a step of the form

xk+1 = prox 1Lk

(xk − 1

Lk∇f (xk)

)we will consider a step of the same form, at an auxiliary point yk :

xk+1 = prox 1Lk

(yk − 1

Lk∇f (yk)

)where yk is a special linear combination of xk , xk−1.

FISTA for Composite Nonsmooth Model (P)

FISTA – Input: Convex (f , g) with f Lf -smooth.Initialization: set y0 = x0 ∈ E and t0 = 1.General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) pick Lk > 0.

(b) xk+1 = TLk(yk) ≡ prox 1

(yk − 1

Lk∇f (yk)

(c) tk+1 =1+√

1+4t2k2 .

(d) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

I The main computation (b) is the same as prox-grad; (c)-(d) marginal

I With tk ≡ 1,∀k we recover ProxGrag.

Some Technical remarks

I Stepsize Rules. Here, we will use the constant stepsize Lk = Lf for all k.

I As mentioned before, when Lf is unknown, we can simply use a backtrackingprocedure, see refs. [1,2] for details.

I The sequence tk The sequence {tk} is determined as the positive root ofthe quadratic equation:

t2k+1 − tk+1 = t2k solved with tk+1 =1 +

√1 + 4t2k2

I It is easy to show by induction that tk ≥ k+22 for all k ≥ 0.

I Therefore, 0 < t−1k ≤ 1 for all k .

O(1/k2) rate of convergence of FISTA

Theorem. Let {xk}k≥0 be the sequence generated by FISTA with theconstant stepsize Lk ≡ Lf . Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤2Lf ‖x0 − x∗‖2

(k + 1)2.

Thus, the complexity of FISTA is O(1/√ε), a square root factor better

than PGM!

O(1/k2) rate of convergence of FISTA

Theorem. Let {xk}k≥0 be the sequence generated by FISTA with theconstant stepsize Lk ≡ Lf . Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤2Lf ‖x0 − x∗‖2

(k + 1)2.

Thus, the complexity of FISTA is O(1/√ε), a square root factor better

than PGM!

Main Pillar in Analysis - A Special RecursionLemma -Recursion The sequences {xk , yk} generated via FISTA with theconstant stepsize rule satisfy for every k ≥ 1

Lt2kvk −

Lt2k+1v

2k+1 ≥ ‖uk+1‖2 − ‖uk‖2,

where vk := F (xk)− Fopt uk := tk−1xk − (tk−1 − 1)xk−1 − x∗.

Proof. We apply the fundamental prox-grad inequality

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Take x = t−1k x∗ + (1− t−1k )xk and y = yk , L ≡ Lf ; here xk+1 = TL(yk),

F (t−1k x∗ + (1− t−1

k )xk)− F (xk+1)

2‖xk+1 − (t−1

k x∗ + (1− t−1k )xk)‖2 − L

2‖yk − (t−1

k x∗ + (1− t−1k )xk)‖2

2t2k‖tkxk+1 − (x∗ + (tk − 1)xk)‖2 − L

2t2k‖tkyk − (x∗ + (tk − 1)xk)‖2. (1)

Main Pillar in Analysis - A Special RecursionLemma -Recursion The sequences {xk , yk} generated via FISTA with theconstant stepsize rule satisfy for every k ≥ 1

Lt2kvk −

Lt2k+1v

2k+1 ≥ ‖uk+1‖2 − ‖uk‖2,

where vk := F (xk)− Fopt uk := tk−1xk − (tk−1 − 1)xk−1 − x∗.

Proof. We apply the fundamental prox-grad inequality

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Take x = t−1k x∗ + (1− t−1k )xk and y = yk , L ≡ Lf ; here xk+1 = TL(yk),

F (t−1k x∗ + (1− t−1

k )xk)− F (xk+1)

2‖xk+1 − (t−1

k x∗ + (1− t−1k )xk)‖2 − L

2‖yk − (t−1

k x∗ + (1− t−1k )xk)‖2

2t2k‖tkxk+1 − (x∗ + (tk − 1)xk)‖2 − L

2t2k‖tkyk − (x∗ + (tk − 1)xk)‖2. (1)

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).

I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).I Since t2k − tk = t2k−1,

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).I Using the notation vk ≡ F (xk)− Fopt,

tk−1−1tk

)(xk − xk−1),

2‖uk+1‖2 − L

2‖uk‖2,

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

tk−1−1tk

)(xk − xk−1),

2‖uk+1‖2 − L

2‖uk‖2,

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

tk−1−1tk

)(xk − xk−1),

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).

I Since t2k − tk = t2k−1,

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

tk−1−1tk

)(xk − xk−1),

2‖uk+1‖2 − L

2‖uk‖2,

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)

I Therefore,2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)I Therefore,

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

L(F (x1)− Fopt)

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

L(F (x1)− Fopt)

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

L(F (x1)− Fopt)

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Lt2k−1vk −

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

L(F (x1)− Fopt)

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Alternative Choice for tk

I For the proof of the O(1/k2) rate, it is enough to require that {tk}k≥0 willsatisfy

(a) tk ≥ k+22;

(b) t2k+1 − tk+1 ≤ t2k . (See proof of last inequality in Lemma.)

I The choice tk = k+22 also satisfies these two properties. (a) is obvious. (b)

holds since

t2k+1 − tk+1 = tk+1(tk+1 − 1) =k + 3

2· k + 1

k2 + 4k + 3

≤ k2 + 4k + 4

(k + 2)2

4= t2k .

Alternative Choice for tk

I For the proof of the O(1/k2) rate, it is enough to require that {tk}k≥0 willsatisfy

(a) tk ≥ k+22;

(b) t2k+1 − tk+1 ≤ t2k . (See proof of last inequality in Lemma.)

I The choice tk = k+22 also satisfies these two properties. (a) is obvious. (b)

holds since

t2k+1 − tk+1 = tk+1(tk+1 − 1) =k + 3

2· k + 1

k2 + 4k + 3

≤ k2 + 4k + 4

(k + 2)2

4= t2k .

ISTA/FISTA - l1-Regularization

Consider the modelminx∈Rn

f (x) + λ‖x‖1,

I λ > 0

I f : Rn → R convex and Lf -smooth.

Iterative Shrinkage/Thresholding Algorithm (ISTA):

xk+1 = Tλ/Lf

(xk − 1

Lf∇f (xk)

Fast Iterative Shrinkage/Thresholding Algorithm (FISTA):

(a) xk+1 = T λLf

(yk − 1

Lf∇f (yk)

(b) tk+1 =1+√

1+4t2k2 .

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

l1-Regularized Least SquaresConsider the problem

minx∈Rn

2‖Ax− b‖22 + λ‖x‖1,

I A ∈ Rm×n,b ∈ Rm and λ > 0.

I Fits (P) with f (x) = 12‖Ax− b‖22 and g(x) = λ‖x‖1.

I f is Lf -smooth with Lf = ‖ATA‖2 = λmax(ATA).

ISTA: xk+1 = T λLk

(xk − 1

LkAT (Axk − b)

FISTA:

(a) xk+1 = T λLk

(yk − 1

LkAT (Ayk − b)

(b) tk+1 =1+√

1+4t2k2 .

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Numerical Example I

I test on regularized l1-regularized least squares.

I λ = 1.

I A ∈ R100×110. The components of A were independently generated using astandard normal distribution.

I the “true” vector is xtrue = e3 − e7.

I b = Axtrue.

I ran 200 iterations of ISTA and FISTA with x0 = e.

Function Values

0 50 100 150 20010

10−10

10−8

10−6

10−4

10−2

k )−F

ISTAFISTA

FISTA is NOT a mononote algorithm !Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 14 / 25

Solutions

0 10 20 30 40 50 60 70 80 90 100 110−1

−0.8

−0.6

−0.4

−0.2

0 10 20 30 40 50 60 70 80 90 100 110−1

−0.8

−0.6

−0.4

−0.2

1FISTA

Example 2: Wavelet-Based Image Deblurring

2‖Ax− b‖2 + λ‖x‖1

I image of size 512x512

I matrix A is dense (Gaussian blurring times inverse of two-stage Haar wavelettransform).

I all problems solved with fixed λ and Gaussian noise.

Deblurring of the Cameraman

original blurred and noisy

1000 Iterations of ISTA versus 200 of FISTA

ISTA: 1000 Iterations FISTA: 200 Iterations

Original Versus Deblurring via FISTA

Original FISTA:1000 Iterations

Function Values errors F (xk)− F (x∗)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010

10−7

10−6

10−5

10−4

10−3

10−2

10−1

ISTAFISTA

FISTA in the Strongly Convex Case

I Assume that f is σ-strongly convex for some σ > 0.

I The proximal gradient method attains an ε-optimal solution after an order ofO(κ log( 1

ε )) iterations (κ = Lf

I A natural question is: how the complexity result changes when using FISTA?

I Answered by incorporating a restarting mechanism to FISTA – improvescomplexity result to O(

√κ log( 1

Restarted FISTAInitialization: pick z−1 ∈ E and a positive integer N. Set z0 = TLf

(z−1).General step (k ≥ 0)

I run N iterations of FISTA with constant stepsize (Lk ≡ Lf ) and input(f , g , zk) and obtain a sequence {xn}Nn=0;

I set zk+1 = xN .

√κ log( 1

I set zk+1 = xN .

√κ log( 1

I set zk+1 = xN .

Restarted FISTA – Complexity O(√κ log(1

Theorem [O(√κ log( 1

ε )) complexity of restarted FISTA] Suppose thatthat f is σ-strongly convex (σ > 0). Let {zk}k≥0 be the sequence generatedby the restarted FISTA method employed with N = d

√8κ − 1e. Let R be

an upper bound on ‖z−1 − x∗‖. Then

(a) F (zk)− Fopt ≤ Lf R2

(b) after k iterations of FISTA with k satisfying

k ≥√

(log( 1

log(2)+

log(Lf R2)

log(2)

an ε-optimal solution is obtained at the end of last completed cycle:

F (zbkN c)− Fopt ≤ ε.

Sketch of Proof of (a)

(b) follows easily from (a). To prove (a), N iterations of FISTA gives

I F (zn+1)− Fopt ≤ 2Lf ‖zn−x∗‖2(N+1)2 .

I F (zn)− Fopt ≥ σ2 ‖z

n − x∗‖2. (F is σ-strongly convex)

I Hence, F (zn+1)− Fopt ≤ 4κ(F (zn)−Fopt)(N+1)2 .

I Since N ≥√

8κ− 1, it follows that 4κ(N+1)2 ≤

12 , and hence,

F (zn+1)− Fopt ≤1

2(F (zn)− Fopt).

I ⇒ F (zk)− Fopt ≤(12

)k(F (z0)− Fopt).

I It remains to show that F (z0)− Fopt ≤ Lf R2

2 to complete the proof of(a)...

Applications/Limitations of FISTA for (P)

(P) min{f (x) + g(x) : x ∈ E}

The smooth convex function can be of any type f ∈ C 1,1 with available gradient.

As long as the prox of the nonsmooth function g

argminx∈E

{g(x) +

2‖x− (y − 1

L∇f (y))‖2

}can be computed analytically or easily/efficiently, FISTA is useful and veryefficient.

FISTA solves some interesting models in

I Signal/image recovery problems

I Matrix minimization problems arising in many machine learning models, e.g.,nuclear matrix norm regularization, multi-task learning, matrix classification,matrix completion problems....

References

Lecture based on [1] and [2].

1. A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithmfor linear inverse problems, SIAM J. Imaging Sci. (2009).

2. A. Beck and M. Teboulle, Gradient-based algorithms with applications tosignal-recovery problems, In Convex optimization in signal processing andcommunications (2010)

3. Y. Nesterov, Gradient methods for minimizing composite functions, Math.Program. (2013)

Lecture 5 - Smoothing Methods

Marc Teboulle

PGMO Lecture Series

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 1 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

(NSO) minx∈X

Jµ(x), Jµ smooth

I Bad News:

(NSO) minx∈X

Jµ(x), Jµ smooth

I Bad News:

(NSO) minx∈X

Jµ(x), Jµ smooth

I Bad News:

(NSO) minx∈X

Jµ(x), Jµ smooth

I Bad News:

(NSO) minx∈X

Jµ(x), Jµ smooth

I Bad News:

FOM for convex nonsmooth problems

I As seen, in general, nonsmooth convex optimization problems can be solvedwith complexity O(1/ε2).

I FISTA requires O(1/√ε) to obtain an ε-optimal solution of the composite

smooth+nonsmooth model f + g , when g is prox-friendly.

Can we reduce this gap? – Can we solve NSO in ∼ O(1/ε) itera-tions via one fixed smoothed counterpart problem?

Answer: Yes!

I We will show how FISTA can be used to devise a method for more generalnonsmooth convex problems.

I We will see that an ε-optimal solution of the original nonsmooth problemcan be obtained with an improved complexity of O(1/ε), by solving oneadequate smoothed counterpart.

Answer: Yes!

Model and IdeaThe model under consideration is

(P) min{f (x) + h(x) + g(x) : x ∈ E}.

I f Lf -smooth and convex;

I g proper closed and convex and “proximable”;

I h real-valued and convex (but not “proximable”)

The Idea - Partial Smoothing

I Solving (P) with FISTA with smooth/nonsmooth parts (f , g + h) is notpractical.

I The idea will be to find a smooth approximation of h, say h and solve theproblem via FISTA with smooth and nonsmooth parts taken as (f + h, g).

I This simple idea will be the basis for the improved O(1/ε) complexity.

I Need to study in more details the notions of smooth approximations andsmoothability.

(P) min{f (x) + h(x) + g(x) : x ∈ E}.

Smooth Approximations and Smoothability

I Definition. A convex function h : E→ R is called (α, β)-smoothable(α, β > 0) if for any µ > 0 there exists a convex differentiablefunction hµ : E→ R such that

(a) hµ(x) ≤ h(x) ≤ hµ(x) + βµ for all x ∈ E.(b) hµ is α

µ-smooth.

I The function hµ is called a 1µ -smooth approximation of h with

parameters (α, β).

Let us see two basic examples.

ExamplesB Smoothing the l2-norm. h(x) = ‖x‖2(E = Rn). For any µ > 0,√

‖x‖22 + µ2 − µ ≤ ‖x‖2 ≤(√‖x‖22 + µ2 − µ

)+ µ.

Thus, hµ(x) ≡√‖x‖22 + µ2 − µ satisfies (a) with β = 1, and one easily verifies

that hµ is 1/µ-smooth, and hence h is (1,1)-smoothable.

B Smoothing the “max” via The Log-Sum-Exp function.h(x) = max{x1, x2, . . . , xn}(E = Rn). For any µ > 0,hµ(x) = µ log

(∑ni=1 e

xi/µ)− µ log n is a smooth approximation of h with

parameters (1, log n) ⇒ h is (1, log n)-smoothable.

One way to see this, recall:‖z‖∞ = max{|z1|, . . . , |zn|} = limp→∞(∑n

i=1 |zi |p)1/p

.Set µ := 1/p, µ→ 0+, and |zi | = exi , and pass to log, and also note that

µ log

(∑ni=1 e

)≤ max{x1, . . . , xn}.

Another, (and more general way) is via asymptotic (recession) functions and generalized

means; see ref [T. 2007].Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 6 / 27

Basic Properties and Algebra for Smoothable Functions

Proposition.

I The nonnegative linear combination of smoothable functions isanother smoothable function: If g1 and g2 are (α1, β1) and (α2, β2)

smoothable, then γ1g1 + γ2g2(γ1, γ2 ≥ 0) is smoothable withparameters (γ1α1 + γ2α2, γ1β1 + γ2β2).

I Linear transformation on the variables of a smoothable function yieldsanother smoothable function: Let A : E→ V be a linear map. If g is(α, β) smoothable, then

qµ(x) ≡ hµ(A(x) + b), b ∈ V

is smoothable with parameters (α‖A‖2, β).

Proof: Easily follows, just use definition.

Smooth Approximation of Piecewise Affine Functions

I Let q(x) = maxi=1,...,m

{aTi x + bi}, where ai ∈ Rn and bi ∈ R for any

i = 1, 2, . . . ,m.

I q(x) = g(Ax + b), where g(y) = max{y1, y2, . . . , ym}, A is the matrix whoserows are aT

1 , aT2 , . . . , a

Tm and b = (b1, b2, . . . , bm)T .

I Let µ > 0. gµ(y) = µ log(∑m

i=1 eyi/µ)− µ logm is a 1

µ -smooth

approximation of g with parameters (1, logm).

I Therefore,

qµ(x) ≡ gµ(Ax + b) = µ log(∑m

i=1 e(aT

i x+bi )/µ)− µ logm

is a 1µ -smooth approximation of q with parameters (‖A‖22, logm).

Smooth Approximation of Piecewise Affine Functions

I Let q(x) = maxi=1,...,m

{aTi x + bi}, where ai ∈ Rn and bi ∈ R for any

i = 1, 2, . . . ,m.

I q(x) = g(Ax + b), where g(y) = max{y1, y2, . . . , ym}, A is the matrix whoserows are aT

1 , aT2 , . . . , a

Tm and b = (b1, b2, . . . , bm)T .

I Let µ > 0. gµ(y) = µ log(∑m

i=1 eyi/µ)− µ logm is a 1

µ -smooth

approximation of g with parameters (1, logm).

I Therefore,

qµ(x) ≡ gµ(Ax + b) = µ log(∑m

i=1 e(aT

i x+bi )/µ)− µ logm

is a 1µ -smooth approximation of q with parameters (‖A‖22, logm).

The Moreau Envelope

Definition. Given a proper closed convex function f : E → (−∞,∞], andµ > 0, the Moreau envelope of f is the function

Mµf (x) = min

{f (u) +

2µ‖u− x‖2

I By the prox theorem the minimization problem defining the Moreau envelopehas a unique solution, given by proxµf (x). Therefore,

Mµf (x) = f (proxµf (x)) +

2µ‖x− proxµf (x)‖2.

We will show that the parameter µ plays the role of a smoothing parameter. First,some examples.

ExamplesI Indicators. Suppose that f = δC , where C ⊆ E is a nonempty closed and

convex set. Then proxf = PC and

Mµf (x) = δC (PC (x)) +

2µ‖x− PC (x))‖2.

Therefore,

MµδC

2µd2C .

I Euclidean Norms f (x) = ‖x‖. Then for any µ > 0 and x ∈ E,

proxµf (x) =

(1− µ

max{‖x‖, µ}

Therefore,

Mµf (x) = ‖proxµf (x)‖+

2µ‖x− proxµf (x)‖2 =

{ 12µ‖x‖

2, ‖x‖ ≤ µ,‖x‖ − µ

2 , ‖x‖ > µ,︸︷︷︸Hµ(x)

Hµ - Known as Huber function in Statistics.

ExamplesI Indicators. Suppose that f = δC , where C ⊆ E is a nonempty closed and

convex set. Then proxf = PC and

Mµf (x) = δC (PC (x)) +

2µ‖x− PC (x))‖2.

Therefore,

MµδC

2µd2C .

I Euclidean Norms f (x) = ‖x‖. Then for any µ > 0 and x ∈ E,

proxµf (x) =

(1− µ

max{‖x‖, µ}

Therefore,

Mµf (x) = ‖proxµf (x)‖+

2µ‖x− proxµf (x)‖2 =

{ 12µ‖x‖

2, ‖x‖ ≤ µ,‖x‖ − µ

2 , ‖x‖ > µ,︸︷︷︸Hµ(x)

Hµ - Known as Huber function in Statistics.

Huber FunctionHµ gets smoother as µ increases.

−5 0 5−1

Calculus of Moreau EnvelopesI Let f : E→ (−∞,∞] be a proper closed and convex function, and letλ, µ > 0. Then

λMµf (x) = M

µ/λλf (x).

I Suppose that E = E1×E2× · · · ×Em, and let f : E→ (−∞,∞] be given by

f (x1, x2, . . . , xm) =m∑i=1

fi (xi ), x1 ∈ E1, x2 ∈ E2, . . . , xm ∈ Em,

with fi : Ei → (−∞,∞] being a proper closed and convex function for any i .Then for any µ > 0,

Mµf (x1, x2, . . . , xm) =

m∑i=1

(xi ).

Example: Let f (x) = ‖x‖1. Then f (x) = ‖x‖1 =∑n

i=1 g(xi ), where g(t) = |t|.Thus,

Mµf (x) =

n∑i=1

Mµg (xi ) =

n∑i=1

Hµ(xi ).

A Key Property: The Moreau Envelope is 1/µ-Smooth

Theorem. Let f : E → (−∞,∞] be a proper closed and convex function.Let µ > 0. Then,

I (a) Mµf admits the following equivalent “dual” formulation:

Mµf = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2}

I (b) Mµf is 1

µ -smooth over E.

I (c) ∇Mµf (x) = 1

(x− proxµf (x)

To prove this; namely the key property (b), we recall a fundamental result onthe relations between conjugacy, strong convexity and smoothness.

Conjugate of Strongly Convex and Smoothness

Theorem -Conjugate and Smoothness Let σ > 0. If f : E→ (−∞,∞] isa proper closed σ-strongly convex function, then f ∗ : E→ R is 1

σ -smooth.

Proof. Let f be proper closed and σ-strongly convex.

I By Fenchel: y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y). Compactly we can write∂f ∗(y) = argmax

x∈E{〈x, y〉 − f (x)} for any y ∈ E, and by strong convexity x is

unique. Thus, ∂f ∗(y) is a singleton for any y, and hence f ∗ is differentiableover E.

I Take y1, y2 ∈ E and denote v1 = ∇f ∗(y1), v2 = ∇f ∗(y2).

I Fenchel again: Let y1 ∈ ∂f (v1), y2 ∈ ∂f (v2). Then,

I 〈y1 − y2, v1 − v2〉 ≥ σ‖v1 − v2‖2,I 〈y1 − y2,∇f ∗(y1)−∇f ∗(y2)〉 ≥ σ‖∇f ∗(y1)−∇f ∗(y2)‖2,I By the Cauchy-Schwarz inequality, ‖∇f ∗(y1)−∇f ∗(y2)‖ ≤ 1

σ‖y1− y2‖∗.

Reverse: If f : E→ R is a 1σ -smooth convex function, then f ∗ is σ-strongly

convex.

σ -smooth.

σ‖y1− y2‖∗.

convex.

σ -smooth.

σ‖y1− y2‖∗.

convex.

σ -smooth.

σ‖y1− y2‖∗.

convex.

Proof of the 1/µ-Smoothness of the Moreau Envelope

To prove (a) we simply use standard duality and get the dual representation ofMµ

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2}

= (f ∗ +µ

2‖ · ‖2)∗(x)

With this dual representation, we can then prove (b) by apply the previousLemma.

I Now, f ∗ + µ2 ‖ · ‖

2 is µ-strongly convex.

I Hence, by previous theorem its conjugate, which is precisely Mµf , is

1/µ-smooth, proving (b).

The proof of (c) follows from calculus, see refs.

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

2‖ · ‖2)∗(x)

I Now, f ∗ + µ2 ‖ · ‖

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

2‖ · ‖2)∗(x)

I Now, f ∗ + µ2 ‖ · ‖

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

2‖ · ‖2)∗(x)

I Now, f ∗ + µ2 ‖ · ‖

Examples - Smoothing via Moreau Envelope

I The squared distance Let C ⊆ E be a nonempty closed and convex set.Recall that 1

2d2C = M1

δC. Then 1

2d2C is 1-smooth and

∇(1/2d2

)(x) = x− proxδC (x) = x− PC (x).

I Norms Hµ = Mµf , where f (x) = ‖x‖. Then Hµ is 1

µ -smooth and

∇Hµ(x) =1

(x− proxµf (x)

(1− µ

max{‖x‖, µ}

{1µx, ‖x‖ ≤ µ,

x‖x‖ , ‖x‖ > µ,

Lipschitz Convex Functions are SmoothableTheorem. Let h : E→ R be convex and Lipschitz with constant `h, i.e.,

|h(x)− h(y)| ≤ `h‖x− y‖ for all x, y ∈ E.

Then h is (1,`2h2 )-smoothable.

Norms are typical in applications: `h =√n for l1 and `h = 1 for l2 norm.

Proof. Fix any µ > 0. Then, hµ := Mµh is 1

µ -smooth, and hence α = 1. It

remains to get our ”sandwich” Mµh ≤ h ≤ Mµ

h + βµ with β = `2h/2.I For any x ∈ E,

Mµh (x) = min

{h(u) +

2µ‖u− x‖2

}≤ h(x) +

2µ‖x− x‖2 = h(x).

I Let gx ∈ ∂h(x). Then, since h Lip-convex we have ‖gx‖ ≤ `h, and

Mµh (x)− h(x) = min

{h(u)− h(x) +

2µ‖u− x‖2

}≥ min

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

h + βµ with β = `2h/2.

I For any x ∈ E,

Mµh (x) = min

{h(u) +

2µ‖u− x‖2

}≤ h(x) +

2µ‖x− x‖2 = h(x).

{h(u)− h(x) +

2µ‖u− x‖2

}≥ min

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Mµh (x) = min

{h(u) +

2µ‖u− x‖2

}≤ h(x) +

2µ‖x− x‖2 = h(x).

{h(u)− h(x) +

2µ‖u− x‖2

}≥ min

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Mµh (x) = min

{h(u) +

2µ‖u− x‖2

}≤ h(x) +

2µ‖x− x‖2 = h(x).

{h(u)− h(x) +

2µ‖u− x‖2

}≥ min

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Examples of Smoothable Lipschitz Convex Functions

I (smooth approximation of the l2-norm) Let h(x) = ‖x‖2 (over Rn). Thenh is convex and Lipschitz with constant `h = 1. Therefore,

Mµh (x) = Hµ(x) =

{ 12µ‖x‖

22, ‖x‖2 ≤ µ,

‖x‖2 − µ2 , ‖x‖2 > µ.

is a 1µ -smooth approximation of h with parameters (1, 12 ).

I (smooth approximation of the l1-norm) Let h(x) = ‖x‖1 (over Rn) Thenh is convex and Lipschitz with constant `h =

√n. Hence,

Mµh (x) =

n∑i=1

Hµ(xi )

is a 1µ -smooth approximation of h with parameters (1, n2 ).

Back to Algorithms - Model and Assumptions

Main model:(P) min

x∈E{H(x) ≡ f (x) + h(x) + g(x)}

(A) f : E→ R is Lf -smooth

(B) h : E→ R is (α, β)-smoothable (α, β > 0). For any µ > 0, hµ denotes a1µ -smooth approximation of h with parameters (α, β).

(C) g : E→ (−∞,∞] is proper closed and convex.

(D) H has bounded level sets.

(E) The optimal set of (P) is nonempty and denoted by X ∗. The optimal valueof the problem is denoted by Hopt.

The S-FISTA Method - Partial SmoothingI The idea is to consider the following smoothed version of (P):

(Pµ) minx∈E{Hµ(x) ≡ f (x) + hµ(x)︸︷︷︸

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.

I A Lipschitz constant of ∇Fµ is Lf + αµ ; the stepsize is taken as 1

Lf +αµ

S-FISTAInput: x0 ∈ dom(g), µ > 0.Initialization: set y0 = x0, t0 = 1; construct hµ – a 1

µ -smooth approxima-

tion of h with parameters (α, β); set Fµ = f + hµ, L = Lf + αµ .

General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.I A Lipschitz constant of ∇Fµ is Lf + α

µ ; the stepsize is taken as 1Lf +

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.I A Lipschitz constant of ∇Fµ is Lf + α

µ ; the stepsize is taken as 1Lf +

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

O(1/ε) complexity of S-FISTA

Theorem. Let ε > 0 and Let {xk}k≥0 be the sequence generated by S-FISTA with smoothing parameter

ε√αβ +

√αβ + Lf ε

Then for any k satisfying

k ≥√αβΓ

ε+√

Lf Γ1√ε,

it holds that H(xk)− Hopt ≤ ε.

I Here Γ > 0 is a constant (e.g., like in FISTA, which depends on x0, x∗,R).

I Not needed to activate the algorithm...

Proof of Theorem - For you....

I Let µ > 0. Set F = f + hµ. Then, LF = Lf + αµ . Applying SFISTA:

(∗) Hµ(xk)− H∗µ ≤(Lf +

I hµ smoothable ⇒ ∃β s.t. :Hµ(x) ≤ H(x) ≤ Hµ(x) + βµ

I In particular, we thus get

H∗ ≥ H∗µ − βµ, H(xk) ≤ Hµ(xk) + βµ

I Hence,

H(xk)− H∗ ≤ Hµ(xk)− H∗µ + βµ

≤ Lf Γ

µ+ βµ.

Proof Contd.I

H(xk)− H∗ ≤ Lf Γ

µ+ βµ.

I Minimizing the RHS with respect to µ:

(∗) µ =

√αΓ

H(xk)− H∗ ≤ Lf Γ

k2+ 2√αβΓ

I Thus to get H(xk)− H∗ ≤ ε require

(Lf Γ)1

k2+ 2√αβΓ

k≤ ε ⇒

k ≥√αβΓ

ε+√Lf Γ

1√ε

Plugging the lower bound of k in µ given in (*) yields the result.

ExampleConsider the problem

(P2) minx∈Rn

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

where A ∈ Rm×n,b ∈ Rm,D ∈ Rp×n and λ > 0.I Fits model (P) with f (x) = 1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.

I All assumptions are met:I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

2)-smoothable. Indeed,

I Set h(x) := q(Dx), where q : Rp → R is given by q(y) = ‖y‖1.I For any µ > 0, qµ(y) = Mµ

q (y) =∑p

i=1 Hµ(yi ) is a1µ-smooth approximation of q

with parameters (1, p2), hence the claim with hµ(x) ≡ qµ(Dx).

I To obtain an ε-optimal solution of problem (P2), we can then use S-FISTAmethod on

minx∈Rn{Fµ(x) + λ‖x‖1} , where Fµ(x) = f (x) + hµ(x)

(P2) minx∈Rn

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.I All assumptions are met:

I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

q (y) =∑p

(P2) minx∈Rn

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.I All assumptions are met:

I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

q (y) =∑p

Example Contd.

I For the smooth part Fµ(x) = f (x) + Mµq (Dx), we have

∇Fµ(x) = ∇f (x) + DT∇Mµq (Dx)

= ∇f (x) + 1µDT (Dx− proxµq(Dx))

= ∇f (x) + 1µDT (Dx− Tµ(Dx))

and then we get Lµ = ‖A‖22 +‖D‖22µ .

I We then set µ =√

ε√αβ+√αβ+Lµε

; (plugin Lµ, α, β above).

I We know that the nonsmooth part g(x) = λ‖x‖1 admits an easy prox (thethresholding operation), and hence S-FISTA can be easily applied.

S-FISTA for the Example

S-FISTA for solving (P2). Initialization: Set y0 = x0 ∈ Rn, t0 = 1; µand L = Lµ as above.General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = Tλ/Lµ

(yk − 1

L(∇f (yk) + 1

µDT (Dyk − Tµ(Dyk))))

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

References

Lecture based on [1].

1. A. Beck and M. Teboulle, Smoothing and first order methods: a unifiedframework. SIAM J. Optim. (2012)

2. M. Teboulle. A unified continuous optimization framework for center-basedclustering methods. Journal of Machine Learning Research, (2007).

3. Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program.(2005)

Lecture 6 - Lagrangian and DecompositionMethods

Marc Teboulle

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, ParisMarc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 1 / 26

Lecture 6 - Lagrangians and Decomposition Methods• The Model: Nonsmooth Convex

(P) p∗ = inf{ϕ(x) ≡ f (x) + g(Ax) : x ∈ Rn},

I Now, f , g are convex extended valued both nonsmooth.I A : Rn → Rm is a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p∗ = inf{f (x) + g(z) : Ax = z , x ∈ Rn, z ∈ Rm}

Rockafellar (’76) has shown that the Proximal Algorithm can be applied to thedual and primal-dual formulation of (P) to produce:

I The Multipliers Method (augmented Lagrangian Method).I The Proximal Method of Multipliers (PMM).I Largely ignored over last 20 years.....Recent very strong revival in, image

science, machine learning etc... within many algorithms.. all being rootedin - and variants of – the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 2 / 26

The PMM–Proximal Method of Multipliers

PMM Generate (xk , zk) and dual multiplier yk via

(xk+1, zk+1) ∈ argminx,z{f (x) + g(z) + 〈yk ,Ax − z〉+

2‖Ax − z‖2 + qk(x , z)}

yk+1 = yk + c(Axk+1 − zk+1), (c > 0).

I The Augmented Lagrangian = Penalized Lagrangian

Lc(x , z , y) :=

Lagrangian︷︸︸︷f (x) + g(z) + 〈y ,Ax − z〉+

2‖Ax − z‖2, (c > 0).

I qk(x , z) := 12

(‖x − xk‖2M1

+ ‖z − zk‖2M2

)is the additional primal proximal

I The choice of M1 ∈ Sn+,M2 ∈ Sm+ is used to conveniently describe/analyzeseveral variants of the PMM.

I M1 = M2 ≡ 0, recovers the Multiplier Methods (PMA on the dual).

Proximal Method of Multipliers–Key Difficulty

I Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + 〈yk ,Ax − z〉+c

2‖Ax − z‖2 + qk(x , z).

I The quadratic coupling term ‖Ax − z‖2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).

I In many applications, separate minimization is often crucial!

Many Strategies available to overcome this difficulty:

I Approximate Minimization – linearized the quad term ‖Ax − z‖2 wrt (x , z).

I Alternating Minimization – a la “Gauss-Seidel” in (x , z).

I Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.

I Above strategies produce various useful variants of the PMM.

Proximal Method of Multipliers–Key Difficulty

I Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + 〈yk ,Ax − z〉+c

2‖Ax − z‖2 + qk(x , z).

I The quadratic coupling term ‖Ax − z‖2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).

I In many applications, separate minimization is often crucial!

Many Strategies available to overcome this difficulty:

I Approximate Minimization – linearized the quad term ‖Ax − z‖2 wrt (x , z).

I Alternating Minimization – a la “Gauss-Seidel” in (x , z).

I Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.

I Above strategies produce various useful variants of the PMM.

A Prototype : Alternating Direction PMM (AD-PMM)Eliminate the coupling (x , z) via alternating minimization steps.

(AD-PMM) Alternating Direction Proximal Method of Multipliers

1. Start with any (x0, z0, y0) ∈ Rn × Rm × Rm and c > 0

2. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 ∈ argmin

{f (x) +

2‖Ax − zk + c−1yk‖2 +

2‖x − xk‖2M1

zk+1 = argmin

{g(z) +

2‖Axk+1 − z + c−1yk‖2 +

2‖z − zk‖2M2

yk+1 = yk + c(Axk+1 − zk+1).

I Useful when (x, z) steps are “easy” to implement exactly or inexactly (e.g., viastrategies just mentioned); for instance the the x-step is often not easy!

I Adequate choices of M1,M2 � 0 allows to derive various algorithms, next slide.

I Allows for Convergence/Complexity via a unifying analysis for many - old and new -algorithms.

A Prototype : Alternating Direction PMM (AD-PMM)Eliminate the coupling (x , z) via alternating minimization steps.

(AD-PMM) Alternating Direction Proximal Method of Multipliers

1. Start with any (x0, z0, y0) ∈ Rn × Rm × Rm and c > 0

2. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 ∈ argmin

{f (x) +

2‖Ax − zk + c−1yk‖2 +

2‖x − xk‖2M1

zk+1 = argmin

{g(z) +

2‖Axk+1 − z + c−1yk‖2 +

2‖z − zk‖2M2

yk+1 = yk + c(Axk+1 − zk+1).

I Useful when (x, z) steps are “easy” to implement exactly or inexactly (e.g., viastrategies just mentioned); for instance the the x-step is often not easy!

I Adequate choices of M1,M2 � 0 allows to derive various algorithms, next slide.

I Allows for Convergence/Complexity via a unifying analysis for many - old and new -algorithms.

Some Typical Schemes from AD-PMM

(1) Classical ADM (Alternating Direction of Multipliers): M1 = M2 = 0I Alternates minimization of the standard Augmented Lagrangian Lc .

(2) Partial Regularized ADMM: M1 � 0,M2 � 0I For example, one can use M1 := τ−1In, and M2 = 0.I Allows to prove the convergence of the sequence {xk} without any assumption

on the matrix A.

(3) Mixed Srategy: Linearize and Alternating MinimizationI Linearization w.r.t x , combined with AM in z . This is achieved by choosing:I M1 := τ−1In − cATA � 0, ⇔ cτ‖A‖2 ≤ 1; M2 := 0.I Produces the recent Primal-Dual algorithm [Chambolle-Pock (2010)].

Some Definitions/Recalls: Saddle point and DualThe schemes are primal-dual. The Lagrangian associated to (P) is

`(x , z , y) = f (x) + g(z) + 〈y ,Ax − z〉, y ∈ Rm multiplier for Ax = z .

Assumption on (P) The Lagrangian l associated to (P) has a saddle point, i.e.,there exists (x∗, y∗, z∗) such that:

`(x∗, z∗, y) ≤ `(x∗, z∗, y∗) ≤ `(x , z , y∗), ∀ x , y , z ∈ Rn × Rm × Rm.

It is well-known that (x∗, z∗, y∗) is a saddle point of the Lagrangian ` if and onlyif (x∗, z∗) is a solution of (P), y∗ is a solution of (D), and

p∗ = `(x∗, z∗, y∗) = infx,z

[supy`(x , z , y)] = sup

y[infx,z`(x , z , y)] = d∗.

The number p∗ = `(x∗, z∗, y∗) ∈ R is the saddle value of `.

Recall: Since (P) is convex, under the standard constraint qualification, e.g.,

there exists x ∈ ri(dom f ), z ∈ ri(dom g), such that Ax = z ,

strong duality holds, and the dual optimal set SD is nonempty, convex and compact.

Approximate Saddle Point

Definition (ε-saddle-point for general K Convex-Concave)Let ε > 0. A point (uε, vε) is called an ε-saddle-point of convex-concavefunction K if

sup {K (uε, v)− K (u, vε) : u ∈ SP , v ∈ SD} ≤ ε,

where associated to the saddle function K we define SP (SD) the optimalsolutions set of the primal (dual) problems:

(P) infu∈Rn

[r (u) = sup

v∈Rd

K (u, v)

]and (D) sup

v∈Rd

[q (v) = inf

u∈RnK (u, v)

Our case: K (u, v) ≡ `(x , z , y) = f (x) + g(z) + 〈y ,Ax − z〉, (u := (x , z), v := y).This measures the closeness of an optimal solution to the optimal set in terms ofthe duality gap, and our main task if to find an upper bound for

K (uε, v)− K (u, vε) .

Ergodic Complexity for AD-PMM is O(1/N)

Recall : For any sequence {wp}, any N ≥ 1, the ergodic sequence

wN := 1N

∑Np=1 wp.

Typical Result [For AD-PMM and other variants, [Shefi-T. (2014)]].Let {xk , zk , yk} be the sequence generated by (AD-PMM). Then, theergodic sequence (xN , zN , yN) with N = O(1/ε) is an ε-saddle point of`, and both an ε-optimal primal (x∗, z∗) solution and an ε-dual optimalsolution of (P)-(D), respectively.

The more precise statement follows next...

Global Rate of Convergence/Complexity for AD-PMM

Theorem [For AD-PMM and other variants, [Shefi-T. (2014)]]. Let{xk , zk , yk} be the sequence generated by (AD-PMM). Then, the ergodic se-quence (xN , zN , yN) satisfies for all N ≥ 1,

I `(xN , zN , y)− p∗ ≤ C(x∗,z∗,y)N , ∀y ∈ Rm

I Under compactness on the optimal primal solution set SP , we have

max{`(xN , zN , y)− `(x , z , yN) : (x , z) ∈ SP , y ∈ SD} ≤c∗N,

C (x , z , y) :=1

{‖x − x0‖2M1

+ ‖z − z0‖2M2+ c−1‖y − y0‖2

2‖Ax−z0‖2.

and c∗ = max{C (x , z , y) : (x , z) ∈ SP , y ∈ SD}.

Issues with PMM

PMM based schemes are not free of potential problems ..raising practicaland theoretical issues:

I The penalty parameter c is unknown: trial/error runs, fine tuning,heuristics, ....

I Iteration complexity bound, as just seen, depends on c !

I Difficult to extend for sum of m > 2 convex composite functions with linearmaps

minx{f (x) +

m∑i=1

gi (Aix)}

Any alternatives?..

Alternative Approach

Current methods which admit an O(1/ε) efficiency estimate

I PMM-Based: Just discussed with its potential drawbacks.

I Smoothing Methods: As discussed in previous lectureI Assume availability of a smoothable functionI Require a smoothing parameter in term of the accuracy fixed in advance.

Our Goal: An algorithm for a broad class of structurednonsmooth problem that achieves the efficiency estimate O(1/ε):

I Removes difficulties with current methods.

I Flexible enough to be applied to more general scenarios.

I Involves simple computational tasks.

I Approach: Saddle Point Reformulation of a Structured Model

Alternative Approach

Current methods which admit an O(1/ε) efficiency estimate

I PMM-Based: Just discussed with its potential drawbacks.

I Smoothing Methods: As discussed in previous lectureI Assume availability of a smoothable functionI Require a smoothing parameter in term of the accuracy fixed in advance.

Our Goal: An algorithm for a broad class of structurednonsmooth problem that achieves the efficiency estimate O(1/ε):

I Removes difficulties with current methods.

I Flexible enough to be applied to more general scenarios.

I Involves simple computational tasks.

I Approach: Saddle Point Reformulation of a Structured Model

A Saddle Point Approach/Reformulation

Our problem is of the form

(P) p∗ = min{F (u) + G (Au) : u ∈ Rn},

F ,G are both nonsmooth, closed proper convex, and A : Rn → Rm a given linearmap.

Use Data Info: Since G is closed convex G ≡ G∗∗ namely,

G (z) = supv{〈z , v〉 − G∗(v)}.

Therefore, (P) admits the min-max convex-concave representation:

maxv{K (u, v) := F (u) + 〈Au, v〉 − G∗(v)}.

Proposed Alternative approach: an algorithm via SP reformulation for abroad class of problems that is very flexible and allows to tackle manyscenarios.

A Saddle Point Approach/Reformulation

Our problem is of the form

(P) p∗ = min{F (u) + G (Au) : u ∈ Rn},

F ,G are both nonsmooth, closed proper convex, and A : Rn → Rm a given linearmap.

Use Data Info: Since G is closed convex G ≡ G∗∗ namely,

G (z) = supv{〈z , v〉 − G∗(v)}.

Therefore, (P) admits the min-max convex-concave representation:

maxv{K (u, v) := F (u) + 〈Au, v〉 − G∗(v)}.

Proposed Alternative approach: an algorithm via SP reformulation for abroad class of problems that is very flexible and allows to tackle manyscenarios.

A Class of Structured Convex-Concave Saddle-Point Model

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

Data Information

(i) f : Rn → R is convexC 1,1Lf

: ‖∇f (u1)−∇f (u2)‖ ≤ Lf ‖u1 − u2‖ , ∀u1, u2.(ii) gi : Rdi → (−∞,+∞], i = 1, 2, . . . ,m, is a proper, (lsc) and convex

function (possibly nonsmooth), and we let g : Rd → (−∞,+∞]

g (v) :=m∑i=1

gi (v) ; d :=m∑i=1

di ; v := (v1, v2, . . . , vm) ∈ Rd .

(iii) Ai : Rdi → Rn, i = 1, 2, . . . ,m, is a linear map and we letA : Rd → Rn be the linear map defined by Av =

∑mi=1 Aivi .

Assumption: K (·, ·) has a saddle-point, i.e., there exists (u∗, v∗) ∈ Rn × Rd s.t.,

K (u∗, v) ≤ K (u∗, v∗) ≤ K (u, v∗) , ∀ u ∈ Rn, v ∈ Rd .

A Proximal Alternating Predictor Corrector (PAPC)

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

The Algorithm blends: duality, predictor-corrector steps, and proximaloperation, described next.

Features of PAPC - Fully exploits structures and data info of aproblem.

I PAPC avoids the computational difficult task: the prox of thecomposition with a linear map (g ◦ A)(x) = g(Ax). Only ask prox of g(·).

I Can be easily applied to minimization problems with sum of suchcomposite terms in objective/constraints.

I Constraints on the variable v , built-in thanks to g being extended valued.

I Constraints on the variable u or/and presence of nonsmooth primal objectiveswill be easily handled via The Dual Transportation Trick.

A Proximal Alternating Predictor Corrector (PAPC)

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

The Algorithm blends: duality, predictor-corrector steps, and proximaloperation, described next.

Features of PAPC - Fully exploits structures and data info of aproblem.

I PAPC avoids the computational difficult task: the prox of thecomposition with a linear map (g ◦ A)(x) = g(Ax). Only ask prox of g(·).

I Can be easily applied to minimization problems with sum of suchcomposite terms in objective/constraints.

I Constraints on the variable v , built-in thanks to g being extended valued.

I Constraints on the variable u or/and presence of nonsmooth primal objectiveswill be easily handled via The Dual Transportation Trick.

The PAPC MethodPAPC Initialization.

(u0, v0

)∈ Rn × Rd , τ > 0, and {σi}mi=1 > 0.

For k = 1, 2, . . . ,

pk = uk−1 − τ(Avk−1 +∇f

(uk−1

♣ vki = proxgiσi

(vk−1i + σiA

k), i = 1, 2, . . . ,m,

uk = uk−1 − τ(Avk +∇f

(uk−1

Output: uN = 1N

∑Nk=1 u

k , vN = 1N

∑Nk=1 v

I ♣ v step “decomposes” according to structure. Only prox for each gi (·),not of g(Aix)!

I The parameters (τ, σi ) are defined in terms of problem’s data Lf ,Ai .

I Each iteration requires one application of A and of AT and one evaluation of∇f .

Notation:Si := σ−1i Idi , i = 1, 2, . . . ,m, S := Diag [S1,S2, . . . ,Sm] , d =m∑i=1

Tools for Anaysis...Simple

We want to find an ε-saddle-point of K . To prove an iteration complexity, ourmain task is to find an upper-bound for:

Γk (u, v) := K(uk , v

)− K

(u, vk

)for all u ∈ Rn, v ∈ Rd

. Analysis relies on: the Descent Lemma for f and weighted prox-inequality

Lemma (Weighted proximal inequality)Let h : Rp → (−∞,∞] be a proper, lsc and convex function. Given M � 0 andx ∈ Rp, let

z = argminξ∈Rp

{h (ξ) +

2‖ξ − x‖2M

Then, for all ξ ∈ Rp, h (z)− h (ξ) ≤ 〈ξ − z ,M (z − x)〉 .

We will bound Γk (u, v) by bounding the following two terms and add them:

K(uk , vk

)− K

(u, vk

)and K

(uk , v

)− K

(uk , vk

Key Estimates for PAPCLet

{(pk , uk , v k

)}k∈N be a sequence generated by the PAPC algorithm.

Lemma 1. For every u ∈ Rn and any k ∈ N, we have

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

Proof. Use the Descent lemma for f + some algebra...

Lemma 2. Assume that G := S − τATA � 0. For every v ∈ Rd and any k ∈ N, wehave

K(uk , v

)− K

(uk , v k

)≤ 1

(∥∥∥v − v k−1∥∥∥2

∥∥∥v − v k∥∥∥2

∥∥∥v k − v k−1∥∥∥2

Proof. Use the weighted proximal inequality for the v k and relations between pk , uk

given in PAPC.

Thus, if G := S − τATA � 0 and τLf ≤ 1, then for every u ∈ Rn and v ∈ Rd :

Γk (u, v) = K(uk , v

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

{(pk , uk , v k

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

K(uk , v

)− K

(uk , v k

)≤ 1

(∥∥∥v − v k−1∥∥∥2

∥∥∥v − v k∥∥∥2

∥∥∥v k − v k−1∥∥∥2

given in PAPC.

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

{(pk , uk , v k

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

K(uk , v

)− K

(uk , v k

)≤ 1

(∥∥∥v − v k−1∥∥∥2

∥∥∥v − v k∥∥∥2

∥∥∥v k − v k−1∥∥∥2

given in PAPC.

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

The PAPC Method – Convergence Results

Notation: Let σ := max1≤i≤m σi .

Theorem. Let{(

pk , uk , vk)}

k∈N be a sequence generated by the PAPC

algorithm with τLf ≤ 1 and στ∑m

i=1 ‖Ai‖2 ≤ 1.

(a) Global Rate of Convergence -Ergodic

K(uN , v

)− K

(u, vN

)≤τ−1

∥∥u − u0∥∥2 +

∥∥v − v0∥∥2G

Bound constant in terms of Data (Lf ,Ai ) – No extraparameters.

(b) Sequential Convergence: The sequence {(uk , vk)}k∈N converges toa saddle-point (u∗, v∗) of K .

Proof Sketch (a) Sum the obtained bound on Γk , and use Jensen’s inequality.(b) Use (a) and standard analysis tools.

The Dual Transportation TrickFor Handling Constraints and More...Main tool: any proper closed convex function, h, satisfies h∗∗ = h, i.e.,

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

I Constraints in u Let U be nonempty closed convex. Then:

minu∈U⊂Rp

F (u) = minu∈Rp{F (u) + δU(u)} = min

u∈Rpmaxv∈Rp{F (u) + 〈u, v〉 − δ∗U (v)} .

I If F is also nonsmooth, then we can “further transport”:

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

I Both formulations fit exactly model (M) with m = 1, and m = 2, respectively.I Separability in v is preserved.I Knowledge of F ∗, δ∗U is not needed to apply PAPC!.. Thanks to Moreau’s

identity,proxF + prox∗F = I .

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

PAPC Applies to Many Important Models

Thanks to the DTT, PAPC can be easily applied to many important models

♣ Convex Problems with Sum of Composite Convex Functions with LinearMaps

1. minu∈Rp

{F (u) +

∑mi=1 Hi (Biu)

2. minxi

{∑mi=1 ψ(xi ) :

∑mi=1 Mixi = b

3. minu∈Rp

{F (u) :

∑mi=1 Hi (Biu) ≤ α

For all these models, PAPC

I Decomposes nicely according to given structure.

I Removes the difficult task of “computing prox of convex functioncomposed with an affine map”.

I Parameters are determined from problem’s data info.

Application Example: Fused lasso regression

Suppose that the training set consists of N training examples {(ai , bi )}Ni=1, whereai ∈ Rn are the feature vectors and bi ∈ {−1,+1} are the respective labels. Thelogistic model assumes

P (b|a) =1

1 + exp (−b (wTa + v)),

where v ∈ R and w ∈ Rn are the parameters of the model. The likelihoodfunction of the training set is then given by

L(v ,w ; {(ai , bi )}Ni=1

N∏i=1

1 + exp (−bi (wTai + v)).

I In “lasso” regression we assume that the vector w has a bounded `1 norm(sparse)

I In “fused” regression we further ask (Dw)i := wi+1 − wi , with a bounded `1norm; i.e., “staircase-like” pattern, e.g., relevant for model parameters givenin some natural order.

Fused lasso regression - A Saddle-Point formulationThe problem is to find the MLE - a convex nonsmooth constrained problem:

(FLR) minv∈R,w∈Rn

{lavg (v ,w) : ‖w‖1 ≤ s1, ‖Dw‖1 ≤ s2} ,

lavg (v ,w) := log L(v ,w ; {(ai , bi )}Ni=1

N∑i=1

log(1 + exp

(bi(wTai + v

Using the “Dual transportation trick” we get the SP formulation

(FLR′) minv∈R,w∈Rn

maxy∈Rn,z∈Rn−1

{lavg (v ,w) + 〈w , y〉 − s1 ‖y‖∞ + 〈Dw , z〉 − s2 ‖z‖∞}

I All the required prox in PAPC are easy to compute: projections onto l1-ballwhich can be done in O(d).

I PAPC requires only one evaluation of the computational expensive gradient∇lavg per iteration.

Fused Lasso Regression - Numerical Illustration

I 1000 samples from the logistic model that corresponds to v = 0 and thevector w ∈ R100 defined by

0.1, i ∈ {11, . . . , 14} ,−0.2, i ∈ {41, . . . , 44} ,0.3, i ∈ {91, . . . , 94} ,0, otherwise.

I ai ∈ R100 (i = 1, 2, . . . , 1000) were generated such that each component wasindependently drawn from the standard normal distribution.

I the labels bi ∈ {−1,+1} were then randomly chosen according to the logisticmodel

P (b|a) =1

1 + exp (−b (wTa + v)).

I Llavg :=∑N

(‖ai‖2 + 1

)/ (4N).

Fused lasso regression - Numerical results

10 20 30 40 50 60 70 80 90 100

−0.2

−0.1

PAPC (ergodic) solution after 1000 iterations

Exact soluction

10 20 30 40 50 60 70 80 90 100

−0.2

−0.1

I Left: model parameters estimated by PAPC versus exact solution via CVX

I Right: Estimated model using classical lasso. Thus, adding “fused”constraints considerably increases the quality and interpretability of therecovered model.

References – Lecture based on [1] and [2]

1. R. Shefi and M. Teboulle. Rate of Convergence Analysis of Decomposition MethodsBased on the Proximal Method of Multipliers for Convex Minimization. SIAM J.Optimization, 24, 269–297, (2014).

2. Y. Drori, S. Sabach and M. Teboulle. A simple algorithm for a class of nonsmoothconvex-concave saddle-point problems. Operations Research Letters, 43, 209–214,(2015).

3. D. Bertsekas. Constrained optimization and lagrange multiplier methods. ComputerScience and Applied Mathematics, Boston: Academic Press, 1982, 1, 1982.

4. R. Glowinski and P. Le Tallec. Augmented Lagrangian and operator splittingmethods in nonlinear mechanics, volume 9. Society for Industrial Mathematics,1989.

5. A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problemswith applications to imaging. J. Math. Imaging Vision, 40(1):120–145, 2011.

6. R. Rockafellar. Monotone operators and the proximal point algorithm. SIAMJournal on Control and Optimization, 14(5):877–898, 1976.

7. R. Tibshirani, M. Saunders, R. Saharon, J. Zhu, and K. Knight. Sparsity andsmoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67,91–108, 2005.

marc teboulle school of mathematical sciences tel aviv ...teboulle/lx/lec1-to-6.pdf · first order...

Documents