marc teboulle school of mathematical sciences tel aviv ...teboulle/lx/lec1-to-6.pdf · first order...

257
First Order Optimization Methods Lecture 1 - Introduction and Basic Results Marc Teboulle School of Mathematical Sciences Tel Aviv University PGMO Lecture Series January 25-26, 2017 Ecole Polytechnique, Paris Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 1 / 22

Upload: vuongtuyen

Post on 13-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

First Order Optimization Methods

Lecture 1 - Introduction and Basic Results

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 1 / 22

Goals

I Deriving simple methods capable of solving very large scaleproblems

I Amenable to theoretical analysis: Convergence/Complexity

Exploit problem structures and data information: convex and nonconvex models

3 ELEMENTARY PRINCIPLES

• Approximation • Regularization • Decomposition

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 2 / 22

Outline

PGMO Lecture SeriesFoundation Mathematique Jacques Hadamard and Ecole Polytechnique

1. Introduction: Basic Schemes/Examples

2. The Proximal Map

3. Proximal Gradient Methods

4. Fast Proximal Gradient

5. Smoothing Methods

6. Lagrangian and Decomposition Methods

7. FOM Beyond Lipschitz Gradient Continuity

8. FOM beyond Convexity

Applications and examples illustrating ideas and methods

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 3 / 22

Opening Remark and CreditAbout more than 386 years ago.....In 1629, Fermat suggested the following:

I Given f , solve for x :

I

[f (x + d)− f (x)

d

]d=0

= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 4 / 22

Opening Remark and CreditAbout more than 386 years ago.....In 1629, Fermat suggested the following:

I Given f , solve for x :

I

[f (x + d)− f (x)

d

]d=0

= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 4 / 22

Simple Minimization Methods

Practical Side – Simplicity/Scalability

I Simple computational operations: additions - multiplications

I Explicit iterations.

I Avoid nested optimization schemes/control-correction of accumulated errors.

I Minimal storage of data

Theoretical Side – Convergence/Complexity Analysis

I Free from heuristic choices of extra parameters.

I Versatile mathematical analytic tools broadly applicable..and with no pains!

I Complexity: nearly independent on dimension.

I Performance: reasonable for medium accuracy.

Natural Candidates: Schemes based on First Order Methods

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 5 / 22

Simple Minimization Methods

Practical Side – Simplicity/Scalability

I Simple computational operations: additions - multiplications

I Explicit iterations.

I Avoid nested optimization schemes/control-correction of accumulated errors.

I Minimal storage of data

Theoretical Side – Convergence/Complexity Analysis

I Free from heuristic choices of extra parameters.

I Versatile mathematical analytic tools broadly applicable..and with no pains!

I Complexity: nearly independent on dimension.

I Performance: reasonable for medium accuracy.

Natural Candidates: Schemes based on First Order Methods

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 5 / 22

Simple Minimization Methods

Practical Side – Simplicity/Scalability

I Simple computational operations: additions - multiplications

I Explicit iterations.

I Avoid nested optimization schemes/control-correction of accumulated errors.

I Minimal storage of data

Theoretical Side – Convergence/Complexity Analysis

I Free from heuristic choices of extra parameters.

I Versatile mathematical analytic tools broadly applicable..and with no pains!

I Complexity: nearly independent on dimension.

I Performance: reasonable for medium accuracy.

Natural Candidates: Schemes based on First Order Methods

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 5 / 22

First-Order Methods

First-Order methods are iterative algorithms that only exploit informationon the objective function and its gradient (sub-gradient).

I A main drawback: Can be very slow for producing high accuracysolutions....But share many advantages:

I Requires minimal data information

I Often lead to very simple and ”cheap” iterative schemes

I Provable complexity/efficiency nearly independent of dimension

I Suitable for large-scale problems when high accuracy is not crucial. [In manylarge scale applications, the data is anyway corrupted or known only roughly.]

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 6 / 22

First-Order Methods

First-Order methods are iterative algorithms that only exploit informationon the objective function and its gradient (sub-gradient).

I A main drawback: Can be very slow for producing high accuracysolutions....But share many advantages:

I Requires minimal data information

I Often lead to very simple and ”cheap” iterative schemes

I Provable complexity/efficiency nearly independent of dimension

I Suitable for large-scale problems when high accuracy is not crucial. [In manylarge scale applications, the data is anyway corrupted or known only roughly.]

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 6 / 22

First-Order Methods

First-Order methods are iterative algorithms that only exploit informationon the objective function and its gradient (sub-gradient).

I A main drawback: Can be very slow for producing high accuracysolutions....But share many advantages:

I Requires minimal data information

I Often lead to very simple and ”cheap” iterative schemes

I Provable complexity/efficiency nearly independent of dimension

I Suitable for large-scale problems when high accuracy is not crucial. [In manylarge scale applications, the data is anyway corrupted or known only roughly.]

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 6 / 22

First Order Based Algorithms

Widely used in applications....

I Clustering Analysis: The k-means algorithm

I Neuro-computing: The backpropagation algorithm

I Statistical Estimation: The EM (Expectation-Maximization) algorithm.

I Machine Learning: SVM, Regularized regression, etc...

I Signal and Image Processing: Sparse Recovery, Denoising and DeblurringSchemes, Total Variation minimization...

I Matrix minimization Problems: Low rank, matrix completion, Networksensor localization ...

I ....and much more...

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 7 / 22

Algorithms - Short Historical Development

I Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]

I Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]

I Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]

I Stochastic Gradients [Robbins and Monro ’51]

I Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]

I Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]

I Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74, Mercier-Lions ’79,Fortin-Glowinski’76, Bertsekas’82]

I Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]

I Optimal Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 8 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are the numbers?

I Typical answers: (2,4), (1,5), (-3,9)......These already ask for”structure”:..least equal distance from average.. integer numbers..

I Why not (2.71828, 3.28172) !?....!...

I A nice one: (3,3) ....is with “minimal norm” and its unique!

I Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it a reasonablemathematical/computational task and often lead to interesting optimizationmodels.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 9 / 22

Linear Inverse Problems - Source for Optimization

Problem: Find x ∈ C ⊂ E which ”best” solves A(x) ≈ b, A : E→ F,where b (observable output), and A are known.

Approach via Optimization – Regularization Models• ρ(x) is a ”regularizer” (one – or sum of functions, convex or nonconvex)• d(b,A(x)) some ”proximity” measure from b to A(x)

. min{ρ(x) : A(x) = b, x ∈ C} or min{ρ(x) : d(b,A(x)) ≤ ε, x ∈ C}

. min{d(b,A(x)) : ρ(x) ≤ δ, x ∈ C} or min{d(b,A(x))+µρ(x) : x ∈ C}, µ > 0

I Choices for ρ(·), d(·, ·) depends on the application at hand.

I Nonsmooth and often Nonconvex regularizers ρ useful to describe desiredfeatures.

I Intensive research activities over the past 50 years.

I Today more with emerging new technologies and increase in computer power.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 10 / 22

Linear Inverse Problems - Source for Optimization

Problem: Find x ∈ C ⊂ E which ”best” solves A(x) ≈ b, A : E→ F,where b (observable output), and A are known.

Approach via Optimization – Regularization Models• ρ(x) is a ”regularizer” (one – or sum of functions, convex or nonconvex)• d(b,A(x)) some ”proximity” measure from b to A(x)

. min{ρ(x) : A(x) = b, x ∈ C} or min{ρ(x) : d(b,A(x)) ≤ ε, x ∈ C}

. min{d(b,A(x)) : ρ(x) ≤ δ, x ∈ C} or min{d(b,A(x))+µρ(x) : x ∈ C}, µ > 0

I Choices for ρ(·), d(·, ·) depends on the application at hand.

I Nonsmooth and often Nonconvex regularizers ρ useful to describe desiredfeatures.

I Intensive research activities over the past 50 years.

I Today more with emerging new technologies and increase in computer power.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 10 / 22

Linear Inverse Problems - Source for Optimization

Problem: Find x ∈ C ⊂ E which ”best” solves A(x) ≈ b, A : E→ F,where b (observable output), and A are known.

Approach via Optimization – Regularization Models• ρ(x) is a ”regularizer” (one – or sum of functions, convex or nonconvex)• d(b,A(x)) some ”proximity” measure from b to A(x)

. min{ρ(x) : A(x) = b, x ∈ C} or min{ρ(x) : d(b,A(x)) ≤ ε, x ∈ C}

. min{d(b,A(x)) : ρ(x) ≤ δ, x ∈ C} or min{d(b,A(x))+µρ(x) : x ∈ C}, µ > 0

I Choices for ρ(·), d(·, ·) depends on the application at hand.

I Nonsmooth and often Nonconvex regularizers ρ useful to describe desiredfeatures.

I Intensive research activities over the past 50 years.

I Today more with emerging new technologies and increase in computer power.

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 10 / 22

Example: Sparsity is a Common Desired Feature/StructureArises in Many Applications

I Sparse learning: feature selection, support vector machines, PCA,...

I Compressive sensing: recover a signal from few measurements ...

I Trust topology design: remove bars that are not needed...

I Image processing: denoising, deblurring,....and much more....

Example Let d(b,A(x)) := ‖b−A(x)‖2, ρ(x) := ‖x‖0.

Find x ∈ Rd which is sparsest or at least δ-sparse

min{‖x‖0 : ‖b−A(x)‖2 ≤ ε, x ∈ Rd}; min{‖b−A(x)‖2 : ‖x‖0 ≤ δ, ∈ Rd}

where ‖x‖0 denotes the number of nonzero component of x.

This can be Hard (despite the convex objective/constraint!).

ApproachesI Convex Relaxation/Approximation: Replace ‖x‖0 by a more tractable

object. The l1-norm ‖x‖1 has been well known (since 70’s) to promotesparsity. Nonconvex (concave) approximations are also relevant.

I Tackle directly the nonconvex problem “as is”?. More on this later...

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 11 / 22

Example: Sparsity is a Common Desired Feature/StructureArises in Many Applications

I Sparse learning: feature selection, support vector machines, PCA,...

I Compressive sensing: recover a signal from few measurements ...

I Trust topology design: remove bars that are not needed...

I Image processing: denoising, deblurring,....and much more....

Example Let d(b,A(x)) := ‖b−A(x)‖2, ρ(x) := ‖x‖0.

Find x ∈ Rd which is sparsest or at least δ-sparse

min{‖x‖0 : ‖b−A(x)‖2 ≤ ε, x ∈ Rd}; min{‖b−A(x)‖2 : ‖x‖0 ≤ δ, ∈ Rd}

where ‖x‖0 denotes the number of nonzero component of x.

This can be Hard (despite the convex objective/constraint!).

ApproachesI Convex Relaxation/Approximation: Replace ‖x‖0 by a more tractable

object. The l1-norm ‖x‖1 has been well known (since 70’s) to promotesparsity. Nonconvex (concave) approximations are also relevant.

I Tackle directly the nonconvex problem “as is”?. More on this later...Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 11 / 22

Example: Image ProcessingNatural images are not random! Large areas of nearly constant intensity/colorseparated by sharp edges. Pixels are noisy, and the objective is to find a“goodlooking” nearby image.

I g = λ‖L(·)‖1 - l1-regularized problem.

minx{f (x) + λ‖Lx‖1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

I g = TV (·) - Total Variation-based regularization. Grid points (i , j) withsame intensity as their neighbors [ROF 1992].

minx{f (x) + λTV (x)}

1-dim TV:TV (x) =

∑i |xi − xi+1|

2-dim TV:isotropic: TV(x) =

∑i

∑j

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

anisotropic: TV(x) =∑

i

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 12 / 22

Example: Image ProcessingNatural images are not random! Large areas of nearly constant intensity/colorseparated by sharp edges. Pixels are noisy, and the objective is to find a“goodlooking” nearby image.

I g = λ‖L(·)‖1 - l1-regularized problem.

minx{f (x) + λ‖Lx‖1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

I g = TV (·) - Total Variation-based regularization. Grid points (i , j) withsame intensity as their neighbors [ROF 1992].

minx{f (x) + λTV (x)}

1-dim TV:TV (x) =

∑i |xi − xi+1|

2-dim TV:isotropic: TV(x) =

∑i

∑j

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

anisotropic: TV(x) =∑

i

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 12 / 22

Example: Image ProcessingNatural images are not random! Large areas of nearly constant intensity/colorseparated by sharp edges. Pixels are noisy, and the objective is to find a“goodlooking” nearby image.

I g = λ‖L(·)‖1 - l1-regularized problem.

minx{f (x) + λ‖Lx‖1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

I g = TV (·) - Total Variation-based regularization. Grid points (i , j) withsame intensity as their neighbors [ROF 1992].

minx{f (x) + λTV (x)}

1-dim TV:TV (x) =

∑i |xi − xi+1|

2-dim TV:isotropic: TV(x) =

∑i

∑j

√(xi,j − xi+1,j)2 + (xi,j − xi,j+1)2

anisotropic: TV(x) =∑

i

∑j(|xi,j − xi+1,j |+ |xi,j − xi,j+1|)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 12 / 22

An Icon...The Gradient Method – Cauchy 1847..We begin with the simplest unconstrained minimization problem of a continuouslydifferentiable function f on E (finite dimensional Euclidean space):

(U) min{f (x) : x ∈ E}.

The basic gradient method: generates a sequence {xk} via

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

with suitable step size tk > 0: fixed or via a backtracking line search.

I This is a descent method:

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

I Explicit discretization of dx(t)/dt +∇f (x(t)) = 0, x(0) = x0.

xk − xk−1

h= −∇f (xk−1), (increment h > 0).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 13 / 22

An Icon...The Gradient Method – Cauchy 1847..We begin with the simplest unconstrained minimization problem of a continuouslydifferentiable function f on E (finite dimensional Euclidean space):

(U) min{f (x) : x ∈ E}.

The basic gradient method: generates a sequence {xk} via

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

with suitable step size tk > 0: fixed or via a backtracking line search.

I This is a descent method:

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

I Explicit discretization of dx(t)/dt +∇f (x(t)) = 0, x(0) = x0.

xk − xk−1

h= −∇f (xk−1), (increment h > 0).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 13 / 22

An Icon...The Gradient Method – Cauchy 1847..We begin with the simplest unconstrained minimization problem of a continuouslydifferentiable function f on E (finite dimensional Euclidean space):

(U) min{f (x) : x ∈ E}.

The basic gradient method: generates a sequence {xk} via

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

with suitable step size tk > 0: fixed or via a backtracking line search.

I This is a descent method:

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

I Explicit discretization of dx(t)/dt +∇f (x(t)) = 0, x(0) = x0.

xk − xk−1

h= −∇f (xk−1), (increment h > 0).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 13 / 22

An Icon...The Gradient Method – Cauchy 1847..We begin with the simplest unconstrained minimization problem of a continuouslydifferentiable function f on E (finite dimensional Euclidean space):

(U) min{f (x) : x ∈ E}.

The basic gradient method: generates a sequence {xk} via

x0 ∈ E, xk = xk−1 − tk∇f (xk−1) (k ≥ 1)

with suitable step size tk > 0: fixed or via a backtracking line search.

I This is a descent method:

x+ = x + td; 〈d,∇f (x)〉 < 0, d := −∇f (x) 6= 0.

I Explicit discretization of dx(t)/dt +∇f (x(t)) = 0, x(0) = x0.

xk − xk−1

h= −∇f (xk−1), (increment h > 0).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 13 / 22

Quick Recall : Backtracking Line Search – BLS

I Backtracking: start with t, try t/2, t/4, . . . until sufficient decrease

I BLS procedure warrants sufficient decrease + rules out too short steps.

A simple inexact line search to find tk for descent methods along the ray xk + tdk :

tk ∈ argmint≥0

φ(t) ≡ f (xk + tdk), (e.g., for GM d := −∇f (x))).

1. Initialize: Choose an initial guess t > 0, and α, β ∈ (0, 1). Settk = t

2. Take tk ← βtk , (e.g., with β = 1/2) until

(∗) f (xk + tkdk) < f (xk) + αtk〈∇f (xk), dk〉

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 14 / 22

Building First Order Based Schemes: Basic Old Idea

Pick an adequate approximate model

To minimize f , given some y, approximate f with some q(·, y): f (x) ' q(x, y)

Solve “some how”, the resulting approximate model:

xk+1 = argminx

q(x, xk), k = 0, . . .

.

Finding good approximation models rely ondata information and structure of the given problem.

1. Linearize + regularize

2. Linearize only + use info on C : e.g., C compact.

3. Use optimality conditions e.g., to identify a fixed point iteration.

4. Any other device/approach that can produce a useful approximation

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 15 / 22

Building First Order Based Schemes: Basic Old Idea

Pick an adequate approximate model

To minimize f , given some y, approximate f with some q(·, y): f (x) ' q(x, y)

Solve “some how”, the resulting approximate model:

xk+1 = argminx

q(x, xk), k = 0, . . .

.

Finding good approximation models rely ondata information and structure of the given problem.

1. Linearize + regularize

2. Linearize only + use info on C : e.g., C compact.

3. Use optimality conditions e.g., to identify a fixed point iteration.

4. Any other device/approach that can produce a useful approximation

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 15 / 22

Building First Order Based Schemes: Basic Old Idea

Pick an adequate approximate model

To minimize f , given some y, approximate f with some q(·, y): f (x) ' q(x, y)

Solve “some how”, the resulting approximate model:

xk+1 = argminx

q(x, xk), k = 0, . . .

.

Finding good approximation models rely ondata information and structure of the given problem.

1. Linearize + regularize

2. Linearize only + use info on C : e.g., C compact.

3. Use optimality conditions e.g., to identify a fixed point iteration.

4. Any other device/approach that can produce a useful approximation

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 15 / 22

A Quadratic Approximation ModelI Simplest case: unconstrained minimization of f ∈ C 1

(U) min{f (x) : x ∈ E}.I Simplest idea: Use the quadratic model

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

Namely, use the linearized part of f at some given point y.Regularized with a quadratic proximity term that would measure the ”localerror” in the approximation.

I This leads to a (strongly) convex approximation for (U):

(Ut) min {qt(x, y) : x ∈ E} .

I Now, fixing y := xk−1 ∈ E, the unique minimizer xk solving (Utk )

xk = argmin {qtk (x, xk−1) : x ∈ E} .I Therefore, optimality condition yields exactly the gradient method:

∇qtk (xk , xk−1) = 0 ⇒ tk∇f (xk−1)+xk−xk−1 = 0 ⇒ xk = xk−1−tk∇f (xk−1).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 16 / 22

A Quadratic Approximation ModelI Simplest case: unconstrained minimization of f ∈ C 1

(U) min{f (x) : x ∈ E}.I Simplest idea: Use the quadratic model

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

Namely, use the linearized part of f at some given point y.Regularized with a quadratic proximity term that would measure the ”localerror” in the approximation.

I This leads to a (strongly) convex approximation for (U):

(Ut) min {qt(x, y) : x ∈ E} .

I Now, fixing y := xk−1 ∈ E, the unique minimizer xk solving (Utk )

xk = argmin {qtk (x, xk−1) : x ∈ E} .I Therefore, optimality condition yields exactly the gradient method:

∇qtk (xk , xk−1) = 0 ⇒ tk∇f (xk−1)+xk−xk−1 = 0 ⇒ xk = xk−1−tk∇f (xk−1).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 16 / 22

A Quadratic Approximation ModelI Simplest case: unconstrained minimization of f ∈ C 1

(U) min{f (x) : x ∈ E}.I Simplest idea: Use the quadratic model

qt(x, y) := f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2, t > 0.

Namely, use the linearized part of f at some given point y.Regularized with a quadratic proximity term that would measure the ”localerror” in the approximation.

I This leads to a (strongly) convex approximation for (U):

(Ut) min {qt(x, y) : x ∈ E} .

I Now, fixing y := xk−1 ∈ E, the unique minimizer xk solving (Utk )

xk = argmin {qtk (x, xk−1) : x ∈ E} .I Therefore, optimality condition yields exactly the gradient method:

∇qtk (xk , xk−1) = 0 ⇒ tk∇f (xk−1)+xk−xk−1 = 0 ⇒ xk = xk−1−tk∇f (xk−1).

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 16 / 22

Constrained Minimization: Gradient Projection Method

Simple algebra: qt(x, y) = f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2,

=1

2t‖x− (y − t∇f (y))‖2 − t

2‖∇f (y)‖2 + f (y).

Allows to easily pass to the constrained (P) min {f (x) : x ∈ C} ,

I Ignoring the constant terms (in y := xk−1) leads to solve (P) via GPM:

xk = argminx∈C

1

2‖x− (xk−1 − tk∇f (xk−1))‖2 ≡ PC (xk−1 − tk∇f (xk−1))

PC stands for the orthogonal projection map: PC (x) := argminz∈C

‖z− x‖2.

It reveals a more general operation - The Proximal Map -....Central to FOM....

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 17 / 22

Constrained Minimization: Gradient Projection Method

Simple algebra: qt(x, y) = f (y) + 〈x− y,∇f (y)〉+1

2t‖x− y‖2,

=1

2t‖x− (y − t∇f (y))‖2 − t

2‖∇f (y)‖2 + f (y).

Allows to easily pass to the constrained (P) min {f (x) : x ∈ C} ,I Ignoring the constant terms (in y := xk−1) leads to solve (P) via GPM:

xk = argminx∈C

1

2‖x− (xk−1 − tk∇f (xk−1))‖2 ≡ PC (xk−1 − tk∇f (xk−1))

PC stands for the orthogonal projection map: PC (x) := argminz∈C

‖z− x‖2.

It reveals a more general operation - The Proximal Map -....Central to FOM....

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 17 / 22

Nonsmooth Problem: Subgradient (Projection) Method

Let g be nonsmooth (and admits a subgradient, e.g., convex), consider theconstrained NSO:

(P) inf{g(x) : x ∈ C}; C ⊂ E closed convex

Similarly, as done for gradient projection through the quadratic approximation, byreplacing in the linearization, the gradient with a subgradient element:

Subgradient Scheme: [Shor (63), Polyak (65)]

γk−1 ∈ ∂g(xk−1), xk = PC (xk−1 − tkγk−1), (tk > 0, a stepsize)

I Note: Subgradient scheme is not a descent method.

I γ ∈ ∂g(x∗) does not imply γ = 0.

I Example in R : g(x) = |x|, with ∂g(0) = [−1, 1].

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 18 / 22

Nonsmooth Problem: Subgradient (Projection) Method

Let g be nonsmooth (and admits a subgradient, e.g., convex), consider theconstrained NSO:

(P) inf{g(x) : x ∈ C}; C ⊂ E closed convex

Similarly, as done for gradient projection through the quadratic approximation, byreplacing in the linearization, the gradient with a subgradient element:

Subgradient Scheme: [Shor (63), Polyak (65)]

γk−1 ∈ ∂g(xk−1), xk = PC (xk−1 − tkγk−1), (tk > 0, a stepsize)

I Note: Subgradient scheme is not a descent method.

I γ ∈ ∂g(x∗) does not imply γ = 0.

I Example in R : g(x) = |x|, with ∂g(0) = [−1, 1].

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 18 / 22

Convergence Rates/Iteration Complexity of AlgorithmsClassical rates of convergence for sequences {sk} converging to zero such as{‖∇F (xk)‖}, {F (xk)− Fopt}, {‖xk − x∗‖} are

Sublinear: sk ≤ c1√k,c2

k,c3

k2

Linear: sk+1 ≤ (1− c)sk , for some c ∈ (0, 1).

Fix an accuracy ε, we then ask how many iterations N are needed to get sN ≤ ε:

N = O(1/εp) (p = 2, 1, 1/2) for sublinear rates, and N = O(1

clog(1/ε)) for linear.

For example, to solve approximately a problem to a given accuracy ε > 0, i.e., tofinding an x satisfying F (x)− Fopt ≤ ε. Suppose that for all k ≥ 1 we have:

F (xk)− Fopt ≤Γ

kθ, (Γ > 0, θ > 0)

Then, it means that the iteration complexity is O(ε−1/θ).

The constant Γ usually depends on distance from initial point to an optimal solution:

R2 := max{‖x∗ − x0‖2 : x∗ ∈ X∗}.Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 19 / 22

Complexity for Basic FOM - A Quick PreviewFor minimizing a function to accuracy ε.⊕

For Smooth Nonconvex Problems: through norm of gradient

I For Gradient method, min1≤k≤N ‖∇F (xk)‖,N = O(1/ε2) (idem for GPM)⊕For Smooth Convex Problems: through function values F (xN)− F∗

I For Gradient (Projection) Method: N = O(1/ε).⊕Nonmooth Lipschitz Convex Problems: through “best” function values

I Subgradient (Projection) Method: N = O(1/ε2), FNbest := min1≤k≤N F (xk)

We will see that these schemes

I Can be analyzed as special case of a more general optimization model .

I Complexity can be improved by exploiting structure/data.

I Can be extended/combined through various approaches to generateschemes that can handle more complex situations (constraints,structures, functions class ...)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 20 / 22

Complexity for Basic FOM - A Quick PreviewFor minimizing a function to accuracy ε.⊕

For Smooth Nonconvex Problems: through norm of gradient

I For Gradient method, min1≤k≤N ‖∇F (xk)‖,N = O(1/ε2) (idem for GPM)⊕For Smooth Convex Problems: through function values F (xN)− F∗

I For Gradient (Projection) Method: N = O(1/ε).⊕Nonmooth Lipschitz Convex Problems: through “best” function values

I Subgradient (Projection) Method: N = O(1/ε2), FNbest := min1≤k≤N F (xk)

We will see that these schemes

I Can be analyzed as special case of a more general optimization model .

I Complexity can be improved by exploiting structure/data.

I Can be extended/combined through various approaches to generateschemes that can handle more complex situations (constraints,structures, functions class ...)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 20 / 22

Before we start..Quick Recalls on Convex Functions

I Throughout, E,F,X ... stand for a finite dimensional real vector space, heremostly an Euclidean space with inner product 〈·, ·〉 endowed with the norm‖x‖ =

√〈x, x〉.

I Let f : E→ (−∞,+∞] be proper, closed (lsc) convex function, withdom f = {x |f (x) < +∞} its effective domain.

I Proper: dom f 6= ∅ and f (x) > −∞, ∀x ∈ E.I Closed and Convex: Its epigraph is a closed convex set

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

I Extended valued functions are useful for handling constraints:

inf{h(x) : x ∈ C} ⇐⇒ inf{f (x) : x ∈ E}, f := h + δC

where δC (x) = 0 if x ∈ C and +∞ if x /∈ C is the indicator of C .

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 21 / 22

Before we start..Quick Recalls on Convex Functions

I Throughout, E,F,X ... stand for a finite dimensional real vector space, heremostly an Euclidean space with inner product 〈·, ·〉 endowed with the norm‖x‖ =

√〈x, x〉.

I Let f : E→ (−∞,+∞] be proper, closed (lsc) convex function, withdom f = {x |f (x) < +∞} its effective domain.

I Proper: dom f 6= ∅ and f (x) > −∞, ∀x ∈ E.I Closed and Convex: Its epigraph is a closed convex set

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

I Extended valued functions are useful for handling constraints:

inf{h(x) : x ∈ C} ⇐⇒ inf{f (x) : x ∈ E}, f := h + δC

where δC (x) = 0 if x ∈ C and +∞ if x /∈ C is the indicator of C .

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 21 / 22

Before we start..Quick Recalls on Convex Functions

I Throughout, E,F,X ... stand for a finite dimensional real vector space, heremostly an Euclidean space with inner product 〈·, ·〉 endowed with the norm‖x‖ =

√〈x, x〉.

I Let f : E→ (−∞,+∞] be proper, closed (lsc) convex function, withdom f = {x |f (x) < +∞} its effective domain.

I Proper: dom f 6= ∅ and f (x) > −∞, ∀x ∈ E.I Closed and Convex: Its epigraph is a closed convex set

epi f := {(x, α) ∈ E× R |α ≥ f (x)}.

I Extended valued functions are useful for handling constraints:

inf{h(x) : x ∈ C} ⇐⇒ inf{f (x) : x ∈ E}, f := h + δC

where δC (x) = 0 if x ∈ C and +∞ if x /∈ C is the indicator of C .

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 21 / 22

Recalls - Subdifferentiability of Convex FunctionsI g ∈ E is a subgradient of f at x if:

f (z) ≥ f (x) + 〈g, z− x〉, ∀z

I Subdifferential of f at x = Set of all subgradients:

∂f (x) = {g ∈ E | f (z) ≥ f (x) + 〈g, z− x〉, ∀z ∈ E}.

I ∂f (x) is a closed convex set (possibly empty) as an infinite intersection ofclosed half-spaces.

I If x ∈ int dom f , ∂f (x) is nonempty and bounded.

I When f is differentiable, ∂f (x) ≡ {∇f (x)} ≡ {f ′(x)}.I f ∗(y) = sup{〈x, y〉 − f (x) : x ∈ E}, its convex conjugate.

I Fenchel Theorem: For f proper convex, for any x, y ∈ E:

〈x, y〉 = f (x) + f ∗(y) ⇐⇒ y ∈ ∂f (x) ⇐⇒︸︷︷︸if f closed

x ∈ ∂f ∗(y)

Marc Teboulle First Order Optimization Methods Lecture 1 - Introduction and Basic Results 22 / 22

First Order Optimization Methods

Lecture 2 - The Proximal Map

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 1 / 18

Lecture 2 - The Proximal Map

The proximal map (Moreau 1965) is a key player in FOM.

Definition. Given g : E→ (−∞,∞] a closed, proper and convex function,the proximal mapping of g is defined by

proxg (x) = argminu∈E

{g(u) +

1

2‖u− x‖2

}.

A Natural Extension of the Projection Operator

I If C ⊆ E is nonempty closed convex, then proxδC = PC

proxδC (x) = argminu∈E

{δC (u) +

1

2‖u− x‖2

}= argmin

u∈C‖u− x‖2 = PC (x).

To study properties of prox, we first recall some basic results.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 2 / 18

Strongly Convex Functions -Recalls

I A function f : E→ (−∞,∞] is called σ-strongly convex with modulusσ > 0, if for any x, y ∈ dom(f ) and λ ∈ [0, 1]:

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y)− 1

2σλ(1− λ)‖x− y‖2.

I f : E→ (−∞,∞] is σ-strongly convex function iff f (·)− σ2 ‖ · ‖

2 is convex.

I Sum of Convex + strongly convex is strongly convex.

First Order Characterization Let f : E → (−∞,∞] and σ > 0. Thefollowing three claims are equivalent:

(i) f is σ-strongly convex.

(ii) For any x ∈ dom(∂f ), y ∈ dom(f ) and gx ∈ ∂f (x) :f (y) ≥ f (x) + 〈gx, y − x〉+ σ

2 ‖y − x‖2.

(iii) For any x, y ∈ dom(∂f ) and gx ∈ ∂f (x), gy ∈ ∂f (y) :〈gx − gy, x− y〉 ≥ σ‖x− y‖2.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 3 / 18

Strongly Convex Functions -Recalls

I A function f : E→ (−∞,∞] is called σ-strongly convex with modulusσ > 0, if for any x, y ∈ dom(f ) and λ ∈ [0, 1]:

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y)− 1

2σλ(1− λ)‖x− y‖2.

I f : E→ (−∞,∞] is σ-strongly convex function iff f (·)− σ2 ‖ · ‖

2 is convex.

I Sum of Convex + strongly convex is strongly convex.

First Order Characterization Let f : E → (−∞,∞] and σ > 0. Thefollowing three claims are equivalent:

(i) f is σ-strongly convex.

(ii) For any x ∈ dom(∂f ), y ∈ dom(f ) and gx ∈ ∂f (x) :f (y) ≥ f (x) + 〈gx, y − x〉+ σ

2 ‖y − x‖2.

(iii) For any x, y ∈ dom(∂f ) and gx ∈ ∂f (x), gy ∈ ∂f (y) :〈gx − gy, x− y〉 ≥ σ‖x− y‖2.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 3 / 18

Minimization of Strongly Convex Functions - Recalls

Theorem. Let f : E→ (−∞,∞] be a proper closed and σ-strongly convexfunction (σ > 0). Then

(a) f has a unique minimizer x∗.

(b) f (x)− f (x∗) ≥ σ2 ‖x− x∗‖2 for all x ∈ dom(f ).

Proof. Existence: Take x0 ∈ dom(∂f ) and g ∈ ∂f (x0). Then

f (x) ≥ f (x0) + 〈g, x− x0〉+σ

2‖x− x0‖2 for all x ∈ E

= ≥ f (x0)− 1

2σ‖g‖2 +

σ

2

∥∥∥∥x−(

x0 −1

σg

)∥∥∥∥2

for any x ∈ E.

Thus, Lev(f , f (x0)) := {x : f (x) ≤ f (x0)} ⊆ B‖·‖[x0 − 1

σg, 1σ‖g‖

].

Therefore, Lev(f , f (x0)) is compact + closedness of f ⇒ a minimizer exists.

Notation: B‖·‖[x0, δ] := {x : ‖x− x0‖ ≤ δ}

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 4 / 18

Minimization of Strongly Convex Functions - Recalls

Theorem. Let f : E→ (−∞,∞] be a proper closed and σ-strongly convexfunction (σ > 0). Then

(a) f has a unique minimizer x∗.

(b) f (x)− f (x∗) ≥ σ2 ‖x− x∗‖2 for all x ∈ dom(f ).

Proof. Existence: Take x0 ∈ dom(∂f ) and g ∈ ∂f (x0). Then

f (x) ≥ f (x0) + 〈g, x− x0〉+σ

2‖x− x0‖2 for all x ∈ E

= ≥ f (x0)− 1

2σ‖g‖2 +

σ

2

∥∥∥∥x−(

x0 −1

σg

)∥∥∥∥2

for any x ∈ E.

Thus, Lev(f , f (x0)) := {x : f (x) ≤ f (x0)} ⊆ B‖·‖[x0 − 1

σg, 1σ‖g‖

].

Therefore, Lev(f , f (x0)) is compact + closedness of f ⇒ a minimizer exists.

Notation: B‖·‖[x0, δ] := {x : ‖x− x0‖ ≤ δ}

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 4 / 18

Proof Contd.

I To show uniqueness, if x and x are minimizers of f . Then f (x) = f (x) = fopt.Then

fopt ≤ f

(1

2x +

1

2x

)≤ 1

2f (x) +

1

2f (x)− σ

8‖x− x‖2 = fopt −

σ

8‖x− x‖2.

showing that x = x (proving part (a)).

I Let x∗ be the unique minimizer of f .

I By Fermat’s optimality condition 0 ∈ ∂f (x∗), and hence using thesubgradient inequality for f − σ strongly convex, we get for all x ∈ E,

f (x)− f (x∗) ≥ 〈0, x− x∗〉+σ

2‖x− x∗‖2 =

σ

2‖x− x∗‖2

We are ready to establish the main properties of the prox map.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 5 / 18

Prox Theorem – Uniqueness and CharacterizationProx Theorem Let g : E → (−∞,∞] be a proper closed and convexfunction. Then,

(i) (Uniqueness) u := proxg (x) is a singleton for any x ∈ E.

(ii) (Characterization) x− u ∈ ∂g(u) ⇐⇒ u = (I + ∂g)−1(x).

(iii) (Subgradient via prox) z ∈ ∂g(x) ⇐⇒ x = proxg (z + x).In particular, x is a minimizer of f iff x = proxg (x).

Proof. (i) For any x ∈ E, u := proxg (x) is by definition the minimizer of theproper closed and 1-strongly convex function:

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

(iii) z ∈ ∂g(x) ⇔ z + x ∈ ∂g(x) + x ⇔ x = proxg (z + x).Now, x minimizer of f iff 0 ∈ ∂g(x). Set z = 0, proves the claim.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 6 / 18

Prox Theorem – Uniqueness and CharacterizationProx Theorem Let g : E → (−∞,∞] be a proper closed and convexfunction. Then,

(i) (Uniqueness) u := proxg (x) is a singleton for any x ∈ E.

(ii) (Characterization) x− u ∈ ∂g(u) ⇐⇒ u = (I + ∂g)−1(x).

(iii) (Subgradient via prox) z ∈ ∂g(x) ⇐⇒ x = proxg (z + x).In particular, x is a minimizer of f iff x = proxg (x).

Proof. (i) For any x ∈ E, u := proxg (x) is by definition the minimizer of theproper closed and 1-strongly convex function:

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.

(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

(iii) z ∈ ∂g(x) ⇔ z + x ∈ ∂g(x) + x ⇔ x = proxg (z + x).Now, x minimizer of f iff 0 ∈ ∂g(x). Set z = 0, proves the claim.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 6 / 18

Prox Theorem – Uniqueness and CharacterizationProx Theorem Let g : E → (−∞,∞] be a proper closed and convexfunction. Then,

(i) (Uniqueness) u := proxg (x) is a singleton for any x ∈ E.

(ii) (Characterization) x− u ∈ ∂g(u) ⇐⇒ u = (I + ∂g)−1(x).

(iii) (Subgradient via prox) z ∈ ∂g(x) ⇐⇒ x = proxg (z + x).In particular, x is a minimizer of f iff x = proxg (x).

Proof. (i) For any x ∈ E, u := proxg (x) is by definition the minimizer of theproper closed and 1-strongly convex function:

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

(iii) z ∈ ∂g(x) ⇔ z + x ∈ ∂g(x) + x ⇔ x = proxg (z + x).Now, x minimizer of f iff 0 ∈ ∂g(x). Set z = 0, proves the claim.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 6 / 18

Prox Theorem – Uniqueness and CharacterizationProx Theorem Let g : E → (−∞,∞] be a proper closed and convexfunction. Then,

(i) (Uniqueness) u := proxg (x) is a singleton for any x ∈ E.

(ii) (Characterization) x− u ∈ ∂g(u) ⇐⇒ u = (I + ∂g)−1(x).

(iii) (Subgradient via prox) z ∈ ∂g(x) ⇐⇒ x = proxg (z + x).In particular, x is a minimizer of f iff x = proxg (x).

Proof. (i) For any x ∈ E, u := proxg (x) is by definition the minimizer of theproper closed and 1-strongly convex function:

u→ g(u) +1

2‖u− x‖2

Therefore, the minimizer proxg (x) exists and is unique.(ii) Fermat’s optimality condition for u is satisfied if and only if

0 ∈ ∂g(u) + u− x ⇔ u = (I + ∂g)−1(x).

(iii) z ∈ ∂g(x) ⇔ z + x ∈ ∂g(x) + x ⇔ x = proxg (z + x).Now, x minimizer of f iff 0 ∈ ∂g(x). Set z = 0, proves the claim.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 6 / 18

The Proximal Inequality – Fundamental

Proximal Inequality Let g : E→ (−∞,∞] be a proper closed and convexfunction. For x ∈ E, let u = proxg (x). Then, for any y ∈ E,

g(u)− g(y) ≤ 〈x− u,u− y, 〉

=1

2

{‖y − x‖2 − ‖y − u‖2 − ‖u− x‖2

}

Proof. Since x− u ∈ ∂g(u), using the subgradient inequality yields for any y ∈ E:

g(u)− g(y) ≤ 〈∂g(u),u− y〉= 〈x− u,u− y〉.

The second line follows by using the identity

2〈x− u,u−w〉 = ‖w − x‖2 − ‖w − u‖2 − ‖u− x‖2, ∀x,u,w ∈ E.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 7 / 18

The Proximal Inequality – Fundamental

Proximal Inequality Let g : E→ (−∞,∞] be a proper closed and convexfunction. For x ∈ E, let u = proxg (x). Then, for any y ∈ E,

g(u)− g(y) ≤ 〈x− u,u− y, 〉

=1

2

{‖y − x‖2 − ‖y − u‖2 − ‖u− x‖2

}

Proof. Since x− u ∈ ∂g(u), using the subgradient inequality yields for any y ∈ E:

g(u)− g(y) ≤ 〈∂g(u),u− y〉= 〈x− u,u− y〉.

The second line follows by using the identity

2〈x− u,u−w〉 = ‖w − x‖2 − ‖w − u‖2 − ‖u− x‖2, ∀x,u,w ∈ E.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 7 / 18

Examples

I Computing proxg can be very hard...!.

I But, for many useful special cases it is easy...

I If g ≡ δC , (C ⊆ E closed and convex), then

proxg (x) = argminu{δC (u) +

1

2t‖u− x‖2} ≡ PC (x)

For some useful sets C easy to compute the orthogonal projection PC :

I Affine sets, Simple Polyhedral Sets (halfspace, Rn+, [l , u]n),

I l2, l1, l∞ - Balls,

I Ice Cream Cone (SOC), Semidefinite Cone Sn+,

I Simplex and Spectrahedron (Simplex in Sn).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 8 / 18

Examples

I Computing proxg can be very hard...!.

I But, for many useful special cases it is easy...

I If g ≡ δC , (C ⊆ E closed and convex), then

proxg (x) = argminu{δC (u) +

1

2t‖u− x‖2} ≡ PC (x)

For some useful sets C easy to compute the orthogonal projection PC :

I Affine sets, Simple Polyhedral Sets (halfspace, Rn+, [l , u]n),

I l2, l1, l∞ - Balls,

I Ice Cream Cone (SOC), Semidefinite Cone Sn+,

I Simplex and Spectrahedron (Simplex in Sn).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 8 / 18

Examples

I Computing proxg can be very hard...!.

I But, for many useful special cases it is easy...

I If g ≡ δC , (C ⊆ E closed and convex), then

proxg (x) = argminu{δC (u) +

1

2t‖u− x‖2} ≡ PC (x)

For some useful sets C easy to compute the orthogonal projection PC :

I Affine sets, Simple Polyhedral Sets (halfspace, Rn+, [l , u]n),

I l2, l1, l∞ - Balls,

I Ice Cream Cone (SOC), Semidefinite Cone Sn+,

I Simplex and Spectrahedron (Simplex in Sn).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 8 / 18

Other Basic Examples of Proximal Mappings

proxth(x) = argminu∈E

{h(u) +

1

2t‖u− x‖2

}, t > 0.

I Constant function h ≡ c for some c ∈ R, then proxth(x) = x, the identitymap.

I Affine function g(x) = 〈a, x〉+ b, a ∈ E and b ∈ R. Thenproxth(x) = x− ta.

I Quadratic function h(x) = 12 xTAx + bTx + c , ( A ∈ Sn+,b ∈ Rn, c ∈ R).

proxth(x) = (I + A)−1(x− tb)

I Euclidean norm h(x) = ‖x‖2

proxth(x) =

{(1− t/‖x‖2)x ‖x‖2 ≥ t0 otherwise

I Euclidean distance d(x) = miny∈C ‖x− y‖2 (C closed convex)

proxtd(x) = θPC (x) + (1− θ)x, θ =

{t/d(x) d(x) ≥ t1 else

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 9 / 18

Other Basic Examples of Proximal Mappings

proxth(x) = argminu∈E

{h(u) +

1

2t‖u− x‖2

}, t > 0.

I Constant function h ≡ c for some c ∈ R, then proxth(x) = x, the identitymap.

I Affine function g(x) = 〈a, x〉+ b, a ∈ E and b ∈ R. Thenproxth(x) = x− ta.

I Quadratic function h(x) = 12 xTAx + bTx + c , ( A ∈ Sn+,b ∈ Rn, c ∈ R).

proxth(x) = (I + A)−1(x− tb)

I Euclidean norm h(x) = ‖x‖2

proxth(x) =

{(1− t/‖x‖2)x ‖x‖2 ≥ t0 otherwise

I Euclidean distance d(x) = miny∈C ‖x− y‖2 (C closed convex)

proxtd(x) = θPC (x) + (1− θ)x, θ =

{t/d(x) d(x) ≥ t1 else

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 9 / 18

Basic Calculus Rules for Prox

f (x) proxf (x) assumptions∑mi=1 fi (xi ) proxf1 (x1)× · · · × proxfm (xm)

g(λx + a) 1λ

[proxλ2g (a + λx)− a

]λ 6= 0, a ∈ E, gproper

λg(x/λ) λproxg/λ(x/λ) λ > 0, g proper

g(x) + c2‖x‖2 +

〈a, x〉+ γprox 1

c+1g ( x−a

c+1) a ∈ E, c >

0, γ ∈ R, gproper

g(A(x) + b) x + 1αAT (proxαg (A(x) + b)−A(x)− b) b ∈ Rm,

A : V → Rm,g closedproper convex,A ◦ AT = αI ,α > 0

Note: In general, the prox of the composition

(g ◦ A)(x) = g(A(x))

is NOT tractable!..

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 10 / 18

Examples of Prox Computations – Summary

f dom f proxf assumptions12 xT Ax + bT x + c Rn (A + I)−1(x− b) A ∈ Sn++, b ∈ Rn, c ∈ R

λ‖x‖ E[

1− λ‖x‖

]+

x Euclidean norm, λ > 0

λ‖x‖1 Rn [|x| − λe]+ ◦ sgn(x) λ > 0

−λ∑n

j=1 log xj Rn++

xj +

√x2j

+4λ

2

n

j=1

λ > 0

δC (x) E PC (x) C ⊆ EλσC (x) E x− λPC (x/λ) C closed and convexλ‖x‖ E x− λPB‖·‖∗ [0,1](x/λ) arbitrary norm

λmax{x1, x2, . . . , xn} Rn x− prox∆n(x/λ) λ > 0

λdC (x) E x + min{

λdC (x) , 1

}(PC (x)− x) C closed convex

λ2 dC (x)2 E λ

λ+1 PC (x) + 1λ+1 x C closed convex

We show the prox of the l1-norm.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 11 / 18

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)

I g(x) =∑n

i=1 ϕ(xi ), where ϕ(t) = λ|t|.I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

I proxϕ(s) = Tλ(s), where Tλ is defined as

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

is the soft thresholding operator. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

I By the separability of the l1-norm, proxg (x) = (Tλ(xj))nj=1. We expend thedefinition of the soft thresholding operator and write

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 12 / 18

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)I g(x) =

∑ni=1 ϕ(xi ), where ϕ(t) = λ|t|.

I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

I proxϕ(s) = Tλ(s), where Tλ is defined as

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

is the soft thresholding operator. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

I By the separability of the l1-norm, proxg (x) = (Tλ(xj))nj=1. We expend thedefinition of the soft thresholding operator and write

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 12 / 18

Prox of l1-NormI g(x) = λ‖x‖1 (λ > 0)I g(x) =

∑ni=1 ϕ(xi ), where ϕ(t) = λ|t|.

I Reduces to 1-D problem: proxϕ(s) = mint{λ|t|+ 2−1|t − s|}

I proxϕ(s) = Tλ(s), where Tλ is defined as

Tλ(y) = [|y | − λ]+sgn(y) =

y − λ, y ≥ λ,0, |y | < λ,y + λ, y ≤ −λ

is the soft thresholding operator. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

I By the separability of the l1-norm, proxg (x) = (Tλ(xj))nj=1. We expend thedefinition of the soft thresholding operator and write

proxg (x) = Tλ(x) ≡ (Tλ(xj))nj=1 = [|x| − λe]+ � sgn(x).Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 12 / 18

Prox of a composition |aTx|

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = TλI proxf (x) = x + 1

‖a‖2 (T‖a‖2 (aTx)− aTx)a.

Unfortunately in general, the prox of f (x) = g(Ax) = ‖Ax‖1

does not admit an explicit formula.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 13 / 18

Prox of a composition |aTx|

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = Tλ

I proxf (x) = x + 1‖a‖2 (T‖a‖2 (aTx)− aTx)a.

Unfortunately in general, the prox of f (x) = g(Ax) = ‖Ax‖1

does not admit an explicit formula.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 13 / 18

Prox of a composition |aTx|

I f (x) = |aTx| (0 6= a ∈ Rn).

I f (x) = g(aTx), where g(t) = |t|.I proxλg = TλI proxf (x) = x + 1

‖a‖2 (T‖a‖2 (aTx)− aTx)a.

Unfortunately in general, the prox of f (x) = g(Ax) = ‖Ax‖1

does not admit an explicit formula.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 13 / 18

More Properties: Firm Nonexpansivity/Nonexpansivity

Theorem. For any x, y ∈ E(i) 〈x− y,proxh(x)− proxh(y)〉 ≥ ‖proxh(x)− proxh(y)‖2.

(ii) ‖proxh(x)− proxh(y)‖ ≤ ‖x− y‖.

Proof.

I Denote u = proxh(x), v = proxh(y).

I x− u ∈ ∂h(u), y − v ∈ ∂h(v).

I By the subgradient inequality

f (v) ≥ f (u) + 〈x− u, v − u〉,f (u) ≥ f (v) + 〈y − v,u− v〉.

I Summing the above two inequalities, we obtain 〈(x−u)− (y− v),u− v〉 ≥ 0.

I Thus,〈u− v, x− y〉 ≥ ‖u− v‖2.

I (ii) follows from Cauchy-Schwarz.

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 14 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).

I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)

I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]

I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).

I Thus,proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Moreau DecompositionTheorem. Let f : E → (−∞,∞] be a closed, proper convex function.Then for any x ∈ E

proxf (x) + proxf ∗(x) = x.

Proof.I Let x ∈ E,u = proxf (x).I x− u ∈ ∂f (u)I iff u ∈ ∂f ∗(x− u). [By Fenchel Thm. (∂f )−1 = ∂f ∗]I iff x− u = proxf ∗(x).I Thus,

proxf (x) + proxf ∗(x) = u + (x− u) = x.

A direct consequence (extended Moreau decomposition) with scaling (λ > 0):

proxλf (x) + λproxf ∗/λ(x/λ) = x

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 15 / 18

Example : Prox of Support FunctionsThe Support Function of a set C defined by

σC (u) := sup{〈u, v〉 : v ∈ C}is essentially “built-in” in most optimization problems. For instance, consider thegeneral NLP with m constraints described through g : Rn → Rm:

inf{f (x) : g(x) ≤ 0, x ∈ Rn} ⇐⇒ infx{f (x) + σRm

+(g(x)}.

Recall that σC = δ∗C and σ∗C = δ∗∗C = δC , (C closed convex nonempty).

Proposition (Prox of support function). Let C be a nonempty closed andconvex set, and let λ > 0. Then

proxλσC(x) = x− λPC (x/λ).

Proof. By the extended Moreau decomposition formula

proxλσC(x) = x− λproxλ−1σ∗C

(x/λ) = x− λproxλ−1δC(x/λ) = x− λPC (x/λ).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 16 / 18

Example : Prox of Support FunctionsThe Support Function of a set C defined by

σC (u) := sup{〈u, v〉 : v ∈ C}is essentially “built-in” in most optimization problems. For instance, consider thegeneral NLP with m constraints described through g : Rn → Rm:

inf{f (x) : g(x) ≤ 0, x ∈ Rn} ⇐⇒ infx{f (x) + σRm

+(g(x)}.

Recall that σC = δ∗C and σ∗C = δ∗∗C = δC , (C closed convex nonempty).

Proposition (Prox of support function). Let C be a nonempty closed andconvex set, and let λ > 0. Then

proxλσC(x) = x− λPC (x/λ).

Proof. By the extended Moreau decomposition formula

proxλσC(x) = x− λproxλ−1σ∗C

(x/λ) = x− λproxλ−1δC(x/λ) = x− λPC (x/λ).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 16 / 18

Examples

Thus, computing the prox of a support function of a set C reduces to compute aprojection onto C:

I proxλ‖·‖α(x) = x− λPB‖·‖α,∗ [0,1](x/λ). (‖ · ‖α - arbitrary norm)

I proxλ‖·‖∞(x) = x− λPB‖·‖1[0,1](x/λ).

I proxλmax(·)(x) = x− λP∆n(x/λ).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 17 / 18

References

I J. J. Moreau, Proximite et dualite dans un espace Hilbertien, Bull. Soc.Math. France (1965).

I H. Bauschke and P. L. Combettes. Convex analysis and monotone operatortheory in Hilbert spaces (2011).

I P. L. Combettes and V. R. Wajs, Signal recovery by proximal forwardbackward splitting, Multiscale Model. Simul. (2005).

Marc Teboulle First Order Optimization Methods Lecture 2 - The Proximal Map 18 / 18

First Order Optimization Methods

Lecture 3 The Proximal Gradient Method

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 1 / 22

Lecture 3 - The Proximal Gradient Method

A Basic Useful Model: Composite Minimization

Composite convex model:

(P) min{F (x) ≡ f (x) + g(x) : x ∈ E}

I f - convex, C 1 with Lipschitz gradient.

I g - convex, extended valued function, nonsmooth.

I X ∗ 6= ∅ optimal set of (P), with optimal value denotedFopt ≡ F (x∗).

This simple model is rich enough to encompass most convex optimization modelsarising in a wide array of disparate applications: image science, signal processing,machine learning...ect...

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 2 / 22

Some Prototype ExamplesI unconstrained smooth convex minimization (g ≡ 0)

min{f (x) : x ∈ E}

I constrained smooth convex minimization (g ≡ δ(·|C ),C− closed andconvex)

min{f (x) : x ∈ C}

I constrained nonsmooth convex minimization (f ≡ 0)

min{g(x) : x ∈ E}, (g is extended valued)

I Nonsmooth l1 regularized convex problems (g(x) ≡ λ‖x‖1, λ > 0)

min{f (x) + λ‖x‖1 : x ∈ E}

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 3 / 22

Smoothness - A Central Assumption in FOMDefinition. Let L ≥ 0. A function f : E → R which is differentiable issaid to have Lipschitz continuous gradient with Lipschitz constant L ≥ 0(or simply is L-smooth if

‖∇f (x)−∇f (y)‖∗ ≤ L‖x− y‖ for all x, y ∈ E.

Recall: Given a norm ‖ · ‖ (not necessarily the endowed Euclidean norm), the dual normis defined by

‖u‖∗ = maxv{〈u, v〉 : ‖v‖ ≤ 1}, u ∈ E.

I The class of L-smooth functions is denoted by C 1,1L .

I If a function is L1-smooth, then it is also L2-smooth for any L2 ≥ L1.

Examples:

I f (x) = 〈a, x〉+ b, a ∈ E, b ∈ R (0-smooth).

I f (x) = 12 xTAx + bTx + c , A ∈ Sn,b ∈ Rn and c ∈ R (‖A‖2-smooth if Rn is

endowed with the l2-norm).

I f (x) = 12d

2C (f : E→ R) (1-smooth)

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 4 / 22

A Key Inequality for Smooth f – The Descent LemmaLemma. Let f : E→ R be L-smooth. Then,

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

In particular, for f convex, we can drop the absolute value; this is usually called“Descent Lemma”.

Proof. By the fundamental theorem of calculus:

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

I

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

0

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

(CS)

≤∫ 1

0

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

Lips

≤∫ 1

0

tL‖y − x‖2dt =L

2‖y − x‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 5 / 22

A Key Inequality for Smooth f – The Descent LemmaLemma. Let f : E→ R be L-smooth. Then,

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

In particular, for f convex, we can drop the absolute value; this is usually called“Descent Lemma”.

Proof. By the fundamental theorem of calculus:

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

I

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

0

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

(CS)

≤∫ 1

0

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

Lips

≤∫ 1

0

tL‖y − x‖2dt =L

2‖y − x‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 5 / 22

A Key Inequality for Smooth f – The Descent LemmaLemma. Let f : E→ R be L-smooth. Then,

|f (y)− f (x)− 〈∇f (x), y − x〉| ≤ L

2‖x− y‖2, ∀x, y

In particular, for f convex, we can drop the absolute value; this is usually called“Descent Lemma”.

Proof. By the fundamental theorem of calculus:

I f (y)− f (x) =∫ 1

0〈∇f (x + t(y − x)), y − x〉dt.

I f (y)− f (x) = 〈∇f (x), y − x〉+∫ 1

0〈∇f (x + t(y − x))−∇f (x), y − x〉dt.

I

|f (y)− f (x)− 〈∇f (x), y − x〉| =

∣∣∣∣∫ 1

0

〈∇f (x + t(y − x))−∇f (x), y − x〉dt∣∣∣∣

(CS)

≤∫ 1

0

‖∇f (x + t(y − x))−∇f (x)‖∗ · ‖y − x‖dt

Lips

≤∫ 1

0

tL‖y − x‖2dt =L

2‖y − x‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 5 / 22

Characterizations of L-smoothness -Useful Inequalities

Theorem. Let f : E → R be a convex function, differentiable over E, andlet L > 0. Then the following claims are equivalent:

(i) f is L-smooth.

(ii) f (y) ≤ f (x) + 〈∇f (x), y − x〉+ L2‖x− y‖2 for all x, y ∈ E.

(iii) f (y) ≥ f (x) + 〈∇f (x), y − x〉+ 12L‖∇f (x)−∇f (y)‖2

∗ for all x, y ∈ E.

(iv) 〈∇f (x)−∇f (y), x− y〉 ≥ 1L‖∇f (x)−∇f (y)‖2

∗ for all x, y ∈ E.

Moreover, for f twice continuously differentiable function over Rn, we havef ∈ C 1,1

L (w.r.t. the l2-norm) iff

‖∇2f (x)‖2 ≤ L ⇐⇒ λmax(∇2f (x)) ≤ L for any x ∈ Rn.

Proof. See refs.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 6 / 22

The Basic Idea

Instead of minimizing directly

minx∈E

f (x) + g(x)

Approximate f by a regularized linear approximation of f while keeping thenonsmooth g untouched. Thus, given xk and tk > 0, the next point is:

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

1

2tk‖x− xk‖2 + g(x)

}

xk+1 = argminx

{g(x) +

1

2tk

∥∥x− (xk − tk∇f (xk))∥∥2}

Proximal Gradient Iteration

xk+1 = proxtkg (xk − tk∇f (xk))

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 7 / 22

The Basic Idea

Instead of minimizing directly

minx∈E

f (x) + g(x)

Approximate f by a regularized linear approximation of f while keeping thenonsmooth g untouched. Thus, given xk and tk > 0, the next point is:

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

1

2tk‖x− xk‖2 + g(x)

}

xk+1 = argminx

{g(x) +

1

2tk

∥∥x− (xk − tk∇f (xk))∥∥2}

Proximal Gradient Iteration

xk+1 = proxtkg (xk − tk∇f (xk))

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 7 / 22

The Basic Idea

Instead of minimizing directly

minx∈E

f (x) + g(x)

Approximate f by a regularized linear approximation of f while keeping thenonsmooth g untouched. Thus, given xk and tk > 0, the next point is:

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

1

2tk‖x− xk‖2 + g(x)

}

xk+1 = argminx

{g(x) +

1

2tk

∥∥x− (xk − tk∇f (xk))∥∥2}

Proximal Gradient Iteration

xk+1 = proxtkg (xk − tk∇f (xk))

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 7 / 22

The Basic Idea

Instead of minimizing directly

minx∈E

f (x) + g(x)

Approximate f by a regularized linear approximation of f while keeping thenonsmooth g untouched. Thus, given xk and tk > 0, the next point is:

xk+1 = argminx

{f (xk) +∇f (xk)T (x− xk) +

1

2tk‖x− xk‖2 + g(x)

}

xk+1 = argminx

{g(x) +

1

2tk

∥∥x− (xk − tk∇f (xk))∥∥2}

Proximal Gradient Iteration

xk+1 = proxtkg (xk − tk∇f (xk))

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 7 / 22

Basic Examples

I Gradient Method ( g = 0, unconstrained minimization)

xk+1 = xk − tk∇f (xk)

I Gradient Projection Method (g = δC , constrained convex minimization)

xk+1 = PC (xk − tk∇f (xk))

I Iterative Soft-Thresholding Algorithm (ISTA) (g = ‖ · ‖1):

xk+1 = Tλtk(xk − tk∇f (xk)

)where Tα(u) = [|u| − αe]� sgn (u).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 8 / 22

Basic Examples

I Gradient Method ( g = 0, unconstrained minimization)

xk+1 = xk − tk∇f (xk)

I Gradient Projection Method (g = δC , constrained convex minimization)

xk+1 = PC (xk − tk∇f (xk))

I Iterative Soft-Thresholding Algorithm (ISTA) (g = ‖ · ‖1):

xk+1 = Tλtk(xk − tk∇f (xk)

)where Tα(u) = [|u| − αe]� sgn (u).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 8 / 22

Basic Examples

I Gradient Method ( g = 0, unconstrained minimization)

xk+1 = xk − tk∇f (xk)

I Gradient Projection Method (g = δC , constrained convex minimization)

xk+1 = PC (xk − tk∇f (xk))

I Iterative Soft-Thresholding Algorithm (ISTA) (g = ‖ · ‖1):

xk+1 = Tλtk(xk − tk∇f (xk)

)where Tα(u) = [|u| − αe]� sgn (u).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 8 / 22

The Proximal Gradient Method -PGM

For simplicity, we analyze PGM with a constant stepsize rule.

Proximal Gradient Method with Constant StepsizeInput: L = Lf - A Lipschitz constant of ∇f .Step 0. Take x0 ∈ E.Step k. (k ≥ 0) Compute

xk+1 = TL(xk) := prox gL

(xk − 1

L∇f (xk)

).

I When the Lipschitz constant Lf is not always known or not easilycomputable, this can be resolved with an easy backtracking stepsize rule, andall the forthcoming results also hold in that case; (see refs. for details).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 9 / 22

Main Pillar in the Analysis: The Prox-Grad InequalityTheorem. Let L ≥ Lf . Then for any x, y ∈ E

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

where `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉.

Note: Valid for nonconvex f . When f convex, `f (x, y) ≥ 0.

Proof. We use the short hand notation y+ = TL(y) = prox gL

(y − 1

L∇f (y)).

I Using the prox-Inequality proven in past lecture:

g(x)− g(ξ) ≥ 〈z− ξ, x− ξ〉, ξ := proxg (z),

it follows that 1Lg(x)− 1

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

I Therefore, rearranging we have:

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − x〉= L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 10 / 22

Main Pillar in the Analysis: The Prox-Grad InequalityTheorem. Let L ≥ Lf . Then for any x, y ∈ E

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

where `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉.

Note: Valid for nonconvex f . When f convex, `f (x, y) ≥ 0.

Proof. We use the short hand notation y+ = TL(y) = prox gL

(y − 1

L∇f (y)).

I Using the prox-Inequality proven in past lecture:

g(x)− g(ξ) ≥ 〈z− ξ, x− ξ〉, ξ := proxg (z),

it follows that 1Lg(x)− 1

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

I Therefore, rearranging we have:

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − x〉= L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 10 / 22

Main Pillar in the Analysis: The Prox-Grad InequalityTheorem. Let L ≥ Lf . Then for any x, y ∈ E

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y),

where `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉.

Note: Valid for nonconvex f . When f convex, `f (x, y) ≥ 0.

Proof. We use the short hand notation y+ = TL(y) = prox gL

(y − 1

L∇f (y)).

I Using the prox-Inequality proven in past lecture:

g(x)− g(ξ) ≥ 〈z− ξ, x− ξ〉, ξ := proxg (z),

it follows that 1Lg(x)− 1

Lg(y+) ≥⟨y − 1

L∇f (y)− y+, x− y+⟩.

I Therefore, rearranging we have:

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − x〉= L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 10 / 22

Proof End

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉,

and using the descent Lemma for L ≥ Lf we have:

f (x)− f (y+) ≥ f (x)− f (y)− 〈∇f (y), y+ − y〉 − L

2‖y+ − y‖2.

Adding the above, with `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉, we get

F (x)− F (y+) ≥ `f (x, y) + L〈y − y+, x− y+〉 − L

2‖y+ − y‖2.

Using the identity 2〈y − y+, x− y+〉 = ‖x− y+‖2 + ‖y − y+‖2 − ‖y − x‖2, weobtain that

F (x)− F (y+) ≥ L

2‖x− y+‖2 − L

2‖x− y‖2 + `f (x, y).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 11 / 22

Proof End

g(x)− g(y+) ≥ L〈y − y+, x− y+〉+ 〈∇f (y), y+ − y〉+ 〈∇f (y), y − x〉,

and using the descent Lemma for L ≥ Lf we have:

f (x)− f (y+) ≥ f (x)− f (y)− 〈∇f (y), y+ − y〉 − L

2‖y+ − y‖2.

Adding the above, with `f (x, y) := f (x)− f (y)− 〈∇f (y), x− y〉, we get

F (x)− F (y+) ≥ `f (x, y) + L〈y − y+, x− y+〉 − L

2‖y+ − y‖2.

Using the identity 2〈y − y+, x− y+〉 = ‖x− y+‖2 + ‖y − y+‖2 − ‖y − x‖2, weobtain that

F (x)− F (y+) ≥ L

2‖x− y+‖2 − L

2‖x− y‖2 + `f (x, y).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 11 / 22

A Useful Side Remark: The Gradient MapIn terms of the Gradient Map GL : E→ E defined by

GL(x) := L (x− TL(x))

the PGM simply reads ”like” a gradient scheme:

xk+1 = TL(xk) ≡ xk − 1

LGL(xk).

This is a convenient device, it parallels the gradient, and can be used (e.g. with fnonconvex) along the analysis we developed next to easily derived the complexityin terms of ‖G (xk)‖ announced earlier.

I Clearly GL(·) reduces to the gradient ∇f (x) when g = 0 (hence the name!)

I GL(x∗) = 0 iff x∗ = prox 1L g

(x∗ − 1

L∇f (x∗)). Therefore, by prox property

x∗ − 1

L∇f (x∗)− x∗ ∈ 1

L∂g(x∗) ⇐⇒ 0 ∈ ∇f (x∗) + ∂g(x∗)

I That is, for nonconvex f , x∗ is a critical point for F = f + g .

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 12 / 22

Sufficient Decrease LemmaAs an immediate consequence of the prox-grad inequality, we obtain (simply sety = x) the sufficient decrease property:

Corollary. Let L ≥ Lf . Then for any x ∈ E,

F (x)− F (TL(x)) ≥ L

2‖x− TL(x)‖2.

Remark. Recall that the sufficient descent property holds for a nonconvex f(`(x, x) ≡ 0 in the prox-grad inequality.)

Above, in terms of the Gradient Map GL : E→ E defined by

GL(x) ≡ L (x− TL(x))

F (x)− F (TL(x)) ≥ 1

2L‖GL(x)‖2,

from which the rate O(1/√N) for min0≤k≤N ‖GL(xk)‖ can be easily

deduced...Try!

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 13 / 22

Sufficient Decrease LemmaAs an immediate consequence of the prox-grad inequality, we obtain (simply sety = x) the sufficient decrease property:

Corollary. Let L ≥ Lf . Then for any x ∈ E,

F (x)− F (TL(x)) ≥ L

2‖x− TL(x)‖2.

Remark. Recall that the sufficient descent property holds for a nonconvex f(`(x, x) ≡ 0 in the prox-grad inequality.)

Above, in terms of the Gradient Map GL : E→ E defined by

GL(x) ≡ L (x− TL(x))

F (x)− F (TL(x)) ≥ 1

2L‖GL(x)‖2,

from which the rate O(1/√N) for min0≤k≤N ‖GL(xk)‖ can be easily

deduced...Try!Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 13 / 22

O(1/k) Rate of Convergence of Proximal Gradient

Theorem. Suppose that f is convex. Let {xk}k≥0 be the sequence gener-ated by the proximal gradient method. Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤Lf ‖x0 − x∗‖2

2k,

I Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.

I Note: The sequence {xk} can be proven to converge to solution x∗ with astep size in (0, 2/Lf ), see refs.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 14 / 22

O(1/k) Rate of Convergence of Proximal Gradient

Theorem. Suppose that f is convex. Let {xk}k≥0 be the sequence gener-ated by the proximal gradient method. Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤Lf ‖x0 − x∗‖2

2k,

I Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.

I Note: The sequence {xk} can be proven to converge to solution x∗ with astep size in (0, 2/Lf ), see refs.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 14 / 22

Proof of Rate for Prox-Grad : xn+1 = TL(xn)

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Proof. Set L = Lf , x = x∗ and y = xn in the prox-grad ineq. (`f (x∗, xn) ≥ 0):

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

I Summing over n = 0, 1, . . . , k − 1 we obtain

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

n=0 F (xn+1)− kFopt ≤ Lf

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

I By the sufficient decrease Corollary one has F (xn+1) ≤ F (xn) ⇒I∑k−1

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

I Consequently adding above to ?: F (xk)− Fopt ≤ Lf ‖x∗−x0‖2

2k .

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 15 / 22

Proof of Rate for Prox-Grad : xn+1 = TL(xn)

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Proof. Set L = Lf , x = x∗ and y = xn in the prox-grad ineq. (`f (x∗, xn) ≥ 0):

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

I Summing over n = 0, 1, . . . , k − 1 we obtain

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

n=0 F (xn+1)− kFopt ≤ Lf

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

I By the sufficient decrease Corollary one has F (xn+1) ≤ F (xn) ⇒I∑k−1

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

I Consequently adding above to ?: F (xk)− Fopt ≤ Lf ‖x∗−x0‖2

2k .

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 15 / 22

Proof of Rate for Prox-Grad : xn+1 = TL(xn)

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Proof. Set L = Lf , x = x∗ and y = xn in the prox-grad ineq. (`f (x∗, xn) ≥ 0):

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

I Summing over n = 0, 1, . . . , k − 1 we obtain

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

n=0 F (xn+1)− kFopt ≤ Lf

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

I By the sufficient decrease Corollary one has F (xn+1) ≤ F (xn) ⇒I∑k−1

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

I Consequently adding above to ?: F (xk)− Fopt ≤ Lf ‖x∗−x0‖2

2k .

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 15 / 22

Proof of Rate for Prox-Grad : xn+1 = TL(xn)

F (x)− F (TL(y)) ≥L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Proof. Set L = Lf , x = x∗ and y = xn in the prox-grad ineq. (`f (x∗, xn) ≥ 0):

I F (x∗)− F (xn+1) ≥ Lf

2 ‖x∗ − xn+1‖2 − Lf

2 ‖x∗ − xn‖2.

I Summing over n = 0, 1, . . . , k − 1 we obtain

k−1∑n=0

F (x∗)− F (xn+1) ≥ Lf2{‖x∗ − xk‖2 − ‖x∗ − x0‖2}.

I ?∑k−1

n=0 F (xn+1)− kFopt ≤ Lf

2 ‖x∗ − x0‖2 − Lf

2 ‖x∗ − xk‖2 ≤ Lf

2 ‖x∗ − x0‖2.

I By the sufficient decrease Corollary one has F (xn+1) ≤ F (xn) ⇒I∑k−1

n=0 n(F (xn+1)− F (xn)) = −∑k−1

n=0 F (xn+1) + kF (xk) ≤ 0.

I Consequently adding above to ?: F (xk)− Fopt ≤ Lf ‖x∗−x0‖2

2k .

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 15 / 22

Iteration Complexity of PGM

Recall:An ε-optimal solution of problem (P) is a vector x satisfying

F (x)− Fopt ≤ ε.

We now ask how many iterations are required to satisfy F (xk)− Fopt ≤ ε?

For PGM we have proved: F (xk)− Fopt ≤ Lf ‖x0−x∗‖2

2k ≡ Lf R2

2k .

Definition of R: R2 := max{‖x∗ − x0‖2 : x∗ ∈ X ∗}. Therefore,

Theorem[O(1/ε) complexity of PGM]. For any k satisfying

k ≥⌈Lf R

2

⌉it holds that F (xk)− Fopt ≤ ε.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 16 / 22

Iteration Complexity of PGM

Recall:An ε-optimal solution of problem (P) is a vector x satisfying

F (x)− Fopt ≤ ε.

We now ask how many iterations are required to satisfy F (xk)− Fopt ≤ ε?

For PGM we have proved: F (xk)− Fopt ≤ Lf ‖x0−x∗‖2

2k ≡ Lf R2

2k .

Definition of R: R2 := max{‖x∗ − x0‖2 : x∗ ∈ X ∗}. Therefore,

Theorem[O(1/ε) complexity of PGM]. For any k satisfying

k ≥⌈Lf R

2

⌉it holds that F (xk)− Fopt ≤ ε.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 16 / 22

Special Cases

I With g ≡ 0 and g = δC in our model (P), PGM recovers the basic gradientand gradient projection methods respectively.

I Therefore, both methods have complexity O(1/ε).

I Let us look at the case when f = 0 in (P).

I In that case, we have to solve a general convex nonsmooth problem and thePGM reduces to the Proximal Minimization Algorithm -PMA describednext.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 17 / 22

Special Cases

I With g ≡ 0 and g = δC in our model (P), PGM recovers the basic gradientand gradient projection methods respectively.

I Therefore, both methods have complexity O(1/ε).

I Let us look at the case when f = 0 in (P).

I In that case, we have to solve a general convex nonsmooth problem and thePGM reduces to the Proximal Minimization Algorithm -PMA describednext.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 17 / 22

Proximal Minimization Algorithm-PMASet f ≡ 0, in (P), i.e., we solve the general convex nonsmooth problem

min{g(x) : x ∈ E}.

PG reduces to Proximal Minimization Algorithm (Martinet (70)):

x0 ∈ E, xk = argminx∈E

{g(x) +

1

2tk‖x− xk−1‖2

}, tk > 0.

This is an implicit discretization of 0 ∈ dx(t)/dt + ∂g(x(t)), x(0) = x0.

Theorem Let xk be the sequence generated by PMA, and set σk =∑k

s=1 ts .

Then, g(xk)− g(x) ≤ ‖x0 − x‖2/2σk ,∀x ∈ E.

In particular, if σk → ∞ then g(xk) ↓ g∗ = infx g(x) and if X∗ 6= ∅, thenxk converges to some point in X∗.

Consequence: Complexity of PMA is O(1/ε). ”Better” than SubgradientO(1/ε2). But, No free lunch.. requires g being prox-friendly!Very useful when combined with duality:−→ Augmented Lagrangians Methods.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 18 / 22

Proximal Minimization Algorithm-PMASet f ≡ 0, in (P), i.e., we solve the general convex nonsmooth problem

min{g(x) : x ∈ E}.

PG reduces to Proximal Minimization Algorithm (Martinet (70)):

x0 ∈ E, xk = argminx∈E

{g(x) +

1

2tk‖x− xk−1‖2

}, tk > 0.

This is an implicit discretization of 0 ∈ dx(t)/dt + ∂g(x(t)), x(0) = x0.

Theorem Let xk be the sequence generated by PMA, and set σk =∑k

s=1 ts .

Then, g(xk)− g(x) ≤ ‖x0 − x‖2/2σk ,∀x ∈ E.

In particular, if σk → ∞ then g(xk) ↓ g∗ = infx g(x) and if X∗ 6= ∅, thenxk converges to some point in X∗.

Consequence: Complexity of PMA is O(1/ε). ”Better” than SubgradientO(1/ε2). But, No free lunch.. requires g being prox-friendly!Very useful when combined with duality:−→ Augmented Lagrangians Methods.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 18 / 22

Strongly Convex Case - Linear Rate for Prox-Grad

Theorem. Suppose that f is σ-strongly convex (σ > 0). Let {xk}k≥0 bethe sequence generated by the proximal gradient method. Then for anyx∗ ∈ X and k ≥ 0,

(a) ‖xk+1 − x∗‖2 ≤(

1− σLf

)‖xk − x∗‖2.

(b) ‖xk − x∗‖2 ≤(

1− σLf

)k‖x0 − x∗‖2.

(c) F (xk+1)− Fopt ≤ Lf

2

(1− σ

Lf

)k+1

‖x0 − x∗‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 19 / 22

ProofI Plugging L = Lf , x = x∗ and y = xk in the prox-grad inequality

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

I `f (x∗, xk) = f (x∗)− f (xk)− 〈∇f (xk), x∗ − xk〉 ≥ σ2 ‖x

k − x∗‖2 ← [f is S.C ]I F (x∗)− F (xk+1) ≥ Lf

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

Lf

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).I (c) follows by third inequality above written as:

F (xk+1)− Fopt ≤ Lf − σ2‖xk − x∗‖2 − Lf

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

Lf

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

Lf

)k+1

‖x0 − x∗‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 20 / 22

ProofI Plugging L = Lf , x = x∗ and y = xk in the prox-grad inequality

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

I `f (x∗, xk) = f (x∗)− f (xk)− 〈∇f (xk), x∗ − xk〉 ≥ σ2 ‖x

k − x∗‖2 ← [f is S.C ]I F (x∗)− F (xk+1) ≥ Lf

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

Lf

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).

I (c) follows by third inequality above written as:

F (xk+1)− Fopt ≤ Lf − σ2‖xk − x∗‖2 − Lf

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

Lf

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

Lf

)k+1

‖x0 − x∗‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 20 / 22

ProofI Plugging L = Lf , x = x∗ and y = xk in the prox-grad inequality

F (x∗)− F (xk+1) ≥ Lf2‖x∗ − xk+1‖2 − Lf

2‖x∗ − xk‖2 + `f (x∗, xk).

I `f (x∗, xk) = f (x∗)− f (xk)− 〈∇f (xk), x∗ − xk〉 ≥ σ2 ‖x

k − x∗‖2 ← [f is S.C ]I F (x∗)− F (xk+1) ≥ Lf

2 ‖x∗ − xk+1‖2 − Lf−σ

2 ‖x∗ − xk‖2.

I Since F (x∗)− F (xk+1) ≤ 0

‖xk+1 − x∗‖2 ≤(

1− σ

Lf

)‖xk − x∗‖2,

establishing part (a). Part (b) follows immediately by recursion on (a).I (c) follows by third inequality above written as:

F (xk+1)− Fopt ≤ Lf − σ2‖xk − x∗‖2 − Lf

2‖xk+1 − x∗‖2

≤ Lf2

(1− σ

Lf

)‖xk − x∗‖2

using (b) ≤ Lf2

(1− σ

Lf

)k+1

‖x0 − x∗‖2.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 20 / 22

Complexity of PGM - Strongly Convex Case

A direct result of the rate analysis. We let κ := Lf

σ .

Theorem [Complexity O(κ log(1/ε) for Strongly Convex PGM] For anyk ≥ 1 satisfying

k ≥ κ log

(1

ε

)+ κ log

(Lf R

2

2

),

it holds that F (xk)− Fopt ≤ ε.

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 21 / 22

References

I A. Beck and M. Teboulle, Fast Gradient-Based Algorithms for ConstrainedTotal Variation Image Denoising and Deblurring Problems. IEEE Trans.Image Processing, 18, 2419–2434 (2009).

I A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications toSignal Recovery Problems. In: Convex Optimization in Signal Processing andCommunications, Edited by Y. Eldar and D. Palomar, Cambridge UniversityPress, (2010).

I P. L. Combettes and V. R. Wajs. Signal recovery by proximal forwardbackward splitting. Multiscale Model. Simul. (2005).

Marc Teboulle First Order Optimization Methods Lecture 3 The Proximal Gradient Method 22 / 22

First Order Optimization Methods

Lecture 4 - Fast Proximal Gradient (FISTA)

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 1 / 25

Lecture 4 – Fast Proximal Gradient Method (FISTA)

Previous explicit methods are simple but might be too slow.

I For Subgradient Methods: rate of O(1/√k).

I For Gradient methods/Prox-Gradient: rate of O(1/k)

Can we do better to solve the composite nonsmooth problem (P)?

(P) min{F (x) := f (x) + g(x) : x ∈ E}.

• Same Underlying Assumptions:

(A) g : E→ (−∞,∞] is proper closed and convex.

(B) f : E→ R is Lf -smooth and convex.

(C) The optimal set of (P) X ∗ 6= ∅ with optimal value denoted by Fopt.

Can we devise a method with:♠ the same computational effort/simplicty as Prox-Grad .♠ a Faster global rate of convergence.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 2 / 25

Lecture 4 – Fast Proximal Gradient Method (FISTA)

Previous explicit methods are simple but might be too slow.

I For Subgradient Methods: rate of O(1/√k).

I For Gradient methods/Prox-Gradient: rate of O(1/k)

Can we do better to solve the composite nonsmooth problem (P)?

(P) min{F (x) := f (x) + g(x) : x ∈ E}.

• Same Underlying Assumptions:

(A) g : E→ (−∞,∞] is proper closed and convex.

(B) f : E→ R is Lf -smooth and convex.

(C) The optimal set of (P) X ∗ 6= ∅ with optimal value denoted by Fopt.

Can we devise a method with:♠ the same computational effort/simplicty as Prox-Grad .♠ a Faster global rate of convergence.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 2 / 25

Yes we Can...

B Answer: Yes, through an “equally simple” scheme.

• The Idea: instead of making a step of the form

xk+1 = prox 1Lk

g

(xk − 1

Lk∇f (xk)

)we will consider a step of the same form, at an auxiliary point yk :

xk+1 = prox 1Lk

g

(yk − 1

Lk∇f (yk)

)where yk is a special linear combination of xk , xk−1.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 3 / 25

Yes we Can...

B Answer: Yes, through an “equally simple” scheme.

• The Idea: instead of making a step of the form

xk+1 = prox 1Lk

g

(xk − 1

Lk∇f (xk)

)we will consider a step of the same form, at an auxiliary point yk :

xk+1 = prox 1Lk

g

(yk − 1

Lk∇f (yk)

)where yk is a special linear combination of xk , xk−1.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 3 / 25

FISTA for Composite Nonsmooth Model (P)

FISTA – Input: Convex (f , g) with f Lf -smooth.Initialization: set y0 = x0 ∈ E and t0 = 1.General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) pick Lk > 0.

(b) xk+1 = TLk(yk) ≡ prox 1

Lkg

(yk − 1

Lk∇f (yk)

).

(c) tk+1 =1+√

1+4t2k2 .

(d) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

I The main computation (b) is the same as prox-grad; (c)-(d) marginal

I With tk ≡ 1,∀k we recover ProxGrag.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 4 / 25

Some Technical remarks

I Stepsize Rules. Here, we will use the constant stepsize Lk = Lf for all k.

I As mentioned before, when Lf is unknown, we can simply use a backtrackingprocedure, see refs. [1,2] for details.

I The sequence tk The sequence {tk} is determined as the positive root ofthe quadratic equation:

t2k+1 − tk+1 = t2k solved with tk+1 =1 +

√1 + 4t2k2

.

I It is easy to show by induction that tk ≥ k+22 for all k ≥ 0.

I Therefore, 0 < t−1k ≤ 1 for all k .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 5 / 25

O(1/k2) rate of convergence of FISTA

Theorem. Let {xk}k≥0 be the sequence generated by FISTA with theconstant stepsize Lk ≡ Lf . Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤2Lf ‖x0 − x∗‖2

(k + 1)2.

Thus, the complexity of FISTA is O(1/√ε), a square root factor better

than PGM!

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 6 / 25

O(1/k2) rate of convergence of FISTA

Theorem. Let {xk}k≥0 be the sequence generated by FISTA with theconstant stepsize Lk ≡ Lf . Then for any x∗ ∈ X ∗ and k ≥ 1,

F (xk)− Fopt ≤2Lf ‖x0 − x∗‖2

(k + 1)2.

Thus, the complexity of FISTA is O(1/√ε), a square root factor better

than PGM!

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 6 / 25

Main Pillar in Analysis - A Special RecursionLemma -Recursion The sequences {xk , yk} generated via FISTA with theconstant stepsize rule satisfy for every k ≥ 1

2

Lt2kvk −

2

Lt2k+1v

2k+1 ≥ ‖uk+1‖2 − ‖uk‖2,

where vk := F (xk)− Fopt uk := tk−1xk − (tk−1 − 1)xk−1 − x∗.

Proof. We apply the fundamental prox-grad inequality

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Take x = t−1k x∗ + (1− t−1k )xk and y = yk , L ≡ Lf ; here xk+1 = TL(yk),

F (t−1k x∗ + (1− t−1

k )xk)− F (xk+1)

≥ L

2‖xk+1 − (t−1

k x∗ + (1− t−1k )xk)‖2 − L

2‖yk − (t−1

k x∗ + (1− t−1k )xk)‖2

=L

2t2k‖tkxk+1 − (x∗ + (tk − 1)xk)‖2 − L

2t2k‖tkyk − (x∗ + (tk − 1)xk)‖2. (1)

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 7 / 25

Main Pillar in Analysis - A Special RecursionLemma -Recursion The sequences {xk , yk} generated via FISTA with theconstant stepsize rule satisfy for every k ≥ 1

2

Lt2kvk −

2

Lt2k+1v

2k+1 ≥ ‖uk+1‖2 − ‖uk‖2,

where vk := F (xk)− Fopt uk := tk−1xk − (tk−1 − 1)xk−1 − x∗.

Proof. We apply the fundamental prox-grad inequality

F (x)− F (TL(y)) ≥ L

2‖x− TL(y)‖2 − L

2‖x− y‖2 + `f (x, y).

Take x = t−1k x∗ + (1− t−1k )xk and y = yk , L ≡ Lf ; here xk+1 = TL(yk),

F (t−1k x∗ + (1− t−1

k )xk)− F (xk+1)

≥ L

2‖xk+1 − (t−1

k x∗ + (1− t−1k )xk)‖2 − L

2‖yk − (t−1

k x∗ + (1− t−1k )xk)‖2

=L

2t2k‖tkxk+1 − (x∗ + (tk − 1)xk)‖2 − L

2t2k‖tkyk − (x∗ + (tk − 1)xk)‖2. (1)

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 7 / 25

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).

I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).I Since t2k − tk = t2k−1,

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 8 / 25

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).I Since t2k − tk = t2k−1,

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 8 / 25

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).I Since t2k − tk = t2k−1,

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 8 / 25

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).

I Since t2k − tk = t2k−1,

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 8 / 25

Proof of Recursion Lemma Contd.I By the convexity of F , F (t−1k x∗+ (1− t−1k )xk) ≤ t−1k F (x∗) + (1− t−1k )F (xk).I Using the notation vk ≡ F (xk)− Fopt,

F (t−1k x∗ + (1− t−1k )xk)− F (xk+1) ≤ (1− t−1k )vk − vk+1. (2)

I Use the relation given by FISTA: yk = xk +(

tk−1−1tk

)(xk − xk−1),

‖tkyk − (x∗ + (tk − 1)xk)‖2 = ‖tk−1xk − (x∗ + (tk−1 − 1)xk−1)‖2 (3)

I Combining (1),(2) and (3),

(t2k − tk)vk − t2kvk+1 ≥L

2‖uk+1‖2 − L

2‖uk‖2,

where uk ≡ tk−1xk − (x∗ + (tk−1 − 1)xk−1).I Since t2k − tk = t2k−1,

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 8 / 25

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)

I Therefore,2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

2

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

2

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 9 / 25

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)I Therefore,

2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

2

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

2

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 9 / 25

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)I Therefore,

2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

2

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

2

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 9 / 25

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)I Therefore,

2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

2

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

2

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 9 / 25

Proof of Theorem : O(1/k2) rate for FISTAUsing the Recursion Lemma:

2

Lt2k−1vk −

2

Lt2kvk+1 ≥ ‖uk+1‖2 − ‖uk‖2.

I Written as ‖uk+1‖2 + 2L t

2kvk+1 ≤ ‖uk‖2 + 2

L t2k−1vk , ⇒

I ‖uk‖2 + 2L t

2k−1vk ≤ ‖u1‖2 + 2

L t20v1 = ‖x1 − x∗‖2 + 2

L (F (x1)− Fopt)I Therefore,

2

Lt2k−1vk ≤ ‖x1 − x∗‖2 +

2

L(F (x1)− Fopt)

I Now, substituting x = x∗, y = y0 and L in the prox-grad inequality we get,

2

L(F (x∗)− F (x1)) ≥ ‖x1 − x∗‖2 − ‖y0 − x∗‖2.

I Since y0 = x0 ⇔: ‖x1 − x∗‖2 + 2L (F (x1)− Fopt) ≤ ‖x0 − x∗‖2.

I Adding the “red”, (recall that vk = F (xk)− Fopt) yields the result,

F (xk)− Fopt ≤2L‖x0 − x∗‖2

(k + 1)2.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 9 / 25

Alternative Choice for tk

I For the proof of the O(1/k2) rate, it is enough to require that {tk}k≥0 willsatisfy

(a) tk ≥ k+22;

(b) t2k+1 − tk+1 ≤ t2k . (See proof of last inequality in Lemma.)

I The choice tk = k+22 also satisfies these two properties. (a) is obvious. (b)

holds since

t2k+1 − tk+1 = tk+1(tk+1 − 1) =k + 3

2· k + 1

2=

k2 + 4k + 3

4

≤ k2 + 4k + 4

4=

(k + 2)2

4= t2k .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 10 / 25

Alternative Choice for tk

I For the proof of the O(1/k2) rate, it is enough to require that {tk}k≥0 willsatisfy

(a) tk ≥ k+22;

(b) t2k+1 − tk+1 ≤ t2k . (See proof of last inequality in Lemma.)

I The choice tk = k+22 also satisfies these two properties. (a) is obvious. (b)

holds since

t2k+1 − tk+1 = tk+1(tk+1 − 1) =k + 3

2· k + 1

2=

k2 + 4k + 3

4

≤ k2 + 4k + 4

4=

(k + 2)2

4= t2k .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 10 / 25

ISTA/FISTA - l1-Regularization

Consider the modelminx∈Rn

f (x) + λ‖x‖1,

I λ > 0

I f : Rn → R convex and Lf -smooth.

Iterative Shrinkage/Thresholding Algorithm (ISTA):

xk+1 = Tλ/Lf

(xk − 1

Lf∇f (xk)

).

Fast Iterative Shrinkage/Thresholding Algorithm (FISTA):

(a) xk+1 = T λLf

(yk − 1

Lf∇f (yk)

).

(b) tk+1 =1+√

1+4t2k2 .

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 11 / 25

l1-Regularized Least SquaresConsider the problem

minx∈Rn

1

2‖Ax− b‖22 + λ‖x‖1,

I A ∈ Rm×n,b ∈ Rm and λ > 0.

I Fits (P) with f (x) = 12‖Ax− b‖22 and g(x) = λ‖x‖1.

I f is Lf -smooth with Lf = ‖ATA‖2 = λmax(ATA).

ISTA: xk+1 = T λLk

(xk − 1

LkAT (Axk − b)

).

FISTA:

(a) xk+1 = T λLk

(yk − 1

LkAT (Ayk − b)

).

(b) tk+1 =1+√

1+4t2k2 .

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 12 / 25

Numerical Example I

I test on regularized l1-regularized least squares.

I λ = 1.

I A ∈ R100×110. The components of A were independently generated using astandard normal distribution.

I the “true” vector is xtrue = e3 − e7.

I b = Axtrue.

I ran 200 iterations of ISTA and FISTA with x0 = e.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 13 / 25

Function Values

0 50 100 150 20010

−12

10−10

10−8

10−6

10−4

10−2

100

102

104

F(x

k )−F

opt

ISTAFISTA

FISTA is NOT a mononote algorithm !Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 14 / 25

Solutions

0 10 20 30 40 50 60 70 80 90 100 110−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1ISTA

0 10 20 30 40 50 60 70 80 90 100 110−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1FISTA

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 15 / 25

Example 2: Wavelet-Based Image Deblurring

minx

1

2‖Ax− b‖2 + λ‖x‖1

I image of size 512x512

I matrix A is dense (Gaussian blurring times inverse of two-stage Haar wavelettransform).

I all problems solved with fixed λ and Gaussian noise.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 16 / 25

Deblurring of the Cameraman

original blurred and noisy

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 17 / 25

1000 Iterations of ISTA versus 200 of FISTA

ISTA: 1000 Iterations FISTA: 200 Iterations

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 18 / 25

Original Versus Deblurring via FISTA

Original FISTA:1000 Iterations

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 19 / 25

Function Values errors F (xk)− F (x∗)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101

ISTAFISTA

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 20 / 25

FISTA in the Strongly Convex Case

I Assume that f is σ-strongly convex for some σ > 0.

I The proximal gradient method attains an ε-optimal solution after an order ofO(κ log( 1

ε )) iterations (κ = Lf

σ ).

I A natural question is: how the complexity result changes when using FISTA?

I Answered by incorporating a restarting mechanism to FISTA – improvescomplexity result to O(

√κ log( 1

ε ))

Restarted FISTAInitialization: pick z−1 ∈ E and a positive integer N. Set z0 = TLf

(z−1).General step (k ≥ 0)

I run N iterations of FISTA with constant stepsize (Lk ≡ Lf ) and input(f , g , zk) and obtain a sequence {xn}Nn=0;

I set zk+1 = xN .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 21 / 25

FISTA in the Strongly Convex Case

I Assume that f is σ-strongly convex for some σ > 0.

I The proximal gradient method attains an ε-optimal solution after an order ofO(κ log( 1

ε )) iterations (κ = Lf

σ ).

I A natural question is: how the complexity result changes when using FISTA?

I Answered by incorporating a restarting mechanism to FISTA – improvescomplexity result to O(

√κ log( 1

ε ))

Restarted FISTAInitialization: pick z−1 ∈ E and a positive integer N. Set z0 = TLf

(z−1).General step (k ≥ 0)

I run N iterations of FISTA with constant stepsize (Lk ≡ Lf ) and input(f , g , zk) and obtain a sequence {xn}Nn=0;

I set zk+1 = xN .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 21 / 25

FISTA in the Strongly Convex Case

I Assume that f is σ-strongly convex for some σ > 0.

I The proximal gradient method attains an ε-optimal solution after an order ofO(κ log( 1

ε )) iterations (κ = Lf

σ ).

I A natural question is: how the complexity result changes when using FISTA?

I Answered by incorporating a restarting mechanism to FISTA – improvescomplexity result to O(

√κ log( 1

ε ))

Restarted FISTAInitialization: pick z−1 ∈ E and a positive integer N. Set z0 = TLf

(z−1).General step (k ≥ 0)

I run N iterations of FISTA with constant stepsize (Lk ≡ Lf ) and input(f , g , zk) and obtain a sequence {xn}Nn=0;

I set zk+1 = xN .

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 21 / 25

Restarted FISTA – Complexity O(√κ log(1

ε))

Theorem [O(√κ log( 1

ε )) complexity of restarted FISTA] Suppose thatthat f is σ-strongly convex (σ > 0). Let {zk}k≥0 be the sequence generatedby the restarted FISTA method employed with N = d

√8κ − 1e. Let R be

an upper bound on ‖z−1 − x∗‖. Then

(a) F (zk)− Fopt ≤ Lf R2

2

(12

)k;

(b) after k iterations of FISTA with k satisfying

k ≥√

(log( 1

ε )

log(2)+

log(Lf R2)

log(2)

),

an ε-optimal solution is obtained at the end of last completed cycle:

F (zbkN c)− Fopt ≤ ε.

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 22 / 25

Sketch of Proof of (a)

(b) follows easily from (a). To prove (a), N iterations of FISTA gives

I F (zn+1)− Fopt ≤ 2Lf ‖zn−x∗‖2(N+1)2 .

I F (zn)− Fopt ≥ σ2 ‖z

n − x∗‖2. (F is σ-strongly convex)

I Hence, F (zn+1)− Fopt ≤ 4κ(F (zn)−Fopt)(N+1)2 .

I Since N ≥√

8κ− 1, it follows that 4κ(N+1)2 ≤

12 , and hence,

F (zn+1)− Fopt ≤1

2(F (zn)− Fopt).

I ⇒ F (zk)− Fopt ≤(12

)k(F (z0)− Fopt).

I It remains to show that F (z0)− Fopt ≤ Lf R2

2 to complete the proof of(a)...

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 23 / 25

Applications/Limitations of FISTA for (P)

(P) min{f (x) + g(x) : x ∈ E}

The smooth convex function can be of any type f ∈ C 1,1 with available gradient.

As long as the prox of the nonsmooth function g

argminx∈E

{g(x) +

L

2‖x− (y − 1

L∇f (y))‖2

}can be computed analytically or easily/efficiently, FISTA is useful and veryefficient.

FISTA solves some interesting models in

I Signal/image recovery problems

I Matrix minimization problems arising in many machine learning models, e.g.,nuclear matrix norm regularization, multi-task learning, matrix classification,matrix completion problems....

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 24 / 25

References

Lecture based on [1] and [2].

1. A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithmfor linear inverse problems, SIAM J. Imaging Sci. (2009).

2. A. Beck and M. Teboulle, Gradient-based algorithms with applications tosignal-recovery problems, In Convex optimization in signal processing andcommunications (2010)

3. Y. Nesterov, Gradient methods for minimizing composite functions, Math.Program. (2013)

Marc Teboulle First Order Optimization Methods Lecture 4 - Fast Proximal Gradient (FISTA) 25 / 25

First Order Optimization Methods

Lecture 5 - Smoothing Methods

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, Paris

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 1 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

Lecture 5 – Smoothing Methods

I Basic idea: Given a nonsmooth objective J, transform

(NSO) minx∈X

J(x) −→ (SO) minx∈X

Jµ(x), Jµ smooth

µ > 0 is a smoothing parameter s.t. limµ→0+ Jµ(x) = J(x).

I Motivation + Good News: Smooth problems are often easier to handleand can be solved more efficiently.

I Bad News:

♠ One needs to solve a sequence of problems with µ↘ 0

♠ Convergence results/approximate solutions are not for original NSO J, butfor the approximated smoothed problem Jµ.

I What we would like: Results for the original nonsmooth problem.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 2 / 27

FOM for convex nonsmooth problems

I As seen, in general, nonsmooth convex optimization problems can be solvedwith complexity O(1/ε2).

I FISTA requires O(1/√ε) to obtain an ε-optimal solution of the composite

smooth+nonsmooth model f + g , when g is prox-friendly.

Can we reduce this gap? – Can we solve NSO in ∼ O(1/ε) itera-tions via one fixed smoothed counterpart problem?

Answer: Yes!

I We will show how FISTA can be used to devise a method for more generalnonsmooth convex problems.

I We will see that an ε-optimal solution of the original nonsmooth problemcan be obtained with an improved complexity of O(1/ε), by solving oneadequate smoothed counterpart.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 3 / 27

FOM for convex nonsmooth problems

I As seen, in general, nonsmooth convex optimization problems can be solvedwith complexity O(1/ε2).

I FISTA requires O(1/√ε) to obtain an ε-optimal solution of the composite

smooth+nonsmooth model f + g , when g is prox-friendly.

Can we reduce this gap? – Can we solve NSO in ∼ O(1/ε) itera-tions via one fixed smoothed counterpart problem?

Answer: Yes!

I We will show how FISTA can be used to devise a method for more generalnonsmooth convex problems.

I We will see that an ε-optimal solution of the original nonsmooth problemcan be obtained with an improved complexity of O(1/ε), by solving oneadequate smoothed counterpart.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 3 / 27

FOM for convex nonsmooth problems

I As seen, in general, nonsmooth convex optimization problems can be solvedwith complexity O(1/ε2).

I FISTA requires O(1/√ε) to obtain an ε-optimal solution of the composite

smooth+nonsmooth model f + g , when g is prox-friendly.

Can we reduce this gap? – Can we solve NSO in ∼ O(1/ε) itera-tions via one fixed smoothed counterpart problem?

Answer: Yes!

I We will show how FISTA can be used to devise a method for more generalnonsmooth convex problems.

I We will see that an ε-optimal solution of the original nonsmooth problemcan be obtained with an improved complexity of O(1/ε), by solving oneadequate smoothed counterpart.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 3 / 27

Model and IdeaThe model under consideration is

(P) min{f (x) + h(x) + g(x) : x ∈ E}.

I f Lf -smooth and convex;

I g proper closed and convex and “proximable”;

I h real-valued and convex (but not “proximable”)

The Idea - Partial Smoothing

I Solving (P) with FISTA with smooth/nonsmooth parts (f , g + h) is notpractical.

I The idea will be to find a smooth approximation of h, say h and solve theproblem via FISTA with smooth and nonsmooth parts taken as (f + h, g).

I This simple idea will be the basis for the improved O(1/ε) complexity.

I Need to study in more details the notions of smooth approximations andsmoothability.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 4 / 27

Model and IdeaThe model under consideration is

(P) min{f (x) + h(x) + g(x) : x ∈ E}.

I f Lf -smooth and convex;

I g proper closed and convex and “proximable”;

I h real-valued and convex (but not “proximable”)

The Idea - Partial Smoothing

I Solving (P) with FISTA with smooth/nonsmooth parts (f , g + h) is notpractical.

I The idea will be to find a smooth approximation of h, say h and solve theproblem via FISTA with smooth and nonsmooth parts taken as (f + h, g).

I This simple idea will be the basis for the improved O(1/ε) complexity.

I Need to study in more details the notions of smooth approximations andsmoothability.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 4 / 27

Model and IdeaThe model under consideration is

(P) min{f (x) + h(x) + g(x) : x ∈ E}.

I f Lf -smooth and convex;

I g proper closed and convex and “proximable”;

I h real-valued and convex (but not “proximable”)

The Idea - Partial Smoothing

I Solving (P) with FISTA with smooth/nonsmooth parts (f , g + h) is notpractical.

I The idea will be to find a smooth approximation of h, say h and solve theproblem via FISTA with smooth and nonsmooth parts taken as (f + h, g).

I This simple idea will be the basis for the improved O(1/ε) complexity.

I Need to study in more details the notions of smooth approximations andsmoothability.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 4 / 27

Smooth Approximations and Smoothability

I Definition. A convex function h : E→ R is called (α, β)-smoothable(α, β > 0) if for any µ > 0 there exists a convex differentiablefunction hµ : E→ R such that

(a) hµ(x) ≤ h(x) ≤ hµ(x) + βµ for all x ∈ E.(b) hµ is α

µ-smooth.

I The function hµ is called a 1µ -smooth approximation of h with

parameters (α, β).

Let us see two basic examples.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 5 / 27

ExamplesB Smoothing the l2-norm. h(x) = ‖x‖2(E = Rn). For any µ > 0,√

‖x‖22 + µ2 − µ ≤ ‖x‖2 ≤(√‖x‖22 + µ2 − µ

)+ µ.

Thus, hµ(x) ≡√‖x‖22 + µ2 − µ satisfies (a) with β = 1, and one easily verifies

that hµ is 1/µ-smooth, and hence h is (1,1)-smoothable.

B Smoothing the “max” via The Log-Sum-Exp function.h(x) = max{x1, x2, . . . , xn}(E = Rn). For any µ > 0,hµ(x) = µ log

(∑ni=1 e

xi/µ)− µ log n is a smooth approximation of h with

parameters (1, log n) ⇒ h is (1, log n)-smoothable.

One way to see this, recall:‖z‖∞ = max{|z1|, . . . , |zn|} = limp→∞(∑n

i=1 |zi |p)1/p

.Set µ := 1/p, µ→ 0+, and |zi | = exi , and pass to log, and also note that

µ log

(∑ni=1 e

xi/µ

n

)≤ max{x1, . . . , xn}.

Another, (and more general way) is via asymptotic (recession) functions and generalized

means; see ref [T. 2007].Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 6 / 27

Basic Properties and Algebra for Smoothable Functions

Proposition.

I The nonnegative linear combination of smoothable functions isanother smoothable function: If g1 and g2 are (α1, β1) and (α2, β2)

smoothable, then γ1g1 + γ2g2(γ1, γ2 ≥ 0) is smoothable withparameters (γ1α1 + γ2α2, γ1β1 + γ2β2).

I Linear transformation on the variables of a smoothable function yieldsanother smoothable function: Let A : E→ V be a linear map. If g is(α, β) smoothable, then

qµ(x) ≡ hµ(A(x) + b), b ∈ V

is smoothable with parameters (α‖A‖2, β).

Proof: Easily follows, just use definition.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 7 / 27

Smooth Approximation of Piecewise Affine Functions

I Let q(x) = maxi=1,...,m

{aTi x + bi}, where ai ∈ Rn and bi ∈ R for any

i = 1, 2, . . . ,m.

I q(x) = g(Ax + b), where g(y) = max{y1, y2, . . . , ym}, A is the matrix whoserows are aT

1 , aT2 , . . . , a

Tm and b = (b1, b2, . . . , bm)T .

I Let µ > 0. gµ(y) = µ log(∑m

i=1 eyi/µ)− µ logm is a 1

µ -smooth

approximation of g with parameters (1, logm).

I Therefore,

qµ(x) ≡ gµ(Ax + b) = µ log(∑m

i=1 e(aT

i x+bi )/µ)− µ logm

is a 1µ -smooth approximation of q with parameters (‖A‖22, logm).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 8 / 27

Smooth Approximation of Piecewise Affine Functions

I Let q(x) = maxi=1,...,m

{aTi x + bi}, where ai ∈ Rn and bi ∈ R for any

i = 1, 2, . . . ,m.

I q(x) = g(Ax + b), where g(y) = max{y1, y2, . . . , ym}, A is the matrix whoserows are aT

1 , aT2 , . . . , a

Tm and b = (b1, b2, . . . , bm)T .

I Let µ > 0. gµ(y) = µ log(∑m

i=1 eyi/µ)− µ logm is a 1

µ -smooth

approximation of g with parameters (1, logm).

I Therefore,

qµ(x) ≡ gµ(Ax + b) = µ log(∑m

i=1 e(aT

i x+bi )/µ)− µ logm

is a 1µ -smooth approximation of q with parameters (‖A‖22, logm).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 8 / 27

The Moreau Envelope

Definition. Given a proper closed convex function f : E → (−∞,∞], andµ > 0, the Moreau envelope of f is the function

Mµf (x) = min

u∈E

{f (u) +

1

2µ‖u− x‖2

}.

I By the prox theorem the minimization problem defining the Moreau envelopehas a unique solution, given by proxµf (x). Therefore,

Mµf (x) = f (proxµf (x)) +

1

2µ‖x− proxµf (x)‖2.

We will show that the parameter µ plays the role of a smoothing parameter. First,some examples.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 9 / 27

ExamplesI Indicators. Suppose that f = δC , where C ⊆ E is a nonempty closed and

convex set. Then proxf = PC and

Mµf (x) = δC (PC (x)) +

1

2µ‖x− PC (x))‖2.

Therefore,

MµδC

=1

2µd2C .

I Euclidean Norms f (x) = ‖x‖. Then for any µ > 0 and x ∈ E,

proxµf (x) =

(1− µ

max{‖x‖, µ}

)x.

Therefore,

Mµf (x) = ‖proxµf (x)‖+

1

2µ‖x− proxµf (x)‖2 =

{ 12µ‖x‖

2, ‖x‖ ≤ µ,‖x‖ − µ

2 , ‖x‖ > µ,︸ ︷︷ ︸Hµ(x)

Hµ - Known as Huber function in Statistics.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 10 / 27

ExamplesI Indicators. Suppose that f = δC , where C ⊆ E is a nonempty closed and

convex set. Then proxf = PC and

Mµf (x) = δC (PC (x)) +

1

2µ‖x− PC (x))‖2.

Therefore,

MµδC

=1

2µd2C .

I Euclidean Norms f (x) = ‖x‖. Then for any µ > 0 and x ∈ E,

proxµf (x) =

(1− µ

max{‖x‖, µ}

)x.

Therefore,

Mµf (x) = ‖proxµf (x)‖+

1

2µ‖x− proxµf (x)‖2 =

{ 12µ‖x‖

2, ‖x‖ ≤ µ,‖x‖ − µ

2 , ‖x‖ > µ,︸ ︷︷ ︸Hµ(x)

Hµ - Known as Huber function in Statistics.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 10 / 27

Huber FunctionHµ gets smoother as µ increases.

−5 0 5−1

0

1

2

3

4

5

H0.1

H1

H4

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 11 / 27

Calculus of Moreau EnvelopesI Let f : E→ (−∞,∞] be a proper closed and convex function, and letλ, µ > 0. Then

λMµf (x) = M

µ/λλf (x).

I Suppose that E = E1×E2× · · · ×Em, and let f : E→ (−∞,∞] be given by

f (x1, x2, . . . , xm) =m∑i=1

fi (xi ), x1 ∈ E1, x2 ∈ E2, . . . , xm ∈ Em,

with fi : Ei → (−∞,∞] being a proper closed and convex function for any i .Then for any µ > 0,

Mµf (x1, x2, . . . , xm) =

m∑i=1

Mµfi

(xi ).

Example: Let f (x) = ‖x‖1. Then f (x) = ‖x‖1 =∑n

i=1 g(xi ), where g(t) = |t|.Thus,

Mµf (x) =

n∑i=1

Mµg (xi ) =

n∑i=1

Hµ(xi ).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 12 / 27

A Key Property: The Moreau Envelope is 1/µ-Smooth

Theorem. Let f : E → (−∞,∞] be a proper closed and convex function.Let µ > 0. Then,

I (a) Mµf admits the following equivalent “dual” formulation:

Mµf = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2}

I (b) Mµf is 1

µ -smooth over E.

I (c) ∇Mµf (x) = 1

µ

(x− proxµf (x)

).

To prove this; namely the key property (b), we recall a fundamental result onthe relations between conjugacy, strong convexity and smoothness.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 13 / 27

Conjugate of Strongly Convex and Smoothness

Theorem -Conjugate and Smoothness Let σ > 0. If f : E→ (−∞,∞] isa proper closed σ-strongly convex function, then f ∗ : E→ R is 1

σ -smooth.

Proof. Let f be proper closed and σ-strongly convex.

I By Fenchel: y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y). Compactly we can write∂f ∗(y) = argmax

x∈E{〈x, y〉 − f (x)} for any y ∈ E, and by strong convexity x is

unique. Thus, ∂f ∗(y) is a singleton for any y, and hence f ∗ is differentiableover E.

I Take y1, y2 ∈ E and denote v1 = ∇f ∗(y1), v2 = ∇f ∗(y2).

I Fenchel again: Let y1 ∈ ∂f (v1), y2 ∈ ∂f (v2). Then,

I 〈y1 − y2, v1 − v2〉 ≥ σ‖v1 − v2‖2,I 〈y1 − y2,∇f ∗(y1)−∇f ∗(y2)〉 ≥ σ‖∇f ∗(y1)−∇f ∗(y2)‖2,I By the Cauchy-Schwarz inequality, ‖∇f ∗(y1)−∇f ∗(y2)‖ ≤ 1

σ‖y1− y2‖∗.

Reverse: If f : E→ R is a 1σ -smooth convex function, then f ∗ is σ-strongly

convex.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 14 / 27

Conjugate of Strongly Convex and Smoothness

Theorem -Conjugate and Smoothness Let σ > 0. If f : E→ (−∞,∞] isa proper closed σ-strongly convex function, then f ∗ : E→ R is 1

σ -smooth.

Proof. Let f be proper closed and σ-strongly convex.

I By Fenchel: y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y). Compactly we can write∂f ∗(y) = argmax

x∈E{〈x, y〉 − f (x)} for any y ∈ E, and by strong convexity x is

unique. Thus, ∂f ∗(y) is a singleton for any y, and hence f ∗ is differentiableover E.

I Take y1, y2 ∈ E and denote v1 = ∇f ∗(y1), v2 = ∇f ∗(y2).

I Fenchel again: Let y1 ∈ ∂f (v1), y2 ∈ ∂f (v2). Then,

I 〈y1 − y2, v1 − v2〉 ≥ σ‖v1 − v2‖2,I 〈y1 − y2,∇f ∗(y1)−∇f ∗(y2)〉 ≥ σ‖∇f ∗(y1)−∇f ∗(y2)‖2,I By the Cauchy-Schwarz inequality, ‖∇f ∗(y1)−∇f ∗(y2)‖ ≤ 1

σ‖y1− y2‖∗.

Reverse: If f : E→ R is a 1σ -smooth convex function, then f ∗ is σ-strongly

convex.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 14 / 27

Conjugate of Strongly Convex and Smoothness

Theorem -Conjugate and Smoothness Let σ > 0. If f : E→ (−∞,∞] isa proper closed σ-strongly convex function, then f ∗ : E→ R is 1

σ -smooth.

Proof. Let f be proper closed and σ-strongly convex.

I By Fenchel: y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y). Compactly we can write∂f ∗(y) = argmax

x∈E{〈x, y〉 − f (x)} for any y ∈ E, and by strong convexity x is

unique. Thus, ∂f ∗(y) is a singleton for any y, and hence f ∗ is differentiableover E.

I Take y1, y2 ∈ E and denote v1 = ∇f ∗(y1), v2 = ∇f ∗(y2).

I Fenchel again: Let y1 ∈ ∂f (v1), y2 ∈ ∂f (v2). Then,

I 〈y1 − y2, v1 − v2〉 ≥ σ‖v1 − v2‖2,I 〈y1 − y2,∇f ∗(y1)−∇f ∗(y2)〉 ≥ σ‖∇f ∗(y1)−∇f ∗(y2)‖2,I By the Cauchy-Schwarz inequality, ‖∇f ∗(y1)−∇f ∗(y2)‖ ≤ 1

σ‖y1− y2‖∗.

Reverse: If f : E→ R is a 1σ -smooth convex function, then f ∗ is σ-strongly

convex.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 14 / 27

Conjugate of Strongly Convex and Smoothness

Theorem -Conjugate and Smoothness Let σ > 0. If f : E→ (−∞,∞] isa proper closed σ-strongly convex function, then f ∗ : E→ R is 1

σ -smooth.

Proof. Let f be proper closed and σ-strongly convex.

I By Fenchel: y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y). Compactly we can write∂f ∗(y) = argmax

x∈E{〈x, y〉 − f (x)} for any y ∈ E, and by strong convexity x is

unique. Thus, ∂f ∗(y) is a singleton for any y, and hence f ∗ is differentiableover E.

I Take y1, y2 ∈ E and denote v1 = ∇f ∗(y1), v2 = ∇f ∗(y2).

I Fenchel again: Let y1 ∈ ∂f (v1), y2 ∈ ∂f (v2). Then,

I 〈y1 − y2, v1 − v2〉 ≥ σ‖v1 − v2‖2,I 〈y1 − y2,∇f ∗(y1)−∇f ∗(y2)〉 ≥ σ‖∇f ∗(y1)−∇f ∗(y2)‖2,I By the Cauchy-Schwarz inequality, ‖∇f ∗(y1)−∇f ∗(y2)‖ ≤ 1

σ‖y1− y2‖∗.

Reverse: If f : E→ R is a 1σ -smooth convex function, then f ∗ is σ-strongly

convex.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 14 / 27

Proof of the 1/µ-Smoothness of the Moreau Envelope

To prove (a) we simply use standard duality and get the dual representation ofMµ

f :

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2}

= (f ∗ +µ

2‖ · ‖2)∗(x)

With this dual representation, we can then prove (b) by apply the previousLemma.

I Now, f ∗ + µ2 ‖ · ‖

2 is µ-strongly convex.

I Hence, by previous theorem its conjugate, which is precisely Mµf , is

1/µ-smooth, proving (b).

The proof of (c) follows from calculus, see refs.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 15 / 27

Proof of the 1/µ-Smoothness of the Moreau Envelope

To prove (a) we simply use standard duality and get the dual representation ofMµ

f :

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

µ

2‖ · ‖2)∗(x)

With this dual representation, we can then prove (b) by apply the previousLemma.

I Now, f ∗ + µ2 ‖ · ‖

2 is µ-strongly convex.

I Hence, by previous theorem its conjugate, which is precisely Mµf , is

1/µ-smooth, proving (b).

The proof of (c) follows from calculus, see refs.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 15 / 27

Proof of the 1/µ-Smoothness of the Moreau Envelope

To prove (a) we simply use standard duality and get the dual representation ofMµ

f :

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

µ

2‖ · ‖2)∗(x)

With this dual representation, we can then prove (b) by apply the previousLemma.

I Now, f ∗ + µ2 ‖ · ‖

2 is µ-strongly convex.

I Hence, by previous theorem its conjugate, which is precisely Mµf , is

1/µ-smooth, proving (b).

The proof of (c) follows from calculus, see refs.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 15 / 27

Proof of the 1/µ-Smoothness of the Moreau Envelope

To prove (a) we simply use standard duality and get the dual representation ofMµ

f :

Mµf (x) = max

y∈E∗{〈y, x〉 − f ∗(y)− µ

2‖y‖2} = (f ∗ +

µ

2‖ · ‖2)∗(x)

With this dual representation, we can then prove (b) by apply the previousLemma.

I Now, f ∗ + µ2 ‖ · ‖

2 is µ-strongly convex.

I Hence, by previous theorem its conjugate, which is precisely Mµf , is

1/µ-smooth, proving (b).

The proof of (c) follows from calculus, see refs.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 15 / 27

Examples - Smoothing via Moreau Envelope

I The squared distance Let C ⊆ E be a nonempty closed and convex set.Recall that 1

2d2C = M1

δC. Then 1

2d2C is 1-smooth and

∇(1/2d2

C

)(x) = x− proxδC (x) = x− PC (x).

I Norms Hµ = Mµf , where f (x) = ‖x‖. Then Hµ is 1

µ -smooth and

∇Hµ(x) =1

µ

(x− proxµf (x)

)=

1

µ

(x−

(1− µ

max{‖x‖, µ}

)x

)=

{1µx, ‖x‖ ≤ µ,

x‖x‖ , ‖x‖ > µ,

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 16 / 27

Lipschitz Convex Functions are SmoothableTheorem. Let h : E→ R be convex and Lipschitz with constant `h, i.e.,

|h(x)− h(y)| ≤ `h‖x− y‖ for all x, y ∈ E.

Then h is (1,`2h2 )-smoothable.

Norms are typical in applications: `h =√n for l1 and `h = 1 for l2 norm.

Proof. Fix any µ > 0. Then, hµ := Mµh is 1

µ -smooth, and hence α = 1. It

remains to get our ”sandwich” Mµh ≤ h ≤ Mµ

h + βµ with β = `2h/2.I For any x ∈ E,

Mµh (x) = min

u∈E

{h(u) +

1

2µ‖u− x‖2

}≤ h(x) +

1

2µ‖x− x‖2 = h(x).

I Let gx ∈ ∂h(x). Then, since h Lip-convex we have ‖gx‖ ≤ `h, and

Mµh (x)− h(x) = min

u∈E

{h(u)− h(x) +

1

2µ‖u− x‖2

}≥ min

u∈E

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 17 / 27

Lipschitz Convex Functions are SmoothableTheorem. Let h : E→ R be convex and Lipschitz with constant `h, i.e.,

|h(x)− h(y)| ≤ `h‖x− y‖ for all x, y ∈ E.

Then h is (1,`2h2 )-smoothable.

Norms are typical in applications: `h =√n for l1 and `h = 1 for l2 norm.

Proof. Fix any µ > 0. Then, hµ := Mµh is 1

µ -smooth, and hence α = 1. It

remains to get our ”sandwich” Mµh ≤ h ≤ Mµ

h + βµ with β = `2h/2.

I For any x ∈ E,

Mµh (x) = min

u∈E

{h(u) +

1

2µ‖u− x‖2

}≤ h(x) +

1

2µ‖x− x‖2 = h(x).

I Let gx ∈ ∂h(x). Then, since h Lip-convex we have ‖gx‖ ≤ `h, and

Mµh (x)− h(x) = min

u∈E

{h(u)− h(x) +

1

2µ‖u− x‖2

}≥ min

u∈E

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 17 / 27

Lipschitz Convex Functions are SmoothableTheorem. Let h : E→ R be convex and Lipschitz with constant `h, i.e.,

|h(x)− h(y)| ≤ `h‖x− y‖ for all x, y ∈ E.

Then h is (1,`2h2 )-smoothable.

Norms are typical in applications: `h =√n for l1 and `h = 1 for l2 norm.

Proof. Fix any µ > 0. Then, hµ := Mµh is 1

µ -smooth, and hence α = 1. It

remains to get our ”sandwich” Mµh ≤ h ≤ Mµ

h + βµ with β = `2h/2.I For any x ∈ E,

Mµh (x) = min

u∈E

{h(u) +

1

2µ‖u− x‖2

}≤ h(x) +

1

2µ‖x− x‖2 = h(x).

I Let gx ∈ ∂h(x). Then, since h Lip-convex we have ‖gx‖ ≤ `h, and

Mµh (x)− h(x) = min

u∈E

{h(u)− h(x) +

1

2µ‖u− x‖2

}≥ min

u∈E

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 17 / 27

Lipschitz Convex Functions are SmoothableTheorem. Let h : E→ R be convex and Lipschitz with constant `h, i.e.,

|h(x)− h(y)| ≤ `h‖x− y‖ for all x, y ∈ E.

Then h is (1,`2h2 )-smoothable.

Norms are typical in applications: `h =√n for l1 and `h = 1 for l2 norm.

Proof. Fix any µ > 0. Then, hµ := Mµh is 1

µ -smooth, and hence α = 1. It

remains to get our ”sandwich” Mµh ≤ h ≤ Mµ

h + βµ with β = `2h/2.I For any x ∈ E,

Mµh (x) = min

u∈E

{h(u) +

1

2µ‖u− x‖2

}≤ h(x) +

1

2µ‖x− x‖2 = h(x).

I Let gx ∈ ∂h(x). Then, since h Lip-convex we have ‖gx‖ ≤ `h, and

Mµh (x)− h(x) = min

u∈E

{h(u)− h(x) +

1

2µ‖u− x‖2

}≥ min

u∈E

{〈gx, u− x〉+ 1

2µ‖u− x‖2

}= −µ

2‖gx‖2 ≥ −

`2h2µ.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 17 / 27

Examples of Smoothable Lipschitz Convex Functions

I (smooth approximation of the l2-norm) Let h(x) = ‖x‖2 (over Rn). Thenh is convex and Lipschitz with constant `h = 1. Therefore,

Mµh (x) = Hµ(x) =

{ 12µ‖x‖

22, ‖x‖2 ≤ µ,

‖x‖2 − µ2 , ‖x‖2 > µ.

is a 1µ -smooth approximation of h with parameters (1, 12 ).

I (smooth approximation of the l1-norm) Let h(x) = ‖x‖1 (over Rn) Thenh is convex and Lipschitz with constant `h =

√n. Hence,

Mµh (x) =

n∑i=1

Hµ(xi )

is a 1µ -smooth approximation of h with parameters (1, n2 ).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 18 / 27

Back to Algorithms - Model and Assumptions

Main model:(P) min

x∈E{H(x) ≡ f (x) + h(x) + g(x)}

.

(A) f : E→ R is Lf -smooth

(B) h : E→ R is (α, β)-smoothable (α, β > 0). For any µ > 0, hµ denotes a1µ -smooth approximation of h with parameters (α, β).

(C) g : E→ (−∞,∞] is proper closed and convex.

(D) H has bounded level sets.

(E) The optimal set of (P) is nonempty and denoted by X ∗. The optimal valueof the problem is denoted by Hopt.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 19 / 27

The S-FISTA Method - Partial SmoothingI The idea is to consider the following smoothed version of (P):

(Pµ) minx∈E{Hµ(x) ≡ f (x) + hµ(x)︸ ︷︷ ︸

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.

I A Lipschitz constant of ∇Fµ is Lf + αµ ; the stepsize is taken as 1

Lf +αµ

.

S-FISTAInput: x0 ∈ dom(g), µ > 0.Initialization: set y0 = x0, t0 = 1; construct hµ – a 1

µ -smooth approxima-

tion of h with parameters (α, β); set Fµ = f + hµ, L = Lf + αµ .

General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

);

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 20 / 27

The S-FISTA Method - Partial SmoothingI The idea is to consider the following smoothed version of (P):

(Pµ) minx∈E{Hµ(x) ≡ f (x) + hµ(x)︸ ︷︷ ︸

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.I A Lipschitz constant of ∇Fµ is Lf + α

µ ; the stepsize is taken as 1Lf +

αµ

.

S-FISTAInput: x0 ∈ dom(g), µ > 0.Initialization: set y0 = x0, t0 = 1; construct hµ – a 1

µ -smooth approxima-

tion of h with parameters (α, β); set Fµ = f + hµ, L = Lf + αµ .

General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

);

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 20 / 27

The S-FISTA Method - Partial SmoothingI The idea is to consider the following smoothed version of (P):

(Pµ) minx∈E{Hµ(x) ≡ f (x) + hµ(x)︸ ︷︷ ︸

Fµ(x)

+g(x)},

for some µ > 0, and solve it using FISTA with constant stepsize.I A Lipschitz constant of ∇Fµ is Lf + α

µ ; the stepsize is taken as 1Lf +

αµ

.

S-FISTAInput: x0 ∈ dom(g), µ > 0.Initialization: set y0 = x0, t0 = 1; construct hµ – a 1

µ -smooth approxima-

tion of h with parameters (α, β); set Fµ = f + hµ, L = Lf + αµ .

General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = prox 1Lg

(yk − 1

L∇Fµ(yk)

);

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 20 / 27

O(1/ε) complexity of S-FISTA

Theorem. Let ε > 0 and Let {xk}k≥0 be the sequence generated by S-FISTA with smoothing parameter

µ =

√α

β

ε√αβ +

√αβ + Lf ε

.

Then for any k satisfying

k ≥√αβΓ

1

ε+√

Lf Γ1√ε,

it holds that H(xk)− Hopt ≤ ε.

I Here Γ > 0 is a constant (e.g., like in FISTA, which depends on x0, x∗,R).

I Not needed to activate the algorithm...

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 21 / 27

Proof of Theorem - For you....

I Let µ > 0. Set F = f + hµ. Then, LF = Lf + αµ . Applying SFISTA:

I

(∗) Hµ(xk)− H∗µ ≤(Lf +

α

µ

k2

I hµ smoothable ⇒ ∃β s.t. :Hµ(x) ≤ H(x) ≤ Hµ(x) + βµ

I In particular, we thus get

H∗ ≥ H∗µ − βµ, H(xk) ≤ Hµ(xk) + βµ

I Hence,

H(xk)− H∗ ≤ Hµ(xk)− H∗µ + βµ

≤ Lf Γ

k2+

(αΓ

k2

)1

µ+ βµ.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 22 / 27

Proof Contd.I

H(xk)− H∗ ≤ Lf Γ

k2+

(αΓ

k2

)1

µ+ βµ.

I Minimizing the RHS with respect to µ:

(∗) µ =

√αΓ

β

1

k

I

H(xk)− H∗ ≤ Lf Γ

k2+ 2√αβΓ

1

k.

I Thus to get H(xk)− H∗ ≤ ε require

(Lf Γ)1

k2+ 2√αβΓ

1

k≤ ε ⇒

k ≥√αβΓ

1

ε+√Lf Γ

1√ε

Plugging the lower bound of k in µ given in (*) yields the result.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 23 / 27

ExampleConsider the problem

(P2) minx∈Rn

{1

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

},

where A ∈ Rm×n,b ∈ Rm,D ∈ Rp×n and λ > 0.I Fits model (P) with f (x) = 1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.

I All assumptions are met:I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

2)-smoothable. Indeed,

I Set h(x) := q(Dx), where q : Rp → R is given by q(y) = ‖y‖1.I For any µ > 0, qµ(y) = Mµ

q (y) =∑p

i=1 Hµ(yi ) is a1µ-smooth approximation of q

with parameters (1, p2), hence the claim with hµ(x) ≡ qµ(Dx).

I To obtain an ε-optimal solution of problem (P2), we can then use S-FISTAmethod on

minx∈Rn{Fµ(x) + λ‖x‖1} , where Fµ(x) = f (x) + hµ(x)

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 24 / 27

ExampleConsider the problem

(P2) minx∈Rn

{1

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

},

where A ∈ Rm×n,b ∈ Rm,D ∈ Rp×n and λ > 0.I Fits model (P) with f (x) = 1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.I All assumptions are met:

I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

2)-smoothable. Indeed,

I Set h(x) := q(Dx), where q : Rp → R is given by q(y) = ‖y‖1.I For any µ > 0, qµ(y) = Mµ

q (y) =∑p

i=1 Hµ(yi ) is a1µ-smooth approximation of q

with parameters (1, p2), hence the claim with hµ(x) ≡ qµ(Dx).

I To obtain an ε-optimal solution of problem (P2), we can then use S-FISTAmethod on

minx∈Rn{Fµ(x) + λ‖x‖1} , where Fµ(x) = f (x) + hµ(x)

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 24 / 27

ExampleConsider the problem

(P2) minx∈Rn

{1

2‖Ax− b‖22 + ‖Dx‖1 + λ‖x‖1

},

where A ∈ Rm×n,b ∈ Rm,D ∈ Rp×n and λ > 0.I Fits model (P) with f (x) = 1

2‖Ax− b‖22, h(x) = ‖Dx‖1, g(x) = λ‖x‖1.I All assumptions are met:

I f is convex and Lf -smooth with Lf = ‖A‖22I g is closed and convex with easy prox.I h is (‖D‖22, p

2)-smoothable. Indeed,

I Set h(x) := q(Dx), where q : Rp → R is given by q(y) = ‖y‖1.I For any µ > 0, qµ(y) = Mµ

q (y) =∑p

i=1 Hµ(yi ) is a1µ-smooth approximation of q

with parameters (1, p2), hence the claim with hµ(x) ≡ qµ(Dx).

I To obtain an ε-optimal solution of problem (P2), we can then use S-FISTAmethod on

minx∈Rn{Fµ(x) + λ‖x‖1} , where Fµ(x) = f (x) + hµ(x)

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 24 / 27

Example Contd.

I For the smooth part Fµ(x) = f (x) + Mµq (Dx), we have

I

∇Fµ(x) = ∇f (x) + DT∇Mµq (Dx)

= ∇f (x) + 1µDT (Dx− proxµq(Dx))

= ∇f (x) + 1µDT (Dx− Tµ(Dx))

and then we get Lµ = ‖A‖22 +‖D‖22µ .

I We then set µ =√

αβ

ε√αβ+√αβ+Lµε

; (plugin Lµ, α, β above).

I We know that the nonsmooth part g(x) = λ‖x‖1 admits an easy prox (thethresholding operation), and hence S-FISTA can be easily applied.

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 25 / 27

S-FISTA for the Example

S-FISTA for solving (P2). Initialization: Set y0 = x0 ∈ Rn, t0 = 1; µand L = Lµ as above.General step: for any k = 0, 1, 2, . . . execute the following steps:

(a) xk+1 = Tλ/Lµ

(yk − 1

L(∇f (yk) + 1

µDT (Dyk − Tµ(Dyk))))

;

(b) tk+1 =1+√

1+4t2k2 ;

(c) yk+1 = xk+1 +(

tk−1tk+1

)(xk+1 − xk).

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 26 / 27

References

Lecture based on [1].

1. A. Beck and M. Teboulle, Smoothing and first order methods: a unifiedframework. SIAM J. Optim. (2012)

2. M. Teboulle. A unified continuous optimization framework for center-basedclustering methods. Journal of Machine Learning Research, (2007).

3. Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program.(2005)

Marc Teboulle First Order Optimization Methods Lecture 5 - Smoothing Methods 27 / 27

First Order Optimization Methods

Lecture 6 - Lagrangian and DecompositionMethods

Marc Teboulle

School of Mathematical SciencesTel Aviv University

PGMO Lecture Series

January 25-26, 2017 Ecole Polytechnique, ParisMarc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 1 / 26

Lecture 6 - Lagrangians and Decomposition Methods• The Model: Nonsmooth Convex

(P) p∗ = inf{ϕ(x) ≡ f (x) + g(Ax) : x ∈ Rn},

I Now, f , g are convex extended valued both nonsmooth.I A : Rn → Rm is a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p∗ = inf{f (x) + g(z) : Ax = z , x ∈ Rn, z ∈ Rm}

Rockafellar (’76) has shown that the Proximal Algorithm can be applied to thedual and primal-dual formulation of (P) to produce:

I The Multipliers Method (augmented Lagrangian Method).I The Proximal Method of Multipliers (PMM).I Largely ignored over last 20 years.....Recent very strong revival in, image

science, machine learning etc... within many algorithms.. all being rootedin - and variants of – the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 2 / 26

Lecture 6 - Lagrangians and Decomposition Methods• The Model: Nonsmooth Convex

(P) p∗ = inf{ϕ(x) ≡ f (x) + g(Ax) : x ∈ Rn},

I Now, f , g are convex extended valued both nonsmooth.I A : Rn → Rm is a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p∗ = inf{f (x) + g(z) : Ax = z , x ∈ Rn, z ∈ Rm}

Rockafellar (’76) has shown that the Proximal Algorithm can be applied to thedual and primal-dual formulation of (P) to produce:

I The Multipliers Method (augmented Lagrangian Method).I The Proximal Method of Multipliers (PMM).I Largely ignored over last 20 years.....Recent very strong revival in, image

science, machine learning etc... within many algorithms.. all being rootedin - and variants of – the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 2 / 26

Lecture 6 - Lagrangians and Decomposition Methods• The Model: Nonsmooth Convex

(P) p∗ = inf{ϕ(x) ≡ f (x) + g(Ax) : x ∈ Rn},

I Now, f , g are convex extended valued both nonsmooth.I A : Rn → Rm is a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p∗ = inf{f (x) + g(z) : Ax = z , x ∈ Rn, z ∈ Rm}

Rockafellar (’76) has shown that the Proximal Algorithm can be applied to thedual and primal-dual formulation of (P) to produce:

I The Multipliers Method (augmented Lagrangian Method).I The Proximal Method of Multipliers (PMM).I Largely ignored over last 20 years.....Recent very strong revival in, image

science, machine learning etc... within many algorithms.. all being rootedin - and variants of – the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 2 / 26

The PMM–Proximal Method of Multipliers

PMM Generate (xk , zk) and dual multiplier yk via

(xk+1, zk+1) ∈ argminx,z{f (x) + g(z) + 〈yk ,Ax − z〉+

c

2‖Ax − z‖2 + qk(x , z)}

yk+1 = yk + c(Axk+1 − zk+1), (c > 0).

I The Augmented Lagrangian = Penalized Lagrangian

Lc(x , z , y) :=

Lagrangian︷ ︸︸ ︷f (x) + g(z) + 〈y ,Ax − z〉+

c

2‖Ax − z‖2, (c > 0).

I qk(x , z) := 12

(‖x − xk‖2M1

+ ‖z − zk‖2M2

)is the additional primal proximal

term.

I The choice of M1 ∈ Sn+,M2 ∈ Sm+ is used to conveniently describe/analyzeseveral variants of the PMM.

I M1 = M2 ≡ 0, recovers the Multiplier Methods (PMA on the dual).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 3 / 26

Proximal Method of Multipliers–Key Difficulty

I Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + 〈yk ,Ax − z〉+c

2‖Ax − z‖2 + qk(x , z).

I The quadratic coupling term ‖Ax − z‖2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).

I In many applications, separate minimization is often crucial!

Many Strategies available to overcome this difficulty:

I Approximate Minimization – linearized the quad term ‖Ax − z‖2 wrt (x , z).

I Alternating Minimization – a la “Gauss-Seidel” in (x , z).

I Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.

I Above strategies produce various useful variants of the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 4 / 26

Proximal Method of Multipliers–Key Difficulty

I Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + 〈yk ,Ax − z〉+c

2‖Ax − z‖2 + qk(x , z).

I The quadratic coupling term ‖Ax − z‖2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).

I In many applications, separate minimization is often crucial!

Many Strategies available to overcome this difficulty:

I Approximate Minimization – linearized the quad term ‖Ax − z‖2 wrt (x , z).

I Alternating Minimization – a la “Gauss-Seidel” in (x , z).

I Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.

I Above strategies produce various useful variants of the PMM.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 4 / 26

A Prototype : Alternating Direction PMM (AD-PMM)Eliminate the coupling (x , z) via alternating minimization steps.

(AD-PMM) Alternating Direction Proximal Method of Multipliers

1. Start with any (x0, z0, y0) ∈ Rn × Rm × Rm and c > 0

2. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 ∈ argmin

{f (x) +

c

2‖Ax − zk + c−1yk‖2 +

1

2‖x − xk‖2M1

},

zk+1 = argmin

{g(z) +

c

2‖Axk+1 − z + c−1yk‖2 +

1

2‖z − zk‖2M2

},

yk+1 = yk + c(Axk+1 − zk+1).

I Useful when (x, z) steps are “easy” to implement exactly or inexactly (e.g., viastrategies just mentioned); for instance the the x-step is often not easy!

I Adequate choices of M1,M2 � 0 allows to derive various algorithms, next slide.

I Allows for Convergence/Complexity via a unifying analysis for many - old and new -algorithms.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 5 / 26

A Prototype : Alternating Direction PMM (AD-PMM)Eliminate the coupling (x , z) via alternating minimization steps.

(AD-PMM) Alternating Direction Proximal Method of Multipliers

1. Start with any (x0, z0, y0) ∈ Rn × Rm × Rm and c > 0

2. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 ∈ argmin

{f (x) +

c

2‖Ax − zk + c−1yk‖2 +

1

2‖x − xk‖2M1

},

zk+1 = argmin

{g(z) +

c

2‖Axk+1 − z + c−1yk‖2 +

1

2‖z − zk‖2M2

},

yk+1 = yk + c(Axk+1 − zk+1).

I Useful when (x, z) steps are “easy” to implement exactly or inexactly (e.g., viastrategies just mentioned); for instance the the x-step is often not easy!

I Adequate choices of M1,M2 � 0 allows to derive various algorithms, next slide.

I Allows for Convergence/Complexity via a unifying analysis for many - old and new -algorithms.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 5 / 26

Some Typical Schemes from AD-PMM

(1) Classical ADM (Alternating Direction of Multipliers): M1 = M2 = 0I Alternates minimization of the standard Augmented Lagrangian Lc .

(2) Partial Regularized ADMM: M1 � 0,M2 � 0I For example, one can use M1 := τ−1In, and M2 = 0.I Allows to prove the convergence of the sequence {xk} without any assumption

on the matrix A.

(3) Mixed Srategy: Linearize and Alternating MinimizationI Linearization w.r.t x , combined with AM in z . This is achieved by choosing:I M1 := τ−1In − cATA � 0, ⇔ cτ‖A‖2 ≤ 1; M2 := 0.I Produces the recent Primal-Dual algorithm [Chambolle-Pock (2010)].

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 6 / 26

Some Definitions/Recalls: Saddle point and DualThe schemes are primal-dual. The Lagrangian associated to (P) is

`(x , z , y) = f (x) + g(z) + 〈y ,Ax − z〉, y ∈ Rm multiplier for Ax = z .

Assumption on (P) The Lagrangian l associated to (P) has a saddle point, i.e.,there exists (x∗, y∗, z∗) such that:

`(x∗, z∗, y) ≤ `(x∗, z∗, y∗) ≤ `(x , z , y∗), ∀ x , y , z ∈ Rn × Rm × Rm.

It is well-known that (x∗, z∗, y∗) is a saddle point of the Lagrangian ` if and onlyif (x∗, z∗) is a solution of (P), y∗ is a solution of (D), and

p∗ = `(x∗, z∗, y∗) = infx,z

[supy`(x , z , y)] = sup

y[infx,z`(x , z , y)] = d∗.

The number p∗ = `(x∗, z∗, y∗) ∈ R is the saddle value of `.

Recall: Since (P) is convex, under the standard constraint qualification, e.g.,

there exists x ∈ ri(dom f ), z ∈ ri(dom g), such that Ax = z ,

strong duality holds, and the dual optimal set SD is nonempty, convex and compact.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 7 / 26

Approximate Saddle Point

Definition (ε-saddle-point for general K Convex-Concave)Let ε > 0. A point (uε, vε) is called an ε-saddle-point of convex-concavefunction K if

sup {K (uε, v)− K (u, vε) : u ∈ SP , v ∈ SD} ≤ ε,

where associated to the saddle function K we define SP (SD) the optimalsolutions set of the primal (dual) problems:

(P) infu∈Rn

[r (u) = sup

v∈Rd

K (u, v)

]and (D) sup

v∈Rd

[q (v) = inf

u∈RnK (u, v)

].

Our case: K (u, v) ≡ `(x , z , y) = f (x) + g(z) + 〈y ,Ax − z〉, (u := (x , z), v := y).This measures the closeness of an optimal solution to the optimal set in terms ofthe duality gap, and our main task if to find an upper bound for

K (uε, v)− K (u, vε) .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 8 / 26

Ergodic Complexity for AD-PMM is O(1/N)

Recall : For any sequence {wp}, any N ≥ 1, the ergodic sequence

wN := 1N

∑Np=1 wp.

Typical Result [For AD-PMM and other variants, [Shefi-T. (2014)]].Let {xk , zk , yk} be the sequence generated by (AD-PMM). Then, theergodic sequence (xN , zN , yN) with N = O(1/ε) is an ε-saddle point of`, and both an ε-optimal primal (x∗, z∗) solution and an ε-dual optimalsolution of (P)-(D), respectively.

The more precise statement follows next...

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 9 / 26

Global Rate of Convergence/Complexity for AD-PMM

Theorem [For AD-PMM and other variants, [Shefi-T. (2014)]]. Let{xk , zk , yk} be the sequence generated by (AD-PMM). Then, the ergodic se-quence (xN , zN , yN) satisfies for all N ≥ 1,

I `(xN , zN , y)− p∗ ≤ C(x∗,z∗,y)N , ∀y ∈ Rm

I Under compactness on the optimal primal solution set SP , we have

max{`(xN , zN , y)− `(x , z , yN) : (x , z) ∈ SP , y ∈ SD} ≤c∗N,

where

C (x , z , y) :=1

2

{‖x − x0‖2M1

+ ‖z − z0‖2M2+ c−1‖y − y0‖2

}+

c

2‖Ax−z0‖2.

and c∗ = max{C (x , z , y) : (x , z) ∈ SP , y ∈ SD}.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 10 / 26

Issues with PMM

PMM based schemes are not free of potential problems ..raising practicaland theoretical issues:

I The penalty parameter c is unknown: trial/error runs, fine tuning,heuristics, ....

I Iteration complexity bound, as just seen, depends on c !

I Difficult to extend for sum of m > 2 convex composite functions with linearmaps

minx{f (x) +

m∑i=1

gi (Aix)}

Any alternatives?..

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 11 / 26

Alternative Approach

Current methods which admit an O(1/ε) efficiency estimate

I PMM-Based: Just discussed with its potential drawbacks.

I Smoothing Methods: As discussed in previous lectureI Assume availability of a smoothable functionI Require a smoothing parameter in term of the accuracy fixed in advance.

Our Goal: An algorithm for a broad class of structurednonsmooth problem that achieves the efficiency estimate O(1/ε):

I Removes difficulties with current methods.

I Flexible enough to be applied to more general scenarios.

I Involves simple computational tasks.

I Approach: Saddle Point Reformulation of a Structured Model

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 12 / 26

Alternative Approach

Current methods which admit an O(1/ε) efficiency estimate

I PMM-Based: Just discussed with its potential drawbacks.

I Smoothing Methods: As discussed in previous lectureI Assume availability of a smoothable functionI Require a smoothing parameter in term of the accuracy fixed in advance.

Our Goal: An algorithm for a broad class of structurednonsmooth problem that achieves the efficiency estimate O(1/ε):

I Removes difficulties with current methods.

I Flexible enough to be applied to more general scenarios.

I Involves simple computational tasks.

I Approach: Saddle Point Reformulation of a Structured Model

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 12 / 26

A Saddle Point Approach/Reformulation

Our problem is of the form

(P) p∗ = min{F (u) + G (Au) : u ∈ Rn},

F ,G are both nonsmooth, closed proper convex, and A : Rn → Rm a given linearmap.

Use Data Info: Since G is closed convex G ≡ G∗∗ namely,

G (z) = supv{〈z , v〉 − G∗(v)}.

Therefore, (P) admits the min-max convex-concave representation:

minu

maxv{K (u, v) := F (u) + 〈Au, v〉 − G∗(v)}.

Proposed Alternative approach: an algorithm via SP reformulation for abroad class of problems that is very flexible and allows to tackle manyscenarios.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 13 / 26

A Saddle Point Approach/Reformulation

Our problem is of the form

(P) p∗ = min{F (u) + G (Au) : u ∈ Rn},

F ,G are both nonsmooth, closed proper convex, and A : Rn → Rm a given linearmap.

Use Data Info: Since G is closed convex G ≡ G∗∗ namely,

G (z) = supv{〈z , v〉 − G∗(v)}.

Therefore, (P) admits the min-max convex-concave representation:

minu

maxv{K (u, v) := F (u) + 〈Au, v〉 − G∗(v)}.

Proposed Alternative approach: an algorithm via SP reformulation for abroad class of problems that is very flexible and allows to tackle manyscenarios.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 13 / 26

A Class of Structured Convex-Concave Saddle-Point Model

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

Data Information

(i) f : Rn → R is convexC 1,1Lf

: ‖∇f (u1)−∇f (u2)‖ ≤ Lf ‖u1 − u2‖ , ∀u1, u2.(ii) gi : Rdi → (−∞,+∞], i = 1, 2, . . . ,m, is a proper, (lsc) and convex

function (possibly nonsmooth), and we let g : Rd → (−∞,+∞]

g (v) :=m∑i=1

gi (v) ; d :=m∑i=1

di ; v := (v1, v2, . . . , vm) ∈ Rd .

(iii) Ai : Rdi → Rn, i = 1, 2, . . . ,m, is a linear map and we letA : Rd → Rn be the linear map defined by Av =

∑mi=1 Aivi .

Assumption: K (·, ·) has a saddle-point, i.e., there exists (u∗, v∗) ∈ Rn × Rd s.t.,

K (u∗, v) ≤ K (u∗, v∗) ≤ K (u, v∗) , ∀ u ∈ Rn, v ∈ Rd .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 14 / 26

A Proximal Alternating Predictor Corrector (PAPC)

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

The Algorithm blends: duality, predictor-corrector steps, and proximaloperation, described next.

Features of PAPC - Fully exploits structures and data info of aproblem.

I PAPC avoids the computational difficult task: the prox of thecomposition with a linear map (g ◦ A)(x) = g(Ax). Only ask prox of g(·).

I Can be easily applied to minimization problems with sum of suchcomposite terms in objective/constraints.

I Constraints on the variable v , built-in thanks to g being extended valued.

I Constraints on the variable u or/and presence of nonsmooth primal objectiveswill be easily handled via The Dual Transportation Trick.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 15 / 26

A Proximal Alternating Predictor Corrector (PAPC)

(M) minu∈Rn

maxv∈Rd

{K (u, v) := f (u) + 〈u,Av〉 − g (v)} ,

The Algorithm blends: duality, predictor-corrector steps, and proximaloperation, described next.

Features of PAPC - Fully exploits structures and data info of aproblem.

I PAPC avoids the computational difficult task: the prox of thecomposition with a linear map (g ◦ A)(x) = g(Ax). Only ask prox of g(·).

I Can be easily applied to minimization problems with sum of suchcomposite terms in objective/constraints.

I Constraints on the variable v , built-in thanks to g being extended valued.

I Constraints on the variable u or/and presence of nonsmooth primal objectiveswill be easily handled via The Dual Transportation Trick.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 15 / 26

The PAPC MethodPAPC Initialization.

(u0, v0

)∈ Rn × Rd , τ > 0, and {σi}mi=1 > 0.

For k = 1, 2, . . . ,

pk = uk−1 − τ(Avk−1 +∇f

(uk−1

)),

♣ vki = proxgiσi

(vk−1i + σiA

Ti p

k), i = 1, 2, . . . ,m,

uk = uk−1 − τ(Avk +∇f

(uk−1

)).

Output: uN = 1N

∑Nk=1 u

k , vN = 1N

∑Nk=1 v

k .

I ♣ v step “decomposes” according to structure. Only prox for each gi (·),not of g(Aix)!

I The parameters (τ, σi ) are defined in terms of problem’s data Lf ,Ai .

I Each iteration requires one application of A and of AT and one evaluation of∇f .

Notation:Si := σ−1i Idi , i = 1, 2, . . . ,m, S := Diag [S1,S2, . . . ,Sm] , d =m∑i=1

di .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 16 / 26

Tools for Anaysis...Simple

We want to find an ε-saddle-point of K . To prove an iteration complexity, ourmain task is to find an upper-bound for:

Γk (u, v) := K(uk , v

)− K

(u, vk

)for all u ∈ Rn, v ∈ Rd

. Analysis relies on: the Descent Lemma for f and weighted prox-inequality

Lemma (Weighted proximal inequality)Let h : Rp → (−∞,∞] be a proper, lsc and convex function. Given M � 0 andx ∈ Rp, let

z = argminξ∈Rp

{h (ξ) +

1

2‖ξ − x‖2M

}.

Then, for all ξ ∈ Rp, h (z)− h (ξ) ≤ 〈ξ − z ,M (z − x)〉 .

We will bound Γk (u, v) by bounding the following two terms and add them:

K(uk , vk

)− K

(u, vk

)and K

(uk , v

)− K

(uk , vk

).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 17 / 26

Key Estimates for PAPCLet

{(pk , uk , v k

)}k∈N be a sequence generated by the PAPC algorithm.

Lemma 1. For every u ∈ Rn and any k ∈ N, we have

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

2

(1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

.

Proof. Use the Descent lemma for f + some algebra...

Lemma 2. Assume that G := S − τATA � 0. For every v ∈ Rd and any k ∈ N, wehave

K(uk , v

)− K

(uk , v k

)≤ 1

2

(∥∥∥v − v k−1∥∥∥2

G−

∥∥∥v − v k∥∥∥2

G−

∥∥∥v k − v k−1∥∥∥2

G

).

Proof. Use the weighted proximal inequality for the v k and relations between pk , uk

given in PAPC.

Thus, if G := S − τATA � 0 and τLf ≤ 1, then for every u ∈ Rn and v ∈ Rd :

Γk (u, v) = K(uk , v

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 18 / 26

Key Estimates for PAPCLet

{(pk , uk , v k

)}k∈N be a sequence generated by the PAPC algorithm.

Lemma 1. For every u ∈ Rn and any k ∈ N, we have

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

2

(1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

.

Proof. Use the Descent lemma for f + some algebra...

Lemma 2. Assume that G := S − τATA � 0. For every v ∈ Rd and any k ∈ N, wehave

K(uk , v

)− K

(uk , v k

)≤ 1

2

(∥∥∥v − v k−1∥∥∥2

G−

∥∥∥v − v k∥∥∥2

G−

∥∥∥v k − v k−1∥∥∥2

G

).

Proof. Use the weighted proximal inequality for the v k and relations between pk , uk

given in PAPC.

Thus, if G := S − τATA � 0 and τLf ≤ 1, then for every u ∈ Rn and v ∈ Rd :

Γk (u, v) = K(uk , v

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 18 / 26

Key Estimates for PAPCLet

{(pk , uk , v k

)}k∈N be a sequence generated by the PAPC algorithm.

Lemma 1. For every u ∈ Rn and any k ∈ N, we have

K(uk , v k

)− K

(u, v k

)≤ 1

(∥∥∥u − uk−1∥∥∥2

−∥∥∥u − uk

∥∥∥2)− 1

2

(1

τ− Lf

)∥∥∥uk − uk−1∥∥∥2

.

Proof. Use the Descent lemma for f + some algebra...

Lemma 2. Assume that G := S − τATA � 0. For every v ∈ Rd and any k ∈ N, wehave

K(uk , v

)− K

(uk , v k

)≤ 1

2

(∥∥∥v − v k−1∥∥∥2

G−

∥∥∥v − v k∥∥∥2

G−

∥∥∥v k − v k−1∥∥∥2

G

).

Proof. Use the weighted proximal inequality for the v k and relations between pk , uk

given in PAPC.

Thus, if G := S − τATA � 0 and τLf ≤ 1, then for every u ∈ Rn and v ∈ Rd :

Γk (u, v) = K(uk , v

)− K

(u, vk

)≤ 1

(∥∥u − uk−1∥∥2 − ∥∥u − uk

∥∥2)+(∥∥v − vk−1∥∥2

G−∥∥v − vk

∥∥2G

).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 18 / 26

The PAPC Method – Convergence Results

Notation: Let σ := max1≤i≤m σi .

Theorem. Let{(

pk , uk , vk)}

k∈N be a sequence generated by the PAPC

algorithm with τLf ≤ 1 and στ∑m

i=1 ‖Ai‖2 ≤ 1.

(a) Global Rate of Convergence -Ergodic

K(uN , v

)− K

(u, vN

)≤τ−1

∥∥u − u0∥∥2 +

∥∥v − v0∥∥2G

2N.

Bound constant in terms of Data (Lf ,Ai ) – No extraparameters.

(b) Sequential Convergence: The sequence {(uk , vk)}k∈N converges toa saddle-point (u∗, v∗) of K .

Proof Sketch (a) Sum the obtained bound on Γk , and use Jensen’s inequality.(b) Use (a) and standard analysis tools.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 19 / 26

The Dual Transportation TrickFor Handling Constraints and More...Main tool: any proper closed convex function, h, satisfies h∗∗ = h, i.e.,

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

I Constraints in u Let U be nonempty closed convex. Then:

minu∈U⊂Rp

F (u) = minu∈Rp{F (u) + δU(u)} = min

u∈Rpmaxv∈Rp{F (u) + 〈u, v〉 − δ∗U (v)} .

I If F is also nonsmooth, then we can “further transport”:

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

I Both formulations fit exactly model (M) with m = 1, and m = 2, respectively.I Separability in v is preserved.I Knowledge of F ∗, δ∗U is not needed to apply PAPC!.. Thanks to Moreau’s

identity,proxF + prox∗F = I .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 20 / 26

The Dual Transportation TrickFor Handling Constraints and More...Main tool: any proper closed convex function, h, satisfies h∗∗ = h, i.e.,

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

I Constraints in u Let U be nonempty closed convex. Then:

minu∈U⊂Rp

F (u) = minu∈Rp{F (u) + δU(u)} = min

u∈Rpmaxv∈Rp{F (u) + 〈u, v〉 − δ∗U (v)} .

I If F is also nonsmooth, then we can “further transport”:

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

I Both formulations fit exactly model (M) with m = 1, and m = 2, respectively.I Separability in v is preserved.I Knowledge of F ∗, δ∗U is not needed to apply PAPC!.. Thanks to Moreau’s

identity,proxF + prox∗F = I .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 20 / 26

The Dual Transportation TrickFor Handling Constraints and More...Main tool: any proper closed convex function, h, satisfies h∗∗ = h, i.e.,

h(u) ≡ maxv{〈u, v〉 − h∗(v)}.

I Constraints in u Let U be nonempty closed convex. Then:

minu∈U⊂Rp

F (u) = minu∈Rp{F (u) + δU(u)} = min

u∈Rpmaxv∈Rp{F (u) + 〈u, v〉 − δ∗U (v)} .

I If F is also nonsmooth, then we can “further transport”:

minu∈U⊂Rp

F (u) = minu∈Rp

maxv1,v2∈Rp

{〈u, v1〉 − F ∗(v1)− 〈u, v2〉 − δ∗U (v2)} .

I Both formulations fit exactly model (M) with m = 1, and m = 2, respectively.I Separability in v is preserved.I Knowledge of F ∗, δ∗U is not needed to apply PAPC!.. Thanks to Moreau’s

identity,proxF + prox∗F = I .

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 20 / 26

PAPC Applies to Many Important Models

Thanks to the DTT, PAPC can be easily applied to many important models

♣ Convex Problems with Sum of Composite Convex Functions with LinearMaps

1. minu∈Rp

{F (u) +

∑mi=1 Hi (Biu)

}.

2. minxi

{∑mi=1 ψ(xi ) :

∑mi=1 Mixi = b

}.

3. minu∈Rp

{F (u) :

∑mi=1 Hi (Biu) ≤ α

}.

For all these models, PAPC

I Decomposes nicely according to given structure.

I Removes the difficult task of “computing prox of convex functioncomposed with an affine map”.

I Parameters are determined from problem’s data info.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 21 / 26

Application Example: Fused lasso regression

Suppose that the training set consists of N training examples {(ai , bi )}Ni=1, whereai ∈ Rn are the feature vectors and bi ∈ {−1,+1} are the respective labels. Thelogistic model assumes

P (b|a) =1

1 + exp (−b (wTa + v)),

where v ∈ R and w ∈ Rn are the parameters of the model. The likelihoodfunction of the training set is then given by

L(v ,w ; {(ai , bi )}Ni=1

)=

N∏i=1

1

1 + exp (−bi (wTai + v)).

I In “lasso” regression we assume that the vector w has a bounded `1 norm(sparse)

I In “fused” regression we further ask (Dw)i := wi+1 − wi , with a bounded `1norm; i.e., “staircase-like” pattern, e.g., relevant for model parameters givenin some natural order.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 22 / 26

Fused lasso regression - A Saddle-Point formulationThe problem is to find the MLE - a convex nonsmooth constrained problem:

(FLR) minv∈R,w∈Rn

{lavg (v ,w) : ‖w‖1 ≤ s1, ‖Dw‖1 ≤ s2} ,

lavg (v ,w) := log L(v ,w ; {(ai , bi )}Ni=1

)=

1

N

N∑i=1

log(1 + exp

(bi(wTai + v

))).

Using the “Dual transportation trick” we get the SP formulation

(FLR′) minv∈R,w∈Rn

maxy∈Rn,z∈Rn−1

{lavg (v ,w) + 〈w , y〉 − s1 ‖y‖∞ + 〈Dw , z〉 − s2 ‖z‖∞}

I All the required prox in PAPC are easy to compute: projections onto l1-ballwhich can be done in O(d).

I PAPC requires only one evaluation of the computational expensive gradient∇lavg per iteration.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 23 / 26

Fused Lasso Regression - Numerical Illustration

I 1000 samples from the logistic model that corresponds to v = 0 and thevector w ∈ R100 defined by

wi =

0.1, i ∈ {11, . . . , 14} ,−0.2, i ∈ {41, . . . , 44} ,0.3, i ∈ {91, . . . , 94} ,0, otherwise.

I ai ∈ R100 (i = 1, 2, . . . , 1000) were generated such that each component wasindependently drawn from the standard normal distribution.

I the labels bi ∈ {−1,+1} were then randomly chosen according to the logisticmodel

P (b|a) =1

1 + exp (−b (wTa + v)).

I Llavg :=∑N

i=1

(‖ai‖2 + 1

)/ (4N).

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 24 / 26

Fused lasso regression - Numerical results

10 20 30 40 50 60 70 80 90 100

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Model

PAPC (ergodic) solution after 1000 iterations

Exact soluction

10 20 30 40 50 60 70 80 90 100

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Model

Lasso

I Left: model parameters estimated by PAPC versus exact solution via CVX

I Right: Estimated model using classical lasso. Thus, adding “fused”constraints considerably increases the quality and interpretability of therecovered model.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 25 / 26

References – Lecture based on [1] and [2]

1. R. Shefi and M. Teboulle. Rate of Convergence Analysis of Decomposition MethodsBased on the Proximal Method of Multipliers for Convex Minimization. SIAM J.Optimization, 24, 269–297, (2014).

2. Y. Drori, S. Sabach and M. Teboulle. A simple algorithm for a class of nonsmoothconvex-concave saddle-point problems. Operations Research Letters, 43, 209–214,(2015).

3. D. Bertsekas. Constrained optimization and lagrange multiplier methods. ComputerScience and Applied Mathematics, Boston: Academic Press, 1982, 1, 1982.

4. R. Glowinski and P. Le Tallec. Augmented Lagrangian and operator splittingmethods in nonlinear mechanics, volume 9. Society for Industrial Mathematics,1989.

5. A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problemswith applications to imaging. J. Math. Imaging Vision, 40(1):120–145, 2011.

6. R. Rockafellar. Monotone operators and the proximal point algorithm. SIAMJournal on Control and Optimization, 14(5):877–898, 1976.

7. R. Tibshirani, M. Saunders, R. Saharon, J. Zhu, and K. Knight. Sparsity andsmoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67,91–108, 2005.

Marc Teboulle First Order Optimization Methods Lecture 6 - Lagrangian and Decomposition Methods 26 / 26