modern convex optimization methods for large-scale empirical … · motivationgradient...

282
Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives Modern Convex Optimization Methods for Large-Scale Empirical Risk Minimization (Part I: Primal Methods) International Conference on Machine Learning Peter Richt´ arik and Mark Schmidt July 2015

Upload: others

Post on 17-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Modern Convex Optimization Methods forLarge-Scale Empirical Risk Minimization

(Part I: Primal Methods)International Conference on Machine Learning

Peter Richtarik and Mark Schmidt

July 2015

Page 2: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Context: Big Data and Big Models

We are collecting data at unprecedented rates.

Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).

Machine learning can use big data to fit richer models:

Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.

Page 3: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Context: Big Data and Big Models

We are collecting data at unprecedented rates.

Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).

Machine learning can use big data to fit richer models:

Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.

Page 4: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Context: Big Data and Big Models

We are collecting data at unprecedented rates.

Seen across many fields of science and engineering.Not gigabytes, but terabytes or petabytes (and beyond).

Machine learning can use big data to fit richer models:

Bioinformatics.Computer vision.Speech recognition.Product recommendation.Machine translation.

Page 5: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Common Framework: Empirical Risk Minimization

The most common framework is empirical risk minimization:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.

Examples range from squared error with 2-norm regularization,

minx∈RP

1

N

N∑

i=1

1

2(aTi x − bi )

2 +λ

2‖x‖2,

but also conditional random fields and deep neural networks.Main practical challenges:

Designing/learning good features ai .Efficiently solving the problem when N or P are very large.

Page 6: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Common Framework: Empirical Risk Minimization

The most common framework is empirical risk minimization:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.

Examples range from squared error with 2-norm regularization,

minx∈RP

1

N

N∑

i=1

1

2(aTi x − bi )

2 +λ

2‖x‖2,

but also conditional random fields and deep neural networks.Main practical challenges:

Designing/learning good features ai .Efficiently solving the problem when N or P are very large.

Page 7: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Common Framework: Empirical Risk Minimization

The most common framework is empirical risk minimization:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.

Examples range from squared error with 2-norm regularization,

minx∈RP

1

N

N∑

i=1

1

2(aTi x − bi )

2 +λ

2‖x‖2,

but also conditional random fields and deep neural networks.

Main practical challenges:Designing/learning good features ai .Efficiently solving the problem when N or P are very large.

Page 8: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Common Framework: Empirical Risk Minimization

The most common framework is empirical risk minimization:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

We have n observations ai (and possibly labels bi ).We want to find optimal parameters x∗.

Examples range from squared error with 2-norm regularization,

minx∈RP

1

N

N∑

i=1

1

2(aTi x − bi )

2 +λ

2‖x‖2,

but also conditional random fields and deep neural networks.Main practical challenges:

Designing/learning good features ai .Efficiently solving the problem when N or P are very large.

Page 9: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Why Learn about Convex Optimization?

Why learn about large-scale optimization?

Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.

Why in particular learn about convex optimization?

Among only efficiently-solvable continuous problems.You can do a lot with convex models.

(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)

Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.

(functions are locally convex around minimizers)

Tools from convex analysis are being extended to non-convex.

Page 10: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Why Learn about Convex Optimization?

Why learn about large-scale optimization?

Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.

Why in particular learn about convex optimization?

Among only efficiently-solvable continuous problems.You can do a lot with convex models.

(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)

Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.

(functions are locally convex around minimizers)

Tools from convex analysis are being extended to non-convex.

Page 11: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Why Learn about Convex Optimization?

Why learn about large-scale optimization?

Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.

Why in particular learn about convex optimization?

Among only efficiently-solvable continuous problems.

You can do a lot with convex models.(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)

Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.

(functions are locally convex around minimizers)

Tools from convex analysis are being extended to non-convex.

Page 12: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Why Learn about Convex Optimization?

Why learn about large-scale optimization?

Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.

Why in particular learn about convex optimization?

Among only efficiently-solvable continuous problems.You can do a lot with convex models.

(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)

Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.

(functions are locally convex around minimizers)

Tools from convex analysis are being extended to non-convex.

Page 13: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Why Learn about Convex Optimization?

Why learn about large-scale optimization?

Optimization is at the core of many ML algorithms.Can’t solve huge problems with traditional techniques.

Why in particular learn about convex optimization?

Among only efficiently-solvable continuous problems.You can do a lot with convex models.

(least squares, lasso, generlized linear models, SVMs, CRFs, etc.)

Empirically effective non-convex methods are often basedmethods with good properties for convex objectives.

(functions are locally convex around minimizers)

Tools from convex analysis are being extended to non-convex.

Page 14: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 15: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 16: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 17: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 18: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 19: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 20: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

Page 21: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

After t iterations, the error of any algorithm is Ω(1/t1/n).(and grid-search is nearly optimal)

Optimization is hard, but assumptions make a big difference.(we went from impossible to very slow)

Page 22: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

How hard is real-valued optimization?

How long to find an ε-optimal minimizer of a real-valued function?

minx∈Rn

f (x).

General function: impossible!

We need to make some assumptions about the function:

Assume f is Lipschitz-continuous: (can not change too quickly)

|f (x)− f (y)| ≤ L‖x − y‖.

After t iterations, the error of any algorithm is Ω(1/t1/n).(and grid-search is nearly optimal)

Optimization is hard, but assumptions make a big difference.(we went from impossible to very slow)

Page 23: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

Page 24: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

Page 25: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

f(x)

f(y)

Page 26: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

f(x)

f(y)

Page 27: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

f(x)

f(y)

0.5f(x) + 0.5f(y)

Page 28: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

f(x)

f(y)

0.5f(x) + 0.5f(y)

f(0.5x + 0.5y)

Page 29: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

Not convex

Page 30: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

Function is below linear interpolation between x and y .

Implies that all local minima are global minima.

f(x)f(y)

Not convexNon-global

local minima

Page 31: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

If ∇f (y) = 0, implies y is a a global minimizer.

Page 32: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

If ∇f (y) = 0, implies y is a a global minimizer.

Page 33: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

f(x)

If ∇f (y) = 0, implies y is a a global minimizer.

Page 34: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

f(x)

f(x) + ∇f(x)T(y-x)

If ∇f (y) = 0, implies y is a a global minimizer.

Page 35: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

f(x)

f(x) + ∇f(x)T(y-x)

f(y)

If ∇f (y) = 0, implies y is a a global minimizer.

Page 36: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

The function is globally above the tangent at x .

f(x)

f(x) + ∇f(x)T(y-x)

f(y)

If ∇f (y) = 0, implies y is a a global minimizer.

Page 37: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

A twice-differentiable function f is convex if for all x we have

∇2f (x) 0

All eigenvalues of ‘Hessian’ are non-negative.

The function is flat or curved upwards in every direction.

This is usually the easiest way to show a function is convex.

Page 38: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

A twice-differentiable function f is convex if for all x we have

∇2f (x) 0

All eigenvalues of ‘Hessian’ are non-negative.

The function is flat or curved upwards in every direction.

This is usually the easiest way to show a function is convex.

Page 39: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convex Functions: Three Characterizations

A function f is convex if for all x and y we have

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y), for θ ∈ [0, 1].

A differentiable function f is convex if for all x and y we have

f (y) ≥ f (x) +∇f (x)T (y − x),

A twice-differentiable function f is convex if for all x we have

∇2f (x) 0

All eigenvalues of ‘Hessian’ are non-negative.

The function is flat or curved upwards in every direction.

This is usually the easiest way to show a function is convex.

Page 40: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Examples of Convex Functions

Some simple convex functions:

f (x) = c

f (x) = aT x

f (x) = ax2 + b (for a > 0)

f (x) = exp(ax)

f (x) = x log x (for x > 0)

f (x) = ‖x‖2

f (x) = ‖x‖pf (x) = maxixi

Some other notable examples:

f (x , y) = log(ex + ey )

f (X ) = log detX (for X positive-definite).

f (x ,Y ) = xTY−1x (for Y positive-definite)

Page 41: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Examples of Convex Functions

Some simple convex functions:

f (x) = c

f (x) = aT x

f (x) = ax2 + b (for a > 0)

f (x) = exp(ax)

f (x) = x log x (for x > 0)

f (x) = ‖x‖2

f (x) = ‖x‖pf (x) = maxixi

Some other notable examples:

f (x , y) = log(ex + ey )

f (X ) = log detX (for X positive-definite).

f (x ,Y ) = xTY−1x (for Y positive-definite)

Page 42: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Operations that Preserve Convexity

1 Non-negative weighted sum:

f (x) = θ1f1(x) + θ2f2(x).

2 Composition with affine mapping:

g(x) = f (Ax + b).

3 Pointwise maximum:

f (x) = maxifi (x).

Show that least-residual problems are convex for any `p-norm:

f (x) = ||Ax − b||p

We know that ‖ · ‖p is a norm, so it follows from (2).

Page 43: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Operations that Preserve Convexity

1 Non-negative weighted sum:

f (x) = θ1f1(x) + θ2f2(x).

2 Composition with affine mapping:

g(x) = f (Ax + b).

3 Pointwise maximum:

f (x) = maxifi (x).

Show that least-residual problems are convex for any `p-norm:

f (x) = ||Ax − b||p

We know that ‖ · ‖p is a norm, so it follows from (2).

Page 44: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Operations that Preserve Convexity

1 Non-negative weighted sum:

f (x) = θ1f1(x) + θ2f2(x).

2 Composition with affine mapping:

g(x) = f (Ax + b).

3 Pointwise maximum:

f (x) = maxifi (x).

Show that least-residual problems are convex for any `p-norm:

f (x) = ||Ax − b||p

We know that ‖ · ‖p is a norm, so it follows from (2).

Page 45: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Operations that Preserve Convexity

1 Non-negative weighted sum:

f (x) = θ1f1(x) + θ2f2(x).

2 Composition with affine mapping:

g(x) = f (Ax + b).

3 Pointwise maximum:

f (x) = maxifi (x).

Show that SVMs are convex:

f (x) =1

2||x ||2 + C

n∑

i=1

max0, 1− biaTi x.

The first term has Hessian I 0, for the second term use (3) onthe two (convex) arguments, then use (1) to put it all together.

Page 46: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Operations that Preserve Convexity

1 Non-negative weighted sum:

f (x) = θ1f1(x) + θ2f2(x).

2 Composition with affine mapping:

g(x) = f (Ax + b).

3 Pointwise maximum:

f (x) = maxifi (x).

Show that SVMs are convex:

f (x) =1

2||x ||2 + C

n∑

i=1

max0, 1− biaTi x.

The first term has Hessian I 0, for the second term use (3) onthe two (convex) arguments, then use (1) to put it all together.

Page 47: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Outline

1 Motivation

2 Gradient Method

3 Stochastic Subgradient

4 Finite-Sum Methods

5 Non-Smooth Objectives

Page 48: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Gradient Methods

We can solve convex optimization problems inpolynomial-time by interior-point methods

But these solvers require O(P2) or worse cost per iteration.

Infeasible for applications where P may be in the billions.

Large-scale problems have renewed interest gradient methods:

x t+1 = x t − αt∇f (x t).

Only have O(P) iteration cost!But how many iterations are needed?

Page 49: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Gradient Methods

We can solve convex optimization problems inpolynomial-time by interior-point methods

But these solvers require O(P2) or worse cost per iteration.

Infeasible for applications where P may be in the billions.

Large-scale problems have renewed interest gradient methods:

x t+1 = x t − αt∇f (x t).

Only have O(P) iteration cost!But how many iterations are needed?

Page 50: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Gradient Methods

We can solve convex optimization problems inpolynomial-time by interior-point methods

But these solvers require O(P2) or worse cost per iteration.

Infeasible for applications where P may be in the billions.

Large-scale problems have renewed interest gradient methods:

x t+1 = x t − αt∇f (x t).

Only have O(P) iteration cost!But how many iterations are needed?

Page 51: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Gradient Methods

We can solve convex optimization problems inpolynomial-time by interior-point methods

But these solvers require O(P2) or worse cost per iteration.

Infeasible for applications where P may be in the billions.

Large-scale problems have renewed interest gradient methods:

x t+1 = x t − αt∇f (x t).

Only have O(P) iteration cost!But how many iterations are needed?

Page 52: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Logistic Regression with 2-Norm Regularization

Let’s consider logistic regression with 2-norm regularization:

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))) +λ

2‖x‖2.

Objective f is convex.

First term is Lipschitz continuous, second term is not.

But we have

µI ∇2f (x) LI ,

for some L and µ.(L ≤ 1

4‖A‖2

2 + λ, µ ≥ λ)

We say that the gradient is Lipschitz-continuous.

We say that the function is strongly-convex.

Page 53: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Logistic Regression with 2-Norm Regularization

Let’s consider logistic regression with 2-norm regularization:

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))) +λ

2‖x‖2.

Objective f is convex.

First term is Lipschitz continuous, second term is not.

But we have

µI ∇2f (x) LI ,

for some L and µ.(L ≤ 1

4‖A‖2

2 + λ, µ ≥ λ)

We say that the gradient is Lipschitz-continuous.

We say that the function is strongly-convex.

Page 54: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Logistic Regression with 2-Norm Regularization

Let’s consider logistic regression with 2-norm regularization:

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))) +λ

2‖x‖2.

Objective f is convex.

First term is Lipschitz continuous, second term is not.

But we have

µI ∇2f (x) LI ,

for some L and µ.(L ≤ 1

4‖A‖2

2 + λ, µ ≥ λ)

We say that the gradient is Lipschitz-continuous.

We say that the function is strongly-convex.

Page 55: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous Gradient

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

Variant of gradient method if we set x t+1 to minimum yvalue:

x t+1 = x t − 1

L∇f (x t).

Plugging this value in:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2.

Guaranteed decrease of objective.

Page 56: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous Gradient

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

Variant of gradient method if we set x t+1 to minimum yvalue:

x t+1 = x t − 1

L∇f (x t).

Plugging this value in:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2.

Guaranteed decrease of objective.

Page 57: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous Gradient

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

Variant of gradient method if we set x t+1 to minimum yvalue:

x t+1 = x t − 1

L∇f (x t).

Plugging this value in:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2.

Guaranteed decrease of objective.

Page 58: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

Page 59: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

f(x)

Page 60: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

Page 61: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

f(x) + ∇f(x)T(y-x) + (L/2)||y-x||2

Page 62: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

f(y)

f(x) + ∇f(x)T(y-x) + (L/2)||y-x||2

Page 63: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Lipschitz-Continuous GradientFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) LI .

f (y) ≤ f (x) +∇f (x)T (y − x) +L

2‖y − x‖2

Global quadratic upper bound on function value.

Stochastic vs. deterministic methods

• Minimizing g(!) =1

n

n!

i=1

fi(!) with fi(!) = ""yi, !

!!(xi)#

+ µ"(!)

• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!

#t

n

n!

i=1

f #i(!t"1)

• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)

Page 64: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +

µ

2‖y − x‖2

Global quadratic lower bound on function value.

Page 65: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +

µ

2‖y − x‖2

Global quadratic lower bound on function value.

Page 66: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +

µ

2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

Page 67: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +

µ

2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

Page 68: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-ConvexityFrom Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .f (y) ≥ f (x) +∇f (x)T (y − x) +

µ

2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

f(x) + ∇f(x)T(y-x) + (μ/2)||y-x||2

Page 69: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Properties of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that ∇2f (z) µI .

f (y) ≥ f (x) +∇f (x)T (y − x) +µ

2‖y − x‖2

Global quadratic lower bound on function value.

Minimize both sides in terms of y :

f (x∗) ≥ f (x)− 1

2µ‖∇f (x)‖2.

Upper bound on how far we are from the solution.

Page 70: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

Page 71: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

f(x)

Page 72: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

f(x) GuaranteedProgress

Page 73: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

f(x) GuaranteedProgress

Page 74: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

f(x) GuaranteedProgress

MaximumSuboptimality

Page 75: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

f(x) GuaranteedProgress

MaximumSuboptimality

f(x+)

Page 76: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

combine them to get

f (x t+1)− f (x∗) ≤(

1− µ

L

)[f (x t)− f (x∗)]

This gives a linear convergence rate:

f (x t)− f (x∗) ≤(

1− µ

L

)t[f (x0)− f (x∗)]

Each iteration multiplies the error by a fixed amount.(very fast if µ/L is not too close to one)

Page 77: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Linear Convergence of Gradient Descent

We have bounds on x t+1 and x∗:

f (x t+1) ≤ f (x t)− 1

2L‖∇f (x t)‖2, f (x∗) ≥ f (x t)− 1

2µ‖∇f (x t)‖2.

combine them to get

f (x t+1)− f (x∗) ≤(

1− µ

L

)[f (x t)− f (x∗)]

This gives a linear convergence rate:

f (x t)− f (x∗) ≤(

1− µ

L

)t[f (x0)− f (x∗)]

Each iteration multiplies the error by a fixed amount.(very fast if µ/L is not too close to one)

Page 78: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

Page 79: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

Page 80: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

f(x)GuaranteedProgress

Page 81: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

f(x)GuaranteedProgress

Page 82: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic RegressionWhat about maximum-likelihood logistic regression?

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

f(x)GuaranteedProgress

MaximumSuboptimality

Page 83: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic Regression

Consider maximum-likelihood logistic regression:

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

If some x∗ exists, we have the sublinear convergence rate:

f (x t)− f (x∗) = O(1/t)

(compare to slower Ω(1/t−1/N) for general Lipschitz functions)

If f is convex, then f + λ‖x‖2 is strongly-convex.

Page 84: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Maximum Likelihood Logistic Regression

Consider maximum-likelihood logistic regression:

f (x) =n∑

i=1

log(1 + exp(−bi (xTai ))).

We now only have

0 ∇2f (x) LI .

Convexity only gives a linear upper bound on f (x∗):

f (x∗) ≤ f (x) +∇f (x)T (x∗ − x)

If some x∗ exists, we have the sublinear convergence rate:

f (x t)− f (x∗) = O(1/t)

(compare to slower Ω(1/t−1/N) for general Lipschitz functions)

If f is convex, then f + λ‖x‖2 is strongly-convex.

Page 85: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.

(and doesn’t require knowledge of L)

Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)

f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.

(with good interpolation, ≈ 1 evaluation of f per iteration)

Also, check your derivative code!

∇i f (x) ≈ f (x + δei )− f (x)

δFor large-scale problems you can check a random direction d :

∇f (x)Td ≈ f (x + δd)− f (x)

δ

Page 86: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.

(and doesn’t require knowledge of L)

Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)

f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.

Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.

(with good interpolation, ≈ 1 evaluation of f per iteration)

Also, check your derivative code!

∇i f (x) ≈ f (x + δei )− f (x)

δFor large-scale problems you can check a random direction d :

∇f (x)Td ≈ f (x + δd)− f (x)

δ

Page 87: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.

(and doesn’t require knowledge of L)

Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)

f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.

(with good interpolation, ≈ 1 evaluation of f per iteration)

Also, check your derivative code!

∇i f (x) ≈ f (x + δei )− f (x)

δFor large-scale problems you can check a random direction d :

∇f (x)Td ≈ f (x + δd)− f (x)

δ

Page 88: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Method: Practical IssuesIn practice, searching for step size (line-search) is usuallymuch faster than α = 1/L.

(and doesn’t require knowledge of L)

Basic Armijo backtracking line-search:1 Start with a large value of α.2 Divide α in half until we satisfy (typically value is γ = .0001)

f (x t+1) ≤ f (x t)− γα||∇f (x t)||2.Practical methods may use Wolfe conditions (so α isn’t toosmall), and/or use interpolation to propose trial step sizes.

(with good interpolation, ≈ 1 evaluation of f per iteration)

Also, check your derivative code!

∇i f (x) ≈ f (x + δei )− f (x)

δFor large-scale problems you can check a random direction d :

∇f (x)Td ≈ f (x + δd)− f (x)

δ

Page 89: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Accelerated Gradient Method

Is this the best algorithm under these assumptions?

Algorithm Assumptions Rate

Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)

Nesterov Strongly-Convex O((1−√µ/L)t)

Nesterov’s accelerated gradient method:

xt+1 = yt − αt∇f (yt),

yt+1 = xt + βt(xt+1 − xt),

for appropriate αt , βt .

Rate is nearly-optimal for dimension-independent algorithm.

Similar to heavy-ball/momentum and conjugate gradient.

For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].

Page 90: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Accelerated Gradient Method

Is this the best algorithm under these assumptions?

Algorithm Assumptions Rate

Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)

Nesterov Strongly-Convex O((1−√µ/L)t)

Nesterov’s accelerated gradient method:

xt+1 = yt − αt∇f (yt),

yt+1 = xt + βt(xt+1 − xt),

for appropriate αt , βt .

Rate is nearly-optimal for dimension-independent algorithm.

Similar to heavy-ball/momentum and conjugate gradient.

For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].

Page 91: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Accelerated Gradient Method

Is this the best algorithm under these assumptions?

Algorithm Assumptions Rate

Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)

Nesterov Strongly-Convex O((1−√µ/L)t)

Nesterov’s accelerated gradient method:

xt+1 = yt − αt∇f (yt),

yt+1 = xt + βt(xt+1 − xt),

for appropriate αt , βt .

Rate is nearly-optimal for dimension-independent algorithm.

Similar to heavy-ball/momentum and conjugate gradient.

For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].

Page 92: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Accelerated Gradient Method

Is this the best algorithm under these assumptions?

Algorithm Assumptions Rate

Gradient Convex O(1/t)Nesterov Convex O(1/t2)Gradient Strongly-Convex O((1− µ/L)t)

Nesterov Strongly-Convex O((1−√µ/L)t)

Nesterov’s accelerated gradient method:

xt+1 = yt − αt∇f (yt),

yt+1 = xt + βt(xt+1 − xt),

for appropriate αt , βt .

Rate is nearly-optimal for dimension-independent algorithm.

Similar to heavy-ball/momentum and conjugate gradient.

For logistic regression and many other losses, we can getlinear convergence without strong-convexity [Luo & Tseng, 1993].

Page 93: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))

Modern form uses the update

x t+1 = x t − αd ,where d is a solution to the system

∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)

Equivalent to minimizing the quadratic approximation:

f (y) ≈ f (x) +∇f (x)T (y − x) +1

2α‖y − x‖2

∇2f (x).

(recall that ‖x‖2H = xTHx)

We can generalize the Armijo condition to

f (x t+1) ≤ f (x t) + γα∇f (x t)Td .

Has a natural step length of α = 1.(always accepted when close to a minimizer)

Page 94: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))

Modern form uses the update

x t+1 = x t − αd ,where d is a solution to the system

∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)

Equivalent to minimizing the quadratic approximation:

f (y) ≈ f (x) +∇f (x)T (y − x) +1

2α‖y − x‖2

∇2f (x).

(recall that ‖x‖2H = xTHx)

We can generalize the Armijo condition to

f (x t+1) ≤ f (x t) + γα∇f (x t)Td .

Has a natural step length of α = 1.(always accepted when close to a minimizer)

Page 95: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

The oldest differentiable optimization method is Newton’s.(also called IRLS for functions of the form f (Ax))

Modern form uses the update

x t+1 = x t − αd ,where d is a solution to the system

∇2f (x)d = ∇f (x).(Assumes ∇2f (x) 0)

Equivalent to minimizing the quadratic approximation:

f (y) ≈ f (x) +∇f (x)T (y − x) +1

2α‖y − x‖2

∇2f (x).

(recall that ‖x‖2H = xTHx)

We can generalize the Armijo condition to

f (x t+1) ≤ f (x t) + γα∇f (x t)Td .

Has a natural step length of α = 1.(always accepted when close to a minimizer)

Page 96: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

Page 97: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

x

Page 98: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

x - !f’(x)

x

Page 99: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

Page 100: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

x

Page 101: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

x - !f’(x)

x

Page 102: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

Q(x)f(x)

x

x - !f’(x)

Page 103: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method

f(x)

xk - !H-1f’(x)

x

x - !f’(x)Q(x)

Page 104: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convergence Rate of Newton’s Method

If ∇2f (x) is Lipschitz-continuous and ∇2f (x) µ, then closeto x∗ Newton’s method has local superlinear convergence:

f (x t+1)− f (x∗) ≤ ρt [f (x t)− f (x∗)],

with limt→∞ ρt = 0.

Converges very fast, use it if you can!

But requires solving ∇2f (x)d = ∇f (x).

Get global rates under various assumptions(cubic-regularization/accelerated/self-concordant).

Page 105: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convergence Rate of Newton’s Method

If ∇2f (x) is Lipschitz-continuous and ∇2f (x) µ, then closeto x∗ Newton’s method has local superlinear convergence:

f (x t+1)− f (x∗) ≤ ρt [f (x t)− f (x∗)],

with limt→∞ ρt = 0.

Converges very fast, use it if you can!

But requires solving ∇2f (x)d = ∇f (x).

Get global rates under various assumptions(cubic-regularization/accelerated/self-concordant).

Page 106: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method: Practical IssuesThere are many practical variants of Newton’s method:

Modify the Hessian to be positive-definite.

Only compute the Hessian every m iterations.

Only use the diagonals of the Hessian.

Quasi-Newton: Update a (diagonal plus low-rank)approximation of the Hessian (BFGS, L-BFGS).

Hessian-free: Compute d inexactly using Hessian-vectorproducts:

∇2f (x)d = limδ→0

∇f (x + δd)−∇f (x)

δ

Barzilai-Borwein: Choose a step-size that acts like the Hessianover the last iteration:

α =(x t+1 − x t)T (∇f (x t+1)−∇f (x t))

‖∇f (x t+1)− f (x t)‖2

Another related method is nonlinear conjugate gradient.

Page 107: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Newton’s Method: Practical IssuesThere are many practical variants of Newton’s method:

Modify the Hessian to be positive-definite.

Only compute the Hessian every m iterations.

Only use the diagonals of the Hessian.

Quasi-Newton: Update a (diagonal plus low-rank)approximation of the Hessian (BFGS, L-BFGS).

Hessian-free: Compute d inexactly using Hessian-vectorproducts:

∇2f (x)d = limδ→0

∇f (x + δd)−∇f (x)

δ

Barzilai-Borwein: Choose a step-size that acts like the Hessianover the last iteration:

α =(x t+1 − x t)T (∇f (x t+1)−∇f (x t))

‖∇f (x t+1)− f (x t)‖2

Another related method is nonlinear conjugate gradient.

Page 108: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Outline

1 Motivation

2 Gradient Method

3 Stochastic Subgradient

4 Finite-Sum Methods

5 Non-Smooth Objectives

Page 109: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

What if number of training examples N is very large?

E.g., ImageNet has more than 14 million annotated images.

Page 110: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

What if number of training examples N is very large?

E.g., ImageNet has more than 14 million annotated images.

Page 111: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing f (x) = 1N

∑Ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

xt+1 = xt − αt∇f (xt) = xt −αt

N

N∑

i=1

∇fi (xt).

Iteration cost is linear in N.Convergence with constant αt or line-search.

Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.

xt+1 = xt − αt f′i (xt).

Gives unbiased estimate of true gradient,

E[f ′(i)(x)] =1

N

N∑

i=1

∇fi (x) = ∇f (x).

Iteration cost is independent of N.Convergence requires αt → 0.

Page 112: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing f (x) = 1N

∑Ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

xt+1 = xt − αt∇f (xt) = xt −αt

N

N∑

i=1

∇fi (xt).

Iteration cost is linear in N.Convergence with constant αt or line-search.

Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.

xt+1 = xt − αt f′i (xt).

Gives unbiased estimate of true gradient,

E[f ′(i)(x)] =1

N

N∑

i=1

∇fi (x) = ∇f (x).

Iteration cost is independent of N.Convergence requires αt → 0.

Page 113: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing f (x) = 1N

∑Ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

xt+1 = xt − αt∇f (xt) = xt −αt

N

N∑

i=1

∇fi (xt).

Iteration cost is linear in N.Convergence with constant αt or line-search.

Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.

xt+1 = xt − αt f′i (xt).

Gives unbiased estimate of true gradient,

E[f ′(i)(x)] =1

N

N∑

i=1

∇fi (x) = ∇f (x).

Iteration cost is independent of N.Convergence requires αt → 0.

Page 114: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing f (x) = 1N

∑Ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

xt+1 = xt − αt∇f (xt) = xt −αt

N

N∑

i=1

∇fi (xt).

Iteration cost is linear in N.Convergence with constant αt or line-search.

Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.

xt+1 = xt − αt f′i (xt).

Gives unbiased estimate of true gradient,

E[f ′(i)(x)] =1

N

N∑

i=1

∇fi (x) = ∇f (x).

Iteration cost is independent of N.

Convergence requires αt → 0.

Page 115: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing f (x) = 1N

∑Ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

xt+1 = xt − αt∇f (xt) = xt −αt

N

N∑

i=1

∇fi (xt).

Iteration cost is linear in N.Convergence with constant αt or line-search.

Stochastic gradient method [Robbins & Monro, 1951]:Random selection of i from 1, 2, . . . ,N.

xt+1 = xt − αt f′i (xt).

Gives unbiased estimate of true gradient,

E[f ′(i)(x)] =1

N

N∑

i=1

∇fi (x) = ∇f (x).

Iteration cost is independent of N.Convergence requires αt → 0.

Page 116: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

We consider minimizing g(x) = 1N

∑ni=1 fi (x).

Deterministic gradient method [Cauchy, 1847]:

Stochastic vs. deterministic methods

• Minimizing g(!) =1

n

n!

i=1

fi(!) with fi(!) = ""yi, !

!!(xi)#

+ µ"(!)

• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!

#t

n

n!

i=1

f #i(!t"1)

• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)

Stochastic gradient method [Robbins & Monro, 1951]:

Stochastic vs. deterministic methods

• Minimizing g(!) =1

n

n!

i=1

fi(!) with fi(!) = ""yi, !

!!(xi)#

+ µ"(!)

• Batch gradient descent: !t = !t"1!#tg#(!t"1) = !t"1!

#t

n

n!

i=1

f #i(!t"1)

• Stochastic gradient descent: !t = !t"1 ! #tf#i(t)(!t"1)

Page 117: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

Stochastic iterations are N times faster, but how many iterations?

Assumption Deterministic Stochastic

Convex O(1/t2) O(1/√t)

Strongly O((1−√µ/L)t) O(1/t)

Stochastic has low iteration cost but slow convergence rate.

Sublinear rate even in strongly-convex case.Bounds are unimprovable if only unbiased gradient available.

Page 118: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Gradient Methods

Stochastic iterations are N times faster, but how many iterations?

Assumption Deterministic Stochastic

Convex O(1/t2) O(1/√t)

Strongly O((1−√µ/L)t) O(1/t)

Stochastic has low iteration cost but slow convergence rate.

Sublinear rate even in strongly-convex case.Bounds are unimprovable if only unbiased gradient available.

Page 119: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic Convergence RatesPlot of convergence rates in strongly-convex case:

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

time

log(

exce

ss c

ost)

stochastic

deterministic

Stochastic will be superior for low-accuracy/time situations.

Page 120: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic for Non-Smooth

Consider the binary support vector machine objective:

f (x) =n∑

i=1

max0, 1− bi (xTai )+

λ

2‖x‖2.

Rates for subgradient methods for non-smooth objectives:

Assumption Deterministic Stochastic

Convex O(1/√t) O(1/

√t)

Strongly O(1/t) O(1/t)

Other black-box methods (cutting plane) are not faster.

For non-smooth problems:

Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).

Page 121: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic for Non-Smooth

Consider the binary support vector machine objective:

f (x) =n∑

i=1

max0, 1− bi (xTai )+

λ

2‖x‖2.

Rates for subgradient methods for non-smooth objectives:

Assumption Deterministic Stochastic

Convex O(1/√t) O(1/

√t)

Strongly O(1/t) O(1/t)

Other black-box methods (cutting plane) are not faster.

For non-smooth problems:

Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).

Page 122: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic vs. Deterministic for Non-Smooth

Consider the binary support vector machine objective:

f (x) =n∑

i=1

max0, 1− bi (xTai )+

λ

2‖x‖2.

Rates for subgradient methods for non-smooth objectives:

Assumption Deterministic Stochastic

Convex O(1/√t) O(1/

√t)

Strongly O(1/t) O(1/t)

Other black-box methods (cutting plane) are not faster.

For non-smooth problems:

Stochastic methods have same rate as smooth case.Deterministic methods are not faster than stochastic method.So use stochastic subgradient (iterations are n times faster).

Page 123: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

At differentiable x :

Only subgradient is ∇f (x).

At non-differentiable x :

We have a set of subgradients.Called the sub-differential, ∂f (x).

Note that 0 ∈ ∂f (x) iff x is a global minimum.

Page 124: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

At differentiable x :

Only subgradient is ∇f (x).

At non-differentiable x :

We have a set of subgradients.Called the sub-differential, ∂f (x).

Note that 0 ∈ ∂f (x) iff x is a global minimum.

Page 125: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

At differentiable x :

Only subgradient is ∇f (x).

At non-differentiable x :

We have a set of subgradients.Called the sub-differential, ∂f (x).

Note that 0 ∈ ∂f (x) iff x is a global minimum.

Page 126: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

Page 127: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 128: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

f(x) + ∇f(x)T(y-x)

Page 129: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 130: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 131: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 132: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 133: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Gradients and Sub-Differentials

Recall that for differentiable convex functions we have

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dT (y − x),∀y .

f(x)

Page 134: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

Page 135: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

Page 136: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(x)

Page 137: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(x)

Page 138: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 139: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 140: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 141: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 142: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 143: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

f(0)

Page 144: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

Sub-differential of max function:

∂maxf1(x), f2(x) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f2(x) > f1(x)

θ∇f1(x) + (1− θ)∇f2(x) f1(x) = f2(x)

(any convex combination of the gradients of the argmax)

Page 145: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Sub-Differential of Absolute Value and Max Functions

Sub-differential of absolute value function:

∂|x | =

1 x > 0

−1 x < 0

[−1, 1] x = 0

(sign of the variable if non-zero, anything in [−1, 1] at 0)

Sub-differential of max function:

∂maxf1(x), f2(x) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f2(x) > f1(x)

θ∇f1(x) + (1− θ)∇f2(x) f1(x) = f2(x)

(any convex combination of the gradients of the argmax)

Page 146: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Subgradient and Stochastic Subgradient methods

The basic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂f (x t).

The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)

Otherwise, may increase the objective even for small α.

But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.

For convergence, we require α→ 0.

The basic stochastic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.

Page 147: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Subgradient and Stochastic Subgradient methods

The basic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂f (x t).

The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)

Otherwise, may increase the objective even for small α.

But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.

For convergence, we require α→ 0.

The basic stochastic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.

Page 148: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Subgradient and Stochastic Subgradient methods

The basic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂f (x t).

The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)

Otherwise, may increase the objective even for small α.

But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.

For convergence, we require α→ 0.

The basic stochastic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.

Page 149: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Subgradient and Stochastic Subgradient methods

The basic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂f (x t).

The steepest descent choice is given by argmind∈∂f (x)‖d‖.(often hard to compute, but easy for `1-regularization)

Otherwise, may increase the objective even for small α.

But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough α.

For convergence, we require α→ 0.

The basic stochastic subgradient method:

x t+1 = x t − αd ,

for some d ∈ ∂fi (x t) for some random i ∈ 1, 2, . . . ,N.

Page 150: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Subgradient Methods in Practice

The theory says to use decreasing sequence αt = 1/µt:

it = rand(1, 2, . . . ,N), αt =1

µt

x t+1 = x t − αt f′it (x

t).

O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.

Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/

√N).

Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.

Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:

xt =t∑

i=1

ωtxt , dt =t∑

i=1

δtdt .

Page 151: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Subgradient Methods in Practice

The theory says to use decreasing sequence αt = 1/µt:

it = rand(1, 2, . . . ,N), αt =1

µt

x t+1 = x t − αt f′it (x

t).

O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.

Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/

√N).

Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.

Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:

xt =t∑

i=1

ωtxt , dt =t∑

i=1

δtdt .

Page 152: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Subgradient Methods in Practice

The theory says to use decreasing sequence αt = 1/µt:

it = rand(1, 2, . . . ,N), αt =1

µt

x t+1 = x t − αt f′it (x

t).

O(1/t) for smooth objectives.O(log(t)/t) for non-smooth objectives.

Except for some special cases, you should not do this.Initial steps are huge: usually µ = O(1/N) or O(1/

√N).

Later steps are tiny: 1/t gets small very quickly.Convergence rate is not robust to mis-specification of µ.No adaptation to ‘easier’ problems than worst case.

Tricks that can improve theoretical and practical properties:1 Use smaller initial step-sizes, that go to zero more slowly.2 Take a (weighted) average of the iterations or gradients:

xt =t∑

i=1

ωtxt , dt =t∑

i=1

δtdt .

Page 153: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Speeding up Stochastic Subgradient Methods

Works that support using large steps and averaging:

Rakhlin et at. [2011]:

Averaging later iterations achieves O(1/t) in non-smooth case.

Nesterov [2007], Xiao [2010]:

Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.

Bach & Moulines [2011]:

αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).

Nedic & Bertsekas [2000]:

Constant step size (αt = α) achieves rate of

E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).

Polyak & Juditsky [1992]:

In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.

Page 154: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Speeding up Stochastic Subgradient Methods

Works that support using large steps and averaging:

Rakhlin et at. [2011]:

Averaging later iterations achieves O(1/t) in non-smooth case.

Nesterov [2007], Xiao [2010]:

Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.

Bach & Moulines [2011]:

αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).

Nedic & Bertsekas [2000]:

Constant step size (αt = α) achieves rate of

E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).

Polyak & Juditsky [1992]:

In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.

Page 155: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Speeding up Stochastic Subgradient Methods

Works that support using large steps and averaging:

Rakhlin et at. [2011]:

Averaging later iterations achieves O(1/t) in non-smooth case.

Nesterov [2007], Xiao [2010]:

Gradient averaging improves constants (‘dual averaging’).Finds non-zero variables with sparse regularizers.

Bach & Moulines [2011]:

αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t).

Nedic & Bertsekas [2000]:

Constant step size (αt = α) achieves rate of

E[f (x t)]− f (x∗) ≤ (1− 2µα)t(f (x0)− f (x∗)) + O(α).

Polyak & Juditsky [1992]:

In smooth case, iterate averaging is asymptotically optimal.Achieves same rate as optimal stochastic Newton method.

Page 156: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Newton Methods?

Should we use accelerated/Newton-like stochastic methods?

These do not improve the convergence rate.

But some positive results exist.Ghadimi & Lan [2010]:

Acceleration can improve dependence on L and µ.Improves performance at start or if noise is small.

Duchi et al. [2010]:

Newton-like methods can improve regret bounds.

Bach & Moulines [2013]:

Newton-like method achieves O(1/t) withoutstrong-convexity.(under extra self-concordance assumption)

Page 157: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Newton Methods?

Should we use accelerated/Newton-like stochastic methods?

These do not improve the convergence rate.

But some positive results exist.Ghadimi & Lan [2010]:

Acceleration can improve dependence on L and µ.Improves performance at start or if noise is small.

Duchi et al. [2010]:

Newton-like methods can improve regret bounds.

Bach & Moulines [2013]:

Newton-like method achieves O(1/t) withoutstrong-convexity.(under extra self-concordance assumption)

Page 158: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Outline

1 Motivation

2 Gradient Method

3 Stochastic Subgradient

4 Finite-Sum Methods

5 Non-Smooth Objectives

Page 159: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Stochastic methods:

O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:

O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.

For minimizing finite sums, can we design a better method?

Page 160: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Stochastic methods:

O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:

O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.

For minimizing finite sums, can we design a better method?

Page 161: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Stochastic methods:

O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:

O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.

For minimizing finite sums, can we design a better method?

Page 162: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Big-N Problems

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Stochastic methods:

O(1/t) convergence but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:

O(ρt) convergence but requires N gradients per iteration.The faster rate is possible because N is finite.

For minimizing finite sums, can we design a better method?

Page 163: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Hybrid Methods

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

time

log(

exce

ss c

ost)

stochastic

deterministic

Page 164: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation for Hybrid Methods

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

hybridlog(

exce

ss c

ost)

stochastic

deterministic

time

Page 165: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Hybrid Deterministic-Stochastic

Approach 1: control the sample size.

The FG method uses all N gradients,

∇f (x t) =1

N

N∑

i=1

fi (xt).

The SG method approximates it with 1 sample,

fit (xt) ≈ 1

N

N∑

i=1

fi (xt).

A common variant is to use larger sample Bt ,

1

|Bt |∑

i∈Btf ′i (x t) ≈ 1

N

N∑

i=1

fi (xt).

Page 166: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Hybrid Deterministic-Stochastic

Approach 1: control the sample size.

The FG method uses all N gradients,

∇f (x t) =1

N

N∑

i=1

fi (xt).

The SG method approximates it with 1 sample,

fit (xt) ≈ 1

N

N∑

i=1

fi (xt).

A common variant is to use larger sample Bt ,

1

|Bt |∑

i∈Btf ′i (x t) ≈ 1

N

N∑

i=1

fi (xt).

Page 167: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Hybrid Deterministic-Stochastic

Approach 1: control the sample size.

The FG method uses all N gradients,

∇f (x t) =1

N

N∑

i=1

fi (xt).

The SG method approximates it with 1 sample,

fit (xt) ≈ 1

N

N∑

i=1

fi (xt).

A common variant is to use larger sample Bt ,

1

|Bt |∑

i∈Btf ′i (x t) ≈ 1

N

N∑

i=1

fi (xt).

Page 168: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Approach 1: Batching

The SG method with a sample Bt uses iterations

x t+1 = x t − αt

|Bt |∑

i∈Btfi (x

t).

For a fixed sample size |Bt |, the rate is sublinear.

Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]

We can choose |Bt | to achieve a linear convergence rate:

Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.

Page 169: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Approach 1: Batching

The SG method with a sample Bt uses iterations

x t+1 = x t − αt

|Bt |∑

i∈Btfi (x

t).

For a fixed sample size |Bt |, the rate is sublinear.

Gradient error decreases as sample size |Bt | increases.

Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]

We can choose |Bt | to achieve a linear convergence rate:

Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.

Page 170: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Approach 1: Batching

The SG method with a sample Bt uses iterations

x t+1 = x t − αt

|Bt |∑

i∈Btfi (x

t).

For a fixed sample size |Bt |, the rate is sublinear.

Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]

We can choose |Bt | to achieve a linear convergence rate:

Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.

Page 171: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Approach 1: Batching

The SG method with a sample Bt uses iterations

x t+1 = x t − αt

|Bt |∑

i∈Btfi (x

t).

For a fixed sample size |Bt |, the rate is sublinear.

Gradient error decreases as sample size |Bt | increases.Common to gradually increase the sample size |Bt |.[Bertsekas & Tsitsiklis, 1996]

We can choose |Bt | to achieve a linear convergence rate:

Early iterations are cheap like SG iterations.Later iterations can use a Newton-like method.

Page 172: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Evaluation on Chain-Structured CRFs

Results on chain-structured conditional random field:

0 20 40 60 80 100

102

103

104

105

Passes through data

Ob

jec

tiv

e m

inu

s O

ptim

al

Stochastic1(1e−01)

Stochastic1(1e−02)

Stochastic1(1e−03)

Hybrid

Deterministic

Page 173: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

∇fi (x t)

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 174: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES!

The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

∇fi (x t)

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 175: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

∇fi (x t)

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 176: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

∇fi (x t)

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 177: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

y ti

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 178: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

y ti

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 179: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Average Gradient

Growing |Bt | eventually requires O(N) iteration cost.

Can we have a rate of O(ρt) with only 1 gradientevaluation per iteration?

YES! The stochastic average gradient (SAG) algorithm:

Randomly select it from 1, 2, . . . ,N and compute f ′it (xt).

x t+1 = x t − αt

N

N∑i=1

y ti

Memory: y ti = ∇fi (x t) from the last t where i was selected.

[Le Roux et al., 2012]

Stochastic variant of increment average gradient (IAG).[Blatt et al., 2007]

Assumes gradients of non-selected examples don’t change.Assumption becomes accurate as ||x t+1 − x t || → 0.

Page 180: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convergence Rate of SAG

If each f ′i is L−continuous and f is strongly-convex,with αt = 1/16L SAG has

E[f (x t)− f (x∗)] 6

(1−min

µ

16L,

1

8N

)t

C ,

where

C = [f (x0)− f (x∗)] +4L

N‖x0 − x∗‖2 +

σ2

16L.

Linear convergence rate but only 1 gradient per iteration.For well-conditioned problems, constant reduction per pass:

(1− 1

8N

)N

≤ exp

(−1

8

)= 0.8825.

For ill-conditioned problems, almost same as deterministicmethod (but N times faster).

Page 181: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Convergence Rate of SAG

If each f ′i is L−continuous and f is strongly-convex,with αt = 1/16L SAG has

E[f (x t)− f (x∗)] 6

(1−min

µ

16L,

1

8N

)t

C ,

where

C = [f (x0)− f (x∗)] +4L

N‖x0 − x∗‖2 +

σ2

16L.

Linear convergence rate but only 1 gradient per iteration.For well-conditioned problems, constant reduction per pass:

(1− 1

8N

)N

≤ exp

(−1

8

)= 0.8825.

For ill-conditioned problems, almost same as deterministicmethod (but N times faster).

Page 182: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 183: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 184: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 185: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 186: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 187: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 188: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:

Stochastic: O( Lµ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 189: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 190: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 191: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 192: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Rate of Convergence ComparisonAssume that N = 700000, L = 0.25, µ = 1/N:

Gradient method has rate(

L−µL+µ

)2

= 0.99998.

Accelerated gradient method has rate(1−

õL

)= 0.99761.

SAG (N iterations) has rate(1−min

µ

16L ,1

8N

)N= 0.88250.

Fastest possible first-order method:(√

L−√µ√L+√µ

)2

= 0.99048.

SAG beats two lower bounds:Stochastic gradient bound (of O(1/t)).Deterministic gradient bound (for typical L, µ, and N).

Number of f ′i evaluations to reach ε:Stochastic: O( L

µ (1/ε)).

Gradient: O(N Lµ log(1/ε)).

Accelerated: O(N√

Lµ log(1/ε)).

SAG: O(maxN, Lµ log(1/ε)).

Page 193: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Comparing Deterministic and Stochatic Methods

quantum (n = 50000, p = 78) and rcv1 (n = 697641,p = 47236)

0 10 20 30 40 50

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SGASG

IAG

0 10 20 30 40 50

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

Page 194: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG Compared to FG and SG Methods

quantum (n = 50000, p = 78) and rcv1 (n = 697641,p = 47236)

0 10 20 30 40 50

10−12

10−10

10−8

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

SAG−LS

0 10 20 30 40 50

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFGL−BFGS

SG

ASG

IAG

SAG−LS

Page 195: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Other Linearly-Convergent Stochastic Methods

Newer stochastic algorithms are now available with linearrates:

Stochastic dual coordinate ascent [Shalev-Schwartz & Zhang, 2013]

Incremental surrogate optimization [Mairal, 2013].Stochastic variance-reduced gradient (SVRG)[Johnson & Zhang, 2013, Konecny & Richtarik, 2013, Mahdavi et al.,

2013, Zhang et al., 2013]

SAGA [Defazio et al., 2014]

SVRG has a much lower memory requirement (later in talk).

There are also non-smooth extensions (last part of talk).

Page 196: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Other Linearly-Convergent Stochastic Methods

Newer stochastic algorithms are now available with linearrates:

Stochastic dual coordinate ascent [Shalev-Schwartz & Zhang, 2013]

Incremental surrogate optimization [Mairal, 2013].Stochastic variance-reduced gradient (SVRG)[Johnson & Zhang, 2013, Konecny & Richtarik, 2013, Mahdavi et al.,

2013, Zhang et al., 2013]

SAGA [Defazio et al., 2014]

SVRG has a much lower memory requirement (later in talk).

There are also non-smooth extensions (last part of talk).

Page 197: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG Implementation Issues

Basic SAG algorithm:

while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α

N .

Practical variants of the basic algorithm allow:

Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].Adaptive non-uniform sampling [Schmidt et al., 2013]:

Sample gradients that change quickly more often.

Page 198: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG Implementation Issues

Basic SAG algorithm:

while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α

N .

Practical variants of the basic algorithm allow:

Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].

Adaptive non-uniform sampling [Schmidt et al., 2013]:

Sample gradients that change quickly more often.

Page 199: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG Implementation Issues

Basic SAG algorithm:

while(1)Sample i from 1, 2, . . . ,N.Compute f ′i (x).d = d − yi + f ′i (x).yi = f ′i (x).x = x − α

N .

Practical variants of the basic algorithm allow:

Regularization.Sparse gradients.Automatic step-size selection.Termination criterion.Acceleration [Lin et al., 2015].Adaptive non-uniform sampling [Schmidt et al., 2013]:

Sample gradients that change quickly more often.

Page 200: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG with Adaptive Non-Uniform Sampling

protein (n = 145751, p = 74) and sido (n = 12678, p = 4932)

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

IAG

AGD

L−BFG

S

SG−C

ASG−C

PCD−L

DCA

SAG

0 10 20 30 40 50

10−6

10−5

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

IAG

AGD

L−BFGS

SG−CASG−C

PCD−L

DCA

SAG

Datasets where SAG had the worst relative performance.

Page 201: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

SAG with Non-Uniform Sampling

protein (n = 145751, p = 74) and sido (n = 12678, p = 4932)

0 10 20 30 40 50

10−20

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

IAG

AGDL−BFGS SG−CASG−C

PCD−L

DC

A

SAG

SAG−LS (Lipschitz)

0 10 20 30 40 50

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

IAG AGD L−BFGS

SG−C ASG−CPCD−L

DCA

SAG

SAG−LS (Lipschitz)

Lipschitz sampling helps a lot.

Page 202: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Minimizing Finite Sums: Dealing with the Memory

A major disadvantage of SAG is the memory requirement.

There are several ways to avoid this:

Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:

For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.

0 20 40 60 80 100

10−4

10−2

100

102

104

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SG AdaGrad

ASGHybrid

SAGSAG−NUS

SAG−NUS*

OEG

SMD

0 20 40 60 80 100

10−5

100

105

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SGAdaGradASG

Hybrid SAG

SAG−NUSSAG−NUS*

OEG

SMD

(optical character and named-entity recognition tasks)

If the above don’t work, use SVRG...

Page 203: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Minimizing Finite Sums: Dealing with the Memory

A major disadvantage of SAG is the memory requirement.

There are several ways to avoid this:

Use mini-batches (only store gradient of the mini-batch).

Use structure in the objective:

For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.

0 20 40 60 80 100

10−4

10−2

100

102

104

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SG AdaGrad

ASGHybrid

SAGSAG−NUS

SAG−NUS*

OEG

SMD

0 20 40 60 80 100

10−5

100

105

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SGAdaGradASG

Hybrid SAG

SAG−NUSSAG−NUS*

OEG

SMD

(optical character and named-entity recognition tasks)

If the above don’t work, use SVRG...

Page 204: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Minimizing Finite Sums: Dealing with the Memory

A major disadvantage of SAG is the memory requirement.

There are several ways to avoid this:

Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:

For fi (x) = L(aTi x), only need to store N values of aTi x .

For CRFs, only need to store marginals of parts.

0 20 40 60 80 100

10−4

10−2

100

102

104

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SG AdaGrad

ASGHybrid

SAGSAG−NUS

SAG−NUS*

OEG

SMD

0 20 40 60 80 100

10−5

100

105

Ob

jec

tiv

e m

inu

s O

ptim

al

L−BFGS

Pegasos

SGAdaGradASG

Hybrid SAG

SAG−NUSSAG−NUS*

OEG

SMD

(optical character and named-entity recognition tasks)

If the above don’t work, use SVRG...

Page 205: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Minimizing Finite Sums: Dealing with the Memory

A major disadvantage of SAG is the memory requirement.

There are several ways to avoid this:

Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:

For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.

0 20 40 60 80 100

10−4

10−2

100

102

104

Ob

jec

tive

min

us

Op

tim

al

L−BFGS

Pegasos

SG AdaGrad

ASGHybrid

SAGSAG−NUS

SAG−NUS*

OEG

SMD

0 20 40 60 80 100

10−5

100

105

Ob

jec

tive

min

us

Op

tim

al

L−BFGS

Pegasos

SGAdaGradASG

Hybrid SAG

SAG−NUSSAG−NUS*

OEG

SMD

(optical character and named-entity recognition tasks)

If the above don’t work, use SVRG...

Page 206: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Minimizing Finite Sums: Dealing with the Memory

A major disadvantage of SAG is the memory requirement.

There are several ways to avoid this:

Use mini-batches (only store gradient of the mini-batch).Use structure in the objective:

For fi (x) = L(aTi x), only need to store N values of aTi x .For CRFs, only need to store marginals of parts.

0 20 40 60 80 100

10−4

10−2

100

102

104

Ob

jec

tive

min

us

Op

tim

al

L−BFGS

Pegasos

SG AdaGrad

ASGHybrid

SAGSAG−NUS

SAG−NUS*

OEG

SMD

0 20 40 60 80 100

10−5

100

105

Ob

jec

tive

min

us

Op

tim

al

L−BFGS

Pegasos

SGAdaGradASG

Hybrid SAG

SAG−NUSSAG−NUS*

OEG

SMD

(optical character and named-entity recognition tasks)

If the above don’t work, use SVRG...

Page 207: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Variance-Reduced Gradient

SVRG algorithm:

Start with x0

for s = 0, 1, 2 . . .

ds = 1N

∑Ni=1 f

′i (xs)

x0 = xs

for t = 1, 2, . . .m

Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f

′it (x

t−1)− f ′it (xs) + ds).

xs+1 = x t for random t ∈ 1, 2, . . . ,m.Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .

Page 208: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Variance-Reduced Gradient

SVRG algorithm:

Start with x0

for s = 0, 1, 2 . . .

ds = 1N

∑Ni=1 f

′i (xs)

x0 = xsfor t = 1, 2, . . .m

Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f

′it (x

t−1)− f ′it (xs) + ds).

xs+1 = x t for random t ∈ 1, 2, . . . ,m.

Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .

Page 209: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Stochastic Variance-Reduced Gradient

SVRG algorithm:

Start with x0

for s = 0, 1, 2 . . .

ds = 1N

∑Ni=1 f

′i (xs)

x0 = xsfor t = 1, 2, . . .m

Randomly pick it ∈ 1, 2, . . . ,Nx t = x t−1 − αt(f

′it (x

t−1)− f ′it (xs) + ds).

xs+1 = x t for random t ∈ 1, 2, . . . ,m.Requires 2 gradients per iteration and occasional full passes,but only requires storing ds and xs .

Page 210: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Outline

1 Motivation

2 Gradient Method

3 Stochastic Subgradient

4 Finite-Sum Methods

5 Non-Smooth Objectives

Page 211: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Sparse Regularization

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Often, regularizer r is used to encourage sparsity pattern in x .

For example, `1-regularized least squares,

minx‖Ax − b‖2 + λ‖x‖1

Regularizes and encourages sparsity in x

The objective is non-differentiable when any xi = 0.

Subgradient methods are optimal (slow) black-box methods.

Faster methods for specific non-smooth problems?

Page 212: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Sparse Regularization

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Often, regularizer r is used to encourage sparsity pattern in x .

For example, `1-regularized least squares,

minx‖Ax − b‖2 + λ‖x‖1

Regularizes and encourages sparsity in x

The objective is non-differentiable when any xi = 0.

Subgradient methods are optimal (slow) black-box methods.

Faster methods for specific non-smooth problems?

Page 213: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Sparse Regularization

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Often, regularizer r is used to encourage sparsity pattern in x .

For example, `1-regularized least squares,

minx‖Ax − b‖2 + λ‖x‖1

Regularizes and encourages sparsity in x

The objective is non-differentiable when any xi = 0.

Subgradient methods are optimal (slow) black-box methods.

Faster methods for specific non-smooth problems?

Page 214: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Motivation: Sparse Regularization

Recall the regularized empirical risk minimization problem:

minx∈RP

1

N

N∑

i=1

L(x , ai , bi ) + λr(x)

data fitting term + regularizer

Often, regularizer r is used to encourage sparsity pattern in x .

For example, `1-regularized least squares,

minx‖Ax − b‖2 + λ‖x‖1

Regularizes and encourages sparsity in x

The objective is non-differentiable when any xi = 0.

Subgradient methods are optimal (slow) black-box methods.

Faster methods for specific non-smooth problems?

Page 215: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Smoothing Approximations of Non-Smooth Functions

Smoothing: replace non-smooth f with smooth fε.

Apply a fast method for smooth optimization.

Smooth approximation to the absolute value:

|x | ≈√x2 + ν.

Page 216: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Smoothing Approximations of Non-Smooth Functions

Smoothing: replace non-smooth f with smooth fε.

Apply a fast method for smooth optimization.

Smooth approximation to the absolute value:

|x | ≈√x2 + ν.

Page 217: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Smoothing Approximations of Non-Smooth Functions

Smoothing: replace non-smooth f with smooth fε.

Apply a fast method for smooth optimization.

Smooth approximation to the absolute value:

|x | ≈√x2 + ν.

Page 218: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Smoothing Approximations of Non-Smooth Functions

Smoothing: replace non-smooth f with smooth fε.

Apply a fast method for smooth optimization.

Smooth approximation to the absolute value:

|x | ≈√x2 + ν.

Smooth approximation to the max function:

maxa, b ≈ log(exp(a) + exp(b))

Smooth approximation to the hinge loss:

max0, x ≈

0 x ≥ 1

1− x2 t < x < 1

(1− t)2 + 2(1− t)(t − x) x ≤ t

Generic smoothing strategy: strongly-convex regularization ofconvex conjugate.[Nesterov, 2005]

Page 219: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Smoothing Approximations of Non-Smooth Functions

Smoothing: replace non-smooth f with smooth fε.

Apply a fast method for smooth optimization.

Smooth approximation to the absolute value:

|x | ≈√x2 + ν.

Smooth approximation to the max function:

maxa, b ≈ log(exp(a) + exp(b))

Smooth approximation to the hinge loss:

max0, x ≈

0 x ≥ 1

1− x2 t < x < 1

(1− t)2 + 2(1− t)(t − x) x ≤ t

Generic smoothing strategy: strongly-convex regularization ofconvex conjugate.[Nesterov, 2005]

Page 220: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Smoothing Approach

Nesterov [2005] shows that:

Gradient method on smoothed problem has O(1/√t)

subgradient rate.Accelerated gradient method has faster O(1/t) rate.

In practice:

Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.

You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]

See also Chambolle & Pock [2010].

Page 221: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Smoothing Approach

Nesterov [2005] shows that:

Gradient method on smoothed problem has O(1/√t)

subgradient rate.Accelerated gradient method has faster O(1/t) rate.

In practice:

Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.

You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]

See also Chambolle & Pock [2010].

Page 222: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Smoothing Approach

Nesterov [2005] shows that:

Gradient method on smoothed problem has O(1/√t)

subgradient rate.Accelerated gradient method has faster O(1/t) rate.

In practice:

Slowly decrease level of smoothing (often difficult to tune).Use faster algorithms like L-BFGS, SAG, or SVRG.

You can get the O(1/t) rate for minx maxfi (x) for fi convexand smooth using mirror-prox method.[Nemirovski, 2004]

See also Chambolle & Pock [2010].

Page 223: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Converting to Constrained Optimization

Re-write non-smooth problem as constrained problem.

The problemminx

f (x) + λ‖x‖1,

is equivalent to the problem

minx+≥0,x−≥0

f (x+ − x−) + λ∑

i

(x+i + x−i ),

or the problems

min−y≤x≤y

f (x) + λ∑

i

yi , min‖x‖1≤γ

f (x) + λγ

These are smooth objective with ‘simple’ constraints.

minx∈C

f (x).

Page 224: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Converting to Constrained Optimization

Re-write non-smooth problem as constrained problem.

The problemminx

f (x) + λ‖x‖1,

is equivalent to the problem

minx+≥0,x−≥0

f (x+ − x−) + λ∑

i

(x+i + x−i ),

or the problems

min−y≤x≤y

f (x) + λ∑

i

yi , min‖x‖1≤γ

f (x) + λγ

These are smooth objective with ‘simple’ constraints.

minx∈C

f (x).

Page 225: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Converting to Constrained Optimization

Re-write non-smooth problem as constrained problem.

The problemminx

f (x) + λ‖x‖1,

is equivalent to the problem

minx+≥0,x−≥0

f (x+ − x−) + λ∑

i

(x+i + x−i ),

or the problems

min−y≤x≤y

f (x) + λ∑

i

yi , min‖x‖1≤γ

f (x) + λγ

These are smooth objective with ‘simple’ constraints.

minx∈C

f (x).

Page 226: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Optimization with Simple Constraints

Recall: gradient descent minimizes quadratic approximation:

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Consider minimizing subject to simple constraints:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Equivalent to projection of gradient descent:

xGDt = x t − αt∇f (x t),

x t+1 = argminy∈C

‖y − xGDt ‖

,

Page 227: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Optimization with Simple Constraints

Recall: gradient descent minimizes quadratic approximation:

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Consider minimizing subject to simple constraints:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Equivalent to projection of gradient descent:

xGDt = x t − αt∇f (x t),

x t+1 = argminy∈C

‖y − xGDt ‖

,

Page 228: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Optimization with Simple Constraints

Recall: gradient descent minimizes quadratic approximation:

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Consider minimizing subject to simple constraints:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

.

Equivalent to projection of gradient descent:

xGDt = x t − αt∇f (x t),

x t+1 = argminy∈C

‖y − xGDt ‖

,

Page 229: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)

Page 230: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)

x1

x2

Page 231: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)Feasible Set

x1

x2

Page 232: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)Feasible Set

x

x1

x2

Page 233: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)Feasible Set

x

x1

x2

x - !f’(x)

Page 234: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)Feasible Set

x

x1

x2

x - !f’(x)

Page 235: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Gradient Projection

f(x)Feasible Set

x+

x1

x2

x

x - !f’(x)

Page 236: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Projected Gradient

Projected gradient has same rate as gradient method!

Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).

For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .

Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.

Page 237: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Projected Gradient

Projected gradient has same rate as gradient method!

Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).

For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .

Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.

Page 238: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Discussion of Projected Gradient

Projected gradient has same rate as gradient method!

Can do many of the same tricks (i.e. line-search, acceleration,Barzilai-Borwein, SAG, SVRG).

For projected Newton, you need to do an expensive projectionunder ‖ · ‖Ht .

Two-metric projection methods allow Newton-like strategy forbound constraints.Inexact Newton methods allow Newton-like like strategy foroptimizing costly functions with simple constraints.

Page 239: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0

argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 240: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , u

argminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 241: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 242: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 243: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.

Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 244: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 245: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 246: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Projection Onto Simple Sets

Projections onto simple sets:

argminy≥0 ‖y − x‖ = maxx , 0argminl≤y≤u ‖y − x‖ = maxl ,minx , uargminaT y=b ‖y − x‖ = x + (b − aT x)a/‖a‖2.

argminaT y≥b ‖y − x‖ =

x aT x ≥ b

x + (b − aT x)a/‖a‖2 aT x < b

argmin‖y‖≤τ ‖y − x‖ = τx/‖x‖.Linear-time algorithm for `1-norm ‖y‖1 ≤ τ .

Linear-time algorithm for probability simplex y ≥ 0,∑

y = 1.

Intersection of simple sets: Dykstra’s algorithm.

We can solve large instances of problems with these constraints.

Page 247: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal-Gradient Method

A generalization of projected-gradient is Proximal-gradient.

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Applies proximity operator of r to gradient descent on f :

xGDt = x t − αt∇f (xt),

x t+1 = argminy

1

2‖y − xGDt ‖2 + αr(y)

,

Equivalent to using the approximation

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2α‖y − x t‖2+r(y)

.

Convergence rates are still the same as for minimizing f .

Page 248: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal-Gradient Method

A generalization of projected-gradient is Proximal-gradient.

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Applies proximity operator of r to gradient descent on f :

xGDt = x t − αt∇f (xt),

x t+1 = argminy

1

2‖y − xGDt ‖2 + αr(y)

,

Equivalent to using the approximation

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2α‖y − x t‖2+r(y)

.

Convergence rates are still the same as for minimizing f .

Page 249: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal-Gradient Method

A generalization of projected-gradient is Proximal-gradient.

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Applies proximity operator of r to gradient descent on f :

xGDt = x t − αt∇f (xt),

x t+1 = argminy

1

2‖y − xGDt ‖2 + αr(y)

,

Equivalent to using the approximation

x t+1 = argminy

f (x t) +∇f (x t)T (y − x t) +

1

2α‖y − x t‖2+r(y)

.

Convergence rates are still the same as for minimizing f .

Page 250: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal Operator, Iterative Soft Thresholding

The proximal operator is the solution to

proxr [y ] = argminx∈RP

r(x) +1

2‖x − y‖2.

For L1-regularization, we obtain iterative soft-thresholding:

x t+1 = softThreshαλ[x t − α∇f (x t)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Page 251: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal Operator, Iterative Soft Thresholding

The proximal operator is the solution to

proxr [y ] = argminx∈RP

r(x) +1

2‖x − y‖2.

For L1-regularization, we obtain iterative soft-thresholding:

x t+1 = softThreshαλ[x t − α∇f (x t)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Page 252: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal Operator, Iterative Soft Thresholding

The proximal operator is the solution to

proxr [y ] = argminx∈RP

r(x) +1

2‖x − y‖2.

For L1-regularization, we obtain iterative soft-thresholding:

x t+1 = softThreshαλ[x t − α∇f (x t)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Page 253: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal Operator, Iterative Soft Thresholding

The proximal operator is the solution to

proxr [y ] = argminx∈RP

r(x) +1

2‖x − y‖2.

For L1-regularization, we obtain iterative soft-thresholding:

x t+1 = softThreshαλ[x t − α∇f (x t)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Page 254: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Proximal Operator, Iterative Soft Thresholding

The proximal operator is the solution to

proxr [y ] = argminx∈RP

r(x) +1

2‖x − y‖2.

For L1-regularization, we obtain iterative soft-thresholding:

x t+1 = softThreshαλ[x t − α∇f (x t)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Page 255: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

f(x)

x

Page 256: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

f(x)

x

Page 257: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

f(x)

x

Page 258: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

Feasible Set

f(x)

x

Page 259: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

Feasible Set

x - !f’(x)f(x)

x

Page 260: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

Feasible Set

f(x)

x

x - !f’(x)

Page 261: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Special case of Projected-Gradient Methods

Projected-gradient methods are another special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesx t+1 = projectC[x t − α∇f (x t)],

Feasible Set

x+

f(x)

x

x - !f’(x)

Page 262: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 263: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.

2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 264: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.

3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 265: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.

4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 266: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.

5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 267: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.

6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 268: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 269: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 270: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 L1-Regularization.2 Group `1-Regularization.3 Lower and upper bounds.4 Small number of linear constraint.5 Probability constraints.6 A few other simple regularizers/constraints.

Can solve these non-smooth/constrained problems as fast assmooth/unconstrained problems!

We can again do many of the same tricks (line-search,acceleration, Barzilai-Borwein, two-metric projection, inexactproximal operators, SAG, SVRG).

Page 271: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternating Direction Method of Multipliers

Alernating direction method of multipliers (ADMM) solves:

minAx+By=c

f (x) + r(y).

Alternate between prox-like operators with respect to f and r .

Can introduce constraints to convert to this form:

minx

f (Ax) + r(x) ⇔ minx=Ay

f (x) + r(y),

minx

f (x) + r(Bx) ⇔ miny=Bx

f (x) + r(y).

If prox can not be computed exactly: Linearized ADMM.

Page 272: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternating Direction Method of Multipliers

Alernating direction method of multipliers (ADMM) solves:

minAx+By=c

f (x) + r(y).

Alternate between prox-like operators with respect to f and r .

Can introduce constraints to convert to this form:

minx

f (Ax) + r(x) ⇔ minx=Ay

f (x) + r(y),

minx

f (x) + r(Bx) ⇔ miny=Bx

f (x) + r(y).

If prox can not be computed exactly: Linearized ADMM.

Page 273: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternating Direction Method of Multipliers

Alernating direction method of multipliers (ADMM) solves:

minAx+By=c

f (x) + r(y).

Alternate between prox-like operators with respect to f and r .

Can introduce constraints to convert to this form:

minx

f (Ax) + r(x) ⇔ minx=Ay

f (x) + r(y),

minx

f (x) + r(Bx) ⇔ miny=Bx

f (x) + r(y).

If prox can not be computed exactly: Linearized ADMM.

Page 274: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternating Direction Method of Multipliers

Alernating direction method of multipliers (ADMM) solves:

minAx+By=c

f (x) + r(y).

Alternate between prox-like operators with respect to f and r .

Can introduce constraints to convert to this form:

minx

f (Ax) + r(x) ⇔ minx=Ay

f (x) + r(y),

minx

f (x) + r(Bx) ⇔ miny=Bx

f (x) + r(y).

If prox can not be computed exactly: Linearized ADMM.

Page 275: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Frank-Wolfe Method

In some cases the projected gradient step

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

,

may be hard to compute (e.g., dual of max-margin Markovnetworks).

Frank-Wolfe method simply uses:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t)

,

requires compact C, takes convex combination of x t and x t+1.

Iterate can be written as convex combination of vertices of C.

O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]

Page 276: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Frank-Wolfe Method

In some cases the projected gradient step

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

,

may be hard to compute (e.g., dual of max-margin Markovnetworks).

Frank-Wolfe method simply uses:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t)

,

requires compact C, takes convex combination of x t and x t+1.

Iterate can be written as convex combination of vertices of C.

O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]

Page 277: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Frank-Wolfe Method

In some cases the projected gradient step

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t) +

1

2αt‖y − x t‖2

,

may be hard to compute (e.g., dual of max-margin Markovnetworks).

Frank-Wolfe method simply uses:

x t+1 = argminy∈C

f (x t) +∇f (x t)T (y − x t)

,

requires compact C, takes convex combination of x t and x t+1.

Iterate can be written as convex combination of vertices of C.

O(1/t) rate for smooth convex objectives, some linearconvergence results for smooth and strongly-convex.[Jaggi, 2013]

Page 278: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternatives to Quadratic/Linear Surrogates

Mirror descent uses the iterations[Beck & Teboulle, 2003]

x t+1 = argminy∈C

f (x) +∇f (x)T (y − x t) +

1

2αtD(x t , y)

,

where D is a Bregman-divergence:

D = ‖x t − y‖2 (gradient method).D = ‖x t − y‖2

H (Newton’s method).

D =∑

i xti log(

x ti

yi)−∑i (x

ti − yi ) (exponentiated gradient).

Mairal [2013,2014] considers general surrogate optimization:

x t+1 = argminy∈C

g(y) ,

where g upper bounds f , g(x t) = f (x t), ∇g(x t) = ∇f (x t),and ∇g −∇f is Lipschitz-continuous.

Get O(1/k) and linear convergence rates depending on g − f .

Page 279: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Alternatives to Quadratic/Linear Surrogates

Mirror descent uses the iterations[Beck & Teboulle, 2003]

x t+1 = argminy∈C

f (x) +∇f (x)T (y − x t) +

1

2αtD(x t , y)

,

where D is a Bregman-divergence:

D = ‖x t − y‖2 (gradient method).D = ‖x t − y‖2

H (Newton’s method).

D =∑

i xti log(

x ti

yi)−∑i (x

ti − yi ) (exponentiated gradient).

Mairal [2013,2014] considers general surrogate optimization:

x t+1 = argminy∈C

g(y) ,

where g upper bounds f , g(x t) = f (x t), ∇g(x t) = ∇f (x t),and ∇g −∇f is Lipschitz-continuous.

Get O(1/k) and linear convergence rates depending on g − f .

Page 280: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Dual Methods

Stronly-convex problems have smooth duals.

Solve the dual instead of the primal.

SVM non-smooth strongly-convex primal:

minx

CN∑

i=1

max0, 1− biaTi x+

1

2‖x‖2.

SVM smooth dual:

min0≤α≤C

1

2αTAATα−

N∑

i=1

αi

Smooth bound constrained problem:

Two-metric projection (efficient Newton-liked method).Randomized coordinate descent (part 2 of this talk).

Page 281: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Dual Methods

Stronly-convex problems have smooth duals.

Solve the dual instead of the primal.

SVM non-smooth strongly-convex primal:

minx

CN∑

i=1

max0, 1− biaTi x+

1

2‖x‖2.

SVM smooth dual:

min0≤α≤C

1

2αTAATα−

N∑

i=1

αi

Smooth bound constrained problem:

Two-metric projection (efficient Newton-liked method).Randomized coordinate descent (part 2 of this talk).

Page 282: Modern Convex Optimization Methods for Large-Scale Empirical … · MotivationGradient MethodStochastic SubgradientFinite-Sum MethodsNon-Smooth Objectives Modern Convex Optimization

Motivation Gradient Method Stochastic Subgradient Finite-Sum Methods Non-Smooth Objectives

Summary

Summary:

Part 1: Convex functions have special properties that allow usto efficiently minimize them.

Part 2: Gradient-based methods allow elegant scaling withdimensionality of problem.

Part 3: Stochastic-gradient methods allow scaling withnumber of training examples, at cost of slower convergencerate.

Part 4: For finite datasets, SAG fixes convergence rate ofstochastic gradient methods, and SVRG fixes memoryproblem of SAG.

Part 5: These building blocks can be extended to solve avariety of constrained and non-smooth problems.