randomized smoothing techniques in optimizationjduchi/projects/duchijowawi14_slides.pdf ·...

93
Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory Seminar October 2014

Upload: others

Post on 16-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Randomized Smoothing Techniques in Optimization

John Duchi

Based on joint work with Peter Bartlett, Michael Jordan,

Martin Wainwright, Andre Wibisono

Stanford University

Information Systems Laboratory SeminarOctober 2014

Page 2: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Problem Statement

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

Page 3: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Problem Statement

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

Often, we will assume

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]

Page 4: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Gradient Descent

Goal: solve

minimize f(x)

Technique: go down the slope,

xt+1 = xt − α∇f(xt)

f(x)

f(y) + 〈∇f(y), x− y〉

(y, f(y))

Page 5: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

When is optimization easy?

Easy problem: function is convex,nice and smooth

Page 6: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

Page 7: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

Page 8: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

Page 9: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

◮ To avoid outliers, minimize

f(x) =1

n

n∑

i=1

|a⊤i x− bi| =1

n‖Ax− b‖

1

a⊤i x− bi

Page 10: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Example 2: Protein Structure Prediction

RSCCPCYWGGCPWGQNCYPEGCSGPKV1 2 3 4 5 6

N

C

N(A)

(B)

45

70

60

103

84

118

(Grace et al., PNAS 2004)

1 2

4

65

3

Page 11: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Protein Structure Prediction

Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Page 12: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Protein Structure Prediction

Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Page 13: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Protein Structure Prediction

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Page 14: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Protein Structure Prediction

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Loss function: L(ν, ν̂) is number of disagreements in matchings

F (x; {ξ, ν}) := maxν̂∈V

(L(ν, ν̂) + x⊤

e∈ν̂

ξe − x⊤∑

e∈ν

ξe

).

Page 15: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

Page 16: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Page 17: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

Page 18: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

u

Page 19: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

u

Page 20: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

Page 21: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

Page 22: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

III Parallelism: really fast solutions for large scale problems [D., Bartlett,Wainwright 2013]

◮ Smooth and non-smooth stochastic optimization problems

Page 23: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

Page 24: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

At each iteration t,

◮ Draw Z1, . . . , Zm i.i.d. ‖Zi‖ ≤ 1

◮ Set git = ∇f(xt + uZi)

◮ Set gradient gt as

gt = argming{‖g‖2

2:

g =∑

i λigit

λ ≥ 0,∑

i λi = 1

}

◮ Update xt+1 = xt − αgt, whereα > 0 chosen by line search

u

Page 25: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]Define the set

Gu(x) := Conv{∇f(x′) :

∥∥x′ − x∥∥2≤ u, ∇f(x′) exists

}

Proposition (Burke, Lewis, Overton):There exist cluster points x̄ of the sequence xt, and for any suchcluster point,

0 ∈ Gu(x̄)

Page 26: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))

Page 27: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Page 28: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Page 29: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

◮ Can randomizedperturbations giveinsights?

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Page 30: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Page 31: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Page 32: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T

Page 33: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T

Note: Dependence on G important

Page 34: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

Page 35: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

First idea gradient estimator:

◮ Sample Z ∼ µ satisfying Eµ[ZZ⊤] = Id×d

◮ Gradient estimator at x:

g =f(x+ uZ)− f(x)

uZ

Perform gradient descent using these g

Page 36: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Page 37: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Random estimates Average ≈ ∇f

Page 38: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ

◮ Update xt+1 = xt − αgt

Page 39: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ

◮ Update xt+1 = xt − αgt

Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d√T

+O

(u2

log T

T

).

Page 40: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

Page 41: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

Page 42: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

Page 43: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

0 50 100 150 200 250 300

Rescaled Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

Page 44: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Page 45: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Page 46: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2

Page 47: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2

Idea: More randomization!

u1

u2

Page 48: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).

Page 49: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).

Note: If u2/u1 → 0, scaling linear in d

Page 50: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2

◮ Update xt+1 = xt − αgt

Page 51: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2

◮ Update xt+1 = xt − αgt

Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and ‖∂f(x)‖2

2≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).

Page 52: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2

◮ Update xt+1 = xt − αgt

Page 53: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2

◮ Update xt+1 = xt − αgt

Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).

Page 54: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

Page 55: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.

Page 56: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.

◮ Open question: Non-stochastic lower bounds? (Sebastian Pokutta,next week.)

Page 57: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Instance III: Parallelization and fast algorithms

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

where

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]

Page 58: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Problem: Tough to compute

f(x) =1

n

n∑

i=1

F (x; ξi).

Instead: At iteration t

◮ Choose random ξi, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Page 59: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Stochastic Gradient Descent

Problem: Tough to compute

f(x) =1

n

n∑

i=1

F (x; ξi).

Instead: At iteration t

◮ Choose random ξi, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Page 60: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Page 61: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Page 62: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Page 63: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Page 64: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Problem: only works for smooth functions.

Page 65: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

Page 66: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

◮ Structured prediction (ranking, parsing, learning matchings)

F (x; {ξ, ν}) = maxν̂∈V

[L(ν, ν̂) + x⊤Φ(ξ, ν̂)− x⊤Φ(ξ, ν)

]

Page 67: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Difficulties of non-smooth

Intuition: Gradient is poor indicator of global structure

Page 68: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Better global estimators

Idea: Ask for subgradients from multiple points

Page 69: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Better global estimators

Idea: Ask for subgradients from multiple points

Page 70: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt

Page 71: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt ut

Page 72: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt ut

Page 73: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

Page 74: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

Page 75: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸ ︷︷ ︸Approximate f

+1

2αt‖x‖2

2

︸ ︷︷ ︸Regularize

}

Page 76: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸ ︷︷ ︸Approximate f

+1

2αt‖x‖2

2

︸ ︷︷ ︸Regularize

}

III. Interpolatext+1 = (1− θt)xt + θtzt+1

Page 77: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Page 78: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

Page 79: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

λ-strongly convex objectives:

f(xT )− f(x∗) = O(

C

T 2+

1

λTm

)

Page 80: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

︸︷︷

︸ ︷︷ ︸

n

m

︷︷

︸c

Page 81: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

Backsolve: after T = S(m+ c) units oftime, error is

O(m+ c

T+

1√Tn

·√

m+ c

m

)

︸︷︷

︸ ︷︷ ︸

n

m

︷︷

︸c

Page 82: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Experimental results

Page 83: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Iteration complexity simulations

Define T (ǫ,m) = min {t ∈ N | f(xt)− f(x∗) ≤ ǫ}, solve robust regressionproblem

f(x) =1

n

n∑

i=1

∣∣∣a⊤i x− bi

∣∣∣ = 1

n‖Ax− b‖

1

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

Actual T (ǫ, m)Predicted T (ǫ, m)

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

Actual T (ǫ, m)Predicted T (ǫ, m)

Page 84: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Robustness to stepsize and smoothing

◮ Two parameters: smoothing parameter u, stepsize η

10−1

101

103

10−1

101

103

10−2

10−1

η1/u

f(x̂

)+

ϕ(x̂

)−

f(x

∗)−

ϕ(x

∗)

Plot: optimality gap after 2000 iterations on synthetic SVM problem

f(x) + ϕ(x) :=1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

Page 85: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Text Classification

Reuter’s RCV1 dataset, time to ǫ-optimal solution for

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 2 4 6 8 10 12 140.5

1

1.5

2

2.5

3

3.5

4

4.5Time to optimality gap of 0.004

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

0 2 4 6 8 10 12 141

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6Time to optimality gap of 0.002

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

Page 86: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Text Classification

Reuter’s RCV1 dataset, optimization speed for minimizing

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 1 2 3 410

−3

10−2

10−1

100

Opt

imal

ity g

ap

Time (s)

Acc (1)Acc (2)Acc (3)Acc (4)Acc (6)Acc (8)Pegasos

Page 87: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Parsing

Penn Treebank dataset, learning weights for a hypergraph parser (here x isa sentence, y ∈ V is a parse tree)

1

n

n∑

i=1

maxν̂∈V

[L(νi, ν̂) + x⊤ (Φ(ξi, ν̂)− Φ(ξi, νi))

]+

λ

2‖x‖2

2.

0 500 1000 150010

−2

10−1

100

101

Iterations

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468

0 50 100 150 20010

−2

10−1

100

101

Time (seconds)

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468RDA

Page 88: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Is smoothing necessary?

Solve multiple-median problem

f(x) =1

n

n∑

i=1

‖x− ξi‖1 ,

ξi ∈ {−1, 1}d. Compare standard stochastic gradient:

0 5 10 15 20 25 30 3520

40

60

80

100

120

140

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

SmoothedUnsmoothed

Page 89: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

Page 90: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

Page 91: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

Thanks!

Page 92: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

References:

◮ Randomized Smoothing for Stochastic Optimization (D., Bartlett,Wainwright). SIAM Journal on Optimization, 22(2), pages 674–701.

◮ Optimal rates for zero-order convex optimization: the power of two functionevaluations (D., Jordan, Wainwright, Wibisono). arXiv:1312.2139[math.OC].

Available on my webpage (http://web.stanford.edu/~jduchi)

Page 93: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter