randomized smoothing techniques in optimizationjduchi/projects/duchijowawi14_slides.pdf ·...

Post on 16-Oct-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Randomized Smoothing Techniques in Optimization

John Duchi

Based on joint work with Peter Bartlett, Michael Jordan,

Martin Wainwright, Andre Wibisono

Stanford University

Information Systems Laboratory SeminarOctober 2014

Problem Statement

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

Problem Statement

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

Often, we will assume

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]

Gradient Descent

Goal: solve

minimize f(x)

Technique: go down the slope,

xt+1 = xt − α∇f(xt)

f(x)

f(y) + 〈∇f(y), x− y〉

(y, f(y))

When is optimization easy?

Easy problem: function is convex,nice and smooth

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

◮ To avoid outliers, minimize

f(x) =1

n

n∑

i=1

|a⊤i x− bi| =1

n‖Ax− b‖

1

a⊤i x− bi

Example 2: Protein Structure Prediction

RSCCPCYWGGCPWGQNCYPEGCSGPKV1 2 3 4 5 6

N

C

N(A)

(B)

45

70

60

103

84

118

(Grace et al., PNAS 2004)

1 2

4

65

3

Protein Structure Prediction

Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Protein Structure Prediction

Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Protein Structure Prediction

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Protein Structure Prediction

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct

Loss function: L(ν, ν̂) is number of disagreements in matchings

F (x; {ξ, ν}) := maxν̂∈V

(L(ν, ν̂) + x⊤

e∈ν̂

ξe − x⊤∑

e∈ν

ξe

).

When is optimization easy?

Easy problem: function is convex,nice and smooth

Not so easy problem: function isnon-smooth

Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

u

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information

Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

u

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

III Parallelism: really fast solutions for large scale problems [D., Bartlett,Wainwright 2013]

◮ Smooth and non-smooth stochastic optimization problems

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

At each iteration t,

◮ Draw Z1, . . . , Zm i.i.d. ‖Zi‖ ≤ 1

◮ Set git = ∇f(xt + uZi)

◮ Set gradient gt as

gt = argming{‖g‖2

2:

g =∑

i λigit

λ ≥ 0,∑

i λi = 1

}

◮ Update xt+1 = xt − αgt, whereα > 0 chosen by line search

u

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]Define the set

Gu(x) := Conv{∇f(x′) :

∥∥x′ − x∥∥2≤ u, ∇f(x′) exists

}

Proposition (Burke, Lewis, Overton):There exist cluster points x̄ of the sequence xt, and for any suchcluster point,

0 ∈ Gu(x̄)

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

◮ Can randomizedperturbations giveinsights?

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T

Note: Dependence on G important

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

First idea gradient estimator:

◮ Sample Z ∼ µ satisfying Eµ[ZZ⊤] = Id×d

◮ Gradient estimator at x:

g =f(x+ uZ)− f(x)

uZ

Perform gradient descent using these g

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Random estimates Average ≈ ∇f

Two-point stochastic gradient: differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ

◮ Update xt+1 = xt − αgt

Two-point stochastic gradient: differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ

◮ Update xt+1 = xt − αgt

Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d√T

+O

(u2

log T

T

).

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2

0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

0 50 100 150 200 250 300

Rescaled Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2

Idea: More randomization!

u1

u2

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large

Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).

Note: If u2/u1 → 0, scaling linear in d

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2

◮ Update xt+1 = xt − αgt

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2

◮ Update xt+1 = xt − αgt

Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and ‖∂f(x)‖2

2≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2

◮ Update xt+1 = xt − αgt

Two-point sub-gradient: non-differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2

◮ Update xt+1 = xt − αgt

Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2

◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.

◮ Open question: Non-stochastic lower bounds? (Sebastian Pokutta,next week.)

Instance III: Parallelization and fast algorithms

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

where

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]

Stochastic Gradient Descent

Problem: Tough to compute

f(x) =1

n

n∑

i=1

F (x; ξi).

Instead: At iteration t

◮ Choose random ξi, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Stochastic Gradient Descent

Problem: Tough to compute

f(x) =1

n

n∑

i=1

F (x; ξi).

Instead: At iteration t

◮ Choose random ξi, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Problem: only works for smooth functions.

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

◮ Structured prediction (ranking, parsing, learning matchings)

F (x; {ξ, ν}) = maxν̂∈V

[L(ν, ν̂) + x⊤Φ(ξ, ν̂)− x⊤Φ(ξ, ν)

]

Difficulties of non-smooth

Intuition: Gradient is poor indicator of global structure

Better global estimators

Idea: Ask for subgradients from multiple points

Better global estimators

Idea: Ask for subgradients from multiple points

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt ut

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt ut

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸ ︷︷ ︸Approximate f

+1

2αt‖x‖2

2

︸ ︷︷ ︸Regularize

}

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸ ︷︷ ︸Approximate f

+1

2αt‖x‖2

2

︸ ︷︷ ︸Regularize

}

III. Interpolatext+1 = (1− θt)xt + θtzt+1

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

λ-strongly convex objectives:

f(xT )− f(x∗) = O(

C

T 2+

1

λTm

)

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

︸︷︷

︸ ︷︷ ︸

n

m

︷︷

︸c

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

Backsolve: after T = S(m+ c) units oftime, error is

O(m+ c

T+

1√Tn

·√

m+ c

m

)

︸︷︷

︸ ︷︷ ︸

n

m

︷︷

︸c

Experimental results

Iteration complexity simulations

Define T (ǫ,m) = min {t ∈ N | f(xt)− f(x∗) ≤ ǫ}, solve robust regressionproblem

f(x) =1

n

n∑

i=1

∣∣∣a⊤i x− bi

∣∣∣ = 1

n‖Ax− b‖

1

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

Actual T (ǫ, m)Predicted T (ǫ, m)

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

Actual T (ǫ, m)Predicted T (ǫ, m)

Robustness to stepsize and smoothing

◮ Two parameters: smoothing parameter u, stepsize η

10−1

101

103

10−1

101

103

10−2

10−1

η1/u

f(x̂

)+

ϕ(x̂

)−

f(x

∗)−

ϕ(x

∗)

Plot: optimality gap after 2000 iterations on synthetic SVM problem

f(x) + ϕ(x) :=1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

Text Classification

Reuter’s RCV1 dataset, time to ǫ-optimal solution for

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 2 4 6 8 10 12 140.5

1

1.5

2

2.5

3

3.5

4

4.5Time to optimality gap of 0.004

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

0 2 4 6 8 10 12 141

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6Time to optimality gap of 0.002

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

Text Classification

Reuter’s RCV1 dataset, optimization speed for minimizing

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 1 2 3 410

−3

10−2

10−1

100

Opt

imal

ity g

ap

Time (s)

Acc (1)Acc (2)Acc (3)Acc (4)Acc (6)Acc (8)Pegasos

Parsing

Penn Treebank dataset, learning weights for a hypergraph parser (here x isa sentence, y ∈ V is a parse tree)

1

n

n∑

i=1

maxν̂∈V

[L(νi, ν̂) + x⊤ (Φ(ξi, ν̂)− Φ(ξi, νi))

]+

λ

2‖x‖2

2.

0 500 1000 150010

−2

10−1

100

101

Iterations

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468

0 50 100 150 20010

−2

10−1

100

101

Time (seconds)

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468RDA

Is smoothing necessary?

Solve multiple-median problem

f(x) =1

n

n∑

i=1

‖x− ξi‖1 ,

ξi ∈ {−1, 1}d. Compare standard stochastic gradient:

0 5 10 15 20 25 30 3520

40

60

80

100

120

140

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

SmoothedUnsmoothed

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

Thanks!

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

References:

◮ Randomized Smoothing for Stochastic Optimization (D., Bartlett,Wainwright). SIAM Journal on Optimization, 22(2), pages 674–701.

◮ Optimal rates for zero-order convex optimization: the power of two functionevaluations (D., Jordan, Wainwright, Wibisono). arXiv:1312.2139[math.OC].

Available on my webpage (http://web.stanford.edu/~jduchi)

top related