Randomized Smoothing Techniques in Optimization
John Duchi
Based on joint work with Peter Bartlett, Michael Jordan,
Martin Wainwright, Andre Wibisono
Stanford University
Information Systems Laboratory SeminarOctober 2014
Problem Statement
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
Problem Statement
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
Often, we will assume
f(x) :=1
n
n∑
i=1
F (x; ξi) or f(x) := E[F (x; ξ)]
Gradient Descent
Goal: solve
minimize f(x)
Technique: go down the slope,
xt+1 = xt − α∇f(xt)
f(x)
f(y) + 〈∇f(y), x− y〉
(y, f(y))
When is optimization easy?
Easy problem: function is convex,nice and smooth
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
Even harder problems:
◮ We cannot compute gradients ∇f(x)
◮ Function f is non-convex and non-smooth
Example 1: Robust regression
◮ Data in pairs ξi = (ai, bi) ∈ Rd × R
◮ Want to estimate bi ≈ a⊤i x
Example 1: Robust regression
◮ Data in pairs ξi = (ai, bi) ∈ Rd × R
◮ Want to estimate bi ≈ a⊤i x
◮ To avoid outliers, minimize
f(x) =1
n
n∑
i=1
|a⊤i x− bi| =1
n‖Ax− b‖
1
a⊤i x− bi
Example 2: Protein Structure Prediction
RSCCPCYWGGCPWGQNCYPEGCSGPKV1 2 3 4 5 6
N
C
N(A)
(B)
45
70
60
103
84
118
(Grace et al., PNAS 2004)
1 2
4
65
3
Protein Structure Prediction
Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Protein Structure Prediction
Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
Protein Structure Prediction
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
Protein Structure Prediction
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
Loss function: L(ν, ν̂) is number of disagreements in matchings
F (x; {ξ, ν}) := maxν̂∈V
(L(ν, ν̂) + x⊤
∑
e∈ν̂
ξe − x⊤∑
e∈ν
ξe
).
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
Even harder problems:
◮ We cannot compute gradients ∇f(x)
◮ Function f is non-convex and non-smooth
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
u
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
u
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]
◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]
◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems
III Parallelism: really fast solutions for large scale problems [D., Bartlett,Wainwright 2013]
◮ Smooth and non-smooth stochastic optimization problems
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]
At each iteration t,
◮ Draw Z1, . . . , Zm i.i.d. ‖Zi‖ ≤ 1
◮ Set git = ∇f(xt + uZi)
◮ Set gradient gt as
gt = argming{‖g‖2
2:
g =∑
i λigit
λ ≥ 0,∑
i λi = 1
}
◮ Update xt+1 = xt − αgt, whereα > 0 chosen by line search
u
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]Define the set
Gu(x) := Conv{∇f(x′) :
∥∥x′ − x∥∥2≤ u, ∇f(x′) exists
}
Proposition (Burke, Lewis, Overton):There exist cluster points x̄ of the sequence xt, and for any suchcluster point,
0 ∈ Gu(x̄)
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe
◮ Can randomizedperturbations giveinsights?
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Theorem (Russians): Let x̂T = 1
T
∑Tt=1
xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then
E[f(x̂T )− f(x∗)] ≤ RG1√T
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Theorem (Russians): Let x̂T = 1
T
∑Tt=1
xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then
E[f(x̂T )− f(x∗)] ≤ RG1√T
Note: Dependence on G important
Derivative-free gradient descent
E[f(x̂T )− f(x∗)] ≤ RG1√T
Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?
Derivative-free gradient descent
E[f(x̂T )− f(x∗)] ≤ RG1√T
Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?
First idea gradient estimator:
◮ Sample Z ∼ µ satisfying Eµ[ZZ⊤] = Id×d
◮ Gradient estimator at x:
g =f(x+ uZ)− f(x)
uZ
Perform gradient descent using these g
Two-point gradient estimates
◮ At any point x and any direction z, for small u > 0
f(x+ uz)− f(x)
u≈ f ′(x, z) := lim
h↓0
f(x+ hz)− f(x)
h
◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)
Two-point gradient estimates
◮ At any point x and any direction z, for small u > 0
f(x+ uz)− f(x)
u≈ f ′(x, z) := lim
h↓0
f(x+ hz)− f(x)
h
◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)
Random estimates Average ≈ ∇f
Two-point stochastic gradient: differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I
◮ Set ut = u/t and
gt =F (xt + utZ; ξ)− F (xt; ξ)
utZ
◮ Update xt+1 = xt − αgt
Two-point stochastic gradient: differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I
◮ Set ut = u/t and
gt =F (xt + utZ; ξ)− F (xt; ξ)
utZ
◮ Update xt+1 = xt − αgt
Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2
2] ≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d√T
+O
(u2
log T
T
).
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
0 50 100 150 200 250 300
Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
0 50 100 150 200 250 300
Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
0 50 100 150 200 250 300
Rescaled Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0
E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2
2‖Z‖2
2] ≥ d2
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0
E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2
2‖Z‖2
2] ≥ d2
Idea: More randomization!
u1
u2
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖
2≤
√d, then
g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)
u2Z2
satisfies
E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2
2] ≤ d
(√u2u1
d+ log(2d)
).
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖
2≤
√d, then
g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)
u2Z2
satisfies
E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2
2] ≤ d
(√u2u1
d+ log(2d)
).
Note: If u2/u1 → 0, scaling linear in d
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimize f(x) subject to x ∈ X ⊂ Rd
Algorithm: Iterate
◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)
utZ2
◮ Update xt+1 = xt − αgt
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimize f(x) subject to x ∈ X ⊂ Rd
Algorithm: Iterate
◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)
utZ2
◮ Update xt+1 = xt − αgt
Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and ‖∂f(x)‖2
2≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T
+O
(ulog T
T
).
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)
utZ2
◮ Update xt+1 = xt − αgt
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)
utZ2
◮ Update xt+1 = xt − αgt
Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2
2] ≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T
+O
(ulog T
T
).
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of
RG
√d√T.
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of
RG
√d√T.
◮ Open question: Non-stochastic lower bounds? (Sebastian Pokutta,next week.)
Instance III: Parallelization and fast algorithms
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
where
f(x) :=1
n
n∑
i=1
F (x; ξi) or f(x) := E[F (x; ξ)]
Stochastic Gradient Descent
Problem: Tough to compute
f(x) =1
n
n∑
i=1
F (x; ξi).
Instead: At iteration t
◮ Choose random ξi, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Stochastic Gradient Descent
Problem: Tough to compute
f(x) =1
n
n∑
i=1
F (x; ξi).
Instead: At iteration t
◮ Choose random ξi, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
Problem: only works for smooth functions.
Non-smooth problems we care about:
◮ Classification
F (x; ξ) = F (x; (a, b)) =[1− bx⊤a
]+
◮ Robust regression
F (x; (a, b)) =∣∣∣b− x⊤a
∣∣∣
Non-smooth problems we care about:
◮ Classification
F (x; ξ) = F (x; (a, b)) =[1− bx⊤a
]+
◮ Robust regression
F (x; (a, b)) =∣∣∣b− x⊤a
∣∣∣
◮ Structured prediction (ranking, parsing, learning matchings)
F (x; {ξ, ν}) = maxν̂∈V
[L(ν, ν̂) + x⊤Φ(ξ, ν̂)− x⊤Φ(ξ, ν)
]
Difficulties of non-smooth
Intuition: Gradient is poor indicator of global structure
Better global estimators
Idea: Ask for subgradients from multiple points
Better global estimators
Idea: Ask for subgradients from multiple points
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt ut
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt ut
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
II. Solve for exploration point
zt+1 = argminx∈X
{ t∑
τ=0
1
θτ
[〈gτ , x〉
]
︸ ︷︷ ︸Approximate f
+1
2αt‖x‖2
2
︸ ︷︷ ︸Regularize
}
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
II. Solve for exploration point
zt+1 = argminx∈X
{ t∑
τ=0
1
θτ
[〈gτ , x〉
]
︸ ︷︷ ︸Approximate f
+1
2αt‖x‖2
2
︸ ︷︷ ︸Regularize
}
III. Interpolatext+1 = (1− θt)xt + θtzt+1
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
Non-strongly convex objectives:
f(xT )− f(x∗) = O(C
T+
1√Tm
)
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
Non-strongly convex objectives:
f(xT )− f(x∗) = O(C
T+
1√Tm
)
λ-strongly convex objectives:
f(xT )− f(x∗) = O(
C
T 2+
1
λTm
)
A few remarks on distributing
Convergence rate:
f(xT )− f(x∗) = O(1
T+
1√Tm
)
◮ If communication is expensive, uselarger batch sizes m:
(a) Communication cost is c(b) n computers with batch size m(c) S total update steps
︸︷︷
︸
︸ ︷︷ ︸
n
m
︸
︷︷
︸c
A few remarks on distributing
Convergence rate:
f(xT )− f(x∗) = O(1
T+
1√Tm
)
◮ If communication is expensive, uselarger batch sizes m:
(a) Communication cost is c(b) n computers with batch size m(c) S total update steps
Backsolve: after T = S(m+ c) units oftime, error is
O(m+ c
T+
1√Tn
·√
m+ c
m
)
︸︷︷
︸
︸ ︷︷ ︸
n
m
︸
︷︷
︸c
Experimental results
Iteration complexity simulations
Define T (ǫ,m) = min {t ∈ N | f(xt)− f(x∗) ≤ ǫ}, solve robust regressionproblem
f(x) =1
n
n∑
i=1
∣∣∣a⊤i x− bi
∣∣∣ = 1
n‖Ax− b‖
1
100
101
102
103
102
103
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
Actual T (ǫ, m)Predicted T (ǫ, m)
100
101
102
103
102
103
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
Actual T (ǫ, m)Predicted T (ǫ, m)
Robustness to stepsize and smoothing
◮ Two parameters: smoothing parameter u, stepsize η
10−1
101
103
10−1
101
103
10−2
10−1
η1/u
f(x̂
)+
ϕ(x̂
)−
f(x
∗)−
ϕ(x
∗)
Plot: optimality gap after 2000 iterations on synthetic SVM problem
f(x) + ϕ(x) :=1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
Text Classification
Reuter’s RCV1 dataset, time to ǫ-optimal solution for
1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
0 2 4 6 8 10 12 140.5
1
1.5
2
2.5
3
3.5
4
4.5Time to optimality gap of 0.004
Number of worker threads
Mea
n tim
e (s
econ
ds)
Batch size 10Batch size 20
0 2 4 6 8 10 12 141
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6Time to optimality gap of 0.002
Number of worker threads
Mea
n tim
e (s
econ
ds)
Batch size 10Batch size 20
Text Classification
Reuter’s RCV1 dataset, optimization speed for minimizing
1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
0 1 2 3 410
−3
10−2
10−1
100
Opt
imal
ity g
ap
Time (s)
Acc (1)Acc (2)Acc (3)Acc (4)Acc (6)Acc (8)Pegasos
Parsing
Penn Treebank dataset, learning weights for a hypergraph parser (here x isa sentence, y ∈ V is a parse tree)
1
n
n∑
i=1
maxν̂∈V
[L(νi, ν̂) + x⊤ (Φ(ξi, ν̂)− Φ(ξi, νi))
]+
λ
2‖x‖2
2.
0 500 1000 150010
−2
10−1
100
101
Iterations
f(x
)+
ϕ(x
)−
f(x
∗)−
ϕ(x
∗)
12468
0 50 100 150 20010
−2
10−1
100
101
Time (seconds)
f(x
)+
ϕ(x
)−
f(x
∗)−
ϕ(x
∗)
12468RDA
Is smoothing necessary?
Solve multiple-median problem
f(x) =1
n
n∑
i=1
‖x− ξi‖1 ,
ξi ∈ {−1, 1}d. Compare standard stochastic gradient:
0 5 10 15 20 25 30 3520
40
60
80
100
120
140
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
SmoothedUnsmoothed
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
Thanks!
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
References:
◮ Randomized Smoothing for Stochastic Optimization (D., Bartlett,Wainwright). SIAM Journal on Optimization, 22(2), pages 674–701.
◮ Optimal rates for zero-order convex optimization: the power of two functionevaluations (D., Jordan, Wainwright, Wibisono). arXiv:1312.2139[math.OC].
Available on my webpage (http://web.stanford.edu/~jduchi)