randomized algorithms for large-scale convex …randomized algorithms for large-scale convex...
TRANSCRIPT
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Randomized Algorithms forLarge-scale Convex Optimization
Lijun Zhang
LAMDA group, Nanjing University, China
The 2nd Chinese Workshop on Evolutionary Computation and Learning
Zhang Randomized Methods
Learning And Mining from DatA
A DAMLNANJING UNIVERSITY
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Big Data
https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Supervised Learning by Optimization
Supervised Learning
Input
A set of training data (xi ∈ Rd , yi ∈ R)n
i=1
A set of hypotheses w ∈ W ⊆ Rd
Output
A hypothesis w∗ ∈ W that minimizes testing error
x 7→ x⊤w∗
Empirical Risk Minimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
ℓ(·, ·) is a loss, e.g., hinge loss ℓ(u, v) = max(0, 1 − uv)
Ω(·) is a regularizer, e.g., λ‖w‖22 or λ‖w‖1
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
The Challenges
Large-scale Convex Optimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w) + Ω(w)
Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , xi
⊤wt) +∇Ω(wt))
3: wt+1 = ΠW(w′t+1)
4: end for
Computational Cost
Time Complexity: O(nd) + O(poly(d))
Space Complexity: O(nd)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Randomized Algorithms
Random Sampling based Algorithms
aim to address the large-scale challenge, i.e., large n
select a subset of training data randomly
referred to as Stochastic Optimization
Random Projection based Algorithms
aim to address the high-dimensional challenge, i.e., large d
reduce the dimensionality by random projection
referred to as Stochastic Approximation
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′
t+1 = wt − ηt(∇ℓ(yi , xi
⊤wt))
4: wt+1 = ΠW(w′t+1)
5: end for
Advantages
Time Complexity: O(d) + O(poly(d))
Space Complexity: O(d)
Limitations
The iteration complexity is much higher than GD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Comparisons between GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)O(log 1
ǫ
)
SGD O( 1ǫ2
)O(1ǫ
)
Note1ǫ2 >
1ǫ>
1√ǫ≫ log
1ǫ
1012 > 106 > 103 ≫ 6, ǫ = 10−6
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt
(∇ℓ(yi , x⊤
i wt))
Stochastic Gradients introduce a constant error
The key idea
Control the variance of stochastic gradients
Choose a fixed step size ηt
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt
(∇ℓ(yi , x⊤
i wt))
Stochastic Gradients introduce a constant error
The key idea
Control the variance of stochastic gradients
Choose a fixed step size ηt
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt
(∇ℓ(yi , x⊤
i wt))
Stochastic Gradients introduce a constant error
The key idea
Control the variance of stochastic gradients
Choose a fixed step size ηt
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Mixed Gradient Descent I
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Mixed Gradient Descent I
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Mixed Gradient Descent II
The Algorithm (NIPS 2013)
1: Compute the true gradient of w0
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
2: for t = 1, 2, . . . ,T do3: Select a training instance (xi , yi) randomly4: Compute the mixed gradient of wt
m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0) +∇f (w0)
5: w′t+1 = wt − ηtm(wt)
6: wt+1 = ΠW(w′t+1)
7: end for
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Theoretical Guarantees
Theorem 1 ([Zhang et al., 2013a])
Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs
True Gradient Stochastic GradientMGD O
(log 1
ǫ
)O(κ2 log 1
ǫ
)
In contrast, SGD needs O(1/ǫ) stochastic gradients.
Extensions
For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]
For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Experimental Results I
Reuters Corpus Volume I (RCV1) data set
The optimization error
0 50 100 150 200
10−10
10−5
# of Gradients
Tra
inin
g Lo
ss −
Opt
imum
EMGDSGD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent
Experimental Results II
Reuters Corpus Volume I (RCV1) data set
The variance of mixed gradient
0 50 100 150 20010
−15
10−10
10−5
# of Gradients
Var
ianc
e
EMGDSGD
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
The Power of Random Projection
Random Projection
A dimensionality reduction method:
x ∈ Rd → A⊤x ∈ R
m
where A ∈ Rd×m and Aij ∼ N (0, 1/m)
Theorem 1 (Johnson and Lindenstrauss Lemma[Achlioptas, 2003])
Given ǫ > 0 and an integer n, let m be a positive integer suchthat m = Ω(ǫ−2 log n). For every set P of n points in R
d thereexists f : Rd → R
m such that for all xi , xj ∈ P
(1 − ǫ)‖xi − xj‖2 ≤ ‖f (xi)− f (xj)‖2 ≤ (1 + ǫ)‖xi − xj‖2.
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Optimization after Random Projection I
The Primal Problem in Rd
minw∈Rd
1n
n∑
i=1
ℓ(yi , x⊤i w) +
λ
2‖w‖2
Traditional Approach
1 Reduce the dimensionality xi = A⊤xi ∈ Rm
2 Solve the primal problem in Rm
minz∈Rm
1n
n∑
i=1
ℓ(yi , z⊤xi) +λ
2‖z‖2
3 Compute w ∈ Rd by w = Az∗
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Optimization after Random Projection II
Advantages
Time complexity is reduced from O(nd) to O(nm)
Space complexity is reduced from O(nd) to O(nm)
It is possible to run gradient descent which converges fast
The Limitation
w is not a good approximation of
w∗ = argminw∈Rd
1n
n∑
i=1
ℓ(yi , x⊤i w) +
λ
2‖w‖2
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Optimization after Random Projection III
Proposition 1 (Distance of a Random Subspace to a FixedPoint [Vershynin, 2009])
Let E ∈ Gd ,m be a random subspace (codim E = d − m). Let xbe an united length vector, which is arbitrary but fixed. Then
Pr
(dist(x,E) ≤ ǫ
√d − m
d
)≤ (cǫ)d−m for any ǫ > 0,
where c is an universal constant.
With a probability at least 1 − 2−d+m, we have
‖w − w∗‖2 ≥ 12c
√d − m
d‖w∗‖2
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Motivations I
The Primal Problem in Rd
minw∈Rd
1n
n∑
i=1
ℓ(yi , x⊤i w) +
λ
2‖w‖2, (P1)
The Dual Problem
maxα∈Ωn
−n∑
i=1
ℓ∗(αi)−1
2nλ(α y)⊤X⊤X (α y), (D1)
Proposition 2
Let w∗ ∈ Rd and α∗ ∈ R
n be solutions to (P1) and (D1).
w∗ =− 1λn
X (α∗ y),
[α∗]i =ℓ′(
yi , x⊤i w∗
), i = 1, . . . , n.
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Motivations II
The Primal Problem in Rm
minz∈Rm
1n
n∑
i=1
ℓ(yi , z⊤xi) +λ
2‖z‖2, (P2)
The Dual Problem
maxα∈Ωn
−n∑
i=1
ℓ∗(αi)−1
2λn(α y)⊤X⊤AA⊤X (α y), (D2)
Proposition 3
Let z∗ ∈ Rm and α∗ ∈ R
n be solutions to (P2) and (D2).
z∗ =− 1λn
A⊤X (α∗ y),
[α∗]i =ℓ′(
yi , x⊤i z∗
), i = 1, . . . , n.
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Motivations III
The Big Picture
Primal-Primal Primal-Dual Dual-Dual
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Motivations IV
Optimization after Random Projection
Primal Solution → Primal Solution
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Dual Random Projection
Use Dual Solutions to Bridge Primal Solutions
Primal Solution → Dual Solution → Primal Solution
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Dual Random Projection
The Algorithm (COLT 2013 & IEEE Trans. Inf. Theory 2014)
1 Reduce the dimensionality xi = A⊤xi ∈ Rm
2 Solve the low-dimensional problem
minz∈Rm
1n
n∑
i=1
ℓ(yi , z⊤xi) +λ
2‖z‖2
3 Construct the dual solution α∗ ∈ Rn by
[α∗]i = ℓ′(
yi , x⊤i z∗
), i = 1, . . . , n
4 Compute w ∈ Rd by
w = − 1λn
X (α∗ y)
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Theoretical Guarantees
Low-rank Assumption
r = rank(X ) ≪ min(d , n).
Theorem 2 ([Zhang et al., 2013b] [Zhang et al., 2014])
For any 0 < ǫ ≤ 1/2, with a probability at least 1 − δ, we have
‖w − w∗‖2 ≤ ǫ
1 − ǫ‖w∗‖2,
provided
m ≥ (r + 1) log(2r/δ)cǫ2 ,
where constant c is at least 1/4.
Implication
To accurately recover w∗, the number of required randomprojections is Ω(r log r).
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Theoretical Guarantees
Low-rank Assumption
r = rank(X ) ≪ min(d , n).
Theorem 2 ([Zhang et al., 2013b] [Zhang et al., 2014])
For any 0 < ǫ ≤ 1/2, with a probability at least 1 − δ, we have
‖w − w∗‖2 ≤ ǫ
1 − ǫ‖w∗‖2,
provided
m ≥ (r + 1) log(2r/δ)cǫ2 ,
where constant c is at least 1/4.
Implication
To accurately recover w∗, the number of required randomprojections is Ω(r log r).
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Experimental Results I
A 20, 000 × 50, 000 data matrix with rank 10.
The reconstruction error
500 1000 15000
1
2
3
4
5
m
Rel
ativ
e E
rror
DRPNaive
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection
Experimental Results II
A 20, 000 × 50, 000 data matrix with rank 10.
The running time
500 1000 15000
100
200
300
m
Tim
e (s
)
DRPRPBaseline
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Outline
1 Introduction
2 Stochastic OptimizationBackgroundMixed Gradient Descent
3 Stochastic ApproximationBackgroundDual Random Projection
4 Conclusions and Future Work
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Conclusions and Future Work
Summary
Based on random sampling, we propose a Mixed GradientDescent (MGD) algorithm which improves the convergencerate significantly.
Based on random projection, we propose a Dual RandomProjection (DRP) algorithm which can recover the optimalsolution accurately.
Future Work
Extend MGD to distributed environments
Relax assumptions in Dual RandomProjection [Yang et al., 2015]
Extend DRP to more problems, such as sparse learning
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Conclusions and Future Work
Summary
Based on random sampling, we propose a Mixed GradientDescent (MGD) algorithm which improves the convergencerate significantly.
Based on random projection, we propose a Dual RandomProjection (DRP) algorithm which can recover the optimalsolution accurately.
Future Work
Extend MGD to distributed environments
Relax assumptions in Dual RandomProjection [Yang et al., 2015]
Extend DRP to more problems, such as sparse learning
Zhang Randomized Methods
Thanks!
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Reference I
,
Achlioptas, D. (2003).Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671 – 687.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.
Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.
Vershynin, R. (2009).Lectures in geometric functional analysis.Technical report, University of Michigan.
Yang, T., Zhang, L., Jin, R., and Zhu, S. (2015).Theory of dual-sparse regularized randomized reduction.In Proceedings of the 32nd International Conference on Machine Learning(ICML).
Zhang Randomized Methods
Learning And Mining from DatA
A DAML
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work
Reference II
Zhang, L., Mahdavi, M., and Jin, R. (2013a).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.
Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2013b).Recovering the optimal solution by dual random projection.In Proceedings of the 26th Annual Conference on Learning Theory (COLT),pages 135–157.
Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2014).Random projections for classification: A recovery approach.IEEE Transactions on Information Theory (TIT), 60(11):7300–7316.
Zhang Randomized Methods
Learning And Mining from DatA
A DAML