large-scale stochastic optimizationhanxiaol/slides/hanxiaol-sgd.pdf · large-scale stochastic...
TRANSCRIPT
![Page 1: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/1.jpg)
Large-scale Stochastic Optimization11-741/641/441 (Spring 2016)
Hanxiao [email protected]
March 24, 2016
1 / 22
![Page 2: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/2.jpg)
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
![Page 3: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/3.jpg)
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
![Page 4: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/4.jpg)
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
![Page 5: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/5.jpg)
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
![Page 6: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/6.jpg)
Risk Minimization
{(xi, yi)}ni=1: training datai.i.d.∼ D.
minf
E(x,y)∼D
` (f (x) , y)︸ ︷︷ ︸True risk
=⇒ minf
1
n
n∑i=1
` (f (xi) , yi)︸ ︷︷ ︸Empirical risk
(1)
=⇒ minw
1
n
n∑i=1
` (fw (xi) , yi) (2)
Algorithm ` (fw(xi), yi)
Logistic Regression ln(1 + e−yiw
>xi)+ λ
2‖w‖2
SVMs max(0, 1− yiw>xi
)+ λ
2‖w‖2
3 / 22
![Page 7: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/7.jpg)
Risk Minimization
{(xi, yi)}ni=1: training datai.i.d.∼ D.
minf
E(x,y)∼D
` (f (x) , y)︸ ︷︷ ︸True risk
=⇒ minf
1
n
n∑i=1
` (f (xi) , yi)︸ ︷︷ ︸Empirical risk
(1)
=⇒ minw
1
n
n∑i=1
` (fw (xi) , yi) (2)
Algorithm ` (fw(xi), yi)
Logistic Regression ln(1 + e−yiw
>xi)+ λ
2‖w‖2
SVMs max(0, 1− yiw>xi
)+ λ
2‖w‖2
3 / 22
![Page 8: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/8.jpg)
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
![Page 9: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/9.jpg)
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
![Page 10: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/10.jpg)
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
![Page 11: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/11.jpg)
Convergence
Theorem (GD convergence)If ` is both convex and differentiable 1
`(w(k)
)−` (w∗) ≤
{‖w(0)−w∗‖22
2ηk= O
(1k
)in general
ckL‖w(0)−w∗‖222
= O(ck)
` is strongly convex
(4)
– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O
(1ρ
)iterations
in general, and O(
log(1ρ
))iterations with strong convexity.
1the step size η must be no larger than 1L , where L is the Lipschitz
constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22
![Page 12: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/12.jpg)
Convergence
Theorem (GD convergence)If ` is both convex and differentiable 1
`(w(k)
)−` (w∗) ≤
{‖w(0)−w∗‖22
2ηk= O
(1k
)in general
ckL‖w(0)−w∗‖222
= O(ck)
` is strongly convex
(4)
– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O
(1ρ
)iterations
in general, and O(
log(1ρ
))iterations with strong convexity.
1the step size η must be no larger than 1L , where L is the Lipschitz
constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22
![Page 13: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/13.jpg)
GD Efficiency
Why not happy with GD?
I Fast convergence 6= high efficiency.
w(k) = w(k−1) − ηk∇`(w(k−1)) (5)
= w(k−1) − ηk∇
[1
n
n∑i=1
`i(w(k−1))] (6)
I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.
I Cheaper GD? - SGD
6 / 22
![Page 14: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/14.jpg)
GD Efficiency
Why not happy with GD?
I Fast convergence 6= high efficiency.
w(k) = w(k−1) − ηk∇`(w(k−1)) (5)
= w(k−1) − ηk∇
[1
n
n∑i=1
`i(w(k−1))] (6)
I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.
I Cheaper GD? - SGD
6 / 22
![Page 15: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/15.jpg)
Stochastic Gradient Descent
Approximate the full gradient via an unbiased estimator
w(k) = w(k−1) − ηk∇( 1
n
n∑i=1
`i(w(k−1)) ) (7)
≈ w(k−1) − ηk∇( 1
|B|∑i∈B
`i(w(k−1)) )
︸ ︷︷ ︸mini-batch SGD 2
Bunif∼ {1, 2, . . . n}
(8)
≈ w(k−1) − ηk∇`i(w(k−1))︸ ︷︷ ︸
pure SGD
iunif∼ {1, 2, . . . n} (9)
Trade-off : lower computation cost v.s. larger variance
2When using GPU, |B| usually depends on the memory budget.7 / 22
![Page 16: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/16.jpg)
GD v.s. SGD
For strongly convex ` (w), according to [Bottou, 2012]
Optimizer GD SGD Winner
Time per-iteration O (n) O (1) SGD
Iterations to accuracy ρ O(log(1ρ
))O(
1ρ
)GD
Time to accuracy ρ O(n log 1
ρ
)O(
1ρ
)Depends
Time to “generalization error” ε O(
1ε1/α
log 1ε
)O(1ε
)SGD
where 12≤ α ≤ 1
8 / 22
![Page 17: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/17.jpg)
SVMs Solver: Pegasos
[Shalev-Shwartz et al., 2011]
Recall
`i (w) = max(0, 1− yiw>xi
)+λ
2‖w‖2 (10)
=
{λ2‖w‖2 yiw
>xi ≥ 1
1− yiw>xi + λ2‖w‖2 yiw
>xi < 1(11)
Therefore
∇`i (w) =
{λw yiw
>xi ≥ 1
λw − yixi yiw>xi < 1
(12)
9 / 22
![Page 18: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/18.jpg)
SVMs Solver: Pegasos
[Shalev-Shwartz et al., 2011]
Recall
`i (w) = max(0, 1− yiw>xi
)+λ
2‖w‖2 (10)
=
{λ2‖w‖2 yiw
>xi ≥ 1
1− yiw>xi + λ2‖w‖2 yiw
>xi < 1(11)
Therefore
∇`i (w) =
{λw yiw
>xi ≥ 1
λw − yixi yiw>xi < 1
(12)
9 / 22
![Page 19: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/19.jpg)
SVMs in 10 Lines
Algorithm 1: Pegasos: SGD solver for SVMs
Input: n, λ, T ;Initialization: w ← 0;for k = 1, 2, . . . , T do
iuni∼ {1, 2, . . . n};ηk ← 1
λk ;
if yiw(k)>xi < 1 then
w(k+1) ← w(k) − ηk(λw(k) − yixi
)else
w(k+1) ← w(k) − ηkλw(k)
end
end
Output: w(T+1)
10 / 22
![Page 20: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/20.jpg)
Empirical Comparisons
SGD v.s. batch solvers3 on RCV1
#Features #Training examples
47, 152 781, 265
Algorithm Time (secs) Primal Obj Test Error
SMO (SVMlight) ≈ 16, 000 0.2275 6.02%Cutting Plane (SVMperf ) ≈ 45 0.2275 6.02%SGD < 1 0.2275 6.02%
Where is the magic?
3http://leon.bottou.org/projects/sgd
11 / 22
![Page 21: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/21.jpg)
The Magic
I SGD takes a long time to accurately solve the problem.
I There’s no need to solve the problem super accuratelyin order to get good generalization ability.
3http://leon.bottou.org/slides/largescale/lstut.pdf
12 / 22
![Page 22: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/22.jpg)
SGD for Matrix Factorization
The idea of SGD can be trivially extended to MF 4
` (U, V ) =1
|O|∑
(a,b)∈O
`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)
2
(13)
SGD updating rule: for each user-item pair (a, b) ∼ O
u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a
)(14)
v(k)b = v
(k−1)b − ηk∇`a,b
(v(k−1)b
)(15)
Buildingblock for distributed SGD for MF
4We omit the regularization term for simplicity13 / 22
![Page 23: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/23.jpg)
SGD for Matrix Factorization
The idea of SGD can be trivially extended to MF 4
` (U, V ) =1
|O|∑
(a,b)∈O
`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)
2
(13)
SGD updating rule: for each user-item pair (a, b) ∼ O
u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a
)(14)
v(k)b = v
(k−1)b − ηk∇`a,b
(v(k−1)b
)(15)
Buildingblock for distributed SGD for MF
4We omit the regularization term for simplicity13 / 22
![Page 24: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/24.jpg)
Empirical Comparisons
On Netflix [Gemulla et al., 2011]
DSGD Distributed SGD
ALS Alternating least squaresI one of the state-of-the-art batch solvers
DGD Distributed GD14 / 22
![Page 25: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/25.jpg)
SGD Revisited
Can we even do better?
Bottleneck of SGD: high variance in ∇`i (w)
I Less effective gradient steps
I The existence of variance =⇒ limk→∞ ηk = 0 forconvergence =⇒ slower progress
Variance reduction—SVRG [Johnson and Zhang, 2013],SAG, SDCA . . .
15 / 22
![Page 26: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/26.jpg)
Stochastic Variance Reduced Gradient
w - a snapshot of w (to be updated every few cycles)
µ - 1n
∑ni=1∇`i (w)
Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)
1
n
n∑i=1
∇`i (w) =1
n
n∑i=1
∇`i (w)− µ+ µ (16)
=1
n
n∑i=1
∇`i (w)− 1
n
n∑i=1
∇`i (w) + µ (17)
≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)
A desirable property: ∇`i(w(k)
)−∇`i (w) + µ→ 0
16 / 22
![Page 27: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/27.jpg)
Stochastic Variance Reduced Gradient
w - a snapshot of w (to be updated every few cycles)
µ - 1n
∑ni=1∇`i (w)
Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)
1
n
n∑i=1
∇`i (w) =1
n
n∑i=1
∇`i (w)− µ+ µ (16)
=1
n
n∑i=1
∇`i (w)− 1
n
n∑i=1
∇`i (w) + µ (17)
≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)
A desirable property: ∇`i(w(k)
)−∇`i (w) + µ→ 0
16 / 22
![Page 28: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/28.jpg)
Results
Image classification using neural networks[Johnson and Zhang, 2013]
17 / 22
![Page 29: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/29.jpg)
Implementation trick
For `i (w) = ψi (w) + λ2‖w‖2
w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w
−η∇ψi(w(k)
)︸ ︷︷ ︸highly sparse
(19)
The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′
c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update
·
w′(k) − ηψi
(c(k)w′(k)
)(1− ηλ) c(k)
︸ ︷︷ ︸
sparse update
(20)More SGD tricks can be found at [Bottou, 2012]
5http://blog.smola.org/post/940672544/
fast-quadratic-regularization-for-online-learning
18 / 22
![Page 30: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/30.jpg)
Implementation trick
For `i (w) = ψi (w) + λ2‖w‖2
w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w
−η∇ψi(w(k)
)︸ ︷︷ ︸highly sparse
(19)
The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′
c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update
·
w′(k) − ηψi
(c(k)w′(k)
)(1− ηλ) c(k)
︸ ︷︷ ︸
sparse update
(20)More SGD tricks can be found at [Bottou, 2012]
5http://blog.smola.org/post/940672544/
fast-quadratic-regularization-for-online-learning
18 / 22
![Page 31: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/31.jpg)
Popular SGD Variants
A non-exhaustive list
1. AdaGrad [Duchi et al., 2011]
2. Momentum [Rumelhart et al., 1988]
3. Nesterov’s method [Nesterov et al., 1994]
4. AdaDelta: AdaGrad refined [Zeiler, 2012]
5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient
All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).
Demos 6: Animation 0, 1, 2, 3
6https://www.reddit.com/r/MachineLearning/comments/2gopfa/
visualizing_gradient_optimization_techniques/cklhott
19 / 22
![Page 32: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/32.jpg)
Popular SGD Variants
A non-exhaustive list
1. AdaGrad [Duchi et al., 2011]
2. Momentum [Rumelhart et al., 1988]
3. Nesterov’s method [Nesterov et al., 1994]
4. AdaDelta: AdaGrad refined [Zeiler, 2012]
5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient
All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).
Demos 6: Animation 0, 1, 2, 3
6https://www.reddit.com/r/MachineLearning/comments/2gopfa/
visualizing_gradient_optimization_techniques/cklhott
19 / 22
![Page 33: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/33.jpg)
Summary
Today’s talk
1. GD - expensive, accurate gradient evaluation
2. SGD - cheap, noisy gradient evaluation
3. SGD-based solvers (SVMs, MF)
4. Variance reduction techniques
Remarks about SGD
I extremely handy for large problems
I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate
descent, ADMM, CG, etc.I depending on the problem structure
20 / 22
![Page 34: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/34.jpg)
Summary
Today’s talk
1. GD - expensive, accurate gradient evaluation
2. SGD - cheap, noisy gradient evaluation
3. SGD-based solvers (SVMs, MF)
4. Variance reduction techniques
Remarks about SGD
I extremely handy for large problems
I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate
descent, ADMM, CG, etc.I depending on the problem structure
20 / 22
![Page 35: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/35.jpg)
Reference I
Bordes, A., Bottou, L., and Gallinari, P. (2009).Sgd-qn: Careful quasi-newton stochastic gradient descent.The Journal of Machine Learning Research, 10:1737–1754.
Bottou, L. (2012).Stochastic gradient descent tricks.In Neural Networks: Tricks of the Trade, pages 421–436. Springer.
Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learning and stochasticoptimization.The Journal of Machine Learning Research, 12:2121–2159.
Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. (2011).Large-scale matrix factorization with distributed stochastic gradient descent.In Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 69–77. ACM.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323.
21 / 22
![Page 36: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070919/5fb8a52d605bd038575921c0/html5/thumbnails/36.jpg)
Reference II
Nesterov, Y., Nemirovskii, A., and Ye, Y. (1994).Interior-point polynomial algorithms in convex programming, volume 13.SIAM.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).Learning representations by back-propagating errors.Cognitive modeling, 5:3.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011).Pegasos: Primal estimated sub-gradient solver for svm.Mathematical programming, 127(1):3–30.
Tieleman, T. and Hinton, G. (2012).Lecture 6.5-rmsprop: Divide the gradient by a running average of its recentmagnitude.COURSERA: Neural Networks for Machine Learning, 4.
Zeiler, M. D. (2012).Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701.
Zhang, T.Modern optimization techniques for big data machine learning.
22 / 22