stochastic optimization for large-scale machine learning · 2020. 7. 15. · introduction time...

65
Introduction Time Reduction Space Reduction Conclusion Stochastic Optimization for Large-scale Machine Learning Lijun Zhang LAMDA group, Nanjing University, China The 13rd Chinese Workshop on Machine Learning and Applications http://cs.nju.edu.cn/zlj Stochastic Optimization Learning And Mining from DatA A D A M L NANJING UNIVERSITY

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Stochastic Optimization forLarge-scale Machine Learning

Lijun Zhang

LAMDA group, Nanjing University, China

The 13rd Chinese Workshop on Machine Learning and Applications

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

Page 2: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 3: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 4: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 5: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 6: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 7: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 8: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 9: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (I)

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 10: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 11: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 12: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (III)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 13: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (III)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 14: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 15: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 16: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (V)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 17: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Why Stochastic Optimization? (V)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 18: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 19: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 20: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 21: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 22: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 23: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 24: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 25: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 26: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

http://cs.nju.edu.cn/zlj Stochastic Optimization

SGD may be slower thanGD if ǫ is very small.

Learning And Mining from DatA

A DAML

Page 27: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 28: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 29: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 30: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 31: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 32: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 33: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 34: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 35: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 36: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Mixed Gradient Descent (II)

The Algorithm [Zhang et al., 2013]

1: Compute the true gradient of w0

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

5: w′t+1 = wt − ηm(wt)

6: wt+1 = ΠW(w′t+1)

7: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 37: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 38: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 39: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Experimental Results (I)

Reuters Corpus Volume I (RCV1) data set

The optimization error

0 50 100 150 200

10−10

10−5

# of Gradients

Tra

inin

g Lo

ss −

Opt

imum

EMGDSGD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 40: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Experimental Results (II)

Reuters Corpus Volume I (RCV1) data set

The variance of mixed gradient

0 50 100 150 20010

−15

10−10

10−5

# of Gradients

Var

ianc

e

EMGDSGD

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 41: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 42: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 43: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 44: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 45: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Gradient Descent (SGD)

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Advantage v.s. Disadvantage

Space complexity per iteration is low: O((m + n)r)

There is no theoretical guarantee due to truncation

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 46: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 47: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 48: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 49: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Efficient Implementation

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 50: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

Advantages

Space complexity per iteration is still low: O((m + n)r)

It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 51: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Theoretical Guarantees

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

where W∗ is the optimal solution.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 52: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Theoretical Guarantees

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

where W∗ is the optimal solution.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 53: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 54: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 55: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 56: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

Wt+1 = argminW∈Rm×n

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 57: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

Wt+1 = argminW∈Rm×n

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 58: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (II)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 59: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Application to Kernel PCA (II)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 60: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Experimental Results (I)

The Magic data set

The rank of each iterate

50 100 150 200

50

100

150

200

t

Rank(W

t)

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 61: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Experimental Results (II)

The Magic data setThe optimization error

50 100 150 2000

1

2

3

4x 10

−3

t

‖W

t−W

∗‖2 F/n2

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 62: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Conclusion and Future Work

Time Reduction by Stochastic Optimization

A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity

Extend MGD to other scenarios, such as distributedsetting, non-convex functions

Space Reduction by Stochastic Optimization

A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity

Apply to more applications, such as matrix completion,multi-task learning

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 63: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Special Issue on “Multi-instance Learning in PatternRecognition and Vision”

URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/

special-issue-on-multiple-instance-learning-in-pattern-recog

Dates:

First submission date: Mar. 1, 2016

Final paper notification: Dec. 1, 2016

Guest editors:

Jianxin Wu, Nanjing University

Xiang Bai, Huazhong University of Science andTechnology

Marco Loog, Delft University of Technology

Fabio Roli, University of Cagliari

Zhi-Hua Zhou, Nanjing University

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML

Page 64: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Reference I

,

Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.

Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Thanks!

Learning And Mining from DatA

A DAML

Page 65: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?

Introduction Time Reduction Space Reduction Conclusion

Reference II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.

Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAML