randomized algorithms for large-scale convex …randomized algorithms for large-scale convex...

53
Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop on Evolutionary Computation and Learning Zhang Randomized Methods Learning And Mining from DatA A D A M L NANJING UNIVERSITY

Upload: others

Post on 24-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Randomized Algorithms forLarge-scale Convex Optimization

Lijun Zhang

LAMDA group, Nanjing University, China

The 2nd Chinese Workshop on Evolutionary Computation and Learning

Zhang Randomized Methods

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

Page 2: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 3: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 4: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Big Data

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 5: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Supervised Learning by Optimization

Supervised Learning

Input

A set of training data (xi ∈ Rd , yi ∈ R)n

i=1

A set of hypotheses w ∈ W ⊆ Rd

Output

A hypothesis w∗ ∈ W that minimizes testing error

x 7→ x⊤w∗

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

ℓ(·, ·) is a loss, e.g., hinge loss ℓ(u, v) = max(0, 1 − uv)

Ω(·) is a regularizer, e.g., λ‖w‖22 or λ‖w‖1

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 6: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 7: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 8: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 9: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 10: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 11: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

The Challenges

Large-scale Convex Optimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w) + Ω(w)

Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , xi

⊤wt) +∇Ω(wt))

3: wt+1 = ΠW(w′t+1)

4: end for

Computational Cost

Time Complexity: O(nd) + O(poly(d))

Space Complexity: O(nd)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 12: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Randomized Algorithms

Random Sampling based Algorithms

aim to address the large-scale challenge, i.e., large n

select a subset of training data randomly

referred to as Stochastic Optimization

Random Projection based Algorithms

aim to address the high-dimensional challenge, i.e., large d

reduce the dimensionality by random projection

referred to as Stochastic Approximation

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 13: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 14: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 15: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 16: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 17: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 18: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 19: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 20: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Stochastic Gradient Descent (SGD)

The Algorithm

1: for t = 1, 2, . . . ,T do2: Select a training instance (xi , yi) randomly3: w′

t+1 = wt − ηt(∇ℓ(yi , xi

⊤wt))

4: wt+1 = ΠW(w′t+1)

5: end for

Advantages

Time Complexity: O(d) + O(poly(d))

Space Complexity: O(d)

Limitations

The iteration complexity is much higher than GD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 21: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 22: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Comparisons between GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(log 1

ǫ

)

SGD O( 1ǫ2

)O(1ǫ

)

Note1ǫ2 >

1ǫ>

1√ǫ≫ log

1012 > 106 > 103 ≫ 6, ǫ = 10−6

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 23: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt

(∇ℓ(yi , x⊤

i wt))

Stochastic Gradients introduce a constant error

The key idea

Control the variance of stochastic gradients

Choose a fixed step size ηt

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 24: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt

(∇ℓ(yi , x⊤

i wt))

Stochastic Gradients introduce a constant error

The key idea

Control the variance of stochastic gradients

Choose a fixed step size ηt

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 25: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt

(∇ℓ(yi , x⊤

i wt))

Stochastic Gradients introduce a constant error

The key idea

Control the variance of stochastic gradients

Choose a fixed step size ηt

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 26: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Mixed Gradient Descent I

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 27: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Mixed Gradient Descent I

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 28: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Mixed Gradient Descent II

The Algorithm (NIPS 2013)

1: Compute the true gradient of w0

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

2: for t = 1, 2, . . . ,T do3: Select a training instance (xi , yi) randomly4: Compute the mixed gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

5: w′t+1 = wt − ηtm(wt)

6: wt+1 = ΠW(w′t+1)

7: end for

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 29: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013a])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/ǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 30: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Experimental Results I

Reuters Corpus Volume I (RCV1) data set

The optimization error

0 50 100 150 200

10−10

10−5

# of Gradients

Tra

inin

g Lo

ss −

Opt

imum

EMGDSGD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 31: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Mixed Gradient Descent

Experimental Results II

Reuters Corpus Volume I (RCV1) data set

The variance of mixed gradient

0 50 100 150 20010

−15

10−10

10−5

# of Gradients

Var

ianc

e

EMGDSGD

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 32: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 33: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 34: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

The Power of Random Projection

Random Projection

A dimensionality reduction method:

x ∈ Rd → A⊤x ∈ R

m

where A ∈ Rd×m and Aij ∼ N (0, 1/m)

Theorem 1 (Johnson and Lindenstrauss Lemma[Achlioptas, 2003])

Given ǫ > 0 and an integer n, let m be a positive integer suchthat m = Ω(ǫ−2 log n). For every set P of n points in R

d thereexists f : Rd → R

m such that for all xi , xj ∈ P

(1 − ǫ)‖xi − xj‖2 ≤ ‖f (xi)− f (xj)‖2 ≤ (1 + ǫ)‖xi − xj‖2.

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 35: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Optimization after Random Projection I

The Primal Problem in Rd

minw∈Rd

1n

n∑

i=1

ℓ(yi , x⊤i w) +

λ

2‖w‖2

Traditional Approach

1 Reduce the dimensionality xi = A⊤xi ∈ Rm

2 Solve the primal problem in Rm

minz∈Rm

1n

n∑

i=1

ℓ(yi , z⊤xi) +λ

2‖z‖2

3 Compute w ∈ Rd by w = Az∗

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 36: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Optimization after Random Projection II

Advantages

Time complexity is reduced from O(nd) to O(nm)

Space complexity is reduced from O(nd) to O(nm)

It is possible to run gradient descent which converges fast

The Limitation

w is not a good approximation of

w∗ = argminw∈Rd

1n

n∑

i=1

ℓ(yi , x⊤i w) +

λ

2‖w‖2

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 37: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Optimization after Random Projection III

Proposition 1 (Distance of a Random Subspace to a FixedPoint [Vershynin, 2009])

Let E ∈ Gd ,m be a random subspace (codim E = d − m). Let xbe an united length vector, which is arbitrary but fixed. Then

Pr

(dist(x,E) ≤ ǫ

√d − m

d

)≤ (cǫ)d−m for any ǫ > 0,

where c is an universal constant.

With a probability at least 1 − 2−d+m, we have

‖w − w∗‖2 ≥ 12c

√d − m

d‖w∗‖2

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 38: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 39: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Motivations I

The Primal Problem in Rd

minw∈Rd

1n

n∑

i=1

ℓ(yi , x⊤i w) +

λ

2‖w‖2, (P1)

The Dual Problem

maxα∈Ωn

−n∑

i=1

ℓ∗(αi)−1

2nλ(α y)⊤X⊤X (α y), (D1)

Proposition 2

Let w∗ ∈ Rd and α∗ ∈ R

n be solutions to (P1) and (D1).

w∗ =− 1λn

X (α∗ y),

[α∗]i =ℓ′(

yi , x⊤i w∗

), i = 1, . . . , n.

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 40: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Motivations II

The Primal Problem in Rm

minz∈Rm

1n

n∑

i=1

ℓ(yi , z⊤xi) +λ

2‖z‖2, (P2)

The Dual Problem

maxα∈Ωn

−n∑

i=1

ℓ∗(αi)−1

2λn(α y)⊤X⊤AA⊤X (α y), (D2)

Proposition 3

Let z∗ ∈ Rm and α∗ ∈ R

n be solutions to (P2) and (D2).

z∗ =− 1λn

A⊤X (α∗ y),

[α∗]i =ℓ′(

yi , x⊤i z∗

), i = 1, . . . , n.

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 41: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Motivations III

The Big Picture

Primal-Primal Primal-Dual Dual-Dual

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 42: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Motivations IV

Optimization after Random Projection

Primal Solution → Primal Solution

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 43: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Dual Random Projection

Use Dual Solutions to Bridge Primal Solutions

Primal Solution → Dual Solution → Primal Solution

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 44: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Dual Random Projection

The Algorithm (COLT 2013 & IEEE Trans. Inf. Theory 2014)

1 Reduce the dimensionality xi = A⊤xi ∈ Rm

2 Solve the low-dimensional problem

minz∈Rm

1n

n∑

i=1

ℓ(yi , z⊤xi) +λ

2‖z‖2

3 Construct the dual solution α∗ ∈ Rn by

[α∗]i = ℓ′(

yi , x⊤i z∗

), i = 1, . . . , n

4 Compute w ∈ Rd by

w = − 1λn

X (α∗ y)

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 45: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Theoretical Guarantees

Low-rank Assumption

r = rank(X ) ≪ min(d , n).

Theorem 2 ([Zhang et al., 2013b] [Zhang et al., 2014])

For any 0 < ǫ ≤ 1/2, with a probability at least 1 − δ, we have

‖w − w∗‖2 ≤ ǫ

1 − ǫ‖w∗‖2,

provided

m ≥ (r + 1) log(2r/δ)cǫ2 ,

where constant c is at least 1/4.

Implication

To accurately recover w∗, the number of required randomprojections is Ω(r log r).

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 46: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Theoretical Guarantees

Low-rank Assumption

r = rank(X ) ≪ min(d , n).

Theorem 2 ([Zhang et al., 2013b] [Zhang et al., 2014])

For any 0 < ǫ ≤ 1/2, with a probability at least 1 − δ, we have

‖w − w∗‖2 ≤ ǫ

1 − ǫ‖w∗‖2,

provided

m ≥ (r + 1) log(2r/δ)cǫ2 ,

where constant c is at least 1/4.

Implication

To accurately recover w∗, the number of required randomprojections is Ω(r log r).

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 47: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Experimental Results I

A 20, 000 × 50, 000 data matrix with rank 10.

The reconstruction error

500 1000 15000

1

2

3

4

5

m

Rel

ativ

e E

rror

DRPNaive

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 48: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future WorkBackground Dual Random Projection

Experimental Results II

A 20, 000 × 50, 000 data matrix with rank 10.

The running time

500 1000 15000

100

200

300

m

Tim

e (s

)

DRPRPBaseline

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 49: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Outline

1 Introduction

2 Stochastic OptimizationBackgroundMixed Gradient Descent

3 Stochastic ApproximationBackgroundDual Random Projection

4 Conclusions and Future Work

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 50: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Conclusions and Future Work

Summary

Based on random sampling, we propose a Mixed GradientDescent (MGD) algorithm which improves the convergencerate significantly.

Based on random projection, we propose a Dual RandomProjection (DRP) algorithm which can recover the optimalsolution accurately.

Future Work

Extend MGD to distributed environments

Relax assumptions in Dual RandomProjection [Yang et al., 2015]

Extend DRP to more problems, such as sparse learning

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 51: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Conclusions and Future Work

Summary

Based on random sampling, we propose a Mixed GradientDescent (MGD) algorithm which improves the convergencerate significantly.

Based on random projection, we propose a Dual RandomProjection (DRP) algorithm which can recover the optimalsolution accurately.

Future Work

Extend MGD to distributed environments

Relax assumptions in Dual RandomProjection [Yang et al., 2015]

Extend DRP to more problems, such as sparse learning

Zhang Randomized Methods

Thanks!

Learning And Mining from DatA

A DAML

Page 52: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Reference I

,

Achlioptas, D. (2003).Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671 – 687.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.

Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.

Vershynin, R. (2009).Lectures in geometric functional analysis.Technical report, University of Michigan.

Yang, T., Zhang, L., Jin, R., and Zhu, S. (2015).Theory of dual-sparse regularized randomized reduction.In Proceedings of the 32nd International Conference on Machine Learning(ICML).

Zhang Randomized Methods

Learning And Mining from DatA

A DAML

Page 53: Randomized Algorithms for Large-scale Convex …Randomized Algorithms for Large-scale Convex Optimization Lijun Zhang LAMDA group, Nanjing University, China The 2nd Chinese Workshop

Introduction Stochastic Optimization Stochastic Approximation Conclusions and Future Work

Reference II

Zhang, L., Mahdavi, M., and Jin, R. (2013a).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.

Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2013b).Recovering the optimal solution by dual random projection.In Proceedings of the 26th Annual Conference on Learning Theory (COLT),pages 135–157.

Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2014).Random projections for classification: A recovery approach.IEEE Transactions on Information Theory (TIT), 60(11):7300–7316.

Zhang Randomized Methods

Learning And Mining from DatA

A DAML