mining big data - nanjing university · introduction stochastic optimization distributed...

91
Introduction Stochastic Optimization Distributed Optimization Online Learning Summary Mining Big Data Lijun Zhang [email protected] Nanjing University, China December 23, 2015 http://cs.nju.edu.cn/zlj Big Data NANJING UNIVERSITY

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Mining Big Data

Lijun [email protected]

Nanjing University, China

December 23, 2015

http://cs.nju.edu.cn/zlj Big Data

NANJING UNIVERSITY

Page 2: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 3: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 4: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Big Data

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

http://cs.nju.edu.cn/zlj Big Data

Page 5: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

The Four V’s of Big Data

http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg

http://cs.nju.edu.cn/zlj Big Data

Page 6: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 7: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (I)

Training

InputA set of training data (x1, y1), . . . , (xn, yn)

Classification v.s. RegressionBatch Setting v.s. Online Setting

Output

A function g(·) such that yi ≈ g(xi), ∀i

Testing

Given a testing point x, predict its label by g(x)

Assumption

Training and testing data are independent and identicallydistributed (i.i.d.)

http://cs.nju.edu.cn/zlj Big Data

Page 8: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (I)

Training

InputA set of training data (x1, y1), . . . , (xn, yn)

Classification v.s. RegressionBatch Setting v.s. Online Setting

Output

A function g(·) such that yi ≈ g(xi), ∀i

Testing

Given a testing point x, predict its label by g(x)

Assumption

Training and testing data are independent and identicallydistributed (i.i.d.)

http://cs.nju.edu.cn/zlj Big Data

Page 9: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (I)

Training

InputA set of training data (x1, y1), . . . , (xn, yn)

Classification v.s. RegressionBatch Setting v.s. Online Setting

Output

A function g(·) such that yi ≈ g(xi), ∀i

Testing

Given a testing point x, predict its label by g(x)

Assumption

Training and testing data are independent and identicallydistributed (i.i.d.)

http://cs.nju.edu.cn/zlj Big Data

Page 10: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (II)

Loss Function

Measure the discrepancy between y and g(x)

Binary loss: ℓ(u, v) = 1(uv < 0)

Hinge loss: ℓ(u, v) = max(0, 1 − uv)

Squared loss: ℓ(u, v) = (u − v)2

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Our goal is to minimize the risk instead of empirical risk.

http://cs.nju.edu.cn/zlj Big Data

Page 11: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (II)

Loss Function

Measure the discrepancy between y and g(x)

Binary loss: ℓ(u, v) = 1(uv < 0)

Hinge loss: ℓ(u, v) = max(0, 1 − uv)

Squared loss: ℓ(u, v) = (u − v)2

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Our goal is to minimize the risk instead of empirical risk.

http://cs.nju.edu.cn/zlj Big Data

Page 12: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Supervised Learning (II)

Loss Function

Measure the discrepancy between y and g(x)

Binary loss: ℓ(u, v) = 1(uv < 0)

Hinge loss: ℓ(u, v) = max(0, 1 − uv)

Squared loss: ℓ(u, v) = (u − v)2

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Our goal is to minimize the risk instead of empirical risk.

http://cs.nju.edu.cn/zlj Big Data

Page 13: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 14: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Big Data

Page 15: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

http://cs.nju.edu.cn/zlj Big Data

Page 16: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Big Data

Page 17: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)

http://cs.nju.edu.cn/zlj Big Data

Page 18: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 19: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (I)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 20: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (I)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 21: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (I)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 22: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) uniformly at random3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 23: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) uniformly at random3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 24: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Time Reduction (II)

Empirical Risk Minimization

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) uniformly at random3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

http://cs.nju.edu.cn/zlj Big Data

Page 25: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 26: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (I)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices

http://cs.nju.edu.cn/zlj Big Data

Page 27: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (I)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices

http://cs.nju.edu.cn/zlj Big Data

Page 28: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (II)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Big Data

Page 29: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (II)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Big Data

Page 30: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (II)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

http://cs.nju.edu.cn/zlj Big Data

Page 31: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (III)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

Stochastic Proximal Gradient Descent (SPGD)[Zhang et al., 2015]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The Advantage—Space Reduction

Space complexity per iteration is low: O((m + n)r)

http://cs.nju.edu.cn/zlj Big Data

Page 32: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (III)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

Stochastic Proximal Gradient Descent (SPGD)[Zhang et al., 2015]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The Advantage—Space Reduction

Space complexity per iteration is low: O((m + n)r)

http://cs.nju.edu.cn/zlj Big Data

Page 33: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Space Reduction (III)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

Stochastic Proximal Gradient Descent (SPGD)[Zhang et al., 2015]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The Advantage—Space Reduction

Space complexity per iteration is low: O((m + n)r)

http://cs.nju.edu.cn/zlj Big Data

Page 34: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Big Data

Page 35: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Big Data

Page 36: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Big Data

Page 37: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

http://cs.nju.edu.cn/zlj Big Data

Page 38: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

Convex & Smooth Strongly Convex & Smooth

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

http://cs.nju.edu.cn/zlj Big Data

Page 39: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 40: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Introduction (I)

Empirical Risk Minimization in Distributed Setting

minw∈W

f (w) =k∑

j=1

nj∑

i=1

ℓ(y ji ,w

⊤xji)

(xj1, y

j1) . . . (x

jnj, y j

nj) are training data in the j-th machine

http://cs.nju.edu.cn/zlj Big Data

Page 41: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Introduction (II)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

For empirical risk minimization, we have

fj(w) =

nj∑

i=1

ℓ(y ji ,w

⊤xji)

http://cs.nju.edu.cn/zlj Big Data

Page 42: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 43: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Distributed Gradient Descent

Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for

Distributed Gradient Descent

1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 44: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Distributed Gradient Descent

Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for

Distributed Gradient Descent

1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 45: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Step 1

Server: send wt to each worker

http://cs.nju.edu.cn/zlj Big Data

Page 46: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Step 2

Each worker j : calculate ∇fj(wt) and send it to server

http://cs.nju.edu.cn/zlj Big Data

Page 47: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Steps 3 and 4

Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

Server: wt+1 = ΠW(w′t+1)

http://cs.nju.edu.cn/zlj Big Data

Page 48: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 49: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (I)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

Global Consensus Problem

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉

http://cs.nju.edu.cn/zlj Big Data

Page 50: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (I)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

Global Consensus Problem

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉

http://cs.nju.edu.cn/zlj Big Data

Page 51: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (I)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

Global Consensus Problem

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉

http://cs.nju.edu.cn/zlj Big Data

Page 52: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 53: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 54: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 55: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 56: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 57: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 58: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Step 1

Server: send yt and λtj to worker j = 1, . . . , k

http://cs.nju.edu.cn/zlj Big Data

Page 59: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Step 2

Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

http://cs.nju.edu.cn/zlj Big Data

Page 60: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Steps 3 and 4

Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

Server: λt+1j = λ

tj +

(yt+1 − wt+1

j

), j = 1, . . . , k

http://cs.nju.edu.cn/zlj Big Data

Page 61: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Challenges of Distributed Optimization

CommunicationSynchronization

http://cs.nju.edu.cn/zlj Big Data

Page 62: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 63: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Learning (I)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 64: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Learning (II)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for

An Example—Online Classification

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X ⊆ R

d

3: Predict pt ∈ D = 0, 14: Receive true answer yt ∈ Y = 0, 15: Suffer loss ℓ(yt , pt) = |yt − pt |6: end for

http://cs.nju.edu.cn/zlj Big Data

Page 65: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Learning (III)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for

Cumulative Loss

Cumulative Loss =T∑

t=1

ℓ(yt , pt)

http://cs.nju.edu.cn/zlj Big Data

Page 66: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Learning (VI)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for

Regret with respect to a hypothesis class H

Regret =T∑

t=1

ℓ(yt , pt)− minh∈H

T∑

t=1

ℓ(yt , h(xt))

h : X 7→ D

http://cs.nju.edu.cn/zlj Big Data

Page 67: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Convex Optimization (I)

Online Convex Optimization [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 68: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Convex Optimization (II)

Online Convex Optimization [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: end for

An Example—Online SVM

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a training instance (xt , yt)4: Suffer loss ft(wt) = ℓ(yt , x⊤

t wt) = max(0, 1 − ytx⊤t wt)

5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 69: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Convex Optimization (III)

Online Convex Optimization [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: end for

Regret

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w)

http://cs.nju.edu.cn/zlj Big Data

Page 70: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 71: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Gradient Descent

Full-Information Setting

The learner observes not only ft(wt) but also ft(·).

Online Gradient Descent [Zinkevich, 2003, Hazan et al., 2007]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: wt+1 = wt − ηt∇ft(wt)6: end for

Regret Bound [Zinkevich, 2003, Hazan et al., 2007]

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

http://cs.nju.edu.cn/zlj Big Data

Page 72: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Gradient Descent

Full-Information Setting

The learner observes not only ft(wt) but also ft(·).

Online Gradient Descent [Zinkevich, 2003, Hazan et al., 2007]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: wt+1 = wt − ηt∇ft(wt)6: end for

Regret Bound [Zinkevich, 2003, Hazan et al., 2007]

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

http://cs.nju.edu.cn/zlj Big Data

Page 73: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Online Gradient Descent

Full-Information Setting

The learner observes not only ft(wt) but also ft(·).

Online Gradient Descent [Zinkevich, 2003, Hazan et al., 2007]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: wt+1 = wt − ηt∇ft(wt)6: end for

Regret Bound [Zinkevich, 2003, Hazan et al., 2007]

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

http://cs.nju.edu.cn/zlj Big Data

Page 74: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 75: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Bandit Setting

Bandit Setting

The learner only observes not only ft(wt).

Generation Process of the Loss Functions

Stochastic: f1, . . . , ft are i.i.d. sampledAdversarial

Oblivious: ft is independent of w1, . . . ,wt−1

Nonoblivious: ft may depend on w1, . . . ,wt−1

Types of the Loss Functions

Convex

Linear

Multi-armed Bandit

http://cs.nju.edu.cn/zlj Big Data

Page 76: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Bandit Setting

Bandit Setting

The learner only observes not only ft(wt).

Generation Process of the Loss Functions

Stochastic: f1, . . . , ft are i.i.d. sampledAdversarial

Oblivious: ft is independent of w1, . . . ,wt−1

Nonoblivious: ft may depend on w1, . . . ,wt−1

Types of the Loss Functions

Convex

Linear

Multi-armed Bandit

http://cs.nju.edu.cn/zlj Big Data

Page 77: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Bandit Setting

Bandit Setting

The learner only observes not only ft(wt).

Generation Process of the Loss Functions

Stochastic: f1, . . . , ft are i.i.d. sampledAdversarial

Oblivious: ft is independent of w1, . . . ,wt−1

Nonoblivious: ft may depend on w1, . . . ,wt−1

Types of the Loss Functions

Convex

Linear

Multi-armed Bandit

http://cs.nju.edu.cn/zlj Big Data

Page 78: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Multi-armed Bandit

http://blog.mynaweb.com/wp-content/uploads/2013/02/Bandits.png

http://cs.nju.edu.cn/zlj Big Data

Page 79: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Stochastic Multi-armed Bandit

Setting

There are K armsSuccessive pulls of arm i yield rewards X i

1, X i2, . . .

Those rewards are i.i.d. according to an unknowndistribution Di with unknown mean µi

The Learning Process

1: for t = 1, 2, . . . ,T do2: pull arm i ∈ [K ]3: Receive reward X i sampled from distribution Di

4: end for

How to maximize the rewards?Always pull arm i = argmaxi∈[K ] µi

http://cs.nju.edu.cn/zlj Big Data

Page 80: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Stochastic Multi-armed Bandit

Setting

There are K armsSuccessive pulls of arm i yield rewards X i

1, X i2, . . .

Those rewards are i.i.d. according to an unknowndistribution Di with unknown mean µi

The Learning Process

1: for t = 1, 2, . . . ,T do2: pull arm i ∈ [K ]3: Receive reward X i sampled from distribution Di

4: end for

How to maximize the rewards?Always pull arm i = argmaxi∈[K ] µi

http://cs.nju.edu.cn/zlj Big Data

Page 81: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Stochastic Multi-armed Bandit

Setting

There are K armsSuccessive pulls of arm i yield rewards X i

1, X i2, . . .

Those rewards are i.i.d. according to an unknowndistribution Di with unknown mean µi

The Learning Process

1: for t = 1, 2, . . . ,T do2: pull arm i ∈ [K ]3: Receive reward X i sampled from distribution Di

4: end for

How to maximize the rewards?Always pull arm i = argmaxi∈[K ] µi

http://cs.nju.edu.cn/zlj Big Data

Page 82: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Exploitation-Exploration Dilemma

A Naive Algorithm

Exploration1 Pull each arm m = 100 times2 Calculate the estimated mean µi each arm i

Exploitation1 Always pull arm i = argmaxi∈[K ] µi

http://cs.nju.edu.cn/zlj Big Data

Page 83: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Upper Confidence Bounds (I)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 84: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Upper Confidence Bounds (II)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 85: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Upper Confidence Bounds (III)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 86: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Upper Confidence Bounds (IV)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 87: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Upper Confidence Bounds (V)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

http://cs.nju.edu.cn/zlj Big Data

Page 88: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

http://cs.nju.edu.cn/zlj Big Data

Page 89: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Summary

Stochastic Optimization

Stochastic Gradient Descent (SGD)—Time Reduction

Stochastic Proximal Gradient Descent (SPGD)—SpaceReduction

Distributed Optimization

Distributed Gradient Descent

Alternating Direction Method of Multipliers (ADMM)

Online Learning

Full-Information Setting—Online Gradient Descent

Bandit Setting—Exploitation-Exploration Dilemma

http://cs.nju.edu.cn/zlj Big Data

Page 90: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Reference I

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002).Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256.

Hazan, E., Agarwal, A., and Kale, S. (2007).Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Roux, N. L., Schmidt, M., and Bach, F. (2012).A stochastic gradient method with an exponential convergence rate for finitetraining sets.In Advances in Neural Information Processing Systems 25, pages 2672–2680.

http://cs.nju.edu.cn/zlj Big Data

Page 91: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Reference II

Shalev-Shwartz, S. (2011).Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107¨C–194.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

Zinkevich, M. (2003).Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning, pages928–936.

http://cs.nju.edu.cn/zlj Big Data