mining big data - nanjing university · introduction stochastic optimization distributed...

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Mining Big Data

Lijun [email protected]

Nanjing University, China

December 23, 2015

http://cs.nju.edu.cn/zlj Big Data

NANJING UNIVERSITY

mailto:[email protected]

http://cs.nju.edu.cn/zlj

Page 2: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary

Page 3: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Outline

5 Summary

Big Data

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

Page 5: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

The Four V’s of Big Data

http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg

Page 6: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 7: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Supervised Learning (I)

Training

InputA set of training data (x1, y1), . . . , (xn, yn)

Classification v.s. RegressionBatch Setting v.s. Online Setting

Output

A function g(·) such that yi ≈ g(xi), ∀i

Testing

Given a testing point x, predict its label by g(x)

Assumption

Training and testing data are independent and identicallydistributed (i.i.d.)

Page 8: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Training

Output

Testing

Assumption

Page 9: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Training

Output

Testing

Assumption

Page 10: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Supervised Learning (II)

Loss Function

Measure the discrepancy between y and g(x)

Binary loss: ℓ(u, v) = 1(uv < 0)

Hinge loss: ℓ(u, v) = max(0, 1 − uv)

Squared loss: ℓ(u, v) = (u − v)2

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Our goal is to minimize the risk instead of empirical risk.

Page 11: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Loss Function

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Page 12: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Loss Function

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Page 13: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Outline

5 Summary

Page 14: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

Page 15: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

Page 16: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)

Page 17: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)

Page 18: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 19: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (I)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

Page 20: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (I)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Page 21: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (I)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Page 22: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (II)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) uniformly at random3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

Page 23: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (II)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

4: wt+1 = ΠW(w′t+1)

5: end for

Page 24: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Time Reduction (II)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

4: wt+1 = ΠW(w′t+1)

5: end for

Page 25: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 26: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Space Reduction (I)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices

Page 27: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Space Reduction (I)

minW∈W∈Rm×n

Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices

Page 28: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Space Reduction (II)

minW∈W∈Rm×n

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

Page 29: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minW∈W∈Rm×n

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Page 30: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minW∈W∈Rm×n

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Page 31: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Space Reduction (III)

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

Stochastic Proximal Gradient Descent (SPGD)[Zhang et al., 2015]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The Advantage—Space Reduction

Space complexity per iteration is low: O((m + n)r)

Page 32: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

Page 33: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

Page 34: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Page 35: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Limitations

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Page 36: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Limitations

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Page 37: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Limitations

f (w) ≤ ǫ

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Page 38: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Limitations

f (w) ≤ ǫ

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Page 39: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Outline

5 Summary

Page 40: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction (I)

Empirical Risk Minimization in Distributed Setting

minw∈W

f (w) =k∑

j=1

nj∑

i=1

ℓ(y ji ,w

⊤xji)

(xj1, y

j1) . . . (x

jnj, y j

nj) are training data in the j-th machine

Page 41: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction (II)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

For empirical risk minimization, we have

fj(w) =

nj∑

i=1

ℓ(y ji ,w

⊤xji)

Page 42: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 43: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Distributed Gradient Descent

Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for

1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for

Page 44: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for

1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for

Page 45: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Step 1

Server: send wt to each worker

Page 46: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Step 2

Each worker j : calculate ∇fj(wt) and send it to server

Page 47: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Steps 3 and 4

Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

Server: wt+1 = ΠW(w′t+1)

Page 48: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 49: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Reformulation of the Distributed Optimization (I)

minw∈W

f (w) =k∑

j=1

fj(w)

Global Consensus Problem

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉

Page 50: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minw∈W

f (w) =k∑

j=1

fj(w)

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

Page 51: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

minw∈W

f (w) =k∑

j=1

fj(w)

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

Page 52: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 53: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

‖wi − y‖22

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

1k

∑kj=1(w

t+1j − γλt

j ))

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 54: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

‖wi − y‖22

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

1k

∑kj=1(w

t+1j − γλt

j ))

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 55: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

‖wi − y‖22

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

1k

∑kj=1(w

t+1j − γλt

j ))

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 56: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

‖wi − y‖22

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

1k

∑kj=1(w

t+1j − γλt

j ))

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 57: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

‖wi − y‖22

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

1k

∑kj=1(w

t+1j − γλt

j ))

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for

Page 58: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Step 1

Server: send yt and λtj to worker j = 1, . . . , k

Page 59: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Step 2

Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

2, j = 1, . . . , k

Page 60: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Steps 3 and 4

Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

Server: λt+1j = λ

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

Page 61: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Challenges of Distributed Optimization

CommunicationSynchronization

Page 62: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Outline

5 Summary

Page 63: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Learning (I)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for

Page 64: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Learning (II)

An Example—Online Classification

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X ⊆ R

d

3: Predict pt ∈ D = 0, 14: Receive true answer yt ∈ Y = 0, 15: Suffer loss ℓ(yt , pt) = |yt − pt |6: end for

Page 65: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Learning (III)

Cumulative Loss

Cumulative Loss =T∑

t=1

ℓ(yt , pt)

Page 66: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Learning (VI)

Regret with respect to a hypothesis class H

Regret =T∑

t=1

ℓ(yt , pt)− minh∈H

T∑

t=1

ℓ(yt , h(xt))

h : X 7→ D

Page 67: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Convex Optimization (I)

Online Convex Optimization [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: end for

Page 68: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Convex Optimization (II)

d

An Example—Online SVM

d

3: Receive a training instance (xt , yt)4: Suffer loss ft(wt) = ℓ(yt , x⊤

t wt) = max(0, 1 − ytx⊤t wt)

5: end for

Page 69: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Convex Optimization (III)

d

Regret

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w)

Page 70: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 71: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Online Gradient Descent

Full-Information Setting

The learner observes not only ft(wt) but also ft(·).

Online Gradient Descent [Zinkevich, 2003, Hazan et al., 2007]

d

4: Suffer loss ft(wt)5: wt+1 = wt − ηt∇ft(wt)6: end for

Regret Bound [Zinkevich, 2003, Hazan et al., 2007]

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

Page 72: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

d

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

Page 73: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

d

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )

Page 74: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Page 75: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Bandit Setting

The learner only observes not only ft(wt).

Generation Process of the Loss Functions

Stochastic: f1, . . . , ft are i.i.d. sampledAdversarial

Oblivious: ft is independent of w1, . . . ,wt−1

Nonoblivious: ft may depend on w1, . . . ,wt−1

Types of the Loss Functions

Convex

Linear

Multi-armed Bandit

Page 76: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Bandit Setting

Convex

Linear

Multi-armed Bandit

Page 77: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Bandit Setting

Convex

Linear

Multi-armed Bandit

Page 78: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Multi-armed Bandit

http://blog.mynaweb.com/wp-content/uploads/2013/02/Bandits.png

Page 79: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Stochastic Multi-armed Bandit

Setting

There are K armsSuccessive pulls of arm i yield rewards X i

1, X i2, . . .

Those rewards are i.i.d. according to an unknowndistribution Di with unknown mean µi

The Learning Process

1: for t = 1, 2, . . . ,T do2: pull arm i ∈ [K ]3: Receive reward X i sampled from distribution Di

4: end for

How to maximize the rewards?Always pull arm i = argmaxi∈[K ] µi

Page 80: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Setting

1, X i2, . . .

4: end for

Page 81: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Setting

1, X i2, . . .

4: end for

Page 82: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Exploitation-Exploration Dilemma

A Naive Algorithm

Exploration1 Pull each arm m = 100 times2 Calculate the estimated mean µi each arm i

Exploitation1 Always pull arm i = argmaxi∈[K ] µi

Page 83: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Upper Confidence Bounds (I)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for

Page 84: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Upper Confidence Bounds (II)

Page 85: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Upper Confidence Bounds (III)

Page 86: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Upper Confidence Bounds (IV)

Page 87: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Upper Confidence Bounds (V)

Page 88: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Outline

5 Summary

Summary

Stochastic Optimization

Stochastic Gradient Descent (SGD)—Time Reduction

Stochastic Proximal Gradient Descent (SPGD)—SpaceReduction

Distributed Optimization

Online Learning

Full-Information Setting—Online Gradient Descent

Bandit Setting—Exploitation-Exploration Dilemma

Page 90: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Reference I

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002).Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256.

Hazan, E., Agarwal, A., and Kale, S. (2007).Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Roux, N. L., Schmidt, M., and Bach, F. (2012).A stochastic gradient method with an exponential convergence rate for finitetraining sets.In Advances in Neural Information Processing Systems 25, pages 2672–2680.

Page 91: Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Reference II

Shalev-Shwartz, S. (2011).Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107¨C–194.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

Zinkevich, M. (2003).Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning, pages928–936.

mining big data - nanjing university · introduction stochastic optimization distributed...

Documents