tutorial :part2 - princeton university computer science · 2019-07-31 ·...

Tutorial: PART 2

Online Convex Optimization,A Game-‐Theoretic Approach to Learning

Elad Hazan Satyen KalePrinceton University Yahoo Research

Exploiting curvature: logarithmic regret

• Some natural problems admit better regret: • Least squares linear regression• Soft-‐margin SVM• Portfolio selection

• α-‐strong convexity: (e.g. 𝑓 𝑥 = ∥ 𝑥 ∥&)

• α-‐exp-‐concavity: (e.g. 𝑓 𝑥 = −log𝑎,𝑥)

Logarithmic regret

r2f(x) ⌫ ↵I

r2f(x) ⌫ ↵rf(x)rf(x)>

Online Soft-‐Margin SVM

• Features 𝑎 ∈ ℝ/, labels 𝑏 ∈ {−1,1}

• Hypothesis class H is a set of linear predictors:• Parameters: weight vectors 𝑥 ∈ ℝ/• Prediction of weight vector x for feature vector a is aTx

• Loss 𝑓5 𝑥 = max 0, 1 − 𝑏5𝑎5,𝑥 + 𝜆 ∥ 𝑥 ∥&

• Thm [Hazan, Agarwal, Kale 2007]: OGD with step sizes 𝜂5 ≈>5 has

𝑶(𝐥𝐨𝐠 𝑻) regret

Convex, non-‐smooth strongly-‐convex

Universal Portfolio Selection

• Loss function 𝑓5 𝑥 = − log(𝑟5,𝑥)

• Convex, but no strongly convex component

• But exp-‐concave: exp(−𝑓5 𝑥 ) = 𝑟5,𝑥 (concave function: actually, linear)

• Definition: f is called 𝛼-‐exp-‐concave if exp −𝛼𝑓 𝑥 is concave

• Theorem [Hazan, Agarwal, Kale 2007]: For exp-‐concave losses, there is an algorithm, Online Newton Step, that obtains 𝑶(𝐥𝐨𝐠 𝑻) regret

Online Newton Step algorithm

• Use x1 = arbitrary point in K in round 1• For t > 1: use xt defined by following equations

rt�1 = rf

t�1(xt�1)

C

t�1 =t�1X

⌧=1

r⌧

r>⌧

g

t�1 =t�1X

⌧=1

r⌧

r>⌧

x

⌧

� cr⌧

y

t

= C

�1t�1gt�1

x

t

= argminx2K

(x� y

t

)>Ct�1(x� y

t

)>

≈ cumulative Hessian

≈ cumulative gradient step

≈ Newton update

c is a constant depending on α, other bounds

Regularizationfollow the regularized leader, online mirror descent

• Most natural:

• Be-‐the-‐leader algorithm has zero regret [Kalai-‐Vempala, 2005]:

• So if 𝑓5 𝑥5 ≈ 𝑓5(𝑥5K>), we get a regret bound

• Unstability: 𝑓5 𝑥5 − 𝑓5(𝑥5K>) can be large!

Follow-‐the-‐leader

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

x

0t

= argminx2K

Pt

i=1 fi(x) = x

t+1

x

t

= argminx2K

Pt�1i=1 fi(x)

Fixing FTL: Follow-‐The-‐Regularized-‐Leader (FTRL)

• First, suffices to use replace ft by a linear function, 𝛻𝑓5 𝑥5 ,𝑥

• Next, add regularization:

• R(x) is a strongly convex function, 𝜂 is a regularization constant

• Under certain conditions, this ensures stability:

rft(xt) · xt �rft(xt) · xt+1 ⇡ O(⌘)

x

t

= arg minx2K

Pt�1i=1 rf

i

(xi

)>x+ 1⌘

R(x)

Specific instances of FTRL

• 𝑅 𝑥 = ∥ 𝑥 ∥&

• Here,

• Almost like OGD: starting with y1 = 0, for t = 1, 2, …

QK

(y) = argminx2K

ky � xk

xt =Q

K(yt)

yt+1 = yt � ⌘rft(xt)

x

t

= argminx2K

Pt�1i=1 rf

i

(xi

)>x+ 1⌘

R(x)

=Q

K

⇣�⌘

Pt�1i=1 rf

i

(xi

)⌘

Specific instances of FTRL• Experts setting: 𝐾 = Δ/ distributions over experts• 𝑓5 𝑥 = 𝑐5,𝑥, where ct is the vector of losses• 𝑅 𝑥 = ∑ 𝑥R log 𝑥R�

R : negative entropy

• Exactly RWM! Starting with y1 = uniform distribution, for t = 1, 2, …

x

t

= arg min

x2K

Pt�1i=1 rf

i

(x

i

)

>x+

1⌘

R(x)

= exp

⇣�⌘

Pt�1i=1 ci

⌘/Z

t

Entrywiseexponential

Normalization constant

xt =eQ

K(yt)

yt+1 = yt � exp(�⌘ct)

“Projection” = Normalization

Entrywise product with entrywise exponential

FTRL ⇔ Online Mirror Descent

x

t

= arg minx2K

Pt�1i=1 rf

i

(xi

)>x+ 1⌘

R(x)

xt =QR

K(yt)

yt+1 = (rR)�1(rR(yt)� ⌘rft(xt))

Bregman Projection:Q

R

K

(y) = argminx2K

B

R

(xky)

B

R

(xky) := R(x)�R(y)�rR(y)>(x� y)

Advanced regularizationfollow the perturbed leader, AdaGrad

Randomized regularization: Follow-‐The-‐Perturbed-‐Leader

• Special case of OCO, Online Linear Optimization: 𝑓5 𝑥 = 𝑐5,𝑥

• Follow-‐The-‐Perturbed-‐Leader [Kalai-‐Vempala 2005]:

• Regularization pTx: vector p drawn u.a.r. from [− 𝑇� , 𝑇� ]

• Expected regret is the same if p is chosen afresh every round• Re-‐randomizing every round ⇒ 𝑥5 = 𝑥5K> w.p. at least 1 − 1/ 𝑇�

• ⇒ regret = 𝑂( 𝑇� )

x

t

= argminx2K

Pt�1i=1 c

>i

x+ p

>x

Follow-‐The-‐Perturbed-‐Leader

• Algorithm: p is random vector with entries in [− 𝑇� , 𝑇� ]

• Implementation requires only linear optimization over K, easier than projections needed by OMD

• 𝑂 𝑇� regret bound holds for arbitrary sets K: convexity unnecessary!

• Can be extended to convex losses using gradient trick + using expected perturbed leader in each round

x

t

= argminx2K

Pt�1i=1 c

>i

x+ p

>x

Adaptive Regularization: AdaGrad

• Consider use of OGD in learning a generalized linear model• For parameter 𝑥 ∈ ℝ/ and example 𝑎, 𝑏 ∈ ℝ/ × ℝ, prediction is some function of 𝑎,𝑥• Thus, 𝛻𝑓5 𝑥 = ℓ𝓁 𝑎5,𝑏5 , 𝑥 𝑎5

• OGD update: 𝑥5K> = 𝑥5 − 𝜂𝛻𝑓5 𝑥5 = 𝑥5 − 𝜂ℓ𝓁 𝑎5, 𝑏5 , 𝑥 𝑎5• Thus, features treated equally in updating parameter vector

• In typical text classification tasks, feature vectors at are very sparse• Slow learning!

• Adaptive regularization: implements per-‐feature learning rates

AdaGrad (diagonal form)[Duchi, Hazan, Singer ’10]

• Set 𝑥> ∈ 𝐾 arbitrarily• For t = 1, 2,…,

1. use 𝑥5 obtain ft2. compute 𝑥5K> as follows:

• Regret bound: 𝑂 ∑ ∑ 𝛻𝑓5 𝑥5 R&�

5��

R

• Infrequently occurring, or small-‐scale, features have small influence on regret (and therefore, convergence to optimal parameter)

G

t

= diag(P

t

i=1 rf

i

(xi

)rf

i

(xi

)>)

y

t+1 = x

t

� ⌘G

�1/2t

rf

t

(xt

)

x

t+1 = argminx2K

(yt+1 � x)>G

t

(yt+1 � x)

Bandit Convex Optimization

Bandit Optimization

• Motivating example: ad selection on websites

• Simplest form: in each round t• Choose one ad it out of n• Simultaneously, user chooses loss vector ct• Observe loss of chosen ad, ct(it)

• Essentially, experts problem with partial feedback• Modeled as online linear optimization over n-‐simplex of distributions• Available choices are usually called “arms” (from multi-‐armed bandits)• Randomization necessary, so consider expected regret

ct(i) = -‐revenue if ad i would be

clicked if shown, 0otherwise

Regret =TX

t=1

ct(it)�mini

TX

t=1

ct(i)

Constructing cost estimators

e1

e2 en

c̃t = nct(it)eit

Exploration:• Choose arm it from 1, 2, …, n u.a.r.• Observe ct(it)

E[c̃t] = ct

Exploitation: x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

Algorithm: 1. w.p. q explore (as before), create estimator

2. o/w exploit

Regret is essentially (ignoring dependence on n) bounded by

Can we do better?

Separate explore-‐exploit template

E[c̃t] = ct

x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

⇡ q · T + (1� q) · 1q· RegretFTRL = O

q +

pT

q

!= O(T 2/3)

Simultaneous exploration-‐exploitation [Auer, Cesa-‐Bianchi, Freund, Schapire 2002]

e1

e2 en

Exploration+Exploitation:• Choose arm it from distribution xt• Observe ct(it)

E[c̃t] = ct

c̃t =ct(it)

xt(it)eit

x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

With R = negative entropy, this alg (called EXP3) has 𝑂 𝑇� regret

Bandit Linear Optimization

• Generalize from multi-‐armed bandits• K: convex set in ℝ/

• Linear cost functions: 𝑓5 𝑥 = 𝑐5,𝑥

• Approach: 1. (Exploration) construct unbiased estimators2. (Exploitation) Plug into FTRL:

E[c̃t] = ct

x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

Bandit Linear Optimization

• K: convex set in ℝ/, linear cost functions: 𝑓5 𝑥 = 𝑐5,𝑥• Assume K contains e1, e2, …, en “exploration basis”• In round t:

1. W.p. q choose 𝑖5 ∈ 1, 2,… , 𝑛 and use 𝑒R5 . Observe 𝑐5 𝑖5 and set

2. W.p. 1 – q use xt obtained from FTRL and set 𝑐5b = 0

• Gives regret bound of O(T2/3)

E[c̃t] = ct

x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

c̃t =ct(it)

q/neit

Geometric regularization via barriers [Abernethy, Hazan, Rakhlin 2008]

• R = self-‐concordant barrier for K.

• Dikin ellipsoid at any point x lies in K: 𝑦 𝑦 − 𝑥 ,𝛻&𝑅 𝑥 𝑦 − 𝑥 ≤ 1}

• Algorithm: use random endpoint of axes of ellipsoid at xt• ≡ using xt in expectation (exploitation)• 2n endpoints suffice to construct 𝑐5b (exploration)

• Thm: Regret = 𝑂( 𝑇� )x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

Bandit Online Convex Optimization[Flaxman, Kalai, McMahan 2005]

• Near boundary, estimators have large variance• Hence, xt and xt+1 can be very different• Regret bound = O(T3/4)• Recent developments:𝑂 𝑇� regret possible!

ft arbitrary convex functions

Approach:1. Construct estimator 𝑐5b of 𝛻𝑓5 𝑥52. Plug into FTRL

x

t

= argminx2K

t�1X

i=1

c̃

i

>x+

1

⌘

R(x)

rf(x) ⇡ 1

�

Eu2Sn�1

[f(x+ �u)u]

Projection-‐free algorithms

Matrix completion

In round t:• Obtain (user, movie) pair, and output a recommended rating• Then observe true rating, and suffer square loss

Comparison class: all matrices with bounded trace norm

Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values• K = { X | X is a matrix with trace norm at most D }• Optimizing over K promotes low rank solutions

• Online Matrix Completion problem is an OCO problem over K• FTRL, OMD etc require computing projections on K• Expensive! Requires computing eigendecomposition of matrix to be projected: O(n3) operation

• But: linear optimization over K is easier• Requires computing top eigenvector; can be done in O(sparsity) time typically

1. Matrix completion (K = bounded trace norm matrices)eigen decomposition largest singular vector computation

2. Online routing (K = flow polytope)conic optimization over flow polytopeshortest path computation

3. Rotations (K = rotation matrices)conic optimization over rotations setWahba’s algorithm – linear time

4. Matroids (K = matroid polytope)convex opt. via ellipsoid methodgreedy algorithm – nearly linear time!

Projections à linear optimization

Projection-‐Free Online Learning

• Is online learning possible given access to linear optimization oracle rather than projection oracle?

• Yes: projection can be implemented via polynomiallymany linear optimizations• But for efficiency, would like only a few (1 or 2) calls to linear optimization oracle per round

• Follow-‐The-‐Perturbed-‐Leader does the job for linear loss functions

• What about general convex loss functions? Is OCO possible with only 1 or 2 calls to a linear optimization oracle per round?

Conditional Gradient algorithm [Frank, Wolfe ’56]

1. At iteration t: convex comb. of at most t vertices (sparsity)2. No learning rate. 𝜂5 ≈

>5 (independent of diameter, gradients etc.)

Convex opt problem:minh∈i

𝑓(𝑥)

Assume: • f is smooth, convex• linear opt over K is easy

v

t

= argminx2K

rf(xt

)>x

x

t+1 = x

t

+ ⌘

t

(vt

� x

t

)

Online Conditional Gradient [Hazan, Kale ’12]

vt

xt+1xt

xt+1 (1� t

�↵)xt + t

�↵vt

Theorem: Regret = O(T 3/4)

• Set 𝑥> ∈ 𝐾 arbitrarily• For t = 1, 2,…,

1. Use 𝑥5, obtain ft2. Compute 𝑥5K> as follows

v

t

= argminx2K

⇣Pt

i=1 rf

i

(xi

) + �

t

x

t

⌘>x

Pti=1 rfi(xi) + �txt

Linearly-‐convergence CG [Garber, Hazan 2013]

xt

Theorem: For polytopes, strongly-‐convex and smooth losses:1. Offline: gap after t steps:2. Online: Regret = O(

pT )exp(�ct)

Directions for future research

1. Efficient algorithms & optimal dependency on dimension for bandit OCO

2. Faster online matrix algorithms3. Optimal regret /running time tradeoffs4. Optimal regret and projection-‐free algorithm for general convex sets5. Time series/dependence/adaptivity (vs. static regret)6. Adding state / online Reinforcement Learning / MDPs / stochastic

games [lots of existing work…] 7. Tensor/spectral learning in the online setting8. Game theoretic aspect – stronger notions, approachability, …

Bibliography & more information, see:http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm

Thank you!

tutorial :part2 - princeton university computer science · 2019-07-31 ·...

Documents