tutorial :part2 - princeton university computer science · 2019-07-31 ·...
TRANSCRIPT
Tutorial: PART 2
Online Convex Optimization,A Game-‐Theoretic Approach to Learning
Elad Hazan Satyen KalePrinceton University Yahoo Research
Exploiting curvature: logarithmic regret
• Some natural problems admit better regret: • Least squares linear regression• Soft-‐margin SVM• Portfolio selection
• α-‐strong convexity: (e.g. 𝑓 𝑥 = ∥ 𝑥 ∥&)
• α-‐exp-‐concavity: (e.g. 𝑓 𝑥 = −log𝑎,𝑥)
Logarithmic regret
r2f(x) ⌫ ↵I
r2f(x) ⌫ ↵rf(x)rf(x)>
Online Soft-‐Margin SVM
• Features 𝑎 ∈ ℝ/, labels 𝑏 ∈ {−1,1}
• Hypothesis class H is a set of linear predictors:• Parameters: weight vectors 𝑥 ∈ ℝ/• Prediction of weight vector x for feature vector a is aTx
• Loss 𝑓5 𝑥 = max 0, 1 − 𝑏5𝑎5,𝑥 + 𝜆 ∥ 𝑥 ∥&
• Thm [Hazan, Agarwal, Kale 2007]: OGD with step sizes 𝜂5 ≈>5 has
𝑶(𝐥𝐨𝐠 𝑻) regret
Convex, non-‐smooth strongly-‐convex
Universal Portfolio Selection
• Loss function 𝑓5 𝑥 = − log(𝑟5,𝑥)
• Convex, but no strongly convex component
• But exp-‐concave: exp(−𝑓5 𝑥 ) = 𝑟5,𝑥 (concave function: actually, linear)
• Definition: f is called 𝛼-‐exp-‐concave if exp −𝛼𝑓 𝑥 is concave
• Theorem [Hazan, Agarwal, Kale 2007]: For exp-‐concave losses, there is an algorithm, Online Newton Step, that obtains 𝑶(𝐥𝐨𝐠 𝑻) regret
Online Newton Step algorithm
• Use x1 = arbitrary point in K in round 1• For t > 1: use xt defined by following equations
rt�1 = rf
t�1(xt�1)
C
t�1 =t�1X
⌧=1
r⌧
r>⌧
g
t�1 =t�1X
⌧=1
r⌧
r>⌧
x
⌧
� cr⌧
y
t
= C
�1t�1gt�1
x
t
= argminx2K
(x� y
t
)>Ct�1(x� y
t
)>
≈ cumulative Hessian
≈ cumulative gradient step
≈ Newton update
c is a constant depending on α, other bounds
Regularizationfollow the regularized leader, online mirror descent
• Most natural:
• Be-‐the-‐leader algorithm has zero regret [Kalai-‐Vempala, 2005]:
• So if 𝑓5 𝑥5 ≈ 𝑓5(𝑥5K>), we get a regret bound
• Unstability: 𝑓5 𝑥5 − 𝑓5(𝑥5K>) can be large!
Follow-‐the-‐leader
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
x
0t
= argminx2K
Pt
i=1 fi(x) = x
t+1
x
t
= argminx2K
Pt�1i=1 fi(x)
Fixing FTL: Follow-‐The-‐Regularized-‐Leader (FTRL)
• First, suffices to use replace ft by a linear function, 𝛻𝑓5 𝑥5 ,𝑥
• Next, add regularization:
• R(x) is a strongly convex function, 𝜂 is a regularization constant
• Under certain conditions, this ensures stability:
rft(xt) · xt �rft(xt) · xt+1 ⇡ O(⌘)
x
t
= arg minx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
Specific instances of FTRL
• 𝑅 𝑥 = ∥ 𝑥 ∥&
• Here,
• Almost like OGD: starting with y1 = 0, for t = 1, 2, …
QK
(y) = argminx2K
ky � xk
xt =Q
K(yt)
yt+1 = yt � ⌘rft(xt)
x
t
= argminx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
=Q
K
⇣�⌘
Pt�1i=1 rf
i
(xi
)⌘
Specific instances of FTRL• Experts setting: 𝐾 = Δ/ distributions over experts• 𝑓5 𝑥 = 𝑐5,𝑥, where ct is the vector of losses• 𝑅 𝑥 = ∑ 𝑥R log 𝑥R�
R : negative entropy
• Exactly RWM! Starting with y1 = uniform distribution, for t = 1, 2, …
x
t
= arg min
x2K
Pt�1i=1 rf
i
(x
i
)
>x+
1⌘
R(x)
= exp
⇣�⌘
Pt�1i=1 ci
⌘/Z
t
Entrywiseexponential
Normalization constant
xt =eQ
K(yt)
yt+1 = yt � exp(�⌘ct)
“Projection” = Normalization
Entrywise product with entrywise exponential
FTRL ⇔ Online Mirror Descent
x
t
= arg minx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
xt =QR
K(yt)
yt+1 = (rR)�1(rR(yt)� ⌘rft(xt))
Bregman Projection:Q
R
K
(y) = argminx2K
B
R
(xky)
B
R
(xky) := R(x)�R(y)�rR(y)>(x� y)
Advanced regularizationfollow the perturbed leader, AdaGrad
Randomized regularization: Follow-‐The-‐Perturbed-‐Leader
• Special case of OCO, Online Linear Optimization: 𝑓5 𝑥 = 𝑐5,𝑥
• Follow-‐The-‐Perturbed-‐Leader [Kalai-‐Vempala 2005]:
• Regularization pTx: vector p drawn u.a.r. from [− 𝑇� , 𝑇� ]
• Expected regret is the same if p is chosen afresh every round• Re-‐randomizing every round ⇒ 𝑥5 = 𝑥5K> w.p. at least 1 − 1/ 𝑇�
• ⇒ regret = 𝑂( 𝑇� )
x
t
= argminx2K
Pt�1i=1 c
>i
x+ p
>x
Follow-‐The-‐Perturbed-‐Leader
• Algorithm: p is random vector with entries in [− 𝑇� , 𝑇� ]
• Implementation requires only linear optimization over K, easier than projections needed by OMD
• 𝑂 𝑇� regret bound holds for arbitrary sets K: convexity unnecessary!
• Can be extended to convex losses using gradient trick + using expected perturbed leader in each round
x
t
= argminx2K
Pt�1i=1 c
>i
x+ p
>x
Adaptive Regularization: AdaGrad
• Consider use of OGD in learning a generalized linear model• For parameter 𝑥 ∈ ℝ/ and example 𝑎, 𝑏 ∈ ℝ/ × ℝ, prediction is some function of 𝑎,𝑥• Thus, 𝛻𝑓5 𝑥 = ℓ𝓁 𝑎5,𝑏5 , 𝑥 𝑎5
• OGD update: 𝑥5K> = 𝑥5 − 𝜂𝛻𝑓5 𝑥5 = 𝑥5 − 𝜂ℓ𝓁 𝑎5, 𝑏5 , 𝑥 𝑎5• Thus, features treated equally in updating parameter vector
• In typical text classification tasks, feature vectors at are very sparse• Slow learning!
• Adaptive regularization: implements per-‐feature learning rates
AdaGrad (diagonal form)[Duchi, Hazan, Singer ’10]
• Set 𝑥> ∈ 𝐾 arbitrarily• For t = 1, 2,…,
1. use 𝑥5 obtain ft2. compute 𝑥5K> as follows:
• Regret bound: 𝑂 ∑ ∑ 𝛻𝑓5 𝑥5 R&�
5��
R
• Infrequently occurring, or small-‐scale, features have small influence on regret (and therefore, convergence to optimal parameter)
G
t
= diag(P
t
i=1 rf
i
(xi
)rf
i
(xi
)>)
y
t+1 = x
t
� ⌘G
�1/2t
rf
t
(xt
)
x
t+1 = argminx2K
(yt+1 � x)>G
t
(yt+1 � x)
Bandit Convex Optimization
Bandit Optimization
• Motivating example: ad selection on websites
• Simplest form: in each round t• Choose one ad it out of n• Simultaneously, user chooses loss vector ct• Observe loss of chosen ad, ct(it)
• Essentially, experts problem with partial feedback• Modeled as online linear optimization over n-‐simplex of distributions• Available choices are usually called “arms” (from multi-‐armed bandits)• Randomization necessary, so consider expected regret
ct(i) = -‐revenue if ad i would be
clicked if shown, 0otherwise
Regret =TX
t=1
ct(it)�mini
TX
t=1
ct(i)
Constructing cost estimators
e1
e2 en
c̃t = nct(it)eit
Exploration:• Choose arm it from 1, 2, …, n u.a.r.• Observe ct(it)
E[c̃t] = ct
Exploitation: x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
Algorithm: 1. w.p. q explore (as before), create estimator
2. o/w exploit
Regret is essentially (ignoring dependence on n) bounded by
Can we do better?
Separate explore-‐exploit template
E[c̃t] = ct
x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
⇡ q · T + (1� q) · 1q· RegretFTRL = O
q +
pT
q
!= O(T 2/3)
Simultaneous exploration-‐exploitation [Auer, Cesa-‐Bianchi, Freund, Schapire 2002]
e1
e2 en
Exploration+Exploitation:• Choose arm it from distribution xt• Observe ct(it)
E[c̃t] = ct
c̃t =ct(it)
xt(it)eit
x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
With R = negative entropy, this alg (called EXP3) has 𝑂 𝑇� regret
Bandit Linear Optimization
• Generalize from multi-‐armed bandits• K: convex set in ℝ/
• Linear cost functions: 𝑓5 𝑥 = 𝑐5,𝑥
• Approach: 1. (Exploration) construct unbiased estimators2. (Exploitation) Plug into FTRL:
E[c̃t] = ct
x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
Bandit Linear Optimization
• K: convex set in ℝ/, linear cost functions: 𝑓5 𝑥 = 𝑐5,𝑥• Assume K contains e1, e2, …, en “exploration basis”• In round t:
1. W.p. q choose 𝑖5 ∈ 1, 2,… , 𝑛 and use 𝑒R5 . Observe 𝑐5 𝑖5 and set
2. W.p. 1 – q use xt obtained from FTRL and set 𝑐5b = 0
• Gives regret bound of O(T2/3)
E[c̃t] = ct
x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
c̃t =ct(it)
q/neit
Geometric regularization via barriers [Abernethy, Hazan, Rakhlin 2008]
• R = self-‐concordant barrier for K.
• Dikin ellipsoid at any point x lies in K: 𝑦 𝑦 − 𝑥 ,𝛻&𝑅 𝑥 𝑦 − 𝑥 ≤ 1}
• Algorithm: use random endpoint of axes of ellipsoid at xt• ≡ using xt in expectation (exploitation)• 2n endpoints suffice to construct 𝑐5b (exploration)
• Thm: Regret = 𝑂( 𝑇� )x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
Bandit Online Convex Optimization[Flaxman, Kalai, McMahan 2005]
• Near boundary, estimators have large variance• Hence, xt and xt+1 can be very different• Regret bound = O(T3/4)• Recent developments:𝑂 𝑇� regret possible!
ft arbitrary convex functions
Approach:1. Construct estimator 𝑐5b of 𝛻𝑓5 𝑥52. Plug into FTRL
x
t
= argminx2K
t�1X
i=1
c̃
i
>x+
1
⌘
R(x)
rf(x) ⇡ 1
�
Eu2Sn�1
[f(x+ �u)u]
Projection-‐free algorithms
Matrix completion
In round t:• Obtain (user, movie) pair, and output a recommended rating• Then observe true rating, and suffer square loss
Comparison class: all matrices with bounded trace norm
Bounded trace norm matrices
• Trace norm of a matrix = sum of singular values• K = { X | X is a matrix with trace norm at most D }• Optimizing over K promotes low rank solutions
• Online Matrix Completion problem is an OCO problem over K• FTRL, OMD etc require computing projections on K• Expensive! Requires computing eigendecomposition of matrix to be projected: O(n3) operation
• But: linear optimization over K is easier• Requires computing top eigenvector; can be done in O(sparsity) time typically
1. Matrix completion (K = bounded trace norm matrices)eigen decomposition largest singular vector computation
2. Online routing (K = flow polytope)conic optimization over flow polytopeshortest path computation
3. Rotations (K = rotation matrices)conic optimization over rotations setWahba’s algorithm – linear time
4. Matroids (K = matroid polytope)convex opt. via ellipsoid methodgreedy algorithm – nearly linear time!
Projections à linear optimization
Projection-‐Free Online Learning
• Is online learning possible given access to linear optimization oracle rather than projection oracle?
• Yes: projection can be implemented via polynomiallymany linear optimizations• But for efficiency, would like only a few (1 or 2) calls to linear optimization oracle per round
• Follow-‐The-‐Perturbed-‐Leader does the job for linear loss functions
• What about general convex loss functions? Is OCO possible with only 1 or 2 calls to a linear optimization oracle per round?
Conditional Gradient algorithm [Frank, Wolfe ’56]
1. At iteration t: convex comb. of at most t vertices (sparsity)2. No learning rate. 𝜂5 ≈
>5 (independent of diameter, gradients etc.)
Convex opt problem:minh∈i
𝑓(𝑥)
Assume: • f is smooth, convex• linear opt over K is easy
v
t
= argminx2K
rf(xt
)>x
x
t+1 = x
t
+ ⌘
t
(vt
� x
t
)
Online Conditional Gradient [Hazan, Kale ’12]
vt
xt+1xt
xt+1 (1� t
�↵)xt + t
�↵vt
Theorem: Regret = O(T 3/4)
• Set 𝑥> ∈ 𝐾 arbitrarily• For t = 1, 2,…,
1. Use 𝑥5, obtain ft2. Compute 𝑥5K> as follows
v
t
= argminx2K
⇣Pt
i=1 rf
i
(xi
) + �
t
x
t
⌘>x
Pti=1 rfi(xi) + �txt
Linearly-‐convergence CG [Garber, Hazan 2013]
xt
Theorem: For polytopes, strongly-‐convex and smooth losses:1. Offline: gap after t steps:2. Online: Regret = O(
pT )exp(�ct)
Directions for future research
1. Efficient algorithms & optimal dependency on dimension for bandit OCO
2. Faster online matrix algorithms3. Optimal regret /running time tradeoffs4. Optimal regret and projection-‐free algorithm for general convex sets5. Time series/dependence/adaptivity (vs. static regret)6. Adding state / online Reinforcement Learning / MDPs / stochastic
games [lots of existing work…] 7. Tensor/spectral learning in the online setting8. Game theoretic aspect – stronger notions, approachability, …
Bibliography & more information, see:http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm
Thank you!