l. vandenberghe ece236c (spring 2019) 13. …vandenbe/236c/lectures/bregman.pdfgeneralized distances...
TRANSCRIPT
L. Vandenberghe ECE236C (Spring 2019)
13. Generalized distances and mirror descent
• Bregman distance
• properties
• Bregman proximal mapping
• mirror descent
13.1
Motivation: proximal gradient method
proximal gradient step for minimizing f (x) = g(x) + h(x) (page 4.4):
xk+1 = proxtk h(xk − tk∇g(xk))
= argminu
(h(u) + g(xk) + ∇g(xk)
T(u − xk) +1
2tk‖u − xk ‖
22
)Interpretation: quadratic term represents
• a penalty that forces xk+1 to be close to xk , where linearization of g is accurate
• an approximation of the error term in the linearization of g at xk
Generalized distances and mirror descent 13.2
Generalized proximal gradient method
replace 12‖u − x‖2
2 with a generalized distance d(u, x):
xk+1 = argminu
(h(u) + g(xk) + ∇g(xk)
T(u − xk) +1tk
d(u, xk)
)Potential benefits
1. “pre-conditioning”: use a more accurate model of g(u) around x, ideally
1tk
d(u, xk) ≈ g(u) − g(xk) − ∇g(xk)T(u − xk)
2. make the generalized proximal mapping (minimizer u) easier to compute
goal of 1 is to reduce number of iterations; goal of 2 is to reduce cost per iteration
Generalized distances and mirror descent 13.3
Bregman distance
d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)
• φ is convex and continuously differentiable on int(dom φ)
• domain of φ may include its boundary or a subset of its boundary
• we define the domain of d as dom d = dom φ × int(dom φ)
• φ is called the kernel function or distance-generating function
(y, φ(y))
(x, φ(x))
d(x, y)
other properties of φ will be required but mentioned explicitly (e.g., strict convexity)Generalized distances and mirror descent 13.4
Immediate properties
d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)
• d(x, y) is convex in x for fixed y
• d(x, y) ≥ 0, with equality if x = y
• if φ is strictly convex, then d(x, y) = 0 only if x = y
• d(x, y) , d(y, x) in general
to emphasize lack of symmetry, d is also called a directed distance or divergence
Generalized distances and mirror descent 13.5
Examples
Squared Euclidean distance (with dom φ = Rn)
φ(x) =12
xT x, ∇φ(x) = x, d(x, y) =12‖x − y‖2
2
General quadratic kernel (with dom φ = Rn)
φ(x) =12
xT Ax, ∇φ(x) = Ax, d(x, y) =12(x − y)T A(x − y)
• A is symmetric positive definite
• in some applications, A is positive semidefinite, but not positive definite
Generalized distances and mirror descent 13.6
Examples
Relative entropy (with dom φ = Rn+)
φ(x) =n∑
i=1xi log xi, ∇φ(x) =
log x1 + 1...
log xn + 1
d(x, y) =
n∑i=1
(xi log
xi
yi− xi + yi
)
Logistic loss divergence (with dom φ = [0,1]n)
φ(x) =n∑
i=1(xi log xi + (1 − xi) log(1 − xi)) , ∇φ(x) =
log(x1/(1 − x1))
...log(xn/(1 − xn))
d(x, y) =
n∑i=1
(xi log
xi
yi+ (1 − xi) log
1 − xi
1 − yi
)Generalized distances and mirror descent 13.7
Examples
Hellinger divergence (with dom φ = [−1,1]n)
φ(x) = −n∑
i=1
√1 − x2
i , ∇φ(x) =
x1/
√1 − x2
1...
xn/
√1 − x2
n
d(x, y) =
n∑i=1
©«−√
1 − x2i +
1 − xiyi√1 − y2
i
ª®®¬
Generalized distances and mirror descent 13.8
Examples
Logarithmic barrier (with dom φ = Rn++)
φ(x) = −n∑
i=1log xi, ∇φ(x) =
−1/x1...
−1/xn
, d(x, y) =n∑
i=1
(xi
yi− log
xi
yi− 1
)d(x, y) is sometimes called Itakura–Saito divergence
Inverse barrier (with dom φ = Rn++)
φ(x) =n∑
i=1
1xi, ∇φ(x) =
−1/x2
1...−1/x2
n
, d(x, y) =n∑
i=1
1yi
(√xi
yi−
√yi
xi
) 2
Generalized distances and mirror descent 13.9
Bregman distances for symmetric matrices
d(X,Y ) = φ(X) − φ(Y ) − tr(∇φ(Y )(X − Y ))
• kernel φ is a convex function on Sn, differentiable on int (dom φ)
• domain of d is dom d = dom φ × int (dom φ)
Relative entropy (with dom φ = Sn++)
φ(X) = − log det X, ∇φ(X) = −X−1
d(X,Y ) = tr(XY−1) − log det(XY−1) − n
• d(X,Y ) is relative entropy between normal distributions N(0,X) and N(0,Y )
• also known as Kullback–Leibler divergence
Generalized distances and mirror descent 13.10
Bregman distances for symmetric matrices
Matrix entropy (with dom φ = Sn++):
φ(X) = tr(X log X), ∇φ(X) = I + log X
d(X,Y ) = tr(X log X − X logY − X + Y )
• matrix logarithm log X is defined as
log X =n∑
i=1(log λi)qiqT
i
if X has eigendecomposition X =∑
i λiqiqTi
• d(X,Y ) is also known as quantum relative entropy
Generalized distances and mirror descent 13.11
Outline
• Bregman distance
• properties
• Bregman proximal mapping
• mirror descent
Three-point identity
for all x ∈ dom φ and y, z ∈ int(dom φ),
d(x, z) = d(x, y) + d(y, z) + (∇φ(y) − ∇φ(z))T(x − y)
• easily verified by substituting the definition of d
• if d is not symmetric, order of the arguments of d in the identity matters
• generalizes the familiar identity for squared Euclidean distance:
12‖x − z‖2
2 =12‖x − y‖2
2 +12‖y − z‖2
2 + (y − z)T(x − y)
Generalized distances and mirror descent 13.12
Strongly convex kernel
we will sometimes assume that φ is strongly convex (page 1.19):
φ(x) ≥ φ(y) + ∇φ(y)T(x − y) +µ
2‖x − y‖2
• µ > 0 is strong convexity constant of φ for the norm ‖ · ‖
• for twice differentiable φ, this is equivalent to
vT∇2φ(x)v ≥ µ‖v‖2 for all x ∈ int(dom φ) and v
(see page 1.18)
• strong convexity of φ implies that
d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)
≥µ
2‖x − y‖2
Generalized distances and mirror descent 13.13
Regularization with Bregman distance
for given y ∈ int(dom φ) and convex f , consider
minimize f (x) + d(x, y)
• equivalently, minimize f (x) + φ(x) − ∇φ(y)T x
• feasible set is dom f ∩ dom φ
Optimality condition: x̂ ∈ dom f ∩ int(dom φ) is optimal if and only if
f (x) + d(x, y) ≥ f (x̂) + d(x̂, y) + d(x, x̂) for all x ∈ dom f ∩ dom φ (1)
Equivalent optimality condition: x̂ ∈ dom f ∩ int(dom φ) is optimal if and only if
∇φ(y) − ∇φ(x̂) ∈ ∂ f (x̂) (2)
Generalized distances and mirror descent 13.14
Proof: we derive optimality conditions for the problem
minimize g(x) + φ(x) (3)
with g convex, and apply the results to g(x) = f (x) − ∇φ(y)T x
• optimality condition: x̂ ∈ dom g ∩ int (dom φ) is optimal for (3) if and only if
g(x) ≥ g(x̂) − ∇φ(x̂)T(x − x̂) for all x ∈ dom g ∩ dom φ (4)
combined with the 3-point identity this gives the optimality condition (1)
• equivalent optimality condition: x̂ ∈ dom g ∩ int (dom φ) is optimal if and only if
− ∇φ(x̂) ∈ ∂g(x̂) (5)
applied to g(x) = f (x) − ∇φ(y)T x this gives the optimality condition (2)
Generalized distances and mirror descent 13.15
Proof: optimality of x̂
(4) (5)
a
b
c
• implication a follows from convexity of φ: if (4) holds, then for all feasible x,
g(x) + φ(x) ≥ g(x̂) + φ(x) − ∇φ(x̂)T(x − x̂) ≥ g(x̂) + φ(x̂)
• implication b: by definition of subgradient, (5) can be written as
g(x) ≥ g(x̂) − ∇φ(x̂)T(x − x̂) for all x ∈ dom g
• we prove c by contradiction: suppose that for some x ∈ dom g
g(x) < g(x̂) − ∇φ(x̂)T(x − x̂)
define v = x − x̂; for small positive t, by convexity of g and Taylor’s theorem,
g(x̂ + tv) + φ(x̂ + tv) ≤ g(x̂) + t(g(x) − g(x̂)) + φ(x̂ + tv)
= g(x̂) + φ(x̂) + t(g(x) − g(x̂) + ∇φ(x̂)Tv) +O(t2)
< g(x̂) + φ(x̂)
Generalized distances and mirror descent 13.16
Outline
• Bregman distance
• properties
• Bregman proximal mapping
• mirror descent
Bregman proximal mapping
for convex f and Bregman kernel φ, define
proxdf (y,a) = argmin
x
(f (x) + aT x + d(x, y)
)= argmin
x
(f (x) + (a − ∇φ(y))T x + φ(x)
)• first argument y must be in int (dom φ)
• second argument a can take any value
• we’ll use this only if for every y and a, a unique minimizer x ∈ int(dom φ) exists
Generalized distances and mirror descent 13.17
Example: quadratic kernel
φ(x) =12‖x‖2
2, d(x, y) =12‖x − y‖2
2
Bregman proximal mapping can be expressed in terms of standard prox f :
proxdf (y,a) = argmin
x
(f (x) + aT x + d(x, y)
)= argmin
x
(f (x) + aT x +
12‖x − y‖2
2
)= prox f (y − a)
closedness of f ensures existence and uniqueness (see page 6.2)
Generalized distances and mirror descent 13.18
Example: relative entropy
φ(x) =n∑
i=1xi log xi, d(x, y) =
n∑i=1
(xi log(xi/yi) − xi + yi)
• we take f = δC, the indicator of probability simplex C = {x � 0 | 1T x = 1}
• Bregman proximal mapping is
proxdf (y,a) = argmin
1T x=1(aT x +
n∑i=1
xi log(xi/yi))
=1
n∑i=1
yie−ai
y1e−a1
...yne−an
• for every y � 0 and a, minimizer in the definition exists, is unique, and positive
Generalized distances and mirror descent 13.19
Example: relative entropy
(1,0,0) (0,1,0)
(0,0,1)
Contour lines of φ(x)
y
x̂−a
(1,0,0) (0,1,0)
(0,0,1)
Contour lines of d(x, y)
right-hand figure shows
x̂ = proxdf (y,a) = argmin (aT x + d(x, y))
for y = (0.1, 0.3, 0.6) and a = (−0.540, 0.585, −0.045)Generalized distances and mirror descent 13.20
Optimality condition
apply the optimality conditions for Bregman-regularized problem (page 13.14) to
proxdf (y,a) = argmin
x
(f (x) + aT x + d(x, y)
)suppose x̂ ∈ dom f ∩ int (dom φ)
• first condition: x̂ = proxdf (y,a) if and only if
f (x) + aT x + d(x, y) ≥ f (x̂) + aT x̂ + d(x̂, y) + d(x, x̂)
for all x ∈ dom f ∩ dom φ
• second condition: x̂ = proxdf (y,a) if and only if
∇φ(y) − ∇φ(x̂) − a ∈ ∂ f (x̂)
Generalized distances and mirror descent 13.21
Outline
• Bregman distance
• properties
• Bregman proximal mapping
• mirror descent
Mirror descent
minimize f (x)subject to x ∈ C
• f is a convex function, C is a convex subset of dom f
• we assume f is subdifferentiable on C
Algorithm: choose x0 ∈ C ∩ int(dom φ) and repeat
xk+1 = argminx∈C
(tkg
Tk x + d(x, xk)
), k = 0,1, . . .
gk is any subgradient of f at xk
update can be written as xk+1 = proxdδC(xk, tkgk) where δC is indicator of C
Generalized distances and mirror descent 13.22
Mirror descent with quadratic kernel
xk+1 = argminx∈C
(tkg
Tk x + d(x, xk)
)
for d(x, y) = 12‖x − y‖2
2, this is the projected subgradient method:
xk+1 = argminx∈C
(tgT
k x +12‖x − xk ‖
22
)= argmin
x∈C
12‖ x − xk + tkgk ‖
22
= PC(xk − tkgk)
where PC is Euclidean projection on C
Generalized distances and mirror descent 13.23
Assumptions
• problem on page 13.22 has optimal value f?, optimal solution x? ∈ C ∩ dom φ
• f is Lipschitz continuous on C with respect to some norm ‖ · ‖
| f (x) − f (y)| ≤ G‖x − y‖ for all x, y ∈ C
this is equivalent to ‖g‖∗ ≤ G for all x ∈ C and g ∈ ∂ f (x)
(proof extends proof for Euclidean norm on page 3.4)
• φ is 1-strongly convex on C, with respect to the same norm ‖ · ‖:
d(x, y) ≥12‖x − y‖2 for all x ∈ dom φ ∩ C and y ∈ int(dom φ) ∩ C
Generalized distances and mirror descent 13.24
Analysis
• apply optimality condition on page 13.21 with x = x?, y = xi, x̂ = xi+1:
d(x?, xi+1) ≤ d(x?, xi) − d(xi+1, xi) + tigTi (xi − xi+1) + tigT
i (x? − xi)
≤ d(x?, xi) − d(xi+1, xi) + ‖tigi‖∗‖xi+1 − xi‖ + tigTi (x? − xi)
≤ d(x?, xi) − d(xi+1, xi) +12‖xi+1 − xi‖
2 +12‖tigi‖
2∗ + tigT
i (x? − xi)
last step is arithmetic-geometric mean inequality
• apply strong convexity of kernel and definition of subgradient:
d(x?, xi+1) ≤ d(x?, xi) +12‖tigi‖
2∗ + ti( f? − f (xi))
• define fbest,k = mini=0,...,k f (xi) and combine inequalities for i = 0, . . . , k:
(k∑
i=0ti)( fbest,k − f?) ≤ d(x?, x0) − d(x?, xk+1) +
12
k∑i=0
‖tigi‖2∗
≤ d(x?, x0) +12
k∑i=0
‖tigi‖2∗
Generalized distances and mirror descent 13.25
Step size selection
fbest,k − f? ≤d(x?, x0)
k∑i=0
ti
+
k∑i=0
‖tigi‖2∗
2k∑
i=0ti
≤d(x?, x0)
k∑i=0
ti
+
G2 k∑i=0
t2i
2k∑
i=0ti
• diminishing step size: fbest,k → f? if
ti → 0,∞∑
i=0ti = ∞
(see page 3.7)
• optimal step size for fixed number of iterations k, if we know that d(x?, x0) ≤ D:
ti =
√2D
‖gi‖∗√
k + 1, fbest,k ≤
G√
2D√
k + 1
(see page 3.10)
Generalized distances and mirror descent 13.26
Entropic mirror descent
apply mirror descent with relative entropy distance and
C = {x ∈ Rn | x � 0, 1T x = 1}
Algorithm: choose x0 � 0, 1T x0 = 1, and repeat
xk+1 =1
sT xk(s ◦ xk) where s =
(e−tkgk,1, . . . , e−tkgk,n
)• gk is any subgradient of f at xk
• ◦ denotes component-wise vector product
Generalized distances and mirror descent 13.27
Convergence
in the analysis on page 13.26
• if we choose x0 = (1/n)1, then we can take D = log n:
d(x?, x0) = log n +n∑
i=1x?i log x?i ≤ log n
• φ(x) =∑i
xi log xi is 1-strongly convex for ‖ · ‖1 on C: by Cauchy–Schwarz,
vT∇2φ(x)v =n∑
i=1
v2ixi
≥ ‖v‖21 if x � 0, 1T x = 1
• with optimal step size for k iterations,
fbest,k ≤G√
2 log n√
k + 1
where G is Lipschitz constant of f for ‖ · ‖1-norm
Generalized distances and mirror descent 13.28
Example
minimize ‖Ax − b‖1subject to x � 0, 1T x = 1
• subgradient g = ATsign(Ax − b), so ‖g‖∞ ≤ G = max j∑
i |Ai j |
• example with randomly generated A ∈ R1000×500, b ∈ R1000
0 200 400 600 800 100010−4
10−3
10−2
10−1
k
( fbest,k − f ?)/ f ?
tk = 0.01/√
k + 1tk = 0.1/(k + 1)
Generalized distances and mirror descent 13.29
References
Generalized distances
• Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, and Applications (1997).• M. Basseville, Distance measures for statistical data processing—An annotated bibliography,
Signal Processing (2013).
Mirror descent
• A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization(1983).
• A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods forconvex optimization, Operations Research Letters (2003).
• A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scaleoptimization, I: General-purpose methods. In S. Sra, S. Nowozin, S. J. Wright, editors,Optimization for Machine Learning (2012).
• A. Beck, First-Order Methods in Optimization (2017), chapter 9.
Generalized distances and mirror descent 13.30