l. vandenberghe ece236c (spring 2019) 13. …vandenbe/236c/lectures/bregman.pdfgeneralized distances...

33
L. Vandenberghe ECE236C (Spring 2019) 13. Generalized distances and mirror descent Bregman distance properties Bregman proximal mapping mirror descent 13.1

Upload: others

Post on 08-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

L. Vandenberghe ECE236C (Spring 2019)

13. Generalized distances and mirror descent

• Bregman distance

• properties

• Bregman proximal mapping

• mirror descent

13.1

Page 2: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Motivation: proximal gradient method

proximal gradient step for minimizing f (x) = g(x) + h(x) (page 4.4):

xk+1 = proxtk h(xk − tk∇g(xk))

= argminu

(h(u) + g(xk) + ∇g(xk)

T(u − xk) +1

2tk‖u − xk ‖

22

)Interpretation: quadratic term represents

• a penalty that forces xk+1 to be close to xk , where linearization of g is accurate

• an approximation of the error term in the linearization of g at xk

Generalized distances and mirror descent 13.2

Page 3: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Generalized proximal gradient method

replace 12‖u − x‖2

2 with a generalized distance d(u, x):

xk+1 = argminu

(h(u) + g(xk) + ∇g(xk)

T(u − xk) +1tk

d(u, xk)

)Potential benefits

1. “pre-conditioning”: use a more accurate model of g(u) around x, ideally

1tk

d(u, xk) ≈ g(u) − g(xk) − ∇g(xk)T(u − xk)

2. make the generalized proximal mapping (minimizer u) easier to compute

goal of 1 is to reduce number of iterations; goal of 2 is to reduce cost per iteration

Generalized distances and mirror descent 13.3

Page 4: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Bregman distance

d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)

• φ is convex and continuously differentiable on int(dom φ)

• domain of φ may include its boundary or a subset of its boundary

• we define the domain of d as dom d = dom φ × int(dom φ)

• φ is called the kernel function or distance-generating function

(y, φ(y))

(x, φ(x))

d(x, y)

other properties of φ will be required but mentioned explicitly (e.g., strict convexity)Generalized distances and mirror descent 13.4

Page 5: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Immediate properties

d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)

• d(x, y) is convex in x for fixed y

• d(x, y) ≥ 0, with equality if x = y

• if φ is strictly convex, then d(x, y) = 0 only if x = y

• d(x, y) , d(y, x) in general

to emphasize lack of symmetry, d is also called a directed distance or divergence

Generalized distances and mirror descent 13.5

Page 6: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Examples

Squared Euclidean distance (with dom φ = Rn)

φ(x) =12

xT x, ∇φ(x) = x, d(x, y) =12‖x − y‖2

2

General quadratic kernel (with dom φ = Rn)

φ(x) =12

xT Ax, ∇φ(x) = Ax, d(x, y) =12(x − y)T A(x − y)

• A is symmetric positive definite

• in some applications, A is positive semidefinite, but not positive definite

Generalized distances and mirror descent 13.6

Page 7: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Examples

Relative entropy (with dom φ = Rn+)

φ(x) =n∑

i=1xi log xi, ∇φ(x) =

log x1 + 1...

log xn + 1

d(x, y) =

n∑i=1

(xi log

xi

yi− xi + yi

)

Logistic loss divergence (with dom φ = [0,1]n)

φ(x) =n∑

i=1(xi log xi + (1 − xi) log(1 − xi)) , ∇φ(x) =

log(x1/(1 − x1))

...log(xn/(1 − xn))

d(x, y) =

n∑i=1

(xi log

xi

yi+ (1 − xi) log

1 − xi

1 − yi

)Generalized distances and mirror descent 13.7

Page 8: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Examples

Hellinger divergence (with dom φ = [−1,1]n)

φ(x) = −n∑

i=1

√1 − x2

i , ∇φ(x) =

x1/

√1 − x2

1...

xn/

√1 − x2

n

d(x, y) =

n∑i=1

©­­«−√

1 − x2i +

1 − xiyi√1 − y2

i

ª®®¬

Generalized distances and mirror descent 13.8

Page 9: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Examples

Logarithmic barrier (with dom φ = Rn++)

φ(x) = −n∑

i=1log xi, ∇φ(x) =

−1/x1...

−1/xn

, d(x, y) =n∑

i=1

(xi

yi− log

xi

yi− 1

)d(x, y) is sometimes called Itakura–Saito divergence

Inverse barrier (with dom φ = Rn++)

φ(x) =n∑

i=1

1xi, ∇φ(x) =

−1/x2

1...−1/x2

n

, d(x, y) =n∑

i=1

1yi

(√xi

yi−

√yi

xi

) 2

Generalized distances and mirror descent 13.9

Page 10: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Bregman distances for symmetric matrices

d(X,Y ) = φ(X) − φ(Y ) − tr(∇φ(Y )(X − Y ))

• kernel φ is a convex function on Sn, differentiable on int (dom φ)

• domain of d is dom d = dom φ × int (dom φ)

Relative entropy (with dom φ = Sn++)

φ(X) = − log det X, ∇φ(X) = −X−1

d(X,Y ) = tr(XY−1) − log det(XY−1) − n

• d(X,Y ) is relative entropy between normal distributions N(0,X) and N(0,Y )

• also known as Kullback–Leibler divergence

Generalized distances and mirror descent 13.10

Page 11: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Bregman distances for symmetric matrices

Matrix entropy (with dom φ = Sn++):

φ(X) = tr(X log X), ∇φ(X) = I + log X

d(X,Y ) = tr(X log X − X logY − X + Y )

• matrix logarithm log X is defined as

log X =n∑

i=1(log λi)qiqT

i

if X has eigendecomposition X =∑

i λiqiqTi

• d(X,Y ) is also known as quantum relative entropy

Generalized distances and mirror descent 13.11

Page 12: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Outline

• Bregman distance

• properties

• Bregman proximal mapping

• mirror descent

Page 13: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Three-point identity

for all x ∈ dom φ and y, z ∈ int(dom φ),

d(x, z) = d(x, y) + d(y, z) + (∇φ(y) − ∇φ(z))T(x − y)

• easily verified by substituting the definition of d

• if d is not symmetric, order of the arguments of d in the identity matters

• generalizes the familiar identity for squared Euclidean distance:

12‖x − z‖2

2 =12‖x − y‖2

2 +12‖y − z‖2

2 + (y − z)T(x − y)

Generalized distances and mirror descent 13.12

Page 14: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Strongly convex kernel

we will sometimes assume that φ is strongly convex (page 1.19):

φ(x) ≥ φ(y) + ∇φ(y)T(x − y) +µ

2‖x − y‖2

• µ > 0 is strong convexity constant of φ for the norm ‖ · ‖

• for twice differentiable φ, this is equivalent to

vT∇2φ(x)v ≥ µ‖v‖2 for all x ∈ int(dom φ) and v

(see page 1.18)

• strong convexity of φ implies that

d(x, y) = φ(x) − φ(y) − ∇φ(y)T(x − y)

≥µ

2‖x − y‖2

Generalized distances and mirror descent 13.13

Page 15: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Regularization with Bregman distance

for given y ∈ int(dom φ) and convex f , consider

minimize f (x) + d(x, y)

• equivalently, minimize f (x) + φ(x) − ∇φ(y)T x

• feasible set is dom f ∩ dom φ

Optimality condition: x̂ ∈ dom f ∩ int(dom φ) is optimal if and only if

f (x) + d(x, y) ≥ f (x̂) + d(x̂, y) + d(x, x̂) for all x ∈ dom f ∩ dom φ (1)

Equivalent optimality condition: x̂ ∈ dom f ∩ int(dom φ) is optimal if and only if

∇φ(y) − ∇φ(x̂) ∈ ∂ f (x̂) (2)

Generalized distances and mirror descent 13.14

Page 16: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Proof: we derive optimality conditions for the problem

minimize g(x) + φ(x) (3)

with g convex, and apply the results to g(x) = f (x) − ∇φ(y)T x

• optimality condition: x̂ ∈ dom g ∩ int (dom φ) is optimal for (3) if and only if

g(x) ≥ g(x̂) − ∇φ(x̂)T(x − x̂) for all x ∈ dom g ∩ dom φ (4)

combined with the 3-point identity this gives the optimality condition (1)

• equivalent optimality condition: x̂ ∈ dom g ∩ int (dom φ) is optimal if and only if

− ∇φ(x̂) ∈ ∂g(x̂) (5)

applied to g(x) = f (x) − ∇φ(y)T x this gives the optimality condition (2)

Generalized distances and mirror descent 13.15

Page 17: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Proof: optimality of x̂

(4) (5)

a

b

c

• implication a follows from convexity of φ: if (4) holds, then for all feasible x,

g(x) + φ(x) ≥ g(x̂) + φ(x) − ∇φ(x̂)T(x − x̂) ≥ g(x̂) + φ(x̂)

• implication b: by definition of subgradient, (5) can be written as

g(x) ≥ g(x̂) − ∇φ(x̂)T(x − x̂) for all x ∈ dom g

• we prove c by contradiction: suppose that for some x ∈ dom g

g(x) < g(x̂) − ∇φ(x̂)T(x − x̂)

define v = x − x̂; for small positive t, by convexity of g and Taylor’s theorem,

g(x̂ + tv) + φ(x̂ + tv) ≤ g(x̂) + t(g(x) − g(x̂)) + φ(x̂ + tv)

= g(x̂) + φ(x̂) + t(g(x) − g(x̂) + ∇φ(x̂)Tv) +O(t2)

< g(x̂) + φ(x̂)

Generalized distances and mirror descent 13.16

Page 18: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Outline

• Bregman distance

• properties

• Bregman proximal mapping

• mirror descent

Page 19: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Bregman proximal mapping

for convex f and Bregman kernel φ, define

proxdf (y,a) = argmin

x

(f (x) + aT x + d(x, y)

)= argmin

x

(f (x) + (a − ∇φ(y))T x + φ(x)

)• first argument y must be in int (dom φ)

• second argument a can take any value

• we’ll use this only if for every y and a, a unique minimizer x ∈ int(dom φ) exists

Generalized distances and mirror descent 13.17

Page 20: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Example: quadratic kernel

φ(x) =12‖x‖2

2, d(x, y) =12‖x − y‖2

2

Bregman proximal mapping can be expressed in terms of standard prox f :

proxdf (y,a) = argmin

x

(f (x) + aT x + d(x, y)

)= argmin

x

(f (x) + aT x +

12‖x − y‖2

2

)= prox f (y − a)

closedness of f ensures existence and uniqueness (see page 6.2)

Generalized distances and mirror descent 13.18

Page 21: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Example: relative entropy

φ(x) =n∑

i=1xi log xi, d(x, y) =

n∑i=1

(xi log(xi/yi) − xi + yi)

• we take f = δC, the indicator of probability simplex C = {x � 0 | 1T x = 1}

• Bregman proximal mapping is

proxdf (y,a) = argmin

1T x=1(aT x +

n∑i=1

xi log(xi/yi))

=1

n∑i=1

yie−ai

y1e−a1

...yne−an

• for every y � 0 and a, minimizer in the definition exists, is unique, and positive

Generalized distances and mirror descent 13.19

Page 22: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Example: relative entropy

(1,0,0) (0,1,0)

(0,0,1)

Contour lines of φ(x)

y

x̂−a

(1,0,0) (0,1,0)

(0,0,1)

Contour lines of d(x, y)

right-hand figure shows

x̂ = proxdf (y,a) = argmin (aT x + d(x, y))

for y = (0.1, 0.3, 0.6) and a = (−0.540, 0.585, −0.045)Generalized distances and mirror descent 13.20

Page 23: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Optimality condition

apply the optimality conditions for Bregman-regularized problem (page 13.14) to

proxdf (y,a) = argmin

x

(f (x) + aT x + d(x, y)

)suppose x̂ ∈ dom f ∩ int (dom φ)

• first condition: x̂ = proxdf (y,a) if and only if

f (x) + aT x + d(x, y) ≥ f (x̂) + aT x̂ + d(x̂, y) + d(x, x̂)

for all x ∈ dom f ∩ dom φ

• second condition: x̂ = proxdf (y,a) if and only if

∇φ(y) − ∇φ(x̂) − a ∈ ∂ f (x̂)

Generalized distances and mirror descent 13.21

Page 24: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Outline

• Bregman distance

• properties

• Bregman proximal mapping

• mirror descent

Page 25: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Mirror descent

minimize f (x)subject to x ∈ C

• f is a convex function, C is a convex subset of dom f

• we assume f is subdifferentiable on C

Algorithm: choose x0 ∈ C ∩ int(dom φ) and repeat

xk+1 = argminx∈C

(tkg

Tk x + d(x, xk)

), k = 0,1, . . .

gk is any subgradient of f at xk

update can be written as xk+1 = proxdδC(xk, tkgk) where δC is indicator of C

Generalized distances and mirror descent 13.22

Page 26: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Mirror descent with quadratic kernel

xk+1 = argminx∈C

(tkg

Tk x + d(x, xk)

)

for d(x, y) = 12‖x − y‖2

2, this is the projected subgradient method:

xk+1 = argminx∈C

(tgT

k x +12‖x − xk ‖

22

)= argmin

x∈C

12‖ x − xk + tkgk ‖

22

= PC(xk − tkgk)

where PC is Euclidean projection on C

Generalized distances and mirror descent 13.23

Page 27: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Assumptions

• problem on page 13.22 has optimal value f?, optimal solution x? ∈ C ∩ dom φ

• f is Lipschitz continuous on C with respect to some norm ‖ · ‖

| f (x) − f (y)| ≤ G‖x − y‖ for all x, y ∈ C

this is equivalent to ‖g‖∗ ≤ G for all x ∈ C and g ∈ ∂ f (x)

(proof extends proof for Euclidean norm on page 3.4)

• φ is 1-strongly convex on C, with respect to the same norm ‖ · ‖:

d(x, y) ≥12‖x − y‖2 for all x ∈ dom φ ∩ C and y ∈ int(dom φ) ∩ C

Generalized distances and mirror descent 13.24

Page 28: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Analysis

• apply optimality condition on page 13.21 with x = x?, y = xi, x̂ = xi+1:

d(x?, xi+1) ≤ d(x?, xi) − d(xi+1, xi) + tigTi (xi − xi+1) + tigT

i (x? − xi)

≤ d(x?, xi) − d(xi+1, xi) + ‖tigi‖∗‖xi+1 − xi‖ + tigTi (x? − xi)

≤ d(x?, xi) − d(xi+1, xi) +12‖xi+1 − xi‖

2 +12‖tigi‖

2∗ + tigT

i (x? − xi)

last step is arithmetic-geometric mean inequality

• apply strong convexity of kernel and definition of subgradient:

d(x?, xi+1) ≤ d(x?, xi) +12‖tigi‖

2∗ + ti( f? − f (xi))

• define fbest,k = mini=0,...,k f (xi) and combine inequalities for i = 0, . . . , k:

(k∑

i=0ti)( fbest,k − f?) ≤ d(x?, x0) − d(x?, xk+1) +

12

k∑i=0

‖tigi‖2∗

≤ d(x?, x0) +12

k∑i=0

‖tigi‖2∗

Generalized distances and mirror descent 13.25

Page 29: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Step size selection

fbest,k − f? ≤d(x?, x0)

k∑i=0

ti

+

k∑i=0

‖tigi‖2∗

2k∑

i=0ti

≤d(x?, x0)

k∑i=0

ti

+

G2 k∑i=0

t2i

2k∑

i=0ti

• diminishing step size: fbest,k → f? if

ti → 0,∞∑

i=0ti = ∞

(see page 3.7)

• optimal step size for fixed number of iterations k, if we know that d(x?, x0) ≤ D:

ti =

√2D

‖gi‖∗√

k + 1, fbest,k ≤

G√

2D√

k + 1

(see page 3.10)

Generalized distances and mirror descent 13.26

Page 30: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Entropic mirror descent

apply mirror descent with relative entropy distance and

C = {x ∈ Rn | x � 0, 1T x = 1}

Algorithm: choose x0 � 0, 1T x0 = 1, and repeat

xk+1 =1

sT xk(s ◦ xk) where s =

(e−tkgk,1, . . . , e−tkgk,n

)• gk is any subgradient of f at xk

• ◦ denotes component-wise vector product

Generalized distances and mirror descent 13.27

Page 31: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Convergence

in the analysis on page 13.26

• if we choose x0 = (1/n)1, then we can take D = log n:

d(x?, x0) = log n +n∑

i=1x?i log x?i ≤ log n

• φ(x) =∑i

xi log xi is 1-strongly convex for ‖ · ‖1 on C: by Cauchy–Schwarz,

vT∇2φ(x)v =n∑

i=1

v2ixi

≥ ‖v‖21 if x � 0, 1T x = 1

• with optimal step size for k iterations,

fbest,k ≤G√

2 log n√

k + 1

where G is Lipschitz constant of f for ‖ · ‖1-norm

Generalized distances and mirror descent 13.28

Page 32: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

Example

minimize ‖Ax − b‖1subject to x � 0, 1T x = 1

• subgradient g = ATsign(Ax − b), so ‖g‖∞ ≤ G = max j∑

i |Ai j |

• example with randomly generated A ∈ R1000×500, b ∈ R1000

0 200 400 600 800 100010−4

10−3

10−2

10−1

k

( fbest,k − f ?)/ f ?

tk = 0.01/√

k + 1tk = 0.1/(k + 1)

Generalized distances and mirror descent 13.29

Page 33: L. Vandenberghe ECE236C (Spring 2019) 13. …vandenbe/236C/lectures/bregman.pdfGeneralized distances and mirror descent 13.3 Bregman distance d„x;y” = ˚„x” ˚„y” r ˚„y”T„x

References

Generalized distances

• Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, and Applications (1997).• M. Basseville, Distance measures for statistical data processing—An annotated bibliography,

Signal Processing (2013).

Mirror descent

• A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization(1983).

• A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods forconvex optimization, Operations Research Letters (2003).

• A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scaleoptimization, I: General-purpose methods. In S. Sra, S. Nowozin, S. J. Wright, editors,Optimization for Machine Learning (2012).

• A. Beck, First-Order Methods in Optimization (2017), chapter 9.

Generalized distances and mirror descent 13.30