subgradient methods - stanford university · subgradient methods...
TRANSCRIPT
Subgradient Methods
• subgradient method and stepsize rules
• convergence results and proof
• projected subgradient method
• projected subgradient method for dual
• optimal network flow
Prof. S. Boyd, EE392o, Stanford University
Subgradient method
subgradient method is simple algorithm to minimize nondifferentiableconvex function f
x(k+1) = x(k) − αkg(k)
• x(k) is the kth iterate
• g(k) is any subgradient of f at x(k)
• αk > 0 is the kth step size
Prof. S. Boyd, EE392o, Stanford University 1
Step size rules
step sizes are fixed ahead of time
• constant step size: αk = h (constant)
• constant step length: αk = h/‖g(k)‖2 (so ‖x(k+1) − x(k)‖2 = h)
• square summable but not summable: step sizes satisfy
∞∑
k=1
α2k <∞,
∞∑
k=1
αk =∞
• nonsummable diminishing: step sizes satisfy
limk→∞
αk = 0,
∞∑
k=1
αk =∞
Prof. S. Boyd, EE392o, Stanford University 2
Convergence results
• we assume ‖g‖2 ≤ G for all g ∈ ∂f (equivalent to Lipschitz conditionon f), p? = infx f(x) > −∞
• define f (k)best = mini=0,...,k f(x
(i)), best value found in k iterations, and
f̄ = limk→∞ f(k)best
• constant step size: f̄ − p? ≤ G2h/2, i.e.,converges to G2h/2-suboptimal
(converges to p? if f differentiable, h small enough)
• constant step length: f̄ − p? ≤ Gh/2, i.e.,converges to Gh/2-suboptimal
• diminishing step size rule: f̄ = p?, i.e., converges
Prof. S. Boyd, EE392o, Stanford University 3
Convergence proof
key quantity: Euclidean distance to the optimal set, not the function value
let x? be any minimizer of f
‖x(k+1) − x?‖22 = ‖x(k) − αkg(k) − x?‖22
= ‖x(k) − x?‖22 − 2αkg(k)T (x(k) − x?) + α2
k‖g(k)‖22≤ ‖x(k) − x?‖22 − 2αk(f(x
(k))− p?) + α2k‖g(k)‖22
using p? = f(x?) ≥ f(x(k)) + g(k)T (x? − x(k))
Prof. S. Boyd, EE392o, Stanford University 4
applying recursively, and using ‖g(i)‖2 ≤ G, we get
‖x(k+1) − x?‖22 ≤ ‖x(1) − x?‖22 − 2
k∑
i=1
αi(f(x(i))− p?) +G2
k∑
i=1
α2i
now we use
k∑
i=1
αi(f(x(i))− p?) ≥ (f
(k)best − p?)
(
k∑
i=1
αi
)
to get
f(k)best − p? ≤ ‖x
(1) − x?‖22 +G2∑k
i=1α2i
2∑k
i=1αi
.
Prof. S. Boyd, EE392o, Stanford University 5
constant step size: for αk = h we get
f(k)best − p? ≤ ‖x
(1) − x?‖22 +G2kh2
2kh
righthand side converges to G2h/2 as k →∞
Prof. S. Boyd, EE392o, Stanford University 6
square summable but not summable step sizes:
suppose step sizes satisfy
∞∑
k=1
α2k <∞,
∞∑
k=1
αk =∞
then
f(k)best − p? ≤ ‖x
(1) − x?‖22 +G2∑k
i=1α2i
2∑k
i=1αi
as k →∞, numerator converges to a finite number, denominatorconverges to ∞, so f (k)
best → p?
Prof. S. Boyd, EE392o, Stanford University 7
Example: Piecewise linear minimization
minimize f(x) = maxi=1,...,m(aTi x+ bi)
to find a subgradient of f : find index j for which
aTj x+ bj = max
i=1,...,m(aT
i x+ bi)
and take g = aj
subgradient method: x(k+1) = x(k) − αkaj
Prof. S. Boyd, EE392o, Stanford University 8
problem instance with n = 10 variables, m = 100 terms
constant step length, h = 0.05, 0.02, 0.005
0 100 200 300 400 50010
−2
10−1
100
h = 0.05h = 0.02h = 0.005
PSfrag replacements
k
f(x
(k) )−p
?
Prof. S. Boyd, EE392o, Stanford University 9
f(k)best − p? and upper bound, constant step length h = 0.02
0 100 200 300 400 50010
−2
10−1
100
101
102
PSfrag replacements
k
boundandf
(k)
best−p
?
Prof. S. Boyd, EE392o, Stanford University 10
constant step size h = 0.05, 0.02, 0.005
0 100 200 300 400 50010
−2
10−1
100
h = 0.05h = 0.02h = 0.005
PSfrag replacements
k
f(x
(k) )−p
?
Prof. S. Boyd, EE392o, Stanford University 11
diminishing step rule α = 0.1/√k and square summable step size rule
α = 0.1/k.
0 50 100 150 200 25010
−2
10−1
100
a = 0.1a = 0.1, b = 0
PSfrag replacements
k
f(x
(k) )−p
?
Prof. S. Boyd, EE392o, Stanford University 12
f(k)best − p? and upper bound, diminishing step size rule α = 0.1/
√k
0 100 200 300 400 50010
−3
10−2
10−1
100
101
PSfrag replacements
k
boundandf
(k)
best−p
?
Prof. S. Boyd, EE392o, Stanford University 13
constant step length h = 0.02, diminishing step size rule α = 0.1/√k, and
square summable step rule α = 0.1/k
0 500 1000 1500 2000 2500 3000 350010
−3
10−2
10−1
100
h = 0.02a = 0.1a = 0.1, b = 0
PSfrag replacements
k
f(k
)best−p
?
Prof. S. Boyd, EE392o, Stanford University 14
Projected subgradient method
solves constrained optimization problem
minimize f(x)subject to x ∈ C,
where f : Rn → R, C ⊆ Rn are convex
projected subgradient method is given by
x(k+1) = P (x(k) − αkg(k)),
P is (Euclidean) projection on C, and g(k) ∈ ∂f(x(k))
Prof. S. Boyd, EE392o, Stanford University 15
same convergence results:
• for constep step size, converges to neighborhood of optimal(for f differentiable and h small enough, converges)
• for diminishing nonsummable step sizes, convergeskey idea: projection does not increase distance to x?
approximate projected subgradient: P only needs to satisfy
P (u) ∈ C, ‖P (u)− z‖2 ≤ ‖u− z‖2 for any z ∈ C
Prof. S. Boyd, EE392o, Stanford University 16
Projected subgradient for dual problem
(convex) primal:
minimize f0(x)subject to fi(x) ≤ 0, i = 1, . . . ,m
solve dual problemmaximize g(λ)subject to λ º 0
via projected subgradient method:
λ(k+1) =(
λ(k) − αkh)
+, h ∈ ∂(−g)(λ(k))
Prof. S. Boyd, EE392o, Stanford University 17
Subgradient of negative dual function
assume f0 is strictly convex, and denote, for λ º 0,
x∗(λ) = argminz
(f0(z) + λ1f1(z) + · · ·+ λmfm(z))
so g(λ) = f0(x∗(λ)) + λ1f1(x
∗(λ)) + · · ·+ λmfm(x∗(λ))
a subgradient of −g at λ is given by hi = −fi(x∗(λ))
projected subgradient method for dual:
x(k) = x∗(λ(k)), λ(k+1)i =
(
λ(k)i + αkfi(x
(k)))
+
Prof. S. Boyd, EE392o, Stanford University 18
note:
• primal iterates x(k) are not feasible, but become feasible in limit
• subgradient of −g directly gives the violation of the primal constraints
• dual function values g(λ(k)) converge to p?
Prof. S. Boyd, EE392o, Stanford University 19
Example: Optimal network flow
• connected directed graph with n links, p nodes• variable xj denotes the flow or traffic on arc j (can be < 0)
• given external source (or sink) flow si at node i, 1Ts = 0
• flow conservation: Ax = s, where A ∈ Rp×n is node incidence matrix
• φj : R→ R convex flow cost function for link j
optimal (single commodity) network flow problem:
minimize∑n
j=1 φj(xj)
subject to Ax = s
Prof. S. Boyd, EE392o, Stanford University 20
Dual network flow problem
Langrangian is
L(x, ν) =
n∑
j=1
φj(xj) + νT (s−Ax)
=n∑
j=1
(φj(xj)−∆νjxj) + νTs
• we interpret νi as potential at node i
• ∆νj denotes potential difference across link j
Prof. S. Boyd, EE392o, Stanford University 21
dual function is
q(ν) = infxL(x, ν) =
n∑
j=1
infxj
(φj(xj)−∆νjxj) + νTs
= −n∑
j=1
φ∗j(∆νj) + νTs
(φ∗j is conjugate function of φj)
dual network flow problem:
maximize q(ν) = −∑n
j=1 φ∗j(∆νj) + νTs
Prof. S. Boyd, EE392o, Stanford University 22
Optimal network flow via dual
assume φi strictly convex, and denote
x∗j(∆νj) = argminxj
(φj(xj)−∆νjxj)
if ν? is optimal solution of the dual network flow problem,
x?j = x∗j(∆ν
?j )
is optimal flow
Prof. S. Boyd, EE392o, Stanford University 23
Electrical network analogy
• electrical network with node incidence matrix A, nonlinear resistors inbranches
• variable xj is the current flow in branch j
• source si is external current injected at node i (must sum to zero)
• flow conservation equation Ax = s is Kirkhoff Current Law (KCL)
• dual variables are node potentials; ∆νj is jth branch voltage
• branch current-voltage characteristic is xj = x∗j(∆νj)
then, current and potentials in circuit are optimal flows and dual variables
Prof. S. Boyd, EE392o, Stanford University 24
Subgradient of negative dual function
a subgradient of the negative dual function −q at ν is
g = Ax∗(∆ν)− s
ith component is gi = aTi x
∗(∆ν)− si, which is flow excess at node i
Prof. S. Boyd, EE392o, Stanford University 25
Subgradient method for dual
subgradient method applied to dual can be expressed as:
xj := x∗j(∆νj)
gi := aTi x− si
νi := νi − αgi
interpretation:
• optimize each flow, given potential difference, without regard for flowconservation
• evaluate flow excess• update potentials to correct flow excesses
Prof. S. Boyd, EE392o, Stanford University 26
Example: Minimum queueing delay
flow cost function
φj(xj) =|xj|
cj − |xj|, domφj = (−cj, cj)
where cj > 0 are given link capacities
(φj(xj) gives expected waiting time in queue with exponential arrivals atrate xj, exponential service at rate cj)conjugate is
φ∗j(y) =
{
0 |y| ≤ 1/cj(
√
|cjy| − 1)2
, |y| > 1/cj
Prof. S. Boyd, EE392o, Stanford University 27
cost function φ(x) (left) and its conjugate φ∗(y) (right), c = 1
−1 −0.5 0 0.5 10
1
2
3
4
5
6
7
8
PSfrag replacements
x
φ(x)
yφ∗(y) −6 −4 −2 0 2 4 6
0
0.5
1
1.5
2
PSfrag replacements
xφ(x)
y
φ∗(y)
(note conjugate is differentiable)
Prof. S. Boyd, EE392o, Stanford University 28
x∗j(∆νj), for cj = 1
−10 −5 0 5 10−1
−0.5
0
0.5
1
PSfrag replacements
∆νj
xj
gives flow as function of potential difference across link
Prof. S. Boyd, EE392o, Stanford University 29
A specific example
network with 5 nodes, 7 links, capacities cj = 1PSfrag replacements
1
2
3
4
51
2
3
4
5
6
7
Prof. S. Boyd, EE392o, Stanford University 30
Optimal flow
optimal flows shown as width of arrows; optimal dual variables shown innodes; potential differences shown on links
PSfrag replacements
4.74
4.90
3.18
2.45
0−0.16
1.56
−1.72
2.45
0.72
3.18
2.45
Prof. S. Boyd, EE392o, Stanford University 31
Convergence of dual function
constant stepsize rule, α = 0.1, 1, 2, 3
0 20 40 60 80 1000
0.5
1
1.5
2
2.5
3
PSfrag replacements
k
q(ν
(k) )
α = 0.1α = 1α = 2α = 3
for α = 1, 2, converges to p? = 2.48 in around about 40 iterations
Prof. S. Boyd, EE392o, Stanford University 32
Convergence of primal residual
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
PSfrag replacements
k
‖Ax
(k)−s‖
α = 0.1α = 1α = 2α = 3
Prof. S. Boyd, EE392o, Stanford University 33
convergence of dual function, nonsummable diminishing stepsize rules
0 20 40 60 80 1000
0.5
1
1.5
2
2.5
3
PSfrag replacements
k
q(ν
(k) )
α = 1/kα = 5/kα = 1/
√k
α = 5/√k
Prof. S. Boyd, EE392o, Stanford University 34
convergence of primal residual, nonsummable diminishing stepsize rules
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
PSfrag replacements
k
‖Ax
(k)−s‖
α = 1/kα = 5/kα = 1/
√k
α = 5/√k
Prof. S. Boyd, EE392o, Stanford University 35
Convergence of dual variables
ν(k) versus iteration number k, constant stepsize rule α = 2
0 20 40 60 80 1000
1
2
3
4
5
PSfrag replacements
k
ν(k
)
ν1ν2ν3ν4
(ν5 is fixed as zero)
Prof. S. Boyd, EE392o, Stanford University 36