homepage: maths.nju.edu.cn/hebma
TRANSCRIPT
1
ã?n¥Ä`z¯K
Ú¦)©Â
n. õ©l¼ê©Â ÚUIO
Û ] )
HEÆêÆX H®ÆêÆXHomepage: maths.nju.edu.cn/˜hebma
a>fEÆ Ü©(Ƭ Jøãá
5Æã?n©Û6ÛÏÆ
ÜS 2019c 7
2
1 nnn©©©lll888III¼¼¼êêêààà`zzz¯KKK
min θ1(x) + θ2(y) + θ3(z)
s.t Ax+By + Cz = b
x ∈ X , y ∈ Y, z ∈ Z
(1.1)
3
Background extraction of surveillance video (II)
The original surveillance video has missing information and additive noise
PΩ(D) = PΩ(X + Y )+noise
PΩ — indicating missing data, Z — noise/outliers
Model
min‖X‖∗ + τ‖Y ‖1 + ‖PΩ(Z)‖2F | X + Y − Z = D
observed video foreground background
4
Image decomposition with degradations The target image for
decomposition contains degradations, e.g., blur, missing pixels, · · ·
f = K(u + divv) + z, K — degradation operator, z — noise/outlier
Model
min‖∇u‖1 + τ‖v‖∞ + ‖z‖22 | K(u + divv) + z = f
target image cartoon texture
5
2 Mathematical Background
üÄVgµC©Øª Ú C: (PPA)
Lemma 1 LetX ⊂ <n be a closed convex set, θ(x) and f(x) be convex func-
tions and f(x) is differentiable. Assume that the solution set of the minimization
problem minθ(x) + f(x) |x ∈ X is nonempty. Then,
x∗ ∈ arg minθ(x) + f(x) |x ∈ X (2.1a)
if and only if
x∗ ∈ X , θ(x)− θ(x∗) + (x− x∗)T∇f(x∗) ≥ 0, ∀x ∈ X . (2.1b)
6
2.1 Linearly constrained convex optimization and VI
The Lagrangian function of the problem (1.1) is
L3(x, y, z, λ) = θ1(x) + θ2(y) + θ3(z)− λT (Ax+By + Cz − b).
The saddle point (x∗, y∗, z∗, λ∗) ∈ X × Y × Z × <m of L3(x, y, z, λ)
7
satisfies
L3λ∈<m
(x∗, y∗, z∗, λ
)≤ L3
(x∗, y∗, z∗, λ∗
)≤ L3
x∈X ,y∈Y,z∈Z(x, y, z, λ∗
).
In other words, for any saddle point (x∗, λ∗), we havex∗ ∈ argminL3(x, y∗, z∗, λ∗)|x ∈ X,y∗ ∈ argminL2(x∗, y, z∗, λ∗)|y ∈ Y,z∗ ∈ argminL2(x∗, y∗, z, λ∗)|y ∈ Z,λ∗ ∈ argmaxL(x∗, y∗, z∗, λ)|λ ∈ <m.
According to Lemma 1, the saddle point is a solution of the following VI:x∗ ∈ X , θ1(x)− θ1(x∗) + (x− x∗)T (−ATλ∗) ≥ 0, ∀x ∈ X ,y∗ ∈ X , θ2(y)− θ2(y∗) + (y − y∗)T (−BTλ∗) ≥ 0, ∀ y ∈ Y,z∗ ∈ Z, θ3(z)− θ3(z∗) + (z − z∗)T (−CTλ∗) ≥ 0, ∀x ∈ Z,λ∗ ∈ <m, (λ− λ∗)T (Ax∗ +By∗ + Cz∗ − b) ≥ 0, ∀ λ ∈ <m.
8
Its compact form is the following variational inequality:
w∗ ∈ Ω, θ(u)− θ(u∗) + (w − w∗)TF (w∗) ≥ 0, ∀w ∈ Ω, (2.2)
where
w =
xy
z
λ
, u =
x
y
z
, F (w) =
−ATλ−BTλ−CTλ
Ax+By + Cz − b
,
and
θ(u) = θ1(x) + θ2(y) + θ3(z), Ω = X × Y × Z × <m.
Note that the operator F is monotone, because
(w− w)T(F (w)−F (w)) ≥ 0, Here (w − w)T(F (w)−F (w)) = 0. (2.3)
9
2.2 Preliminaries of PPA for Variational Inequalities
The optimal condition of the problem (1.1) is characterized as a mixed monotone
variational inequality:
w∗ ∈ Ω, θ(u)− θ(u∗) + (w − w∗)TF (w∗) ≥ 0, ∀w ∈ Ω. (2.4)
PPA for monotone mixed VI in H-norm
For given wk, find the proximal point wk+1 in H-norm which satisfies
wk+1 ∈ Ω, θ(u)− θ(uk+1) + (w − wk+1)T
F (wk+1) +H(wk+1 − wk) ≥ 0, ∀ w ∈ Ω,(2.5)
where H is a symmetric positive definite matrix.
Convergence Property of Proximal Point Algorithm in H-norm
‖wk+1 − w∗‖2H ≤ ‖wk − w∗‖2H − ‖wk − wk+1‖2H . (2.6)
10
2.3 Splitting Methods in a Unified Framework
We study the algorithms using the guidance of variational inequality.
w∗ ∈ Ω, θ(u)− θ(u∗) + (w − w∗)TF (w∗) ≥ 0, ∀w ∈ Ω. (2.7)
Algorithms in a unified framework
[Prediction Step.] With given vk, find a vector wk ∈ Ω such that
θ(u)−θ(uk)+(w−wk)TF (wk) ≥ (v−vk)TQ(vk−vk), ∀w ∈ Ω, (2.8a)
where the matrix Q is not necessary symmetric, but QT +Q is positive definite.
[Correction Step.] The new iterate vk+1 by
vk+1 = vk − αM(vk − vk). (2.8b)
11
Convergence Conditions
For the matrices Q and M , there is a positive definite matrix H such that
HM = Q. (2.9a)
Moreover, the matrix
G = QT +Q− αMTHM (2.9b)
is positive semi-definite.
Convergence using the unified framework
Theorem 1 Let vk be the sequence generated by a method for the problem
(3.1) and wk is obtained in the k-th iteration. If vk, vk+1 and wk satisfy the
conditions in the unified framework, then we have
‖vk+1 − v∗‖2H ≤ ‖vk − v∗‖2H − α‖vk − vk‖2G, ∀v∗ ∈ V∗. (2.10)
12
½n 1Ì(Ø
‖vk+1 − v∗‖2H ≤ ‖vk − v∗‖2H − α‖vk − vk‖2G, ∀v∗ ∈ V∗.
´ PPAaq ت,¤±`ùa´ PPA Like.
'''uuuÚÚÚµµµeeeeee999ÙÙÙÂÂÂñññ555yyy²²²±±±ëëëeee¡¡¡©©©ÙÙÙµµµ
• B.S. He, and X. M. Yuan, A class of ADMM-based algorithms for three-block
separable convex programming. Comput. Optim. Appl. 70 (2018), 791õ826.
• Û]),·Ú¦fO 20c,5$ÊÆÆ622ò11Ï, pp. 1-31,
2018.
PPAaÚÚE,S¶":´gÅ,KÝ°Ý.
13
3 Two special prediction-correction methods
We study the optimization algorithms using the guidance of variational inequality.
w∗ ∈ Ω, θ(u)− θ(u∗) + (w − w∗)TF (w∗) ≥ 0, ∀w ∈ Ω. (3.1)
3.1 Algorithms I Q = H , H is positive definite
[Prediction Step.] With given vk, find a vector wk ∈ Ω such that
θ(u)−θ(uk)+(w−wk)TF (wk) ≥ (v−vk)TH(vk−vk), ∀w ∈ Ω, (3.2a)
where the matrix H is symmetric and positive definite.
[Correction Step.] The new iterate vk+1 by
vk+1 = vk − α(vk − vk), α ∈ (0, 2) (3.2b)
H is a symmetric positive definite matrix. ýýýÿÿÿ éééëëëêêêkkk¦¦¦
14
The sequence vk generated by the prediction-correction method (3.2) satisfies
‖vk+1 − v∗‖2H ≤ ‖vk − v∗‖2H − α(2− α)‖vk − vk‖2H . ∀v∗ ∈ V∗.
The above inequality is the Key for convergence analysis !
þª´ (??)aqت,äk PPA LikeÂñ5.
Set α = 1 in (3.2b), the prediction (3.2a) becomes: wk+1 ∈ Ω such that
θ(u)−θ(uk+1)+(w−wk+1)TF (wk+1) ≥ (v−vk+1)TH(vk−vk+1), ∀w ∈ Ω.
The generated sequence vk satisfies
‖vk+1 − v∗‖2H ≤ ‖vk − v∗‖2H − ‖vk − vk+1‖2H . ∀v∗ ∈ V∗.
þª´ (??)aqت,´'uØ%Cþ v PPA.
15
3.2 Algorithms II Q is the sum of two matrices
[Prediction Step.] With given vk, find a vector wk ∈ Ω such that
θ(u)−θ(uk)+(w−wk)TF (wk) ≥ (v−vk)TQ(vk−vk), ∀w ∈ Ω, (3.3a)
whereQ = D +K, (3.3b)
D is a block diagonal positive definite matrix
K is skew-symmetric (é¡) QT +Q = 2D
[Correction Step.] For the positive matrix D, the new iterate vk+1 is given by
vk+1 = vk − γα∗kM(vk − vk), (3.4a)
where M = D−1Q, γ ∈ (0, 2), and the optimal step size is given by
α∗k =‖vk − vk‖2D‖M(vk − vk)‖2D
. (3.4b)
16
Since MTDM = MTQ, we have
‖M(vk − vk)‖2D =[M(vk − vk)
]T [Q(vk − vk)
]and thus
α∗k =‖vk − vk‖2D[
M(vk − vk)]T [
Q(vk − vk)] . ÚOéN´¢y
The sequence vk generated by the prediction-correction Algorithm II satisfies
‖vk+1 − v∗‖2H ≤ ‖vk − v∗‖2D − γ(2− γ)α∗k‖vk − vk‖2D. ∀v∗ ∈ V∗.
þª´ (??)aqت,ýÿ-Ñäk PPA LikeÂñ5.
¤±,ùw¥¤`,Ñ´C:a (PPA Like).
17
Convergence of the prediction-correction method II
Lemma 2 For given vk, let the predictor wk be generated by (3.3a), then we
have
(vk − v∗)TQ(vk − vk) ≥ ‖vk − vk‖2D, (3.5)
where Q is given in the right hand side of (3.3a) and D is given in (3.3b).
Proof. Set w = w∗ in (3.3a), we get
(vk − v∗)TQ(vk − vk) ≥ θ(uk)− θ(u∗) + (wk − w∗)TF (wk). (3.6)
Because
(wk − w∗)TF (wk) = (wk − w∗)TF (w∗)
and
θ(uk)− θ(u∗) + (wk − w∗)TF (w∗) ≥ 0,
the right hand side of (3.6) is non-negative. Thus, we have
(vk − v∗)− (vk − vk)TQ(vk − vk) ≥ 0
18
and
(vk − v∗)TQ(vk − vk) ≥ (vk − vk)TQ(vk − vk). (3.7)
For the right hand side of the above inequality, by using Q = D +K and the
skew-symmetry of K , we obtain
(vk − vk)TQ(vk − vk) = (vk − vk)T (D +K)(vk − vk)
= ‖vk − vk‖2D.
The lemma is proved.
Theorem 2 For given vk, let the predictor wk be generated by (3.3a). If the new
iterate vk+1 is given by
vk+1(α) = vk − αM(vk − vk), γ ∈ (0, 2), (3.8)
then we have
‖vk+1 − v∗‖2D ≤ ‖vk − v∗‖2D − qII
k (α), ∀v∗ ∈ V∗, (3.9)
19
where
qII
k (α) = 2α‖wk − wk‖2D − α2‖M(wk − wk)‖2D. (3.10)
Proof. First, we define the profit function by
ϑII
k (α) = ‖vk − v∗‖2D − ‖vk+1(α)− v∗‖2D. (3.11)
Thus, it follows from (3.8) that
ϑII
k (α) = ‖vk − v∗‖2D − ‖(vk − v∗)− αM(vk − vk)‖2D= 2α(vk − v∗)TDM(vk − vk)− α2‖M(vk − vk)‖2D.
By using DM = Q and (3.5), we get
ϑII
k (α) ≥ 2α‖vk − vk‖2D − α2‖M(vk − vk)‖2D = qII
k (α).
qIIk (α) reaches its maximum atα∗
k which is given by (3.4b).
20
O α* γα*
q(α)
ϑ(α)
α
γ ∈ [1, 2)«¿ã
21
Since we take α = γα∗k, it follows from (3.10) that
qII
k (α) = 2γα∗k‖vk − vk‖2D − γ2(α∗k)2‖M(vk − vk)‖2D. (3.12)
By using (3.4b), we get
(α∗k)2‖M(vk − vk)‖2D
= α∗k‖vk − vk‖2D‖M(vk − vk)‖2D
‖M(vk − vk)‖2D
= α∗k‖vk − vk‖2D.
Substituting it in (3.12) we get qII
k (α) ≥ γ(2− γ)α∗k‖vk − vk‖2D .
‖vk+1 − v∗‖2D ≤ ‖vk − v∗‖2D − γ(2− γ)α∗k‖vk − vk‖2D. ∀v∗ ∈ V∗.
22
4 Applications for separable problems
This section presents various applications of the proposed algorithms for the
separable convex optimization problem
minθ1(x) + θ2(y) | Ax+By = b, x ∈ X , y ∈ Y. (4.1)
Its VI-form is
w∗ ∈ Ω, θ(u)− θ(u∗) + (w − w∗)TF (w∗) ≥ 0, ∀w ∈ Ω. (4.2)
where
w =
xy
λ
, u =
x
y
, F (w) =
−ATλ−BTλ
Ax+By − b
, (4.3a)
and
θ(u) = θ1(x) + θ2(y), Ω = X × Y × <m. (4.3b)
23
The augmented Lagrangian Function of the problem (4.1) is
Lβ(x, y, λ) = θ1(x)+θ2(y)−λT (Ax+By−b)+β
2‖Ax+By−b‖2. (4.4)
Solving the problem (4.1) by using ADMM, the k-th iteration begins with given
(yk, λk), it offers the new iterate (yk+1, λk+1) via
(ADMM)
xk+1 = arg min
Lβ(x, yk, λk)
∣∣ x ∈ X, (4.5a)
yk+1 = arg minLβ(xk+1, y, λk)
∣∣ y ∈ Y, (4.5b)
λk+1 = λk − β(Axk+1 +Byk+1 − b). (4.5c)
w =
x
y
λ
, v =
y
λ
and V∗ = (y∗, λ∗) | (x∗, y∗, λ∗) ∈ Ω∗.
‖vk+1−v∗‖2H ≤ ‖vk−v∗‖2H−‖vk−vk+1‖2H , H =
βBTB 0
0 1β Im
24
â I¦ Oýÿúª.
4.1 ADMM in PPA-sense
In order to solve the separable convex optimization problem (4.1), we construct a
method whose prediction-step is
θ(u)− θ(uk) + (w − wk)TF (wk) ≥ (v − vk)TH(vk − vk), ∀w ∈ Ω,
(4.6a)
where
H =
(1 + δ)βBTB −BT
−B 1β Im
, (a small δ > 0, say δ = 0.05).
(4.6b)
Since H is positive definite, we can use the update form of Algorithm I to produce
the new iterate vk+1 = (yk+1, λk+1). (In the algorithm [2], we took δ = 0).
25
The concrete form of (4.6) is
θ1(x)− θ1(xk) + (x− xk)T
−AT λk ≥ 0,
θ2(y)− θ2(yk) + (y − yk)T
−BT λk + (1 + δ)βBTB(yk − yk)−BT (λk − λk) ≥ 0,
(Axk +Byk − b) −B(yk − yk) + (1/β) (λk − λk) = 0.
In fact, the prediction can be arranged by
xk = ArgminLβ(x, yk, λk) |x ∈ X, (4.7a)
λk = λk − β(Axk +Byk − b), (4.7b)
yk = Argmin
θ2(y)− yTBT [2λk − λk]
+ 1+δ2 β‖B(y − yk)‖2
∣∣∣∣ y ∈ Y. (4.7c)
ùýÿ²;O (6.10),æ^(3.2b),¬\¯Ý.
The underline part is F (wk):
F (w) =
−ATλ
−BTλ
Ax+By − b
264.2 Linearized ADMM-Like Method
f¯K (4.7c)¦)k(J,^ s2‖y − y
k‖2O 1+δ2 β‖B(y − yk)‖2.
By using the linearized version of (4.7), the prediction step becomes
θ(u)−θ(uk)+(w−wk)TF (wk) ≥ (v−vk)TH(vk−vk), ∀w ∈ Ω, (4.8)
where
H =
[sI −BT
−B 1β Im
], O (4.6)¥
[(1 + δ)βBTB −BT
−B 1βIm
]. (4.9)
The concrete formula of (4.8) is
θ1(x)− θ1(xk) + (x− xk)T
−AT λk ≥ 0,
θ2(y)− θ2(yk) + (y − yk)T
−BT λk + s(yk − yk)−BT (λk − λk) ≥ 0,
(Axk +Byk − b)−B(yk − yk) + (1/β)(λk − λk) = 0.
(4.10)
The underline part is F (wk):
F (w) =
−ATλ
−BTλ
Ax+By − b
27
Then, we use the form
vk+1 = vk − α(vk − vk), α ∈ (0, 2)
to update the new iterate vk+1.
How to implement the prediction? To get wk which satisfies (4.10),
we need only use the following procedure:xk = ArgminLβ(x, yk, λk) |x ∈ X,
λk = λk − β(Axk +Byk − b),
yk = Argminθ2(y)− yTBT [2λk − λk] +s
2‖y − yk‖2 | y ∈ Y.
^ s2‖y − y
k‖2O 1+δ2 β‖B(y − yk)‖2,yÂñ,I s > β‖BTB‖.
é½ β > 0, ¦ s > β‖BTB‖, s¬KÂñÝ
28
4.3 Method without s > β‖BTB‖
Ý BTB^ØÐ,q7L5z,Òæ±e
For solving the same problem, we give the following prediction:
θ(u)− θ(uk) + (w − wk)TF (wk) ≥ (v − vk)TQ(vk − vk), ∀w ∈ Ω,
(4.11a)
where
Q =
sI BT
−B 1β Im
= D +K. (4.11b)
Because
D =
sI 0
0 1β Im
and K =
0 BT
−B 0
,
âùýÿ,±^ IIúª (3.4))#S:.
29
How to implement the prediction? The concrete formula of (4.11) is
θ1(x)− θ1(xk) + (x− xk)T
−AT λk ≥ 0,
θ2(y)− θ2(yk) + (y − yk)T
−BT λk + s(yk − yk) +BT (λk − λk) ≥ 0,
(Axk +Byk − b)−B(yk − yk)+(1/β)(λk − λk) = 0.
This can be implemented byxk = ArgminLβ(x, yk, λk) |x ∈ X,
λk = λk − β(Axk +Byk − b),
yk = Argminθ2(y)− yTBTλk +s
2‖y − yk‖2 | y ∈ Y.
The y-subproblem is easy.é½ β > 0,±?¿ s > 0.
The underline part is F (wk):
F (w) =
−ATλ
−BTλ
Ax+By − b
30
é©l8I¼ê`z¯K,·3 §4¥JÑn«ýÿ-
• XJf¯K¥¦)L§¥,gØ5?Û(Jÿ,ïÆæ^
§4.1¥.
• XJf¯K¥¦)¥,7Léf¯K¥g5z,¿Ý
^Ðÿ,ïÆæ^ §4.2¥.
• XJ7L5z,Ý^qØÐÿ,ïÆ©Oæ^ §??.3Ú §4.3
¥.
F"ùµeUé¢S¯KOJøÏ.
31
5 ¦¦¦)))nnn©©©lll888III¼¼¼êêêààà`zzz¯KKK
ù¯K Lagrange¼ê´
L(x, y, z, λ) = θ1(x) + θ2(y) + θ3(z)− λT (Ax+By + Cz − b).
O2 Lagrange¼ê´
L3β(x, y, z, λ) = L(x, y, z, λ) +
β
2‖Ax+By + Cz − b‖2.
í2Oxk+1 = arg min
L3β(x, yk, zk, λk)
∣∣ x ∈ X,yk+1 = arg min
L3β(xk+1, y, zk, λk)
∣∣ y ∈ Y,zk+1 = arg min
L3β(xk+1, yk+1, z, λk)
∣∣ z ∈ Z,λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b).
(5.12)
ém ≥ 3,/ª¯Kí2OØUyÂñ [4].
32
♣ a׶ÎÇ5¿·k'ó.
33
ííí222 ADMMµµµ·uL3 2016 Math.Progr.nf¯K
minθ1(x) + θ2(y) + θ3(z)|Ax+By+Cz = b, x ∈ X , y ∈ Y, z ∈ Z
1~f¥, θ1(x) = θ2(y) = θ3(z) = 0, X = Y = Z = <,
A = [A,B,C] ∈ <3×3 ´ÛÉÝ, b = 0 ∈ <3.
kâdò~f,y²í2 ADMM¿ØÂñ.
ù~fõ´3nØ¡¿Â.
UUUYYYïïïÄÄįKKKµnf¢S¯K¥,5åÝ
A = [A,B,C] k´ü Ý,=, A = [A,B, I].
í2 ADMM?nù«bC¢Snf¯K,
QQQvvvkkkyyy²²²ÂÂÂñññ§§§vvvkkkÞÞÞÑÑÑ~~~§§§888···uuu%%%ØØØ[[[
34
ÞÞÞüüü~~~fff555` :
• ¦fO (ADMM)?n¯K
minθ1(x) + θ2(y)|Ax+By = b, x ∈ X , y ∈ Y ´Âñ.
• òªå¤Øªå,¯KÒC¤
minθ1(x) + θ2(y)|Ax+By ≤ b, x ∈ X , y ∈ Y.
• 2z¤nfªå¯K
minθ1(x) + θ2(y) + 0 |Ax+By + z = b, x ∈ X , y ∈ Y, z ≥ 0
• í2 ADMM?nþ¡ù«¯K,Ø<LÁ,´
8Qvky²Âñ5,vkÞÑ~
Äuþã@,·énf¯KJÑ?.5¿µ
·é¯KØ\?Û^é βØ\,éÄÃâ
35
5.1 pppddd£££ ADMM
± (5.12)Jø (yk+1, zk+1)ýÿ, α ∈ (0, 1),úª(yk+1
zk+1
):=
(yk
zk
)−α
(I −(BTB)−1BTC
0 I
)(yk − yk+1
zk − zk+1
). (5.13)
dueÚSO (Byk+1, Czk+1, λk+1),·(Byk+1
Czk+1
):=
(Byk
Czk
)− α
(I −I0 I
)(B(yk − yk+1)
C(zk − zk+1)
).
• B. S. He, M. Tao and X.M. Yuan, Alternating direction method
with Gaussian back substitution for separable convex
programming, SIAM Journal on Optimization 22(2012), 313-340.
é yÚ z,kk,Øú²,@ÒéÖ,N
36
5.2 ADMM + Prox-Parallel Splitting ALM
ü/r yÚz²ØUyÂñ
xk+1 = arg minL3β(x, yk, zk, λk)
∣∣ x ∈ X,yk+1 = arg min
L3β(xk+1, y, zk, λk)
∣∣ y ∈ Y,zk+1 = arg min
L3β(xk+1, yk, z, λk)
∣∣ z ∈ Z,λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b).
y, zf¯K²1,XJØ?n,Ò§èýkÑ\Kxk+1 = arg min
L3β(x, yk, zk, λk)
∣∣ x ∈ X, (τ > 1)
yk+1 = arg minL3β(xk+1, y, zk, λk) + τ
2β‖B(y − yk)‖2∣∣y ∈ Y,
zk+1 = arg minL3β(xk+1, yk, z, λk) + τ
2β‖C(z − zk)‖2∣∣z ∈ Z,
λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b).
37
þãuµ
xk+1 = Argminθ1(x) + β2‖Ax+Byk + Czk − b− 1
βλk‖2 |x ∈ X,
λk+12 = λk − β(Axk+1 +Byk + Czk − b)
yk+1 =Argminθ2(y)−(λk+12 )TBy + µβ
2‖B(y − yk)‖2 | y ∈ Y,
zk+1 =Argminθ3(z)−(λk+12 )TCz + µβ
2‖C(z − zk)‖2 | z ∈ Z,
λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b),(5.14)
Ù¥ µ > 2.~X,± µ = 2.01.
• B. He, M. Tao and X. Yuan, A splitting method for separable
convex programming. IMA J. Numerical Analysis, 31(2015),
394-426.
gd,qØ,Ò\K,Ø#gCU«ì.
38
This method is accepted by Osher’s research group
• E. Esser, M. Moller, S. Osher, G. Sapiro and J. Xin, A convex model for
non-negative matrix factorization and dimensionality reduction on physical
space, IEEE Trans. Imag. Process., 21(7), 3239-3252, 2012.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 7, JULY 2012 3239
A Convex Model for Nonnegative MatrixFactorization and DimensionalityReduction on Physical Space
Ernie Esser, Michael Möller, Stanley Osher, Guillermo Sapiro, Senior Member, IEEE, and Jack Xin
Abstract—A collaborative convex framework for factoring adata matrix into a nonnegative product , with a sparsecoefficient matrix , is proposed. We restrict the columns of thedictionary matrix to coincide with certain columns of the datamatrix , thereby guaranteeing a physically meaningful dictio-nary and dimensionality reduction. We use regularizationto select the dictionary from the data and show that this leads toan exact convex relaxation of in the case of distinct noise-freedata. We also show how to relax the restriction-to- constraintby initializing an alternating minimization approach with thesolution of the convex model, obtaining a dictionary close to butnot necessarily in . We focus on applications of the proposedframework to hyperspectral endmember and abundance identifi-cation and also show an application to blind source separation ofnuclear magnetic resonance data.
Index Terms—Blind source separation (BSS), dictionarylearning, dimensionality reduction, hyperspectral endmember de-tection, nonnegative matrix factorization (NMF), subset selection.
I. INTRODUCTION
D IMENSIONALITY reduction has been widely studied inthe signal processing and computational learning com-
munities. One of the major drawbacks of virtually all popularapproaches for dimensionality reduction is the lack of phys-ical meaning in the reduced dimension space. This significantlyreduces the applicability of such methods. In this paper, wepresent a framework for dimensionality reduction, based on ma-trix factorization and sparsity theory, that uses the data itself(or small variations from it) for the low-dimensional representa-tion, thereby guaranteeing physical fidelity. We propose a newconvex method to factor a nonnegative data matrix into a
Manuscript received February 04, 2011; revised October 21, 2011; acceptedFebruary 19, 2012. Date of publication March 06, 2012; date of current ver-sion June 13, 2012. The work of E. Esser and J. Xin was supported in part bythe NSF under Grant DMS-0911277 and Grant DMS-0928427. The work ofM. Möller and S. Osher was supported in part by the NSF under Grant DMS-0835863, Grant DMS-0914561, and Grant DMS-0914856, and in part by theONR under Grant N00014-08-1119. The work of M. Möller was also supportedby the German Academic Exchange Service (DAAD). The work of G. Sapirowas supported by the NSF, NGA, ONR, ARO, DARPA, and NSSEFF. The as-sociate editor coordinating the review of this manuscript and approving it forpublication was Prof. Birsen Yazici.E. Esser and J. Xin are with the Department of Mathematics, University of
California at Irvine, Irvine, CA 92697-3875 USA (e-mail: [email protected]).M. Möller was with the Department of Mathematics, University of California
at Los Angeles, Los Angeles, CA 90095-1555 USA. He is now with the West-fälische Wilhelms University of Münster, 48149 Münster, Germany.S. Osher is with the Department of Mathematics, University of California at
Los Angeles, Los Angeles, CA 90095-1555 USA.G. Sapiro is with the Department of Electrical and Computer Engineering,
University of Minnesota, Minneapolis, MN 55455-0170 USA.Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIP.2012.2190081
product , for which is nonnegative and sparse and thecolumns of coincide with columns from the data matrix .The organization of this paper is as follows. In the remainder
of the introduction, we further explain the problem, summarizeour approach, and discuss applications and related work. InSection II, we present our proposed convex model for end-member (dictionary) computation that uses regularizationto select as endmembers a sparse subset of columns of , suchthat sparse nonnegative linear combinations of them are capableof representing all other columns. Section III shows that, in thecase of distinct noise-free data, regularization is an exactrelaxation of the ideal row-0 norm (number of nonzero rows) andfurthermore proves the stability of our method in the noisy case.Section IV presents numerical results for both synthetic and realhyperspectral data. In Section V, we present an extension of ourconvex endmember detection model that is better able to handleoutliers in the data. We discuss its numerical optimization, com-pare its performance to the basic model, and also demonstrateits application to a blind source separation (BSS) problem basedon nuclear magnetic resonance (NMR) spectroscopy data.
A. Summary of the Problem and Geometric Interpretation
The underlying general problem of representingwith 0 is known as nonnegative matrix factorization(NMF). Variational models for solving NMF problems are typi-cally nonconvex and are solved by estimating and alternat-ingly. Although variants of alternating minimization methodsfor NMF often produce good results in practice, they are notguaranteed to converge to a global minimum.The problem can be greatly simplified by assuming a partial
orthogonality condition on matrix as is done in [1] and [2].More precisely, the assumption is that, for each row of , thereexists some column such that 0 and for .Under this assumption, NMF has a simple geometric interpreta-tion. Not only should the columns of appear in the data upto scaling but the remaining data should be expressible as non-negative linear combinations of these columns. Therefore, theproblem of finding is to find columns in , preferably as fewas possible, that span a cone containing the rest of the data .Fig. 1 illustrates the geometry in three dimensions.The problem we actually want to solve is more difficult than
NMF in a couple respects. One reason is the need to deal withnoisy data. While NMF by itself is a difficult problem already,the identification of the vectors becomes even more difficult ifthe data contain noise and we need to find a low-dimensionalcone that contains most of the data (see the lower right image inFig. 1). Notice that in the noisy case, finding vectors such that alldata are contained in the cone they span would lead to a drastic
1057-7149/$31.00 © 2012 IEEE
ESSER et al.: CONVEX MODEL FOR NMF AND DIMENSIONALITY REDUCTION ON PHYSICAL SPACE 3247
Fig. 4. Spectral signatures of endmembers extracted by different methods. (Top row) Results of our method and the alternating minimization approach. (Bottomrow) Endmembers found by N-findr, QR, and VCA.
Fig. 5. Region of possible values for .
restrict each column to lie in a hockey-puck-shaped disk .Decompose , where is the orthogonal pro-jection of onto the line spanned by and is the radial
component of perpendicular to . Then, given ,
we restrict and . The or-thogonal projection onto this set is straightforward to computesince it is a box constraint in cylindrical coordinates. This con-straint set for is shown in Fig. 5 in the case when .We also allow for a few columns of the data to be outliers.
These are columns of that we do not expect to be well repre-sented as a small error plus a sparse nonnegative linear com-bination of other data but that we also do not want to con-sider as endmembers. Given some , this sparse erroris modeled as with restricted to the convex set
and . Since is the non-negative region of a weighted ball, the orthogonal projectiononto can be computed with complexity. Here,since the weights sum to one by definition, can be roughlyinterpreted as the fraction of data we expect to be outliers. Fornonoutlier data , we want , and for outlier data, wewant . In the latter outlier case, regularization on matrix
should encourage the corresponding column to be close tozero; hence, is encouraged to be small rather than closeto one.Keeping the regularization, the nonnegativity constraint,
and theweighted penalty from (6), the overall extendedmodelis given by
such that (15)
The structure of this model is similar to the robust principalcomponent analysis model proposed in [33] although it has adifferent noise model and uses regularization instead of thenuclear norm.
B. Numerical Optimization
Since the convex functional for the extended model (15) isslightly more complicated, it is convenient to use a variant ofADMM that allows the functional to be split into more thantwo parts. The method proposed by He et al. in [34] is appro-priate for this application. Again, introduce a new variableand constraint . In addition, let and be Lagrangemultipliers for constraints and
, respectively. Then, the augmented Lagrangianis given by
39
ESSER et al.: CONVEX MODEL FOR NMF AND DIMENSIONALITY REDUCTION ON PHYSICAL SPACE 3247
Fig. 4. Spectral signatures of endmembers extracted by different methods. (Top row) Results of our method and the alternating minimization approach. (Bottomrow) Endmembers found by N-findr, QR, and VCA.
Fig. 5. Region of possible values for .
restrict each column to lie in a hockey-puck-shaped disk .Decompose , where is the orthogonal pro-jection of onto the line spanned by and is the radial
component of perpendicular to . Then, given ,
we restrict and . The or-thogonal projection onto this set is straightforward to computesince it is a box constraint in cylindrical coordinates. This con-straint set for is shown in Fig. 5 in the case when .We also allow for a few columns of the data to be outliers.
These are columns of that we do not expect to be well repre-sented as a small error plus a sparse nonnegative linear com-bination of other data but that we also do not want to con-sider as endmembers. Given some , this sparse erroris modeled as with restricted to the convex set
and . Since is the non-negative region of a weighted ball, the orthogonal projectiononto can be computed with complexity. Here,since the weights sum to one by definition, can be roughlyinterpreted as the fraction of data we expect to be outliers. Fornonoutlier data , we want , and for outlier data, wewant . In the latter outlier case, regularization on matrix
should encourage the corresponding column to be close tozero; hence, is encouraged to be small rather than closeto one.Keeping the regularization, the nonnegativity constraint,
and theweighted penalty from (6), the overall extendedmodelis given by
such that (15)
The structure of this model is similar to the robust principalcomponent analysis model proposed in [33] although it has adifferent noise model and uses regularization instead of thenuclear norm.
B. Numerical Optimization
Since the convex functional for the extended model (15) isslightly more complicated, it is convenient to use a variant ofADMM that allows the functional to be split into more thantwo parts. The method proposed by He et al. in [34] is appro-priate for this application. Again, introduce a new variableand constraint . In addition, let and be Lagrangemultipliers for constraints and
, respectively. Then, the augmented Lagrangianis given by
3248 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 7, JULY 2012
Fig. 6. Results of the extended model applied to the RGB image. (Top left) RGB image we apply the blind unmixing algorithm to. (Top middle) 3-D plot ofthe data points in the image in their corresponding color (in online version). (Black dots) Endmembers detected without allowing outliers and withoutencouraging particular sparsity on the coefficients . (Top right) With allowing some outliers the method removed an endmember in the one of the outsideclusters, but included the middle cluster due to the encouraged sparsity. (Bottom left) Endmember coefficients for the parameter choice , , wherethe brightness corresponds to the coefficient value. We can see that the coefficient matrix is sparse. (Bottom middle) Increasing the allowed outliers the red clusterendmember is removed (in online version). Increasing the outliers even further leads to decreasing the number of endmembers to four.
where and are indicator functions for the andconstraints.
Using the ADMM-like method in [34], a saddle point of theaugmented Lagrangian can be found by iteratively solving thesubproblems with parameters 0 and 2, shown in theequations at the bottom of this page.Each of these subproblems can be efficiently solved. There
are closed formulas for the and updates, and theand updates both involve orthogonal projections that
can be efficiently computed.
C. Effect of Extended Model
A helpful example for visualizing the effect of the extendedmodel (15) is to apply it to an RGB image. Although low dimen-sionality makes this significantly different from hyperspectral
data, it is possible to view a scatter plot of the colors and howmodifying the model parameters affects the selection of end-members. The NMR data in Section V-E is 4-D; hence, low-di-mensional data is not inherently unreasonable.For the following RGB experiments, we use the same param-
eters as described in Section II-E and use the same -means withfarthest first initialization strategy to reduce the size of initialmatrix . We do not however perform the alternating minimiza-tion refinement step. Due to the different algorithm used to solvethe extended model, there is an additional numerical parameter, which for this application must be greater than two accordingto [34]. We set equal to 2.01. There are also model parame-ters and for modeling the noise and outliers. To model thesmall-scale noise , we set , where is fixed at .07and is the maximum distance from data in cluster to the
40
3248 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 7, JULY 2012
Fig. 6. Results of the extended model applied to the RGB image. (Top left) RGB image we apply the blind unmixing algorithm to. (Top middle) 3-D plot ofthe data points in the image in their corresponding color (in online version). (Black dots) Endmembers detected without allowing outliers and withoutencouraging particular sparsity on the coefficients . (Top right) With allowing some outliers the method removed an endmember in the one of the outsideclusters, but included the middle cluster due to the encouraged sparsity. (Bottom left) Endmember coefficients for the parameter choice , , wherethe brightness corresponds to the coefficient value. We can see that the coefficient matrix is sparse. (Bottom middle) Increasing the allowed outliers the red clusterendmember is removed (in online version). Increasing the outliers even further leads to decreasing the number of endmembers to four.
where and are indicator functions for the andconstraints.
Using the ADMM-like method in [34], a saddle point of theaugmented Lagrangian can be found by iteratively solving thesubproblems with parameters 0 and 2, shown in theequations at the bottom of this page.Each of these subproblems can be efficiently solved. There
are closed formulas for the and updates, and theand updates both involve orthogonal projections that
can be efficiently computed.
C. Effect of Extended Model
A helpful example for visualizing the effect of the extendedmodel (15) is to apply it to an RGB image. Although low dimen-sionality makes this significantly different from hyperspectral
data, it is possible to view a scatter plot of the colors and howmodifying the model parameters affects the selection of end-members. The NMR data in Section V-E is 4-D; hence, low-di-mensional data is not inherently unreasonable.For the following RGB experiments, we use the same param-
eters as described in Section II-E and use the same -means withfarthest first initialization strategy to reduce the size of initialmatrix . We do not however perform the alternating minimiza-tion refinement step. Due to the different algorithm used to solvethe extended model, there is an additional numerical parameter, which for this application must be greater than two accordingto [34]. We set equal to 2.01. There are also model parame-ters and for modeling the noise and outliers. To model thesmall-scale noise , we set , where is fixed at .07and is the maximum distance from data in cluster to the
3252 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 7, JULY 2012
[22] J. Greer, “Sparse demixing,” in Proc. SPIE Algorithms Technol. Multi-spectral, Hyperspectral, Ultraspectral Imag. XVI, 2010, vol. 7695, pp.76 951O-1–76 951O-12.
[23] A. Castrodad, Z. Xing, J. Greer, E. Bosch, L. Carin, and G. Sapiro,“Learning discriminative sparse representations for modeling, sourceseparation, and mapping of hyperspectral imagery,” IEEE Trans.Geosci. Remote Sensing, vol. 49, no. 11, pp. 4263–4281, Nov. 2011.
[24] P. Sprechmann, I. Ramrez, G. Sapiro, and Y. Eldar, “C-HiLasso: Acollaborative hierarchical sparse modeling framework,” IEEE Trans.Signal Process., vol. 59, no. 9, pp. 4183–4198, Sep. 2011.
[25] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach, “Network flow al-gorithms for structured sparsity,” in Proc. NIPS, 2010, pp. 1558–1566.
[26] D. Gabay and B. Mercier, “A dual algorithm for the solution ofnonlinear variational problems via finite-element approximations,”Comput. Math. Appl., vol. 2, no. 1, pp. 17–40, 1976.
[27] R. Glowinski and A. Marrocco, “Sur lapproximation par elements finisdordre un, et la resolution par penalisation-dualite dune classe de prob-lemes de Dirichlet nonlineaires,” Rev. Francaise dAut Inf. Rech. Oper.,vol. R-2, pp. 41–76, 1975.
[28] X. Zhang, M. Burger, and S. Osher, “A unified primal–dual algorithmframework based on Bregman iteration,” UCLA CAM, Los Angeles,CA, Tech. Rep. 09-99, 2009.
[29] J. J. Moreau, “Proximité et dualité dans un espace hilbertien,” Bull.Soc. Math. France, vol. 93, pp. 273–299, 1965.
[30] E. Esser, “Applications of Lagrangian-based alternating directionmethods and connections to split Bregman,” UCLA CAM, Los An-geles, CA, Tech. Rep. 09-31, 2009.
[31] J. M. P. Nascimento and J. M. Bioucas-Dias, “Vertex componentanalysis: A fast algorithm to unmix hyperspectral data,” IEEE Trans.Geosci. Remote Sens., vol. 43, no. 4, pp. 898–910, Apr. 2005.
[32] Open Source MATLAB Hyperspectral Toolbox. ver. 0.04, 2010 [On-line]. Available: http://matlabhyperspec.sourceforge.net/
[33] E. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis,” 2009 [Online]. Available: http://arxiv.org/PS cache/arxiv/pdf/0912/0912.3599v1.pdf
[34] B. He, M. Tao, and X. Yuan, “A splitting method for separateconvex programming with linking linear constraints,” Tech.Rep., 2011 [Online]. Available: http://www.optimization-on-line.org/DB_FILE/2010/06/ 2665.pdf
[35] Y. Sun, C. Ridge, F. del Rio, A. J. Shaka, and J. Xin, “Postprocessingand sparse blind source separation of positive and partially overlappeddata,” Signal Process., vol. 91, no. 8, pp. 1838–1851, Aug. 2011.
Ernie Esser received the B.S. degrees in mathe-matics and applied and computational mathematicalsciences from the University of Washington, Seattle,in 2003 and the Ph.D. degree in math from theUniversity of California at Los Angeles in 2010.He is currently a Postdoctoral Researcher at the
University of California, Irvine. His main researchinterests are convex optimization, image processing,and inverse problems.
Michael Möller received the Diploma in mathe-matics from the Westfälische Wilhelms Universityof Münster, Münster, Germany, in 2009, where hecurrently working toward the Ph.D. degree.He spent two years as a Visiting Researcher at the
University of California at Los Angeles. His researchinterests include convex optimization, compressedsensing, image processing, and inverse problems.
Stanley Osher received the B.S. degree in math-ematics from Brooklyn College, Brooklyn, NY, in1962, and the M.S. and Ph.D. degrees in mathematicsfrom New York University, New York, in 1964 and1966, respectively.Since 1977, he has been with the University of Cal-
ifornia, Los Angeles (UCLA), where he is currentlya Professor in the Department of Mathematics andthe Director of Special Projects at the Institute forPure and Applied Mathematics. His current interestsmainly involve information science, which includes
image processing, compressed sensing, and machine learning. His website iswww.math.ucla.edu/~sjo. His recent papers are on the website for the UCLAComputational and Applied Mathematics Reports at http://www.math.ucla.edu/applied/cam/index.shtml.Dr. Osher is a member of the National Academy of Sciences and the Amer-
ican Academy of Arts and Sciences, and he is one of the top 25 most highlycited researchers in both mathematics and computer science. He has receivednumerous academic honors and has cofounded three successful companies, eachbased largely on his own (joint) research.
Guillermo Sapiro (M’95–SM’03) was born in Mon-tevideo, Uruguay, on April 3, 1966. He received theB.Sc. (summa cum laude), M.Sc., and Ph.D. degreesfrom the Technion-Israel Institute of Technology,Haifa, Israel, in 1989, 1991, and 1993, respectively.After postdoctoral research at the Massachusetts
Institute of Technology, Cambridge, he becamea member of the Technical Staff at the researchfacilities of HP Laboratories, Palo Alto, CA. Heis currently with the Department of Electrical andComputer Engineering, University of Minnesota,
Minneapolis, where he holds the position of Distinguished McKnight Univer-sity Professor and Vincentine Hermes-Luh Chair in Electrical and ComputerEngineering. He works on differential geometry and geometric partial differ-ential equations both in theory and applications in computer vision, computergraphics, medical imaging, and image analysis. He has recently co-edited aspecial issue of IEEE TRANSACTIONS ON IMAGE PROCESSING in this topic and asecond one in the Journal of Visual Communication and Image Representation.He is the founding Editor-in-Chief of the SIAM Journal on Imaging Sciences.He has authored and coauthored numerous papers in this area and has written abook published by Cambridge University Press in January 2001.Dr. Sapiro is a member of the Society of Industrial and Applied Mathematics
(SIAM). He was the recipient of the Gutwirth Scholarship for Special Excel-lence in Graduate Studies in 1991, the Ollendorff Fellowship for Excellence inVision and Image Understanding Work in 1992, the Rothschild Fellowship forPostdoctoral Studies in 1993, the Office of Naval Research Young InvestigatorAward in 1998, the Presidential Early Career Awards for Scientist and Engineers(PECASE) in 1998, the National Science Foundation Career Award in 1999, andthe National Security Science and Engineering Faculty Fellowship in 2010.
Jack Xin received the B.S. degree in computationalmathematics from Peking University, Beijing, China,in 1985 and the M.S. and Ph.D. degrees in appliedmathematics from New York University, New York,in 1988 and 1990, respectively.He was a Postdoctoral Fellow at the University
of California at Berkeley and Princeton University,Princeton, NJ, in 1991 and 1992, respectively. Hewas an Assistant Professor and Associate Professorof mathematics at the University of Arizona, Tucson,from 1991 to 1999. He was a Professor of mathe-
matics from 1999 to 2005 at the University of Texas, Austin. Since 2005, he hasbeen a Professor of mathematics in the Department of Mathematics, Center forHearing Research, Institute for Mathematical Behavioral Sciences, and Centerfor Mathematical and Computational Biology at University of California,Irvine. His research interests include applied analysis, computational methodsand their applications in nonlinear and multiscale dynamics in fluids, and signalprocessing.Dr. Xin is a Fellow of the John S. Guggenheim Foundation.
41
6 `ëëëêêê
Some linearly constrained convex optimization problems
1. Linearly constrained convex optimization minθ(x)|Ax = b, x ∈ X
2. Convex optimization problem with separable objective function
minθ1(x) + θ2(y)|Ax+By = b, x ∈ X , y ∈ Y
3. Convex optimization problem with 3 separable objective functions
minθ1(x)+θ2(y)+θ3(z)|Ax+By+Cz = b, x ∈ X , y ∈ Y, z ∈ Z
There are some crucial parametersµµµ• Crucial parameter in the so called linearized ALM for the first problem,
• Crucial parameter in the so called linearized ADMM for the second problem,
• Crucial proximal parameter in the Proximal Parallel ADMM-like Method forthe convex optimization problem with 3 separable objective functions.
42
6.1 Linearized Augmented Lagrangian Method
Consider the following convex optimization problem:
minθ(x) | Ax = b, x ∈ X. (6.1)
The augmented Lagrangian function of the problem (6.1) is
Lβ(x, λ) = θ(x)− λT (Ax− b) + β2 ‖Ax− b‖
2.
Starting with a given λk, the k-th iteration of the Augmented Lagrangian Method
[15, 19] produces the new iterate wk+1 = (xk+1, λk+1) via
(ALM)
xk+1 = arg min
Lβ(x, λk)
∣∣ x ∈ X, (6.2a)
λk+1 = λk − γβ(Axk+1 − b), γ ∈ (0, 2) (6.2b)
In the classical ALM, the optimization subproblem (6.2a) is
minθ(x) + β2 ‖Ax− (b+ 1
βλk)‖2|x ∈ X.
Sometimes, because of the structure of the matrix A, we should simplify the
subproblem (6.2a). Notice that
43
• Ignore the constant term in the objective function of Lβ(x, λk), we have
argminLβ(x, λk)
∣∣ x ∈ X= argmin
θ(x)− (λk)T (Ax− b) + β
2‖Ax− b‖2
∣∣ x ∈ X= argmin
θ(x)− (λk)T (Ax− b) +
β2‖(Axk − b) +A(x− xk)‖2
∣∣∣∣ x ∈ X
= argmin
θ(x)− xTAT [λk − β(Axk − b)]
+β2‖A(x− xk)‖2
∣∣∣∣ x ∈ X. (6.3)
• In the so called Linearized ALM, the term β2 ‖A(x− xk)‖2 is replaced
with r2‖x− x
k‖2. In this way, the x-subproblem becomes
xk+1 = argminθ(x)−xTAT [λk−β(Axk−b)]+ r
2‖x− xk‖2
∣∣x ∈ X. (6.4)
In fact, the linearized ALM simplifies the quadratic term β2 ‖A(x− xk)‖2.
44
In comparison with (6.3), the simplified x-subproblem (6.4) is equivalent to
xk+1 = arg minLβ(x, λk) + 1
2‖x− xk‖2DA
| x ∈ X, (6.5)
where
DA = rI − βATA. (6.6)
In order to ensure the convergence, it was required that r > β‖ATA‖.
Thus, the mathematical form of the Linearized ALM can be written asxk+1 = arg min
Lβ(x, λk) + 1
2‖x− xk‖2DA
∣∣ x ∈ X, (6.7a)
λk+1 = λk − γβ(Axk+1 − b), γ ∈ (0, 2). (6.7b)
where DA is defined by (6.6).
Large parameter r in (6.6) will lead a slow convergence !
45
Recent Advance. Bingsheng He, Feng Ma, Xiaoming Yuan:
Optimal proximal augmented Lagrangian method and its application to full Jaco-
bian splitting for multi-block separable convex minimization problems, IMA Jour-
nal of Numerical Analysis. 39(2019).
Our new result in the above paper:
For the matrix DA in (6.7a) with the form (6.6)
• if r > 2+γ4 β‖ATA‖ is used in the method (6.7), it is still convergent;
• if r < 2+γ4 β‖ATA‖ is used in the method (6.7), there is divergent example.
46
Especially, when γ = 1,xk+1 = arg min
Lβ(x, λk) + 1
2‖x− xk‖2DA
∣∣ x ∈ X, (6.8a)
λk+1 = λk − β(Axk+1 − b). (6.8b)
According to our new result: For the matrix DA in in (6.7a) with the form (6.6),
• if r > 34β‖A
TA‖ is taken in the method (6.8), it is still convergent;
• if r < 34β‖A
TA‖ is taken in the method (6.8), there is divergent example.
r = 0.75 is the threshold factor in the matrix DA for linearized ALM (6.8) !
47
6.2 Linearized ADMM
Consider the convex optimization problem with separable objective function:
minθ1(x) + θ2(y) | Ax+By = b, x ∈ X , y ∈ Y. (6.9)
The augmented Lagrangian function of the problem (6.9) is
L2β(x, y, λ) = θ1(x) + θ2(y)− λT (Ax+By − b) + β
2 ‖Ax+By − b‖2.
Starting with a given (yk, λk), the k-th iteration of the classical ADMM [?, 7]
generates the new iterate wk+1 = (xk+1, yk+1, λk+1) via
(ADMM)
xk+1 = arg min
Lβ(x, yk, λk)
∣∣ x ∈ X, (6.10a)
yk+1 = arg minLβ(xk+1, y, λk)
∣∣ y ∈ Y, (6.10b)
λk+1 = λk − β(Axk+1 +Byk+1 − b). (6.10c)
In (6.10a) and (6.10a), the optimization subproblems are
minθ1(x)+ β2‖Ax−pk‖2|x ∈ X and minθ2(y)+ β
2‖By−qk‖2|y ∈ Y,
48
respectively. We assume that one of the minimization subproblems (without loss
of the generality, say, (6.10b)) should be simplified. Notice that
• Using the notation Lβ(xk+1, y, λk) and ignoring the constant term in theobjective function, we have
argminLβ(xk+1, y, λk)
∣∣ y ∈ Y= argmin
θ2(y)− (λk)T (Axk+1 +By − b)
+β2‖Axk+1 +By − b‖2
∣∣∣∣ y ∈ Y
= argmin
θ2(y)− (λk)TBy +
β2‖(Axk+1 +Byk − b) +B(y − yk)‖2
∣∣∣∣ y ∈ Y
= argmin
θ2(y)−yTBT [λk−β(Axk+1+Byk−b)]
+β2‖B(y − yk)‖2
∣∣∣∣y ∈ Y.(6.11)
• In the so called Linearized ADMM, the term β2 ‖B(y − yk)‖2 is replaced
with s2‖y − y
k‖2. Thus, the y-subproblem becomes
49
yk+1 = argmin
θ2(y)− yTBT [λk − β(Axk+1 +Byk − b)]
+s
2‖y − yk‖2
∣∣∣∣ y ∈ Y.(6.12)
In fact, the linearized ADMM simplifies the quadratic term β2 ‖B(y − yk)‖2.
In comparison with (6.11), the simplified y-subproblem (6.12) is equivalent to
yk+1 = arg minLβ(xk+1, y, λk) + 1
2‖y − yk‖2DB
| y ∈ Y, (6.13)
where
DB = sI − βBTB. (6.14)
In order to ensure the convergence, it was required that s > β‖BTB‖.
Thus, the mathematical form of the Linearized ADMM can be written asxk+1 = arg min
Lβ(x, yk, λk)
∣∣ x ∈ X, (6.15a)
yk+1 = arg minLβ(xk+1, y, λk) + 1
2‖y − yk‖2DB
∣∣ y ∈ Y, (6.15b)
λk+1 = λk − β(Axk+1 +Byk+1 − b), (6.15c)
50
where DB is defined by (6.14).
A large parameter s will lead a slow convergence of the linearized ADMM.
###???ÐÐÐ: `555zzzÏÏÏfffÀÀÀJJJ– OO6228(((ØØØ
Recent Advance. Bingsheng He, Feng Ma, Xiaoming Yuan:
Optimal Linearized Alternating Direction Method of Multipliers for Convex Pro-
gramming. http://www.optimization-online.org/DB HTML/2017/09/6228.html
Our new result in the above paper: For the matrix DB in (6.15b) with the form
(6.14)
• if s > 34β‖B
TB‖ is taken in the method (6.15), it is still convergent;
• if s < 34β‖B
TB‖ is taken in the method (6.15), there is divergent example.
s = 0.75 is the threshold factor in the matrix DB for linearized ADMM (6.15) !
Notice that the matrix DB defined in (6.14) is indefinite for s ∈ (0.75, 1) !
51
6.3 Parameters improvements in the method forproblem with 3 separable objective functions
For the problem with three separable objective functions
minθ1(x)+θ2(y)+θ3(z)|Ax+By+Cz = b, x ∈ X , y ∈ Y, z ∈ Z, (6.16)
the augmented Lagrangian function is
L3β(x, y, z, λ) = θ1(x) + θ2(y) + θ3(z)− λT(Ax+By + Cz − b)
+β2‖Ax+By + Cz − b‖2.
Using the direct extension of ADMM to solve the problem (6.16), the formula isxk+1 = ArgminL3
β(x, yk, zk, λk) |x ∈ X,yk+1 = ArgminL3
β(xk+1, y, zk, λk) | y ∈ Y,zk+1 = ArgminL3
β(xk+1, yk+1, z, λk) | z ∈ Z,λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b).
(6.17)
Unfortunately, the direct extension (6.17) is not necessarily convergent [4] !
52
ADMM + Parallel Splitting ALMry, z
²
xk+1 = arg minL3β(x, yk, zk, λk)
∣∣ x ∈ X,yk+1 = arg min
L3β(xk+1, y, zk, λk)
∣∣ y ∈ Y,zk+1 = arg min
L3β(xk+1, yk, z, λk)
∣∣ z ∈ Z,λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b).
²1?n y, zf¯K,g,ØUyÂñ
ADMM + Parallel-Prox Splitting ALM
g,L©gd.§\·K(τ > 1),ÒUyÂñ.
xk+1 = argminL(x, yk, zk, λk)
∣∣ x ∈ X, (6.18a) yk+1 = argminL(xk+1, y, zk, λk) + τ
2‖B(y − yk)‖2
∣∣ y ∈ Y,zk+1 = argmin
L(xk+1, yk, z, λk) + τ
2‖C(z − zk)‖2
∣∣ z ∈ Z,(6.18b)
λk+1 = λk − (Axk+1 +Byk+1 + Czk+1 − b). (6.18c)
53
Notice that (6.18b) can be written as(yk+1
zk+1
)= arg min
L(xk+1, y, z, λk) +
1
2
∥∥∥∥ y − ykz − zk
∥∥∥∥2
DBC
∣∣∣∣ y ∈ Yz ∈ Z
,
where
DBC
=
(τBTB −BTC
−CTB τCTC
). (6.19)
DBC
is positive semidefinite when τ ≥ 1.
However, the matrix DBC
is indefinite for τ ∈ (0, 1).
In other words, the scheme (6.18) can be rewritten as
xk+1 = argminL(x, yk, zk, λk)
∣∣ x ∈ X,(yk+1
zk+1
)= argmin
L(xk+1, y, z, λk) + 1
2
∥∥∥∥ y − yk
z − zk
∥∥∥∥2DBC
∣∣∣∣ y ∈ Yz ∈ Z
,
λk+1 = λk − (Axk+1 +Byk+1 + Czk+1 − b),
54
The algorithm (6.18) can be rewritten in an equivalent form: (µ = τ + 1 > 2).
xk+1 = argminθ1(x) + β2‖Ax+Byk + Czk − b− 1
βλk‖2 |x ∈ X,
λk+12 = λk − β(Axk+1 +Byk + Czk − b)
yk+1 =argminθ2(y)−(λk+12 )TBy + µβ
2‖B(y − yk)‖2 | y ∈ Y,
zk+1 =argminθ3(z)−(λk+12 )TCz + µβ
2‖C(z − zk)‖2 | z ∈ Z,
λk+1 = λk − β(Axk+1 +Byk+1 + Czk+1 − b),(6.20)
The related publicationµ
• B. He, M. Tao and X. Yuan, A splitting method for separable convex program-
ming. IMA J. Numerical Analysis, 31(2015), 394-426.
In the above paper, in order to ensure the convergence, it was required
τ > 1 (in (6.18)) which is equivalent to µ > 2 (in (6.20)).
55
This method is accepted by Osher’s research group
• E. Esser, M. Moller, S. Osher, G. Sapiro and J. Xin, A convex model for
non-negative matrix factorization and dimensionality reduction on physical
space, IEEE Trans. Imag. Process., 21(7), 3239-3252, 2012.
3248 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 7, JULY 2012
Fig. 6. Results of the extended model applied to the RGB image. (Top left) RGB image we apply the blind unmixing algorithm to. (Top middle) 3-D plot ofthe data points in the image in their corresponding color (in online version). (Black dots) Endmembers detected without allowing outliers and withoutencouraging particular sparsity on the coefficients . (Top right) With allowing some outliers the method removed an endmember in the one of the outsideclusters, but included the middle cluster due to the encouraged sparsity. (Bottom left) Endmember coefficients for the parameter choice , , wherethe brightness corresponds to the coefficient value. We can see that the coefficient matrix is sparse. (Bottom middle) Increasing the allowed outliers the red clusterendmember is removed (in online version). Increasing the outliers even further leads to decreasing the number of endmembers to four.
where and are indicator functions for the andconstraints.
Using the ADMM-like method in [34], a saddle point of theaugmented Lagrangian can be found by iteratively solving thesubproblems with parameters 0 and 2, shown in theequations at the bottom of this page.Each of these subproblems can be efficiently solved. There
are closed formulas for the and updates, and theand updates both involve orthogonal projections that
can be efficiently computed.
C. Effect of Extended Model
A helpful example for visualizing the effect of the extendedmodel (15) is to apply it to an RGB image. Although low dimen-sionality makes this significantly different from hyperspectral
data, it is possible to view a scatter plot of the colors and howmodifying the model parameters affects the selection of end-members. The NMR data in Section V-E is 4-D; hence, low-di-mensional data is not inherently unreasonable.For the following RGB experiments, we use the same param-
eters as described in Section II-E and use the same -means withfarthest first initialization strategy to reduce the size of initialmatrix . We do not however perform the alternating minimiza-tion refinement step. Due to the different algorithm used to solvethe extended model, there is an additional numerical parameter, which for this application must be greater than two accordingto [34]. We set equal to 2.01. There are also model parame-ters and for modeling the noise and outliers. To model thesmall-scale noise , we set , where is fixed at .07and is the maximum distance from data in cluster to theThus, Osher’s research group utilize the iterative formula (6.20), according to our
previous paper, they set
µ = 2.01, it is only a pity larger than 2.
Large parameter µ (or τ ) will lead a slow convergence.
56
###???ÐÐÐ: `KKKzzzÏÏÏfffÀÀÀJJJ– OO6235(((ØØØ
Recent Advance in : Bingsheng He, Xiaoming Yuan: On the Optimal Proximal
Parameter of an ADMM-like Splitting Method for Separable Convex Programming
http://www.optimization-online.org/DB HTML/2017/ 10/6235.html
Our new assertion: In (6.18)
• if τ > 0.5, the method is still convergent;
• if τ < 0.5, there is divergent example.
Equivalently in (6.20) :
• if µ > 1.5, the method is still convergent;
• if µ < 1.5, there is divergent example.
For convex optimization prob-lem (6.16) with three separableobjective functions, the param-eters in the equivalent methods(6.18) and (6.20) :
• 0.5 is the threshold factor ofthe parameter τ in (6.18) !
• 1.5 is the threshold factor ofthe parameter µ in (6.20) !
57
References[1] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, Distributed Optimization and Statistical Learning via the
Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning Vol. 3, No. 1 (2010)
1õ122.
[2] X.J. Cai, G.Y. Gu, B.S. He and X.M. Yuan, A proximal point algorithms revisit on the alternating direction method
of multipliers, Science China Mathematics, 56 (2013), 2179-2186.
[3] A. Chambolle, T. Pock, A first-order primal-dual algorithms for convex problem with applications to imaging, J.
Math. Imaging Vison, 40, 120-145, 2011.
[4] C. H. Chen, B. S. He, Y. Y. Ye and X. M. Yuan, The direct extension of ADMM for multi-block convex
minimization problems is not necessarily convergent, Mathematical Programming, Series A 2016.
[5] E. Esser, M. Moller, S. Osher, G. Sapiro and J. Xin, A convex model for non-negative matrix factorization and
dimensionality reduction on physical space, IEEE Trans. Imag. Process., 21(7), 3239-3252, 2012.
[6] D. Gabay, Applications of the method of multipliers to variational inequalities, Augmented Lagrange Methods:
Applications to the Solution of Boundary-valued Problems, edited by M. Fortin and R. Glowinski, North Holland,
Amsterdam, The Netherlands, 1983, pp. 299–331.
[7] R. Glowinski, Numerical Methods for Nonlinear Variational Problems, Springer-Verlag, New York, Berlin,
Heidelberg, Tokyo, 1984.
[8] D. Hallac, Ch. Wong, S. Diamond, A. Sharang, R. Sosic, S. Boyd and J. Leskovec, SnapVX: A Network-Based
Convex Optimization Solver, Journal of Machine Learning Research 18 (2017) 1-5.
[9] B. S. He, H. Liu, Z.R. Wang and X.M. Yuan, A strictly contractive Peaceman-Rachford splitting method for
58
convex programming, SIAM Journal on Optimization 24(2014), 1011-1040.
[10] B. S. He, M. Tao and X.M. Yuan, Alternating direction method with Gaussian back substitution for separable
convex programming, SIAM Journal on Optimization 22(2012), 313-340.
[11] B.S. He, M. Tao and X.M. Yuan, A splitting method for separable convex programming, IMA J. Numerical
Analysis 31(2015), 394-426.
[12] B.S. He, H. Yang, and S.L. Wang, Alternating directions method with self-adaptive penalty parameters for
monotone variational inequalities, JOTA 23(2000), 349–368.
[13] B. S. He and X. M. Yuan, On theO(1/t) convergence rate of the alternating direction method, SIAM J.
Numerical Analysis 50(2012), 700-709.
[14] B.S. He and X.M. Yuan, Convergence analysis of primal-dual algorithms for a saddle-point problem: From
contraction perspective, SIAM J. Imag. Science 5(2012), 119-149.
[15] M. R. Hestenes, Multiplier and gradient methods, JOTA 4, 303-320, 1969.
[16] B. Martinet, Regularisation, d’inequations variationelles par approximations succesives, Rev. Francaise
d’Inform. Recherche Oper., 4, 154-159, 1970.
[17] A. Nemirovski. Prox-method with rate of convergenceO(1/t) for variational inequalities with Lipschitz
continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15 (2004),
229–251.
[18] J. Nocedal and S. J. Wright, Numerical Optimization, Springer Verlag, New York, 1999.
[19] M. J. D. Powell, A method for nonlinear constraints in minimization problems, in Optimization, R. Fletcher, ed.,
Academic Press, New York, NY, pp. 283-298, 1969.
59
Thank you very much for your attention !
60
Thank you very much for reading !