3. conjugate gradient method

33
L. Vandenberghe EE236C (Spring 2016) 3. Conjugate gradient method conjugate gradient method for linear equations convergence analysis conjugate gradient method as iterative method applications in nonlinear optimization 3-1

Upload: ngokhanh

Post on 24-Jan-2017

236 views

Category:

Documents


0 download

TRANSCRIPT

L. Vandenberghe EE236C (Spring 2016)

3. Conjugate gradient method

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• applications in nonlinear optimization

3-1

Unconstrained quadratic minimization

minimize f(x) =1

2xTAx− bTx

with A ∈ Sn++

• equivalent to solving linear equation Ax = b

• residual r = b−Ax is negative gradient: r = −∇f(x)

Conjugate gradient method (CG)

• invented by Hestenes and Stiefel around 1951

• the most widely used iterative method for solving Ax = b, with A � 0

• can be extended to non-quadratic unconstrained minimization

Conjugate gradient method 3-2

Krylov subspaces

Definition: a sequence of nested subspaces (K0 ⊆ K1 ⊆ K2 ⊆ · · · )

K0 = {0}, Kk = span{b, Ab, . . . , Ak−1b} for k ≥ 1

if Kk+1 = Kk, then Ki = Kk for all i ≥ k

Key property: A−1b ∈ Kn (even when Kn 6= Rn)

• from Cayley-Hamilton theorem,

p(A) = An + a1An−1 + · · ·+ anI = 0

where p(λ) = det(λI −A) = λn + a1λn−1 + · · ·+ an−1λ+ an

• thereforeA−1b = − 1

an

(An−1b+ a1A

n−2b+ · · ·+ an−1b)

Conjugate gradient method 3-3

Krylov sequence

x(k) = argminx∈Kk

f(x), k ≥ 0

• from previous page, x(n) = A−1b

• CG is a recursive method for computing the Krylov sequence x(0), x(1), . . .

• we will see there is a simple two-term recurrence

x(k+1) = x(k) − αk∇f(x(k)) + γk(x(k) − x(k−1))

Example

A =

[1 00 10

], b =

[1010

]−20 −10 0

−4

−2

0

2

x1

x2

Conjugate gradient method 3-4

Residuals of Krylov sequence

• optimality conditions in definition of Krylov sequence:

x(k) ∈ Kk, ∇f(x(k)) = Ax(k) − b ∈ K⊥k

• hence, the residual rk = b−Ax(k) satisfies

rk ∈ Kk+1, rk ∈ K⊥k

(the first property follows from b ∈ K1 and x(k) ∈ Kk)

the (nonzero) residuals form an orthogonal basis for the Krylov subspaces:

Kk = span{r0, r1, . . . , rk−1}, rTi rj = 0 (i 6= j)

Conjugate gradient method 3-5

Conjugate directions

the ‘steps’ vi = x(i) − x(i−1) in the Krylov sequence satisfy

vTi Avj = 0 for i 6= j, vTi Avi = vTi ri−1

(proof on next page)

• the vectors vi are ‘conjugate’: orthogonal for inner product 〈v, w〉 = vTAw

• in particular, if vi 6= 0, it is independent of v1, . . . , vi−1

the (nonzero) vectors vi form a ‘conjugate’ basis for the Krylov subspaces:

Kk = span{v1, v2, . . . , vk}, vTi Avj = 0 (i 6= j)

Conjugate gradient method 3-6

Proof of properties on page 3-6 (assume j < i)

• vTi Avj = 0 because

vj = x(j) − x(j−1) ∈ Kj ⊆ Ki−1

andAvi = A(x(i) − x(i−1)) = −ri + ri−1 ∈ K⊥i−1

• the expression vTi Avi = vTi ri−1 follows from the fact that t = 1 minimizes

f(x(i−1) + tvi) = f(x(i−1)) +1

2t2vTi Avi − tvTi ri−1

(since x(i) = x(i−1) + vi minimizes f over the entire subspace Ki)

Conjugate gradient method 3-7

Conjugate vectors

instead of vi, we will work a sequence pi of scaled vectors vi:

pi =‖ri−1‖22vTi ri−1

vi

• scaling factor is chosen to satisfy rTi−1pi = ‖ri−1‖22; equivalently,

−∇f(x(i−1))Tpi = ‖∇f(x(i−1))‖22

• using vTi Avi = vTi ri−1 (page 3-6), we can write the scaling factor as

‖ri−1‖22vTi ri−1

=‖ri−1‖22vTi Avi

=pTi Api‖ri−1‖22

• with this notation we can write the update as

x(i) = x(i−1) + αpi, α =‖ri−1‖22pTi Api

Conjugate gradient method 3-8

Recursion for pk

pk ∈ Kk = span{p1, p2, . . . , pk−1, rk−1}, so we can express pk as

p1 = δr0, pk = δrk−1 + βpk−1 +

k−2∑i=1

γipi (k > 1)

• γ1 = · · · = γk−2 = 0: take inner products with Apj for j ≤ k − 2, and use

pTj Api = 0 for j 6= i, pTj Ark−1 = 0

(second equality because Apj ∈ Kj+1 ⊆ Kk−1 and rk−1 ∈ K⊥k−1)

• δ = 1: take inner product with rk−1 and use rTk−1pk = ‖rk−1‖22

• hence, pk = rk−1 + βpk−1; inner product with Apk−1 shows that

β = −pTk−1Ark−1

pTk−1Apk−1

Conjugate gradient method 3-9

Basic conjugate gradient algorithm

Initialize: x(0) = 0, r0 = b

For k = 1, 2, . . .

1. if k = 1, take pk = r0; otherwise, take

pk = rk−1 + βpk−1 where β = −pTk−1Ark−1

pTk−1Apk−1

2. compute

α =‖rk−1‖22pTkApk

, x(k) = x(k−1) + αpk, rk = b−Ax(k)

if rk is sufficiently small, return x(k)

Conjugate gradient method 3-10

Improvements

Step 2: compute residual recursively:

rk = rk−1 − αApk

Step 1: simplify the expression for β by using

rk−1 = rk−2 −‖rk−2‖22

pTk−1Apk−1Apk−1

taking inner product with rk−1 gives

β = −pTk−1Ark−1

pTk−1Apk−1=‖rk−1‖22‖rk−2‖22

this reduces number of matrix-vector products to one per iteration (product Apk)

Conjugate gradient method 3-11

Conjugate gradient algorithm

Initialize: x(0) = 0, r0 = b

For k = 1, 2, . . .

1. if k = 1, take pk = r0; otherwise, take

pk = rk−1 + βpk−1 where β =‖rk−1‖22‖rk−2‖22

2. compute

α =‖rk−1‖22pTkApk

, x(k) = x(k−1) + αpk, rk = rk−1 − αApk

if rk is sufficiently small, return x(k)

Conjugate gradient method 3-12

Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• applications in nonlinear optimization

Notation

minimize f(x) =1

2xTAx− bTx

Optimal valuef(x?) = −1

2bTA−1b = −1

2‖x?‖2A

Suboptimality at xf(x)− f? =

1

2‖x− x?‖2A

Relative error measure

τ =f(x)− f?

f(0)− f?=‖x− x?‖2A‖x?‖2A

here, ‖u‖A = (uTAu)1/2 is A-weighted norm

Conjugate gradient method 3-13

Error after k steps

• x(k) ∈ Kk = span{b, Ab,A2b, . . . , Ak−1b}, so x(k) can be expressed as

x(k) =

k∑i=1

ciAi−1b = p(A)b

where p(λ) =∑ki=1 ciλ

i−1 is some polynomial of degree k − 1 or less

• x(k) minimizes f(x) over Kk; hence

2(f(x(k))− f?) = infx∈Kk

‖x− x?‖2A = infdeg p<k

∥∥(p(A)−A−1)b∥∥2

A

we now use the eigenvalue decomposition of A to bound this quantity

Conjugate gradient method 3-14

Error and spectrum of A

• eigenvalue decomposition of A

A = QΛQT =

n∑i=1

λiqiqTi (QTQ = I, Λ = diag(λ1, . . . , λn))

• define d = QT b

expression on previous page simplifies to

2(f(x(k))− f?) = infdeg p<k

∥∥(p(A)−A−1)b∥∥2

A

= infdeg p<k

∥∥(p(Λ)− Λ−1) d∥∥2

Λ

= infdeg p<k

n∑i=1

(λip(λi)− 1)2 d2i

λi

= infdeg q≤k, q(0)=1

n∑i=1

q(λi)2 d2

i

λi

Conjugate gradient method 3-15

Error bounds

Absolute error

f(x(k))− f? ≤

(n∑i=1

d2i

2λi

)inf

deg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)

=1

2‖x?‖2A inf

deg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)

(equality follows from∑i

d2i/λi = bTA−1b = ‖x?‖2A)

Relative error

τk =‖x(k) − x?‖2A‖x?‖2A

≤ infdeg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)

Conjugate gradient method 3-16

Convergence rate and spectrum of A

• if A hasm distinct eigenvalues γ1, . . . , γm, CG terminates inm steps:

q(λ) =(−1)m

γ1 · · · γm(λ− γ1) · · · (λ− γm)

satisfies deg q = m, q(0) = 1, q(λ1) = · · · = q(λn) = 0; therefore τm = 0

• if eigenvalues are clustered inm groups, then τm is small

can find q(λ) of degreem, with q(0) = 1, that is small on spectrum

• if x? is a linear combination ofm eigenvectors, CG terminates inm steps

take q of degreem with q(λi) = 0 where di 6= 0; then

n∑i=1

q(λi)2d2i

λi= 0

Conjugate gradient method 3-17

Other bounds

we omit the proofs of the following results

• in terms of condition number κ = λmax/λmin

τk ≤ 2

(√κ− 1√κ+ 1

)k

derived by taking for q a Chebyshev polynomial on [λmin, λmax]

• in terms of sorted eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn

τk ≤(λk − λnλk + λn

)2

derived by taking q with roots at λ1, . . . , λk−1 and (λ1 + λn)/2

Conjugate gradient method 3-18

Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• applications in nonlinear optimization

Conjugate gradient method as iterative method

In exact arithmetic

• CG was originally proposed as a direct (non-iterative) method

• in theory, terminates in at most n steps

In practice

• due to rounding errors, CG method can take many more than n steps (or fail)

• CG is now used as an iterative method

• with luck (good spectrum of A), good approximation in small number of steps

• attractive if matrix-vector products are inexpensive

Conjugate gradient method 3-19

Preconditioning

• make change of variables y = Bx with B nonsingular, and apply CG to

B−TAB−1y = B−T b

• if spectrum of B−TAB−1 is clustered, PCG converges fast

• trade-off between enhanced convergence, cost of extra computation

• the matrix C = BTB is called the preconditioner

Examples

• diagonal C = diag(A11, A22, . . . , Ann)

• incomplete or approximate Cholesky factorization of A

• good preconditioners are often application-dependent

Conjugate gradient method 3-20

Naive implementation

define A = B−TAB−1 and apply algorithm of page 3-12 to Ay = B−T b

Initialize: y(0) = 0, r0 = B−T b

For k = 1, 2, . . .

1. if k = 1, take pk = r0; otherwise, take

pk = rk−1 + βpk−1 where β =‖rk−1‖22‖rk−2‖22

2. define A = B−TAB−1 and compute

α =‖rk−1‖22pTk Apk

, y(k) = y(k−1) + αpk, rk = rk−1 − αApk

if rk is sufficiently small, return B−1y(k)

Conjugate gradient method 3-21

Improvements

• instead of y(k), pk compute iterates and steps in original coordinates

x(k) = B−1y(k), pk = B−1pk,

• compute residuals in original coordinates:

rk = BT rk = b−Ax(k)

• compute squared residual norms as

‖rk−1‖22 = rTk−1C−1rk−1

• extra work per iteration is solving one equation to compute C−1rk−1

Conjugate gradient method 3-22

Preconditioned conjugate gradient algorithm

Initialize: x(0) = 0, r0 = b

For k = 1, 2, . . .

1. solve the equation Csk = rk−1

2. if k = 1, take pk = sk; otherwise, take

pk = sk + βpk−1 where β =rTk−1sk

rTk−2sk−1

3. compute

α =rTk−1sk

pTkApk, x(k) = x(k−1) + αpk, rk = rk−1 − αApk

if rk is sufficiently small, return x(k)

Conjugate gradient method 3-23

Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• applications in nonlinear optimization

Applications in optimization

Nonlinear conjugate gradient methods

• extend linear CG method to nonquadratic functions

• local convergence similar to linear CG

• limited global convergence theory

Inexact and truncated Newton methods

• use conjugate gradient method to compute (approximate) Newton step

• less reliable than exact Newton methods, but handle very large problems

Conjugate gradient method 3-24

Nonlinear conjugate gradient

minimize f(x)

(f convex and differentiable)

Modifications needed to extend linear CG algorithm of page 3-12

• replace rk = b−Ax(k) with −∇f(x(k))

• determine α by line search

Conjugate gradient method 3-25

Fletcher-Reeves CG algorithm

CG algorithm of page 3-12 modified to minimize non-quadratic convex f

Initialize: choose x(0)

For k = 1, 2, . . .

1. if k = 1, take p1 = −∇f(x(0)); otherwise, take

pk = −∇f(x(k−1)) + βkpk−1 where βk =‖∇f(x(k−1))‖22‖∇f(x(k−2))‖22

2. update x(k) = x(k−1) + αkpk where

αk = argminα

f(x(k−1) + αpk)

if ∇f(x(k)) is sufficiently small, return x(k)

Conjugate gradient method 3-26

Some observations

Interpretation

• first iteration is a gradient step

• general update is gradient step with momentum term

x(k) = x(k−1) − αk∇f(x(k−1)) +αkβkαk−1

(x(k−1) − x(k−2))

• it is common to restart the algorithm periodically by taking a gradient step

Line search

• with exact line search, reduces to linear CG for quadratic f

• exact line search in step 2 implies ∇f(x(k))Tpk = 0

• therefore in step 1, pk is a descent direction at x(k−1):

∇f(x(k−1))Tpk = −‖∇f(x(k−1))‖22 < 0

Conjugate gradient method 3-27

Variations

Polak-Ribière: in step 1, compute β from

β =∇f(x(k−1))T (∇f(x(k−1))−∇f(x(k−2)))

‖∇f(x(k−2))‖22

Hestenes-Stiefel

β =∇f(x(k−1))T (∇f(x(k−1))−∇f(x(k−2)))

pTk−1(∇f(x(k−1))−∇f(x(k−2)))

formulas are equivalent for quadratic f and exact line search

Conjugate gradient method 3-28

Interpretation as restarted BFGS method

BFGS update (page 2-5) with Hk−1 = I :

H−1k = I + (1 +

yTy

sTy)ssT

yTs− ys

T + syT

yTs

where y = ∇f(x(k))−∇f(x(k−1)) and s = x(k) − x(k−1)

• ∇f(x(k))Ts = 0 if x(k) is determined by exact line search

• quasi-Newton step in iteration k is

−H−1k ∇f(x(k)) = −∇f(x(k)) +

yT∇f(x(k))

yTss

this is the Hestenes-Stiefel update

nonlinear CG can be interpreted as L-BFGS withm = 1

Conjugate gradient method 3-29

References

• S. Boyd, Lecture notes for EE364b, Convex Optimization II.

• G. H. Golub and C. F. Van Loan, Matrix Computations (1996), chapter 10.

• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapter 5.

• H. A. van der Vorst, Iterative Krylov Methods for Large Linear Systems (2003).

Conjugate gradient method 3-30