3. conjugate gradient method

L. Vandenberghe EE236C (Spring 2016)

3. Conjugate gradient method

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• applications in nonlinear optimization

3-1

Unconstrained quadratic minimization

minimize f(x) =1

2xTAx− bTx

with A ∈ Sn++

• equivalent to solving linear equation Ax = b

• residual r = b−Ax is negative gradient: r = −∇f(x)

Conjugate gradient method (CG)

• invented by Hestenes and Stiefel around 1951

• the most widely used iterative method for solving Ax = b, with A � 0

• can be extended to non-quadratic unconstrained minimization

Conjugate gradient method 3-2

Krylov subspaces

Definition: a sequence of nested subspaces (K0 ⊆ K1 ⊆ K2 ⊆ · · · )

K0 = {0}, Kk = span{b, Ab, . . . , Ak−1b} for k ≥ 1

if Kk+1 = Kk, then Ki = Kk for all i ≥ k

Key property: A−1b ∈ Kn (even when Kn 6= Rn)

• from Cayley-Hamilton theorem,

p(A) = An + a1An−1 + · · ·+ anI = 0

where p(λ) = det(λI −A) = λn + a1λn−1 + · · ·+ an−1λ+ an

• thereforeA−1b = − 1

an

(An−1b+ a1A

n−2b+ · · ·+ an−1b)


Krylov sequence

x(k) = argminx∈Kk

f(x), k ≥ 0

• from previous page, x(n) = A−1b

• CG is a recursive method for computing the Krylov sequence x(0), x(1), . . .

• we will see there is a simple two-term recurrence

x(k+1) = x(k) − αk∇f(x(k)) + γk(x(k) − x(k−1))

Example

A =

[1 00 10

], b =

[1010

]−20 −10 0

−4

−2

0

2

x1

x2


Residuals of Krylov sequence

• optimality conditions in definition of Krylov sequence:

x(k) ∈ Kk, ∇f(x(k)) = Ax(k) − b ∈ K⊥k

• hence, the residual rk = b−Ax(k) satisfies

rk ∈ Kk+1, rk ∈ K⊥k

(the first property follows from b ∈ K1 and x(k) ∈ Kk)

the (nonzero) residuals form an orthogonal basis for the Krylov subspaces:

Kk = span{r0, r1, . . . , rk−1}, rTi rj = 0 (i 6= j)


Conjugate directions

the ‘steps’ vi = x(i) − x(i−1) in the Krylov sequence satisfy

vTi Avj = 0 for i 6= j, vTi Avi = vTi ri−1

(proof on next page)

• the vectors vi are ‘conjugate’: orthogonal for inner product 〈v, w〉 = vTAw

• in particular, if vi 6= 0, it is independent of v1, . . . , vi−1

the (nonzero) vectors vi form a ‘conjugate’ basis for the Krylov subspaces:

Kk = span{v1, v2, . . . , vk}, vTi Avj = 0 (i 6= j)


Proof of properties on page 3-6 (assume j < i)

• vTi Avj = 0 because

vj = x(j) − x(j−1) ∈ Kj ⊆ Ki−1

andAvi = A(x(i) − x(i−1)) = −ri + ri−1 ∈ K⊥i−1

• the expression vTi Avi = vTi ri−1 follows from the fact that t = 1 minimizes

f(x(i−1) + tvi) = f(x(i−1)) +1

2t2vTi Avi − tvTi ri−1

(since x(i) = x(i−1) + vi minimizes f over the entire subspace Ki)


Conjugate vectors

instead of vi, we will work a sequence pi of scaled vectors vi:

pi =‖ri−1‖22vTi ri−1

vi

• scaling factor is chosen to satisfy rTi−1pi = ‖ri−1‖22; equivalently,

−∇f(x(i−1))Tpi = ‖∇f(x(i−1))‖22

• using vTi Avi = vTi ri−1 (page 3-6), we can write the scaling factor as

‖ri−1‖22vTi ri−1

=‖ri−1‖22vTi Avi

=pTi Api‖ri−1‖22

• with this notation we can write the update as

x(i) = x(i−1) + αpi, α =‖ri−1‖22pTi Api


Recursion for pk

pk ∈ Kk = span{p1, p2, . . . , pk−1, rk−1}, so we can express pk as

p1 = δr0, pk = δrk−1 + βpk−1 +

k−2∑i=1

γipi (k > 1)

• γ1 = · · · = γk−2 = 0: take inner products with Apj for j ≤ k − 2, and use

pTj Api = 0 for j 6= i, pTj Ark−1 = 0

(second equality because Apj ∈ Kj+1 ⊆ Kk−1 and rk−1 ∈ K⊥k−1)

• δ = 1: take inner product with rk−1 and use rTk−1pk = ‖rk−1‖22

• hence, pk = rk−1 + βpk−1; inner product with Apk−1 shows that

β = −pTk−1Ark−1

pTk−1Apk−1


Basic conjugate gradient algorithm

Initialize: x(0) = 0, r0 = b

For k = 1, 2, . . .

1. if k = 1, take pk = r0; otherwise, take

pk = rk−1 + βpk−1 where β = −pTk−1Ark−1

pTk−1Apk−1

2. compute

α =‖rk−1‖22pTkApk

, x(k) = x(k−1) + αpk, rk = b−Ax(k)

if rk is sufficiently small, return x(k)


Improvements

Step 2: compute residual recursively:

rk = rk−1 − αApk

Step 1: simplify the expression for β by using

rk−1 = rk−2 −‖rk−2‖22

pTk−1Apk−1Apk−1

taking inner product with rk−1 gives

β = −pTk−1Ark−1

pTk−1Apk−1=‖rk−1‖22‖rk−2‖22

this reduces number of matrix-vector products to one per iteration (product Apk)


Conjugate gradient algorithm


For k = 1, 2, . . .


pk = rk−1 + βpk−1 where β =‖rk−1‖22‖rk−2‖22

2. compute

α =‖rk−1‖22pTkApk

, x(k) = x(k−1) + αpk, rk = rk−1 − αApk



Outline





Notation

minimize f(x) =1

2xTAx− bTx

Optimal valuef(x?) = −1

2bTA−1b = −1

2‖x?‖2A

Suboptimality at xf(x)− f? =

1

2‖x− x?‖2A

Relative error measure

τ =f(x)− f?

f(0)− f?=‖x− x?‖2A‖x?‖2A

here, ‖u‖A = (uTAu)1/2 is A-weighted norm


Error after k steps

• x(k) ∈ Kk = span{b, Ab,A2b, . . . , Ak−1b}, so x(k) can be expressed as

x(k) =

k∑i=1

ciAi−1b = p(A)b

where p(λ) =∑ki=1 ciλ

i−1 is some polynomial of degree k − 1 or less

• x(k) minimizes f(x) over Kk; hence

2(f(x(k))− f?) = infx∈Kk

‖x− x?‖2A = infdeg p<k

∥∥(p(A)−A−1)b∥∥2

A

we now use the eigenvalue decomposition of A to bound this quantity


Error and spectrum of A

• eigenvalue decomposition of A

A = QΛQT =

n∑i=1

λiqiqTi (QTQ = I, Λ = diag(λ1, . . . , λn))

• define d = QT b

expression on previous page simplifies to

2(f(x(k))− f?) = infdeg p<k

∥∥(p(A)−A−1)b∥∥2

A

= infdeg p<k

∥∥(p(Λ)− Λ−1) d∥∥2

Λ

= infdeg p<k

n∑i=1

(λip(λi)− 1)2 d2i

λi

= infdeg q≤k, q(0)=1

n∑i=1

q(λi)2 d2

i

λi


Error bounds

Absolute error

f(x(k))− f? ≤

(n∑i=1

d2i

2λi

)inf

deg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)

=1

2‖x?‖2A inf

deg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)

(equality follows from∑i

d2i/λi = bTA−1b = ‖x?‖2A)

Relative error

τk =‖x(k) − x?‖2A‖x?‖2A

≤ infdeg q≤k, q(0)=1

(max

i=1,...,nq(λi)

2

)


Convergence rate and spectrum of A

• if A hasm distinct eigenvalues γ1, . . . , γm, CG terminates inm steps:

q(λ) =(−1)m

γ1 · · · γm(λ− γ1) · · · (λ− γm)

satisfies deg q = m, q(0) = 1, q(λ1) = · · · = q(λn) = 0; therefore τm = 0

• if eigenvalues are clustered inm groups, then τm is small

can find q(λ) of degreem, with q(0) = 1, that is small on spectrum

• if x? is a linear combination ofm eigenvectors, CG terminates inm steps

take q of degreem with q(λi) = 0 where di 6= 0; then

n∑i=1

q(λi)2d2i

λi= 0


Other bounds

we omit the proofs of the following results

• in terms of condition number κ = λmax/λmin

τk ≤ 2

(√κ− 1√κ+ 1

)k

derived by taking for q a Chebyshev polynomial on [λmin, λmax]

• in terms of sorted eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn

τk ≤(λk − λnλk + λn

)2

derived by taking q with roots at λ1, . . . , λk−1 and (λ1 + λn)/2


Outline





Conjugate gradient method as iterative method

In exact arithmetic

• CG was originally proposed as a direct (non-iterative) method

• in theory, terminates in at most n steps

In practice

• due to rounding errors, CG method can take many more than n steps (or fail)

• CG is now used as an iterative method

• with luck (good spectrum of A), good approximation in small number of steps

• attractive if matrix-vector products are inexpensive


Preconditioning

• make change of variables y = Bx with B nonsingular, and apply CG to

B−TAB−1y = B−T b

• if spectrum of B−TAB−1 is clustered, PCG converges fast

• trade-off between enhanced convergence, cost of extra computation

• the matrix C = BTB is called the preconditioner

Examples

• diagonal C = diag(A11, A22, . . . , Ann)

• incomplete or approximate Cholesky factorization of A

• good preconditioners are often application-dependent


Naive implementation

define A = B−TAB−1 and apply algorithm of page 3-12 to Ay = B−T b

Initialize: y(0) = 0, r0 = B−T b

For k = 1, 2, . . .


pk = rk−1 + βpk−1 where β =‖rk−1‖22‖rk−2‖22

2. define A = B−TAB−1 and compute

α =‖rk−1‖22pTk Apk

, y(k) = y(k−1) + αpk, rk = rk−1 − αApk

if rk is sufficiently small, return B−1y(k)


Improvements

• instead of y(k), pk compute iterates and steps in original coordinates

x(k) = B−1y(k), pk = B−1pk,

• compute residuals in original coordinates:

rk = BT rk = b−Ax(k)

• compute squared residual norms as

‖rk−1‖22 = rTk−1C−1rk−1

• extra work per iteration is solving one equation to compute C−1rk−1


Preconditioned conjugate gradient algorithm


For k = 1, 2, . . .

1. solve the equation Csk = rk−1

2. if k = 1, take pk = sk; otherwise, take

pk = sk + βpk−1 where β =rTk−1sk

rTk−2sk−1

3. compute

α =rTk−1sk

pTkApk, x(k) = x(k−1) + αpk, rk = rk−1 − αApk



Outline





Applications in optimization

Nonlinear conjugate gradient methods

• extend linear CG method to nonquadratic functions

• local convergence similar to linear CG

• limited global convergence theory

Inexact and truncated Newton methods

• use conjugate gradient method to compute (approximate) Newton step

• less reliable than exact Newton methods, but handle very large problems


Nonlinear conjugate gradient

minimize f(x)

(f convex and differentiable)

Modifications needed to extend linear CG algorithm of page 3-12

• replace rk = b−Ax(k) with −∇f(x(k))

• determine α by line search


Fletcher-Reeves CG algorithm

CG algorithm of page 3-12 modified to minimize non-quadratic convex f

Initialize: choose x(0)

For k = 1, 2, . . .

1. if k = 1, take p1 = −∇f(x(0)); otherwise, take

pk = −∇f(x(k−1)) + βkpk−1 where βk =‖∇f(x(k−1))‖22‖∇f(x(k−2))‖22

2. update x(k) = x(k−1) + αkpk where

αk = argminα

f(x(k−1) + αpk)

if ∇f(x(k)) is sufficiently small, return x(k)


Some observations

Interpretation

• first iteration is a gradient step

• general update is gradient step with momentum term

x(k) = x(k−1) − αk∇f(x(k−1)) +αkβkαk−1

(x(k−1) − x(k−2))

• it is common to restart the algorithm periodically by taking a gradient step

Line search

• with exact line search, reduces to linear CG for quadratic f

• exact line search in step 2 implies ∇f(x(k))Tpk = 0

• therefore in step 1, pk is a descent direction at x(k−1):

∇f(x(k−1))Tpk = −‖∇f(x(k−1))‖22 < 0


Variations

Polak-Ribière: in step 1, compute β from

β =∇f(x(k−1))T (∇f(x(k−1))−∇f(x(k−2)))

‖∇f(x(k−2))‖22

Hestenes-Stiefel

β =∇f(x(k−1))T (∇f(x(k−1))−∇f(x(k−2)))

pTk−1(∇f(x(k−1))−∇f(x(k−2)))

formulas are equivalent for quadratic f and exact line search


Interpretation as restarted BFGS method

BFGS update (page 2-5) with Hk−1 = I :

H−1k = I + (1 +

yTy

sTy)ssT

yTs− ys

T + syT

yTs

where y = ∇f(x(k))−∇f(x(k−1)) and s = x(k) − x(k−1)

• ∇f(x(k))Ts = 0 if x(k) is determined by exact line search

• quasi-Newton step in iteration k is

−H−1k ∇f(x(k)) = −∇f(x(k)) +

yT∇f(x(k))

yTss

this is the Hestenes-Stiefel update

nonlinear CG can be interpreted as L-BFGS withm = 1


References

• S. Boyd, Lecture notes for EE364b, Convex Optimization II.

• G. H. Golub and C. F. Van Loan, Matrix Computations (1996), chapter 10.

• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapter 5.

• H. A. van der Vorst, Iterative Krylov Methods for Large Linear Systems (2003).


3. conjugate gradient method

Documents