conjugate gradient solution of linear equations

18
CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 10(2), 139–156 (FEBRUARY 1998) Conjugate gradient solution of linear equations PER BRINCH HANSEN * School of Computer and Information Science, Syracuse University, Syracuse, NY 13244, USA SUMMARY The conjugate gradient method is an ingenious method for iterative solution of sparse linear equations. It is now a standard benchmark for parallel scientific computing. In the author’s opinion, the apparent mystery of this method is largely due to the inadequate way in which it is presented in textbooks. This tutorial explains conjugate gradients by deriving the computational steps from elementary mathematical concepts. The computation is illustrated by a numerical example and an algorithmic outline. 1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10(2), 139–156 (1998) 1. INTRODUCTION In physics, partial differential equations describe phenomena that vary continuously in space and time. Discrete forms of these equations are often large systems of linear equations. Fortunately, each equation typically involves just a few unknowns. For such sparse systems, direct numerical methods, such as Gaussian elimination, are extremely wasteful, since most of the arithmetic is performed on zero operands. It is much more efficient to use iterative methods. The conjugate gradient method is one of the best iterative methods for solving large systems of sparse linear equations. Sadly enough, it is described so poorly in textbooks that it remains a mystery to many scientists, who use it without understanding it. Let me paraphrase the essence of a typical introduction to the method in standard texts on numerical analysis: If a real matrix A is symmetric and positive definite, then it can be shown that mini- mizing the function P (x)= 0.5x · Ax - x · b is equivalent to solving the linear system Ax = b (see Exercise 1). The conjugate gradient method minimizes P (x) iteratively using conjugate search directions, p0,...,p k , where pj · Ap k = 0 for j<k The algorithm shown below includes definitions of two scaling factors, f k and g k . The following theorem concerning the orthogonality of vectors used by the algorithm is proven by induction in [somebody else’s book]. * Correspondence to: Per Brinch Hansen, School of Computer and Information Science, Syracuse University, Syracuse, NY 13244, USA. (e-mail: [email protected]) CCC 1040–3108/98/020139–18$17.50 Received 29 October 1996 1998 John Wiley & Sons, Ltd. Revised 5 December 1996

Upload: per

Post on 06-Jun-2016

225 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Conjugate gradient solution of linear equations

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 10(2), 139–156 (FEBRUARY 1998)

Conjugate gradient solution of linear equationsPER BRINCH HANSEN∗

School of Computer and Information Science, Syracuse University, Syracuse, NY 13244, USA

SUMMARYThe conjugate gradient method is an ingenious method for iterative solution of sparse linearequations. It is now a standard benchmark for parallel scientific computing. In the author’sopinion, the apparent mystery of this method is largely due to the inadequate way in which it ispresented in textbooks. This tutorial explains conjugate gradients by deriving the computationalsteps from elementary mathematical concepts. The computation is illustrated by a numericalexample and an algorithmic outline. 1998 John Wiley & Sons, Ltd.

Concurrency: Pract. Exper., Vol. 10(2), 139–156 (1998)

1. INTRODUCTION

In physics, partial differential equations describe phenomena that vary continuously inspace and time. Discrete forms of these equations are often large systems of linear equations.Fortunately, each equation typically involves just a few unknowns. For such sparse systems,direct numerical methods, such as Gaussian elimination, are extremely wasteful, since mostof the arithmetic is performed on zero operands. It is much more efficient to use iterativemethods.

The conjugate gradient method is one of the best iterative methods for solving largesystems of sparse linear equations. Sadly enough, it is described so poorly in textbooks thatit remains a mystery to many scientists, who use it without understanding it.

Let me paraphrase the essence of a typical introduction to the method in standard textson numerical analysis:

If a real matrix A is symmetric and positive definite, then it can be shown that mini-mizing the function

P (x) = 0.5x · Ax− x · bis equivalent to solving the linear system Ax = b (see Exercise 1).The conjugate gradient method minimizes P (x) iteratively using conjugate searchdirections, p0, . . . , pk, where

pj · Apk = 0 for j < k

The algorithm shown below includes definitions of two scaling factors, fk and gk.The following theorem concerning the orthogonality of vectors used by the algorithmis proven by induction in [somebody else’s book].

∗Correspondence to: Per Brinch Hansen, School of Computer and Information Science, Syracuse University,Syracuse, NY 13244, USA. (e-mail: [email protected])

CCC 1040–3108/98/020139–18$17.50 Received 29 October 19961998 John Wiley & Sons, Ltd. Revised 5 December 1996

Page 2: Conjugate gradient solution of linear equations

140 P. BRINCH HANSEN

This form of presentation is unacceptable to an inquisitive reader. Unfamiliar equationsand algorithms appear out of nowhere, and theorems are proven about these miraculousobjects without any background or motivation. The reader is even asked to fill in gaps inthe presentation by solving exercises.

After reading surrogate ‘explanations’ like that in more than a dozen standard texts onnumerical analysis, I was left with a list of questions that I really wanted the authors toanswer:

1. How does the function P (x) arise in scientific and engineering computation?2. Why is it necessary and reasonable to assume that the matrix A is symmetric and

positive definite?3. Why is minimization of P (x) equivalent to solving linear equations?4. Why is the method based on conjugate gradient vectors (instead of the simpler

concept of orthogonal vectors)?5. Where do the peculiar scaling factors in the algorithm come from?6. How does one derive such an algorithm from elementary mathematical concepts

(instead of analyzing the finished product)?

Well, these are the questions I try to answer in this tutorial. It is written for computerscientists (like me) who are not very familiar with computational physics. I assume thatyou know linear algebra and elementary calculus.

2. SCALAR PRODUCTS

The algorithms discussed here require the computation of scalar products of vectors. Thefollowing is a brief reminder of this operation.

If x and y are n-dimensional vectors

x = [x1x2 · · ·xn]T y = [y1y2 · · · yn]T

the scalar product x · y is defined by

x · y = xTy = x1y1 + x2y2 + · · ·+ xnyn

Scalar products satisfy the usual algebraic laws of multiplication.Two vectors x and y are orthogonal (or perpendicular) if x · y = 0.The Euclidean norm (or length) of a vector x is

‖x‖ =√

(x21 + x2

2 + · · ·+ x2n)

Obviously,x · x = ‖x‖2 > 0 for x 6= 0

where 0 denotes the null vector [0 0 · · · 0]T.We will also need scalar products of the form x · Ay, where A is a symmetric matrix,

A = AT. SincexTAy = (Ay)Tx = yTATx = yTAx

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 3: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 141

we have the important algebraic law

x · Ay = y ·Ax for A = AT (1)

3. STATIC EQUILIBRIUM

The following example is adapted from Gilbert Strang’s delightful book on applied mathe-matics[1]. Figure 1 shows a mechanical system consisting of two identical point masses ontwo identical springs with negligible masses. The system is suspended from a fixed support.

f1

b1

f2

b2

e1

e2

0

d+ x1

2d+ x2y

y������

HHHHHH

������

HHHHHH

Figure 1. Masses on springs

If the masses were zero, the end points of each spring would be separated by a fixeddistance d. Under the influence of gravitation, the springs are elongated until the masseshave co-ordinates d+ x1 and 2d+ x2 relative to the support. The problem is to determinethe displacements x1 and x2 of the masses.

The elongation of the springs is defined as follows:

e1 = (d+ x1)− d = x1

e2 = (2d+ x2)− (d+ x1)− d = x2 − x1

These linear equations can be combined into a single matrix equation,[e1

e2

]=

[1 0−1 1

] [x1

x2

]or

e = Lx

where e is the elongation vector, x is the displacement vector and L is the lower triangularmatrix, shown above.

According to Hooke’s law, the spring forces, f1 and f2, are proportional to the elongation;that is,

f1 = e1 f2 = e2

assuming that the spring constants are 1.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 4: Conjugate gradient solution of linear equations

142 P. BRINCH HANSEN

The corresponding matrix equation is[f1

f2

]=

[e1

e2

]or

f = e

where f is the internal force vector and e is the elongation vector.The gravitational forces on the masses are denoted b1 and b2. In static equilibrium, the

net force on each mass is zero. Consequently,

b1 = f1 − f2 b2 = f2

In matrix form, [b1

b2

]=

[1 −10 1

] [f1

f2

]or

b = Uf

where b is the external force vector, f is the internal force vector and U is the uppertriangular matrix, shown above.

Now, in equilibrium the external work done by the gravitational forces equals the internalwork done by the spring forces[2]:

b1x1 + b2x2 = f1e1 + f2e2

In terms of scalar productsb · x = f · e

Sinceb · x = f · e = fTLx = (LTf)Tx = LTf · x

we haveb = LTf

Consequently, U is the transpose of L,

U = LT

andb = Uf = LTf = LTe = LTLx

In other words, in static equilibrium, we have

Ax = b

whereA = LTL

A is known as the stiffness matrix, x is the displacement vector and b is the external forcevector of the mechanical system.

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 5: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 143

For the system in Figure 1 we have

A =

[1 −10 1

] [1 0−1 1

]=

[2 −1−1 1

]If you assume that gravitation exerts a unit force on each mass in Figure 1, then

b =

[11

]and the displacement vector x is defined by the linear system[

2 −1−1 1

]x =

[11

]which has the solution

x =

[23

]

4. POTENTIAL ENERGY

The work required to extend a spring elongation e by de is the spring force e times de. Sothe total potential energy of the two springs is

Ps =

∫ e1

0e de+

∫ e2

0e de = 0.5(e2

1 + e22)

Since the two masses are lowered, their total potential energy is negative:

Pm = −(b1x1 + b2x2)

The total potential energy of the mechanical system is

P (x) = Ps + Pm

In terms of vectors and matrices,

Ps = 0.5e · e= 0.5Lx · Lx= 0.5x · LTLx

= 0.5x ·Ax

andPm = −x · b

SoP (x) = 0.5x ·Ax − x · b

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 6: Conjugate gradient solution of linear equations

144 P. BRINCH HANSEN

5. POSITIVE-DEFINITE MATRICES

Matrix A is symmetric since

AT = (LTL)T = LTL = A

The matrix has another crucial property:

x · Ax = xTLTLx

= Lx · Lx= ‖Lx‖2

= ‖e‖2

> 0

for any nonzero displacement vector x.A real symmetric matrix A is called positive definite if

x ·Ax > 0 for any real vector x 6= 0 (2)

Such matrices occur in many practical applications. The physical nature of a problemoften makes it obvious that a matrix is positive definite. In our example, 0.5x·Ax representsspring energy, which is positive for any nonzero displacement x.

6. A MINIMIZATION PRINCIPLE

Let me now extend the two-dimensional example to the general n-dimensional case. Givena symmetric, positive-definite n× n matrix A and a vector b,

A =

a11 a12 · · · a1n

a12 a22 · · · a2n

· · ·a1n a2n · · · ann

b =

b1

b2...bn

we must find the vector

x =

x1

x2...xn

that minimizes the ‘energy’ function

P (x) = 0.5x ·Ax − x · b (3)

If P has a minimum at a point x, then the partial derivatives must be zero at that point.The partial derivatives are the components of the vector

∇P (x) =

[∂P

∂x1

∂P

∂x2· · · ∂P

∂xn

]T

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 7: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 145

known as the gradient of P at x.Since

P (x) = 0.5x1(a11x1 + a12x2 + · · ·+ a1nxn)− x1b1

+ 0.5x2(a12x1 + a22x2 + · · ·+ a2nxn)− x2b2

· · ·+ 0.5xn(a1nx1 + a2nx2 + · · ·+ annxn)− xnbn

we have

∇P (x) =

a11x1 + a12x2 + · · ·+ a1nxn − b1

a12x1 + a22x2 + · · ·+ a2nxn − b2

· · ·a1nx1 + a2nx2 + · · ·+ annxn − bn

In short,

∇P (x) = Ax− b (4)

A vector x that minimizes (or maximizes) P (x) must have a gradient vector equal to 0:

∇P (x) = Ax− b = 0

To show that P (x) is minimized, we need to consider how P changes when you movefrom an arbitrary point x to another point x+∆x, where ∆x is an arbitrary nonzero vector:

P (x+ ∆x)− P (x) = (x+ ∆x) · [0.5A(x+ ∆x)− b]− x · (0.5Ax− b)

This can also be expressed as follows:

P (x + ∆x)− P (x) = ∆x · (Ax − b) + 0.5∆x · A∆x (5)

using the algebraic law (1): x ·A∆x = ∆x ·Ax, for the symmetric matrix A.If x is the solution to the linear system Ax = b, where A is positive definite, then

P (x+ ∆x)− P (x) = 0.5∆x ·A∆x by (5)

> 0 by (2)

In short, P (x) increases if you move away from the solution x in any direction ∆x.Consequently, the function P reaches its minimum (rather than its maximum) at point x.

So, if you can find the vector x that minimizesP , then you have also found the solution tothe linear system Ax = b. For a mechanical system, Strang[1] expressed this minimizationprinciple as follows: ‘At equilibrium, the displacement vector x minimizes the potentialenergy P .’

7. LINE MINIMIZATION

I will discuss two iterative methods that solve linear equations by minimizing the functionP (x). These methods are known as the gradient method and the conjugate gradient method.Since they are very similar, I will begin by explaining the common idea behind thesealgorithms.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 8: Conjugate gradient solution of linear equations

146 P. BRINCH HANSEN

Both methods compute a sequence of approximationsx1, x2, . . . , xk to the exact solutionx until the residual vector

rk = b−Axk (6)

is sufficiently close to zero. Here, and in the following, an indexed vector, such as xk ,denotes the kth approximation to the vector x.

Starting from an arbitrary point, say x0 = 0, where the residual r0 = b, we select asearch direction p0, and move some distance f0 in that direction to the next point:

x1 = x0 + f0p0

Using x = x0 and ∆x = f0p0 in (5), we obtain

P (x1)− P (x0) = −f0p0 · r0 + 0.5f 20p0 ·Ap0

We choose a distance f0 that minimizes P in the direction p0. This idea is known as lineminimization.

Now,∂P (x1)

∂f0= −p0 · r0 + f0p0 ·Ap0 = 0

forf0 =

p0 · r0

p0 ·Ap0

Consequently,

P (x1)− P (x0) = −f0p0 · r0 + 0.5f0p0 · r0

= −0.5f0p0 · r0

= −0.5(p0 · r0)2

p0 · Ap0

Since matrix A is positive definite, the denominator p0 ·Ap0 > 0. Consequently,

P (x1)− P (x0) < 0 if p0 · r0 6= 0

In other words, no matter where you start from, line minimization always reduces P ,provided the direction p0 is not orthogonal to the residual r0.

The gradient and conjugate gradient methods differ only in their choice of search direc-tions. To make these iterations converge towards the exact solution x, we must ensure thatthe search directions and residuals satisfy the above constraint.

At point x1, we select another search direction p1 and move a distance f1 in that directionto point x2 = x1 + f1p1, and so on.

In general, the (k + 1)st step of the iteration uses the following computational rules:

qk = Apk (7)

fk =pk · rkpk · qk

(8)

xk+1 = xk + fkpk (9)

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 9: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 147

rk+1 = rk − fkqk (10)

Equation (10) is obtained as follows:

rk+1 = b−Axk+1 by (6)

= (b −Axk)−A(xk+1 − xk)

= rk − fkApk by (6), (9)

= rk − fkqk by (7)

To ensure convergence, we assume that

pk · rk 6= 0 (11)

Since A is positive definite, we know that pk · qk > 0. It then follows from (8) and (11)that the scaling factor fk is nonzero:

fk 6= 0 (12)

Figure 2 shows a generic algorithm based on iterative line minimization.

Figure 2. Generic algorithm

In numerical examples, I will use the tolerance

eps = 0.001 ∗ ‖b‖

recommended by Jennings[3].We must now decide how to initialize and update the search direction p.

8. THE GRADIENT METHOD

The simplest method is the gradient method also known as steepest descent. It was intro-duced in 1847 by Augustin Cauchy.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 10: Conjugate gradient solution of linear equations

148 P. BRINCH HANSEN

In the neighborhood of any point x = xk, a continuous function P (x) increases mostrapidly in the direction of the local gradient∇P (xk). And, it decreases most rapidly in theopposite direction[4].

The gradient method uses the downhill gradient as the current search direction:

pk = −∇P (xk) = b−Axk by (4)

In other words, the search direction is always the current residual

pk = rk (13)

Figure 3 shows an algorithmic outline of the gradient method. Since p = r, the variablep is superfluous. The variable rn = ‖r‖ holds the norm of the residual.

Since they are identical, the search direction is obviously not perpendicular to the residual.Consequently, the steepest descent method always converges towards the exact solution tothe linear system.

Figure 3. The gradient method

Consider now the numerical example discussed in Section 3. The iteration terminateswhen rn ≤ eps, where eps = 0.001 ∗ ‖b‖ ≈ 0.001. The results shown in Table 1 wereproduced by a computer and rounded to three decimal places in the printing.

Figure 4 is a graphic representation of the same iteration. Each search direction isperpendicular to the previous direction, since

rk+1 · rk = (rk − fkqk) · rk by (10)

= 0 by (8), (13)

Notice the slow convergence as the iteration zig-zags across the plane in smaller and smallersteps (most of them too small to be shown).

For a two-dimensional problem, a graph of P (x) looks like a valley completely sur-rounded by hills. Steepest descent is like skiing straight downhill, as far as you can go, and

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 11: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 149

Table 1. Steepest descent iteration

k 0 1 2 3 · · · 7 8 9 10

qk 1.000 −3.000 0.200 · · · 0.008 −0.024 0.002 −0.0050.000 2.000 0.000 · · · 0.000 0.016 0.000 0.003

fk 2.000 0.400 2.000 · · · 2.000 0.400 2.000 0.400xk 0.000 2.000 1.600 2.000 · · · 2.000 1.997 2.000 1.999

0.000 2.000 2.400 2.800 · · · 2.992 2.995 2.998 2.999rk 1.000 −1.000 0.200 −0.200 · · · −0.080 0.002 −0.002 0.000

1.000 1.000 0.200 0.200 · · · 0.008 0.002 0.002 0.000rnk 1.414 1.414 0.283 0.283 · · · 0.011 0.002 0.002 0.000

3

2

1

00 1 2

x0

x1

x2

x3

x10x

••

����������@@��

Figure 4. Steepest descent convergence

then skiing downhill again after making a right angle turn, and so on, until you reach thebottom of the valley. If the valley is long and narrow, this technique may force you to goback and forth across the valley in many small steps.

The convergence criterion (11) expresses the obvious: you cannot make progress if youski parallel to the hilltops instead of going downhill.

The gradient method is too slow. It serves mainly as a pedagogical introduction toconjugate gradients.

9. THE CONJUGATE GRADIENT METHOD

The generic algorithm converges for any search direction that satisfies (11). This freedomcan be exploited to let the iteration select a sequence of search directions

p0, p1, . . . , pk

that are linearly independent. Since at most n vectors can be linearly independent inn-dimensional space, this method is theoretically guaranteed to find the exact solution xin n steps (or less). This is the ingenious idea behind the conjugate gradient method of

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 12: Conjugate gradient solution of linear equations

150 P. BRINCH HANSEN

Hesteness and Stiefel[5]. (Figure 4 shows that the gradient method does not select linearlyindependent search directions.)

9.1. Conjugate directions

Let me for a moment assume that the search directions are not linearly independent. In thatcase, at least one of them, say pk, must be a linear combination of the rest:

pk = c0p0 + c1p1 + · · ·+ ck−1pk−1 (14)

where the cs denote real constants.Consequently,

pk · qk = c0p0 · qk + c1p1 · qk + · · ·+ ck−1pk−1 · qk (15)

Consider now search directions with the following property:

pj · qk = 0 for j < k (16)

that is,pj · Apk = 0

Two vectors pj and pk with this property are said to be conjugate with respect to the matrixA, or just A-orthogonal.

For conjugate search directions, the right side of (15) is zero, but, since A is positivedefinite, the left side, pk · qk = pk ·Apk, is greater than zero for pk 6= 0. This contradictionimplies that the search directions must be linearly independent.

The conjugate gradient method is a variant of the generic algorithm, which uses conjugatesearch directions to achieve faster convergence than the gradient method. Like the lattermethod, conjugate gradients use the downhill gradient

p0 = r0 (17)

as the initial search direction. However, each of the following search directions:

pk+1 = rk+1 + gkpk (18)

is a weighted sum of the current residual and the previous search direction.Before I select the scaling factor gk, I will reformulate the algebraic law (1) in terms of

the present terminology:pj · qk = pk · qj (19)

To make the next search direction and the current one conjugate, we proceed as follows:

pk · qk+1 = pk+1 · qk by (19)

= (rk+1 + gkpk) · qk by (18)

Sopk · qk+1 = 0 (20)

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 13: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 151

forgk = −rk+1 · qk

pk · qk(21)

In the Appendix, I prove the more general result (16) that the current search direction isA-orthogonal to each of the previous directions.

9.2. Convergence

According to (11), the iteration converges if pk · rk 6= 0. Consider first the scalar productof the previous search direction and the current residual:

pk−1 · rk = pk−1 · (rk−1 − fk−1qk−1) by (10)

or, by (8),pk−1 · rk = 0 (22)

In general, the current residual turns out to be orthogonal toall previous search directions:

pj · rk = 0 for j < k (23)

(The Appendix includes a proof of this invariant.)Since

pk · rk = (rk + gk−1pk−1) · rk by (18)

we obtain the following identity by (22):

pk · rk = rk · rk (24)

Since rk ·rk > 0 for rk 6= 0, the conjugate gradient method converges towards the exactsolution according to (11).

9.3. Residuals

For two successive residuals, we have

rk · rk+1 = rk · (rk − fkqk) by (10)

The productrk · qk = (pk − gk−1pk−1) · qk by (18)

can be simplified using (20):rk · qk = pk · qk (25)

Consequently,rk · rk+1 = pk · rk − fkpk · qk by (24), (25)

or, by (8),rk · rk+1 = 0 (26)

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 14: Conjugate gradient solution of linear equations

152 P. BRINCH HANSEN

So each residual is orthogonal to the previous one. However, in contrast to the gradientmethod, it can be shown that every residual is orthogonal to all previous residuals:

rj · rk = 0 for j < k (27)

(See the Appendix.)Consequently, the conjugate gradient method can also be viewed as an iteration that

constructs a sequence of orthogonal vectors. Since no more thann vectors can be orthogonalin n dimensions, we see again that the iteration must terminate after at most n steps.

9.4. Scaling factors

For computational purposes, it is useful to derive alternative definitions of the scalingfactors fk and gk.

Using (8) and (24), we obtain

fk =pk · rkpk · qk

=rk · rkpk · qk

(28)

To redefine gk, I rewrite the numerator in (21) as follows:

rk+1 · qk = rk+1 · (rk − rk+1)/fk by (10), (12)

= −rk+1 · rk+1/fk by (26)

and replace the denominator in the same equation by

pk · qk = (rk + gk−1pk−1) · (rk − rk+1)/fk by (10), (12), (18)

= rk · rk/fk by (23), (26)

This substitution combined with (21) provides two equivalent definitions of gk:

gk = −rk+1 · qkpk · qk

=rk+1 · rk+1

rk · rk(29)

9.5. Algorithm

This insight leads to the algorithm shown in Figure 5. In theory, the conjugate gradientmethod terminates with the exact solution after n steps. However, due to rounding errors,it is wise to let termination be controlled by the residual norm. For large problems, themethod often finds an accurate solution in less than n steps.

I now return to the numerical example discussed earlier. Table 2 shows that the conjugategradient algorithm finds the exact solution x = [2 3]T in two steps (since n = 2). Figure 6is a graphic representation of the iteration.

10. CONCLUSIONS

I have explained the conjugate gradient method for solving linear equations by derivingthe computational steps from elementary mathematical concepts. The computation wasillustrated by a numerical example and an algorithmic outline. Although I now understand

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 15: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 153

Figure 5. The conjugate gradient method

this ingenious method, I am still not completely satisfied with my explanation. I would liketo see more of the algebra replaced by intuitive geometric insight.

For dense linear equations, the conjugate gradient method requires more arithmeticoperations than Gaussian elimination. Its main advantage is that it converges quickly formany sparse systems[6]. I have not discussed the convergence rate of conjugate gradients,or how it can be improved by a technique known as preconditioning. These refinements arediscussed in most standard texts, including[1,3,7]. Golub[8] is an annotated bibliographyof papers about conjugate gradients.

When you understand how the conjugate gradient method works on sequential computers,parallelization of the algorithm seems relatively straightforward. The parallel computationis dominated by the time-consuming multiplication of the sparse matrixA and the direction

Table 2. Conjugate gradient iteration

k 0 1 2

qk 1.000 −2.0000.000 2.000

fk 2.000 0.500xk 0.000 2.000 2.000

0.000 2.000 3.000rk 1.000 −1.000 0.000

1.000 1.000 0.000gk 1.000 0.000pk 1.000 0.000 0.000

1.000 2.000 0.000rnk 1.414 1.414 0.000

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 16: Conjugate gradient solution of linear equations

154 P. BRINCH HANSEN

3

2

1

00 1 2

x0

x1

x2x

���������

Figure 6. Conjugent gradient convergence

vector p. Parallel implementation of the conjugate gradient method is discussed in [9–11].The method is now a standard benchmark for parallel scientific computing[12].

I did not expect to spend so much time on this method. I was forced to do it byincomprehensible standard texts that just tell you what the computational steps are withoutrevealing how they follow from simple ideas.

I will end this journey of rediscovery by quoting the painter Robert Henri[13]: ‘Low artis just telling things; as, There is the night. High art gives the feel of the night.’ Indeed!

ACKNOWLEDGEMENTS

I thank Erik Hemmingsen and the anonymous reviewer for their helpful comments.This work was supported by the National Science Foundation under grant numberCCR-9311759.

APPENDIX: PROOF OF ORTHOGONALITY

The following is an inductive proof of the invariants (16), (23) and (27) for the conjugategradient method. In formulating this proof, I found Ortega[10] helpful.

Basis:

p0 · q1 = 0 by (20)

p0 · r1 = 0 by (22)

r0 · r1 = 0 by (26)

Induction step:

For each invariant, the induction step consists of assuming that all three invariants are truefor j < k and proving that the given invariant then also holds for j ≤ k.

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.

Page 17: Conjugate gradient solution of linear equations

CONJUGATE GRADIENT SOLUTION OF LINEAR EQUATIONS 155

Invariant (16):

1. For j = k,pk · qk+1 = 0 by (20)

2. For j < k,

pj · qk+1 = pk+1 · qj by (19)

= (rk+1 + gkpk) · qj by (18)

= rk+1 · qj by (16), (19)

= rk+1 · (rj − rj+1)/fj by (10), (12)

= 0 by (27)

Invariant (23):

1. For j = k,pk · rk+1 = 0 by (22)

2. For j < k,

pj · rk+1 = pj · (rk − fkqk) by (10)

= 0 by (16), (23)

Invariant (27):

1. For j = k,rk · rk+1 = 0 by (26)

2. For j < k,

rj · rk+1 = rj · (rk − fkqk) by (10)

= −fkrj · qk by (27)

= −fk(pj − gj−1pj−1) · qk by (18)

= 0 by (16)

REFERENCES

1. G. Strang, Introduction to Applied Mathematics, Wellesley-Cambridge Press, Wellesley, MA,1986.

2. R. Feynman, R. B. Leighton and M. L. Sands, The Feynman Lectures on Physics, Addison-Wesley, Redwood City, CA, 1989.

3. A. Jennings and J. J. McKeown, Matrix Computation, 2nd edn, John Wiley & Sons, NY, 1992.4. R. Courant and F. John, Introduction to Calculus and Analysis, Springer-Verlag, NY, 1989.5. M. R. Hesteness and E. Stiefel, ‘Methods of conjugate gradients for solving linear systems’,

J. Res. Natl. Bur. Stand., 49, 409–436 (1952).6. J. K. Reid, ‘On the method of conjugate gradients for the solution of large sparse systems of

linear equations’, in J. K. Reid (Ed.), Large Sparse Sets of Linear Equations, Academic Press,NY, 1971.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 139–156 (1998)

Page 18: Conjugate gradient solution of linear equations

156 P. BRINCH HANSEN

7. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd edn, The Johns Hopkins UniversityPress, Baltimore, MD, 1989.

8. G. Golub and D. O’Leary, ‘Some history of the conjugate gradient and Lanczos algorithms:1948–1976’, SIAM Rev., 31, 50–102 (1989).

9. G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon and D. W. Walker, SolvingProblems on Concurrent Processors, Vol. I, Prentice Hall, Englewood Cliffs, NJ, 1988.

10. J. M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, Plenum Press, NY,1988.

11. A. Gupta, V. Kumar and A. Sameh, ‘Performance and scalability of preconditioned conjugategradient methods on parallel computers’, IEEE Trans. Parallel Distrib. Syst., 6, 455–469 (1995).

12. D. H. Bailey, E. Barszcz, L. Dagum and H. D. Simon, ‘NAS parallel benchmark results’, IEEEParallel Distrib. Technol., 1, 43–51 (1993).

13. R. Henry, The Art Spirit, Harper & Row, NY, 1951, p. 265.

Concurrency: Pract. Exper., Vol. 10, 139–156 (1998) 1998 John Wiley & Sons, Ltd.