numerical methods for elliptic partial diﬀerential ... · introduction to elliptic boundary value...

Numerical methods for

elliptic partial differential equations

Arnold Reusken

Preface

This is a book on the numerical approximation of partial differential equations. On the nextpage we give an overview of the structure of this book:

2

Elliptic boundary value problems (chapter 1):• Poisson equation: scalar, symmetric, elliptic.

• Convection-diffusion equation: scalar, nonsymmetric,singularly perturbed.

• Stokes equation: system, symmetric, indefinite.

Weak formulation (chapter 2)

Finite elementmethod

−→

• Basic principles (chapter 3);application to Poisson equation.

• Streamline-diffusion FEM (chapter 4);application to convection-diffusion eqn.

• FEM for Stokes equation (chapter 5).

Iterative methods −→

• Basics on linear iterative methods(chapter 6).

• Preconditioned CG method (chapter 7);application to Poisson equation.

• Krylov subspace methods (chapter 8);application to convection-diffusion eqn.

• Multigrid methods (chapter 9).

• Iterative methods for saddle-pointproblems (chapter 10);application to Stokes equation.

Adaptivity −→• A posteriori error estimation (chapter ).

• Grid refinement techniques (chapter ).

3

Contents

1 Introduction to elliptic boundary value problems 91.1 Preliminaries on function spaces and domains . . . . . . . . . . . . . . . . . . . . 91.2 Scalar elliptic boundary value problems . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.1 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2.3 Existence, uniqueness, regularity . . . . . . . . . . . . . . . . . . . . . . . 14

1.3 The Stokes equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Weak formulation 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 The spaces Wm(Ω) based on weak derivatives . . . . . . . . . . . . . . . . 23

2.2.2 The spaces Hm(Ω) based on completion . . . . . . . . . . . . . . . . . . . 252.2.3 Properties of Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 General results on variational formulations . . . . . . . . . . . . . . . . . . . . . . 342.4 Minimization of functionals and saddle-point problems . . . . . . . . . . . . . . . 432.5 Variational formulation of scalar elliptic problems . . . . . . . . . . . . . . . . . . 45

2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5.2 Elliptic BVP with homogeneous Dirichlet boundary conditions . . . . . . 462.5.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 492.5.4 Regularity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.5.5 Riesz-Schauder theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.6 Weak formulation of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 56

2.6.1 Proof of the inf-sup property . . . . . . . . . . . . . . . . . . . . . . . . . 582.6.2 Regularity of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 602.6.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 Galerkin discretization and finite element method 633.1 Galerkin discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Examples of finite element spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2.1 Simplicial finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2.2 Rectangular finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.3 Approximation properties of finite element spaces . . . . . . . . . . . . . . . . . . 683.4 Finite element discretization of scalar elliptic problems . . . . . . . . . . . . . . . 75

3.4.1 Error bounds in the norm ‖ · ‖1 . . . . . . . . . . . . . . . . . . . . . . . . 753.4.2 Error bounds in the norm ‖ · ‖L2 . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 Stiffness matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.5.1 Mass matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5

3.6 Isoparametric finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.7 Nonconforming finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Finite element discretization of a convection-diffusion problem 87

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2 A variant of the Cea-lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.3 A one-dimensional hyperbolic problem and its finite element discretization . . . . 93

4.4 The convection-diffusion problem reconsidered . . . . . . . . . . . . . . . . . . . . 100

4.4.1 Well-posedness of the continuous problem . . . . . . . . . . . . . . . . . . 101

4.4.2 Finite element discretization . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.4.3 Stiffness matrix for the convection-diffusion problem . . . . . . . . . . . . 113

5 Finite element discretization of the Stokes problem 115

5.1 Galerkin discretization of saddle-point problems . . . . . . . . . . . . . . . . . . . 115

5.2 Finite element discretization of the Stokes problem . . . . . . . . . . . . . . . . . 117

5.2.1 Error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.2.2 Other finite element spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Linear iterative methods 127

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Basic linear iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.3 Convergence analysis in the symmetric positive definite case . . . . . . . . . . . . 134

6.4 Rate of convergence of the SOR method . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Convergence analysis for regular matrix splittings . . . . . . . . . . . . . . . . . . 140

6.5.1 Perron theory for positive matrices . . . . . . . . . . . . . . . . . . . . . . 141

6.5.2 Regular matrix splittings . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.6 Application to scalar elliptic problems . . . . . . . . . . . . . . . . . . . . . . . . 146

7 Preconditioned Conjugate Gradient method 151

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2 Conjugate Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.3 Introduction to preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.4 Preconditioning based on a linear iterative method . . . . . . . . . . . . . . . . . 161

7.5 Preconditioning based on incomplete LU factorizations . . . . . . . . . . . . . . . 162

7.5.1 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.5.2 Incomplete LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.5.3 Modified incomplete Cholesky method . . . . . . . . . . . . . . . . . . . . 169

7.6 Problem based preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.7 Preconditioned Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . 170

8 Krylov Subspace Methods 175

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.2 The Conjugate Gradient method reconsidered . . . . . . . . . . . . . . . . . . . . 176

8.3 MINRES method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.4 GMRES type of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8.5 Bi-CG type of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

6

9 Multigrid methods 1979.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979.2 Multigrid for a one-dimensional model problem . . . . . . . . . . . . . . . . . . . 1989.3 Multigrid for scalar elliptic problems . . . . . . . . . . . . . . . . . . . . . . . . . 2039.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2079.4.2 Approximation property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2099.4.3 Smoothing property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2109.4.4 Multigrid contraction number . . . . . . . . . . . . . . . . . . . . . . . . . 2169.4.5 Convergence analysis for symmetric positive definite problems . . . . . . . 218

9.5 Multigrid for convection-dominated problems . . . . . . . . . . . . . . . . . . . . 2239.6 Nested Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239.7 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.8 Algebraic multigrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2269.9 Nonlinear multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

10 Iterative methods for saddle-point problems 22910.1 Block diagonal preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23010.2 Application to the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . . . . 232

A Functional Analysis 235A.1 Different types of spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235A.2 Theorems from functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 238

B Linear Algebra 241B.1 Notions from linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241B.2 Theorems from linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

7

Chapter 1

Introduction to elliptic boundary

value problems

In this chapter we introduce the classical formulation of scalar elliptic problems and of the Stokesequations. Some results known from the literature on existence and uniqueness of a classicalsolution will be presented. Furthermore, we briefly discuss the issue of regularity.

1.1 Preliminaries on function spaces and domains

The boundary value problems that we consider in this book will be posed on domains Ω ⊂ Rn,n = 1, 2, 3. In the remainder we always assume that

Ω is open, bounded and connected.

Moreover, the boundary of Ω should satisfy certain smoothnes conditions that will be introducedin this section. For this we need so-called Holder spaces.

By Ck(Ω), k ∈ N, we denote the space of functions f : Ω → R for which all (partial) derivatives

Dνf :=∂|ν|f

∂xν11 . . . ∂xνnn

, ν = (ν1, . . . , νn), |ν| = ν1 + . . .+ νn ,

of order |ν| ≤ k are continuous functions on Ω. The space Ck(Ω), k ∈ N, consists of all functionsin Ck(Ω) ∩ C(Ω) for which all derivatives of order ≤ k have continuous extensions to Ω.Since Ω is compact, the functional

f → max|ν|≤k

maxx∈Ω

|Dνf(x)| = max|ν|≤k

‖Dνf‖∞,Ω =: ‖f‖Ck(Ω)

defines a norm on Ck(Ω). The space (Ck(Ω), ‖ · ‖Ck(Ω)) is a Banach space (cf. Appendix A.1).

Note that f → max|ν|≤k ‖Dνf‖∞,Ω does not define a norm on Ck(Ω).For f : Ω → R we define its support by

supp(f) := x ∈ Ω | f(x) 6= 0 .

The space Ck0 (Ω), k ∈ N, consists of all functions in Ck(Ω) which have a compact support inΩ, i.e., supp(f) ⊂ Ω. The functional f → max|ν|≤k ‖Dνf‖∞,Ω defines a norm on Ck0 (Ω), but

9

(Ck0 (Ω), ‖ · ‖Ck(Ω)) is not a Banach space.

For a compact set D ⊂ Rn and λ ∈ (0, 1] we introduce the quantity

[f ]λ,D := sup |f(x) − f(y)|‖x − y‖λ | x,y ∈ D, x 6= y for f : D → R.

We write f ∈ C0,λ(Ω) and say that f is Holder continuous in Ω with exponent λ if [f ]λ,Ω <∞.

A norm on the space C0,λ(Ω) is defined by

f → ‖f‖C(Ω) + [f ]λ,Ω.

We write f ∈ C0,λ(Ω) and say that f is Holder continuous in Ω with exponent λ if for arbitrarycompact subsets D ⊂ Ω the property [f ]λ,D <∞ holds. An important special case is λ = 1: thespace C0,1(Ω) [or C0,1(Ω)] consists of all Lipschitz continuous functions on Ω [Ω].The space Ck,λ(Ω) [Ck,λ(Ω)], k ∈ N, λ ∈ (0, 1], consists of those functions in Ck(Ω) [Ck(Ω)]for which all derivatives Dνf of order |ν| = k are elements of C0,λ(Ω) [C0,λ(Ω)]. On Ck(Ω) wedefine a norm by

f → ‖f‖Ck(Ω) +∑

|α|=k[Dαf ]λ,Ω.

Note that

Ck,λ(Ω) ⊂ Ck(Ω) for all k ∈ N, λ ∈ (0, 1],

Ck,λ2(Ω) ⊂ Ck,λ1(Ω) for all k ∈ N, 0 < λ1 ≤ λ2 ≤ 1 ,

and similarly with Ω replaced by Ω. We use the notation Ck,0(Ω) := Ck(Ω) [Ck,0(Ω) := Ck(Ω)].

Remark 1.1.1 The inclusion Ck+1(Ω) ⊂ Ck,λ(Ω), λ ∈ (0, 1], is in general not true. Considern = 2 and Ω = (x, y) | − 1 < x < 1, − 1 < y <

√

|x| . The function

f(x, y) =

(signx)y1 12 if y > 0,

0 otherwise,

belongs to C1(Ω), but f /∈ C0,λ(Ω) if λ ∈ (34 , 1].

Based on these Holder spaces we can now characterize smoothness of the boundary ∂Ω.

Definition 1.1.2 For k ∈ N, λ ∈ [0, 1] the property ∂Ω ∈ Ck,λ (the boundary is of class Ck,λ)holds if at each point x0 ∈ ∂Ω there is a ball B = x ∈ Rn | ‖x − x0‖ < δ, δ > 0 and abijection ψ : B → E ⊂ Rn such that

ψ(B ∩ Ω) ⊂ Rn+ := x ∈ Rn | xn > 0 , (1.1a)

ψ(B ∩ ∂Ω) ⊂ ∂Rn+, (1.1b)

ψ ∈ Ck,λ(B), ψ−1 ∈ Ck,λ(E). (1.1c)

For the case n = 2 this is illustrated in Figure ??.

Figure 1

10

A very important special case is ∂Ω ∈ C0,1. In this case all transformations ψ (and theirinverses) must be Lipschitz continuous functions and we then call Ω a Lipschitz domain. Thisholds, for example, if ∂Ω consists of different patches which are graphs of smooth functions (e.g.,polynomials) and at the interface between different patches interior angles are bounded awayfrom zero. In Figure ?? we give an illustration for the two dimensional case.

Figure 2

A domain Ω is convex if for arbitrary x,y ∈ Ω the inclusion tx+(1−t)y | t ∈ [0, 1] ⊂ Ω holds.

In almost all theoretical analyses presented in this book it suffices to have ∂Ω ∈ C0,1. Moreover,the domains used in practice usually satisfy this condition. Therefore, in the remainder of thisbook we always consider such domains, unless stated otherwise explicitly.

Assumption 1.1.3 In this book we assume that the domain Ω ⊂ Rn is such that

Ω is open, connected and bounded,

∂Ω is of class C0,1.

One can show that if this assumption holds then Ck+1(Ω) ⊂ Ck,1(Ω) (cf. remark 1.1.1).

1.2 Scalar elliptic boundary value problems

1.2.1 Formulation of the problem

On C2(Ω) we define a linear second order differential operator L as follows:

Lu =

n∑

i,j=1

aij∂2u

∂xi∂xj+

n∑

i=1

bi∂u

∂xi+ cu, (1.2)

with aij , bi and c given functions on Ω. Because ∂2u∂xi∂xj

= ∂2u∂xj∂xi

we may assume, without loss

of generality, that

aij(x) = aji(x)

holds for all x ∈ Ω. Corresponding to the differential operator L we can define a partial differ-ential equation

Lu = f, (1.3)

with f a given function on Ω. In (1.2) the part containing the second derivatives only, i.e.

n∑

i,j=1

aij∂2u

∂xi∂xj,

is called the principal part of L. Related to this principal part we have the n × n symmetricmatrix

A(x) = (aij(x))1≤i,j≤n . (1.4)

11

Note that due to the symmetry of A the eigenvalues are real. These eigenvalues, which maydepend on x ∈ Ω, are denoted by

λ1(x) ≤ λ2(x) ≤ . . . ≤ λn(x).

Hyperbolicity, parabolicity, or ellipticity of the differential operator L is determined by theseeigenvalues. The operator L, or the partial differential equation in (1.2), is called elliptic at thepoint x ∈ Ω if all eigenvalues of A(x) have the same sign. The operator L and the correspondingdifferential equation are called elliptic if L is elliptic at every x ∈ Ω. Note that this property isdetermined by the principal part of L only.

Remark 1.2.1 If the operator L is elliptic, then we may assume that all eigenvalues of thematrix A(x) in (1.4) are positive:

0 < λ1(x) ≤ λ2(x) ≤ . . . ≤ λn(x) for all x ∈ Ω.

The operator L (and the corresponding boundary value problem) is called uniformly elliptic ifinfλ1(x) | x ∈ Ω > 0 holds. Note that if the operator L is elliptic with coefficients aij ∈ C(Ω)then the function x → A(x) is continuous on the compact set Ω and hence L is uniformly elliptic.Using

n∑

i,j=1

aij(x)ξiξj = ξTA(x)ξ ≥ λ1(x)ξT ξ,

we obtain that the operator L is uniformly elliptic if and only if there exists a constant α0 > 0such that for all ξ ∈ Rn, we have

n∑

i,j=1

aij(x)ξiξj ≥ α0ξT ξ for all x ∈ Ω.

We obtain a boundary value problem when we combine the partial differential equation in (1.3)with certain boundary conditions for the unknown function u. For ease we restrict ourselves toproblems with Dirichlet boundary conditions, i.e., we impose :

u = g on ∂Ω,

with g a given function on ∂Ω. Other types of boundary conditions are the so called Neumannboundary conditon, i.e., a condition on the normal derivative ∂u

∂n on ∂Ω, and the mixed boundarycondition which is a linear combination of a Dirichlet and a Neumann boundary condition.Summarizing, we consider a linear second order Dirichlet boundary value problem in Ω ⊂ Rn :

Lu =

n∑

i,j=1

aij∂2u

∂xi∂xj+

n∑

i=1

bi∂u

∂xi+ cu = f in Ω, (1.5a)

u = g on ∂Ω, (1.5b)

where (aij(x))1≤i,j≤n is such that the problem is elliptic. A solution u of (1.5) is called aclassical solution if u ∈ C2(Ω) ∩ C(Ω). The functions (aij(x))1≤i,j≤n, (bi(x))1≤i≤n and c(x) arecalled the coefficients of the operator L.

12

1.2.2 Examples

We assume n = 2, i.e. a problem with two independent variables, say x1 = x and x2 = y. Thenthe differential operator is given by

Lu = a11∂2u

∂x2+ 2a12

∂2u

∂x∂y+ a22

∂2u

∂y2+ b1

∂u

∂x+ b2

∂u

∂y+ cu.

In this case we have λ1(x)λ2(x) = det(A(x)) and the ellipticity condition can be formulated as

a11(x, y)a22(x, y) − a212(x, y) > 0, (x, y) ∈ Ω.

Examples of elliptic equations are the Laplace equation

∆u :=∂2u

∂x2+∂2u

∂y2= 0 in Ω,

the Poisson equation (cf. Poisson [72])

−∆u = f in Ω, (1.6)

the reaction-diffusion equation−∆u+ cu = f in Ω, (1.7)

and the convection-diffusion equation

−ε∆u+ b1∂u

∂x+ b2

∂u

∂y= f in Ω, ε > 0. (1.8)

If we add Dirichlet boundary conditions to the Poisson equation in (1.6), we obtain the classicalDirichlet problem for Poisson’s equation:

−∆u = f in Ω ,u = g on ∂Ω.

(1.9)

Remark 1.2.2 We briefly comment on the convection-diffusion equation in (1.8). If |ε/b1| ≪1 or |ε/b2| ≪ 1 (in a part of the domain) then the diffusion term −ε∆u can be seen as aperturbation of the convection term b1

∂u∂x + b2

∂u∂y (in a part of the domain). The convection-

diffusion equation is of elliptic type. However, for ε = 0 we obtain the so-called reduced equationwhich is of hyperbolic type. In view of this the convection-diffusion equation with |ε/b1| ≪ 1or |ε/b2| ≪ 1 is called a singularly perturbed equation. The fact that the elliptic equation (1.8)is then in some sense close to a hyperbolic equation results in some special phenomena, that donot occur in diffusion dominated problems (as (1.9)). For example, in a convection dominatedproblem (e.g., an equation as (1.8) with ε ≪ 1 and bi = 1 , i = 1, 2 ) the solution u shows abehaviour in which most of the information is transported in certain directions (“streamlines”).So we observe a behaviour as in the hyperbolic problem (ε = 0), in which the solution satisfies anordinary differential equation along each characteristic. Another phenomenon is the occurrenceof boundary layers. If we combine the equation in (1.8) with Dirichlet boundary conditions on∂Ω then in general these boundary conditions are not appropriate for the hyperbolic problem(ε = 0). As a result, if |ε/b1| ≪ 1 or |ε/b2| ≪ 1 we often observe that on a part of theboundary (corresponding to the outflow boundary in the hyperbolic problem) there is a smallneighbourhood in which the solution u varies very rapidly. Such a neighbourhood is called aboundary layer.

13

For a detailed analysis of singularly perturbed convection-diffusion equations we refer to Roos etal [76]. An illustration of the two phenomena described above is given in Section ??. Finally wenote that for the numerical solution of a problem with a singularly perturbed equation specialtools are needed, both with respect to the discretization of the problem and the iterative solverfor the discrete problem.

1.2.3 Existence, uniqueness, regularity

For the elliptic boundary value problems introduced above, a first important topic that shouldbe addressed concerns the existence and uniqueness of a solution. If a unique solution exists thenanother issue is the smoothness of the solution and how this smoothness depends on the data(source term, boundary condition, coefficients). Such smoothness results are called regularityproperties of the problem. The topic of existence, uniqueness and regularity has been, and stillis, the subject of many mathematical studies. We will not treat these topics here. We only givea few references to standard books in this field: Gilbarg and Trudinger [39], Miranda [64], Lionsand Magenes [60], Hackbusch [45], [47].We note that for the classical formulation of an elliptic boundary value problem it is often ratherhard to establish satisfactory results on existence, uniqueness or regularity. In Section 2.5 wewill discuss the variational (or weak) formulation of elliptic boundary value problems. In thatsetting, additional tools for the analysis of existence, uniqueness and regularity are available and(much) more results are known.

Example 1.2.3 The reaction-diffusion equation can be used to show that a solution of anelliptic Dirichlet boundary value problem as in (1.5) need not be unique. Consider the problemin (1.7) on Ω = (0, 1)2, with f = 0 and c(x, y) = −(µπ)2 − (νπ)2 , µ, ν ∈ N, combined withzero Dirichlet boundary conditions. Then both u(x, y) ≡ 0 and u(x, y) = sin(µπx) sin(νπy) aresolutions of this boundary value problem.

Example 1.2.4 Even for very simple elliptic problems a classical solution may not exist. Con-sider

−a∂2u∂x2 = 1 in (0,1),

u(0) = u(1) = 0,

with a(x) = 1 for 0 ≤ x ≤ 0.5 and a(x) = 2 for 0.5 < x ≤ 1. Clearly the second derivative of asolution u of this problem cannot be continuous at x = 0.5.

We present a typical result from the literature on existence and uniqueness of a classical solu-tion. For this we need a certain condition on ∂Ω. The domain Ω is said to satisfy the exteriorsphere condition if for every x0 ∈ ∂Ω there exists a ball B such that B ∩ Ω = x0. Note that thiscondition is fulfilled, for example, if Ω is convex or if ∂Ω ∈ C2,0.

14

Theorem 1.2.5 ([39], Theorem 6.13) Consider the boundary value problem (1.5) and as-sume that

(i) L is uniformly elliptic,

(ii) Ω satisfies the exterior sphere condition,

(iii) the coefficients of L and the function f belong to C0,λ(Ω), λ ∈ (0, 1),

(iv) c ≤ 0 holds,

(v) the boundary data are continuous : g ∈ C(∂Ω).

Then the problem (1.5) has a unique classical solution u ∈ C2,λ(Ω) ∩C(Ω).

With respect to regularity of the solution it is important to distinguish between interiorsmoothness (i.e., in Ω) and global smoothness (i.e., in Ω). A typical result on interior smoothnessis given in the next theorem:

Theorem 1.2.6 ([39], Theorem 6.17) Let u ∈ C2(Ω) be a solution of (1.5). Suppose that Lis elliptic and that there are k ∈ N, λ ∈ (0, 1) such that the coefficients of L and the functionf are in Ck,λ(Ω). Then u ∈ Ck+2,λ(Ω) holds. If the coefficients and f are in C∞(Ω), thenu ∈ C∞(Ω).

This result shows that the interior regularity depends on the smoothness of the coefficients andof the right handside f , but does not depend on the smoothness of the boundary (data). Aresult on global regularity is given in:

Theorem 1.2.7 ([39], Theorem 6.19) Let u ∈ C2(Ω)∩C(Ω) be a classical solution of (1.5).Suppose that L is uniformly elliptic and that there are k ∈ N, λ ∈ (0, 1) such that the coefficientsof L and the function f are in Ck,λ(Ω), ∂Ω ∈ Ck+2,λ. Assume that g can be extended on Ω suchthat g ∈ Ck+2,λ(Ω). Then u ∈ Ck+2,λ(Ω) holds.

For a global regularity result as in the previous theorem to hold, the smoothness of the boundary(data) is important. In practice one often has a domain with a boundary consisting of the unionof straight lines (in 2D) or planes (3D). Then the previous theorem does not apply and theglobal regularity of the solution can be rather low as is shown in the next example.

Example 1.2.8 [from [47], p.13] We consider (1.9) with Ω = (0,1)×(0,1), f ≡ 0, g(x, y) = x2

(so g ∈ C(∂Ω), g ∈ C∞(Ω)). Then Theorem 1.2.5 guarantees the existence of a unique classicalsolution u ∈ C2(Ω) ∩C(Ω). However, u is not an element of C2(Ω).Proof. Assume that u ∈ C2(Ω) holds. From this and −∆u = 0 in Ω it follows that ∆u = 0 in Ωholds. From u = g = x2 on ∂Ω we get uxx(x, 0) = 2 for x ∈ [0, 1] and uyy(0, y) = 0 for y ∈ [0, 1].It follows that ∆u(0, 0) = 2 must hold, which yields a contradiction.

1.3 The Stokes equations

The n-dimensional Navier-Stokes equations model the motion of an incompressible viscousmedium. It can be derived using basic principles from continuum mechanics (cf. [43]). Theunknowns are the velocity field u(x) = (u1(x), . . . , un(x)) and the pressure p(x), x ∈ Ω. If one

15

considers a steady-state situation then these Navier-Stokes equations, in dimensionless quanti-ties, are as follows:

−ν∆ui +n∑

j=1

uj∂ui∂xj

+∂p

∂xi= fi in Ω, 1 ≤ i ≤ n, (1.10a)

n∑

j=1

∂uj∂xj

= 0 in Ω, (1.10b)

with ν > 0 a parameter that is related to the viscosity of the medium. Using the notation∆u := (∆u1, . . . ,∆un)

T , divu :=∑n

j=1∂uj

∂xj, f = (f1, . . . , fn)

T , we obtain the more compact

formulation

−ν∆u + (u · ∇)u + ∇p = f in Ω, (1.11a)

divu = 0 in Ω. (1.11b)

Note that the pressure p is determined only up to a constant by these Navier-Stokes equations.The problem has to be completed with suitable boundary conditions. One simple possibility isto take homogeneous Dirichlet boundary conditions for u, i.e., u = 0 on ∂Ω. If in the Navier-Stokes equations the nonlinear convection term (u · ∇)u is neglected, which can be justified insituations where the viscosity parameter ν is large, one obtains the Stokes equations. From asimple rescaling argument (replace u by 1

νu) it follows that without loss of generality in theStokes equations we can assume ν = 1. Summarizing, we obtain the following Stokes problem:

−∆u + ∇p = f in Ω, (1.12a)

divu = 0 in Ω, (1.12b)

u = 0 on ∂Ω. (1.12c)

This is a stationary boundary value problem, consisting of n + 1 coupled partial differentialequations for the unknowns (u1, . . . , un) and p.

In [2] the notion of ellipticity is generalized to systems of partial differential equations. Itcan be shown (cf. [2, 47]) that the Stokes equations indeed form an elliptic system. We donot discuss existence and uniqueness of a classical solution of the Stokes problem. In chapter 2we discuss the varitional formulation of the Stokes problem. For this formulation the issue ofexistence, uniqueness and regularity of a solution is treated in section 2.6.

16

Chapter 2

Weak formulation

2.1 Introduction

For solving a boundary value problem it can be (very) advantageous to consider a generalizationof the classical problem formulation, in which larger function spaces are used and a “weaker”solution (explained below) is allowed. This results in the variational formulation (also calledweak formulation) of a boundary value problem. In this section we consider an introductoryexample which illustrates that even for a very simple boundary value problem the choice of an“appropriate solution space” is an important issue. This example also serves as a motivation forthe introduction of the Sobolev spaces in section 2.2.

Consider the following elliptic two-point boundary value problem:

−(au′)′ = 1 in (0, 1), (2.1a)

u(0) = u(1) = 0. (2.1b)

We assume that the coefficient a is an element of C1([0, 1]) and that a(x) > 0 holds for allx ∈ [0, 1]. This problem then has a unique solution in the space

V1 = v ∈ C2([0, 1]) | v(0) = v(1) = 0 .

This solution is given by

u(x) =

∫ x

0

−t+ c

a(t)dt, c :=

(

∫ 1

0

t

a(t)dt)(

∫ 1

0

1

a(t)dt)−1

,

which may be checked by substitution in (2.1). If one multiplies the equation (2.1a) by anarbitrary function v ∈ V1, integrates both the left and right handside and then applies partialintegration one can show that u ∈ V1 is the solution of (2.1) if and only if

∫ 1

0a(x)u′(x)v′(x) dx =

∫ 1

0v(x) dx for all v ∈ V1. (2.2)

This variational problem can be reformulated as a minimization problem. For this we introducethe notion of a bilinear form.

17

Definition 2.1.1 Let X be a vector space. A mapping k : X×X → R is called a bilinear formif for arbitrary α, β ∈ R and u,v,w ∈ X the following holds:

k(αu + βv,w) = αk(u,w) + β k(v,w),

k(u, αv + βw) = αk(u,v) + β k(u,w).

The bilinear form is symmetric if k(u,v) = k(v,u) holds for all u,v ∈ X.

Lemma 2.1.2 Let X be a vector space and k : X ×X → R a symmetric bilinear form which ispositive, i.e., k(v,v) > 0 for all v ∈ X, v 6= 0. Let f : X → R be a linear functional. DefineJ : X → R by

J(v) =1

2k(v,v) − f(v).

Then J(u) < J(v) for all v ∈ X, v 6= u, holds if and only if

k(u,v) = f(v) for all v ∈ X. (2.3)

Moreover, there exists at most one minimizer u of J(·).

Proof. Take u,w ∈ X, t ∈ R. Note that

J(u + tw) − J(u) = t(

k(u,w) − f(w))

+1

2t2k(w,w) =: g(t;u,w).

“⇒”. If J(u) < J(v) for all v ∈ X, v 6= u, then the function t → g(t;u,w) must be strictlypositive for all w ∈ X \ 0 and t ∈ R \ 0. It follows that k(u,w) − f(w) = 0 for all w ∈ X.“⇐”. From (2.3) it follows that J(u + tw) − J(u) = 1

2t2k(w,w) for all w ∈ X, t ∈ R. For

v 6= u, w = v − u, t = 1 this yields J(v) − J(u) = 12k(v − u,v − u) > 0.

We finally prove the uniqueness of the minimizer. Assume that k(ui,v) = f(v) for all v ∈ Xand for i = 1, 2. It follows that k(u1 − u2,v) = 0 for all v ∈ X. For the choice v = u1 − u2

we get k(u1 − u2,u1 − u2) = 0. From the property that the bilinear form k(·, ·) is positive weconclude u1 = u2.

Note that in this lemma we do not claim existence of a solution.For the minimizer u (if it exists) the relation

J(v) − J(u) =1

2k(v − u,v − u) for all v ∈ X (2.4)

holds. We now return to the example and take

X = V1, k(u, v) =

∫ 1

0a(x)u′(x)v′(x) dx, f(v) =

∫ 1

0v(x) dx.

Note that all assumptions of lemma 2.1.2 are fulfilled. It then follows that the unique solutionof (2.1) (or, equivalently, of (2.2)) is also the unique minimizer of the functional

J(v) =

∫ 1

0[1

2a(x)v′(x)2 − v(x)] dx. (2.5)

We consider a case in which the coefficient a is only piecewise continuous (and not differentiableat all x ∈ (0, 1)). Then the problem in (2.1) is not well-defined. However, the definitions of the

18

bilinear form k(·, ·) and of the functional J(·) still make sense. We now analyze a minimizationproblem with a functional as in (2.5) in which the coefficient a is piecewise constant:

a(x) =

1 if x ∈ [0, 12 ]

2 if x ∈ (12 , 1].

We show that for this problem the choice of an appropriate solution space is a delicate issue. Notethat due to lemma 2.1.2 the minimization problem in X = V1 has a corresponding equivalentvariational formulation as in (2.3). With our choice of the coefficient a the functional J(·) takesthe form

J(v) :=

∫ 12

0[1

2v′(x)2 − v(x)] dx +

∫ 1

12

[v′(x)2 − v(x)] dx. (2.6)

This functional is well-defined on the space V1. The functional J , however, is also well-definedif v is only differentiable and even if we allow v to be nondifferentiable at x = 1

2 . We introducethe spaces

V2 = v ∈ C1([0, 1]) | v(0) = v(1) = 0 ,V3 = v ∈ C1([0, 1

2 ]) ∩ C1([12 , 1]) ∩ C([0, 1]) | v(0) = v(1) = 0 ,V4 = v ∈ C1([0, 1

2 ]) ∩ C1([12 , 1]) | v(0) = v(1) = 0 .

Note that V1 ⊂ V2 ⊂ V3 ⊂ V4 and that on all these spaces the functional J(·) is well-defined.Moreover, with X = Vi, i = 1, . . . , 4, and

k(u, v) =

∫ 12

0u′(x)v′(x) dx+

∫ 1

12

2u′(x)v′(x) dx, f(v) =

∫ 1

0v(x) dx (2.7)

all assumptions of lemma 2.1.2 are fulfilled. We define a (natural) norm on these spaces:

|||w|||2 :=

∫ 12

0w′(x)2 dx+

∫ 1

12

w′(x)2 dx. (2.8)

One easily checks that this indeed defines a norm on the space V4 and thus also on the subspacesVi, i = 1, 2, 3. Furthermore, this norm is induced by the scalar product

(w, v)1 :=

∫ 12

0w′(x)v′(x) dx+

∫ 1

12

w′(x)v′(x) dx (2.9)

on V4, and1

2|||w|||2 ≤ k(w,w) ≤ |||w|||2 for all w ∈ V4 (2.10)

holds. We show that in the space V3 the minimization problem has a unique solution.

Lemma 2.1.3 The problem minv∈V3J(v) has a unique solution u given by

u(x) =

−12x

2 + 512x if 0 ≤ x ≤ 1

2 ,

−14x

2 + 524x+ 1

24 if 12 ≤ x ≤ 1.

(2.11)

19

Proof. Note that u ∈ V3 and even u ∈ C∞([0, 12 ]) ∩ C∞([12 , 1]). We use the notation

u′L(12 ) = limx↑ 1

2u′(x) and similarly for u′R(1

2 ). We apply lemma 2.1.2 with X = V3. For

arbitrary v ∈ V3 we have

k(u, v) − f(v) =

∫ 12

0u′(x)v′(x) − v(x) dx +

∫ 1

12

2u′(x)v′(x) − v(x) dx

= u′L(1

2)v(

1

2) −

∫ 12

0(u′′(x) + 1)v(x) dx

− 2u′R(1

2)v(

1

2) −

∫ 1

12

(2u′′(x) + 1)v(x) dx.

(2.12)

Due to u′′(x) = −1 on [0, 12 ], u′′(x) = −1

2 on [12 , 1] and u′L(12) − 2u′R(1

2 ) = 0 we obtaink(u, v) = f(v) for all v ∈ V3. From lemma 2.1.2 we conclude that u is the unique minimizer inV3.

Thus with X = V3 a minimizer u exists and the relation (2.4) takes the form

J(v) − J(u) =1

2k(v − u, v − u),

with k(·, ·) as in (2.7). Due to (2.10) the norm ||| · ||| can be used as a measure for the distancefrom the minimum (i.e. J(v) − J(u)):

1

4|||v − u|||2 ≤ J(v) − J(u) ≤ 1

2|||v − u|||2. (2.13)

Before we turn to the the minimization problems in the spaces V1 and V2 we first present auseful lemma.

Lemma 2.1.4 Define W := v ∈ C∞([0, 1]) | v(0) = v(1) = 0 . For every u ∈ V3 there is asequence (un)n≥1 in W such that

limn→∞

|||un − u||| = 0. (2.14)

Proof. Take u ∈ V3 and define u(x) := u′(x) for all x ∈ [0, 1], x 6= 12 , u(1

2) a fixed arbitraryvalue and u(−x) = u(x) for all x ∈ [0, 1]. Then u is even and u ∈ L2

(

(−1, 1))

. From Fourieranalysis it follows that there is a sequence

un(x) =

n∑

k=0

ak cos(kπx), n ∈ N,

such that

limn→∞

‖u− un‖2L2 = lim

n→∞

∫ 1

−1

(

u(x) − un(x))2dx = 0.

Note that due to the fact that u is continuous and u(0) = u(1) = 0 we get a0 = 12

∫ 1−1 u(x) dx =

∫

12

0 u′(x) dx +∫ 1

12

u′(x) dx = 0. Define un(x) =∑n

k=1ak

kπ sin(kπx) For n ≥ 1. Then un ∈ W ,

u′n = un and|||u− un|||2 ≤ ‖u− un‖2

L2

holds. Thus it follows that limn→∞ |||un − u||| = 0.

20

Lemma 2.1.5 Let u ∈ V3 be given by (2.11). For i = 1, 2 the following holds:

infv∈Vi

J(v) = minv∈V3

J(v) = J(u). (2.15)

Proof. Take i = 1 or i = 2. I := infv∈ViJ(v) is defined as the greatest lower bound of J(v)

for v ∈ Vi. From V3 ⊃ Vi it follows that J(u) ≤ I holds. Suppose that J(u) < I holds, i.e. wehave δ := I − J(u) > 0. Due to W ⊂ Vi and lemma 2.1.4 there is a sequence (un)n≥1 in Vi suchthat limn→∞ |||u− un||| = 0 holds. Using (2.13) we obtain

J(un) = J(u) + (J(un) − J(u)) ≤ I − δ +1

2|||un − u|||2.

So for n sufficiently large we have J(un) < I, which on the other hand is not possible becauseI is a lower bound of J(v) for v ∈ Vi. We conclude that J(u) = I holds.

The result in this lemma shows that the infimum of J(v) for v ∈ V2 is equal to J(u) andthus, using (2.13), it follows that the minimizer u ∈ V3 can be approximated to any accuracy,measured in the norm ||| · |||, by elements from the smaller space V2. The question arises whyin the minimization problem the space V3 is used and not the seemingly more natural space V2.The answer to this question is formulated in the following lemma.

Lemma 2.1.6 There does not exist u ∈ V2 such that J(u) ≤ J(v) for all v ∈ V2.

Proof. Suppose that such a minimizer, say u ∈ V2, exists. From lemma 2.1.5 we then obtain

J(u) = minv∈V2

J(v) = infv∈V2

J(v) = J(u)

with u as in (2.11) the minimizer in V3. Note that u ∈ V3. From lemma 2.1.2 it follows thatthe minimizer in V3 is unique and thus u = u must hold. But then u = u /∈ V2, which yields acontradiction.

The same arguments as in the proof of this lemma can be used to show that in the smallerspace V1 there also does not exist a minimizer. Based on these results the function u ∈ V3 iscalled the weak solution of the minimization problem in V2. From (2.15) we see that for solvingthe minimization problem in the space V2, in the sense that one wants to compute infv∈V2

J(v),it is natural to consider the minimization problem in the larger space V3.

We now consider the even larger space V4 and show that the minimization problem still makessense (i.e. has a unique solution). However, the minimum value does not equal infv∈V2

J(v).

Lemma 2.1.7 The problem minv∈V4J(v) has a unique solution u given by

u(x) =

−12x(x− 1) if 0 ≤ x ≤ 1

2 ,

−14x(x− 1) if 1

2 < x ≤ 1.

Proof. Note that u ∈ V4 holds. We apply lemma 2.1.2. Recall the relation (2.12):

k(u, v) − f(v) = u′L(1

2)v(

1

2) −

∫ 12

0(u′′(x) + 1)v(x) dx

− 2u′R(1

2)v(

1

2) −

∫ 1

12

(2u′′(x) + 1)v(x) dx.

21

From u′′(x) = −1 on [0, 12 ], u′′(x) = −1

2 on [12 , 1] and u′L(12) = u′R(1

2 ) = 0 it follows thatk(u, v) = f(v) for all v ∈ V4. We conclude that u is the unique minimizer in V4.

A straightforward calculation yields the following values for the minima of the functional J(·)in V3 and in V4, respectively:

J(u) = −11

12

1

32, J(u) = − 1

32.

From this we see that, opposite to u, we should not call u a weak solution of the minimizationproblem in V2, because for u we have J(u) < infv∈V2

J(v).

We conclude the discussion of this example with a few remarks on important issues that playan important role in the remainder of this book

1) Both the theoretical analysis and the numerical solution methods treated in this bookheavily rely on the varational formulation of the elliptic boundary value problem (as, forexample, in (2.2)). In section 2.3 general results on existence, uniquess and stability ofvariational problems are presented. In the sections 2.5 and 2.6 these are applied to varia-tional formulations of elliptic boundary value problems. The finite element discretizationmethod treated in chapter 3 is based on the variational formulation of the boundary valueproblem. The derivation of the conjugate gradient (CG) iterative method, discussed inchapter 7, is based on the assumption that the given (discrete) problem can be formulatedas a minimization problem with a functional J very similar to the one in lemma 2.1.2.

2) The bilinear form used in the weak formulation often has properties similar to those of aninner product, cf. (2.7), (2.9), (2.10). To take advantage of this one should formulate theproblem in an inner product space. Then the structure of the space (inner product) fitsnicely to the variational problem.

3) To guarantee that a “weak solution” actually lies in the space one should use a space thatis “large enough” but “not too large”. This can be realized by completion of the space inwhich the problem is formulated. The concept of completion is explained in section 2.2.2.

The conditions discussed in the remarks 2) and 3) lead to Hilbert spaces, i.e. inner productspaces that are complete. The Hilbert spaces that are appropriate for elliptic boundary valueproblems are the Sobolev spaces. These are treated in section 2.2.

2.2 Sobolev spaces

The Holder spaces Ck,λ(Ω) that are used in the classical formulation of elliptic boundary valueproblems in chapter 1 are Banach spaces but not Hilbert spaces. In this section we introduceSobolev spaces. All Sobolev spaces are Banach spaces. Some of these are Hilbert spaces. In ourtreatment of elliptic boundary value problems we only need these Hilbert spaces and thus werestrict ourselves to the presentation of this subset of Sobolev Hilbert spaces. A very generaltreatment on Sobolev spaces is given in [1].

22

2.2.1 The spaces Wm(Ω) based on weak derivatives

We take u ∈ C1(Ω) and φ ∈ C∞0 (Ω). Since φ vanishes identically outside some compact subsetof Ω, one obtains by partial integration in the variable xj:

∫

Ω

∂u(x)

∂xjφ(x) dx = −

∫

Ωu(x)

∂φ(x)

∂xjdx

and thus∫

ΩDαu(x)φ(x) dx = −

∫

Ωu(x)Dαφ(x) dx , |α| = 1,

holds. Repeated application of this result yields the fundamental Green’s formula∫

ΩDαu(x)φ(x) dx = (−1)|α|

∫

Ωu(x)Dαφ(x) dx ,

for all φ ∈ C∞0 (Ω), u ∈ Ck(Ω), k = 1, 2, . . . and |α| ≤ k.

(2.16)

Based on this formula we introduce the notion of a weak derivative:

Definition 2.2.1 Consider u ∈ L2(Ω) and |α| > 0. If there exists v ∈ L2(Ω) such that

∫

Ωv(x)φ(x) dx = (−1)|α|

∫

Ωu(x)Dαφ(x) dx for all φ ∈ C∞0 (Ω) (2.17)

then v is called the αth weak derivative of u and is denoted by Dαu := v.

Two elementary results are given in the next lemma.

Lemma 2.2.2 If for u ∈ L2(Ω) the αth weak derivative exists then it is unique (in the usualLebesgue sense). If u ∈ Ck(Ω) then for 0 < |α| ≤ k the αth weak derivative and the classicalαth derivative coincide.

Proof. The second statement follows from the first one and Green’s formula (2.16). We nowprove the uniqueness. Assume that vi ∈ L2(Ω), i = 1, 2, both satisfy (2.17). Then it followsthat

∫

Ω

(

v1(x) − v2(x))

φ(x) dx = 〈v1 − v2, φ〉L2 = 0 for all φ ∈ C∞0 (Ω)

Since C∞0 (Ω) is dense in L2(Ω) this implies that 〈v1 − v2, φ〉L2 = 0 for all φ ∈ L2(Ω) and thusv1 − v2 = 0 (a.e.).

Remark 2.2.3 As a warning we note that if the classical derivative of u, say Dαu, existsalmost everywhere in Ω and Dαu ∈ L2(Ω), this does not guarantee the existence of the αthweak derivative. This is shown by the following example:

Ω = (−1, 1), u(x) = 0 if x < 0, u(x) = 1 if x ≥ 0 .

The classical first derivative of u on Ω \ 0 is u′(x) = 0. However, the weak first derivative ofu as defined in 2.2.1 does not exist.

A further noticable observation is the following: If u ∈ C∞(Ω)∩C(Ω) then u does not alwayshave a first weak derivative. This is shown by the example Ω = (0, 1), u(x) =

√x. The only

candidate for the first weak derivative of u is v(x) = 12√x. However, v /∈ L2(Ω).

23

The Sobolev space Wm(Ω), m = 1, 2, . . . , consists of all functions in L2(Ω) for which all αthweak derivatives with |α| ≤ m exist:

Wm(Ω) := u ∈ L2(Ω) | Dαu exists for all 0 < |α| ≤ m (2.18)

Remark 2.2.4 By definition, for u ∈Wm(Ω), Green’s formula

∫

ΩDαu(x)φ(x) dx = (−1)|α|

∫

Ωu(x)Dαφ(x) dx , for all φ ∈ C∞0 (Ω), |α| ≤ m,

holds.

For m = 0 we define W 0(Ω) := L2(Ω). In Wm(Ω) a natural inner product and correspondingnorm are defined by

〈u, v〉m :=∑

|α|≤m〈Dαu,Dαv〉L2 , ‖u‖m := 〈u, u〉

12m , u, v ∈Wm(Ω) (2.19)

It is easy to verify, that 〈·, ·〉m defines an inner product on Wm(Ω). We now formulate a mainresult:

Theorem 2.2.5 The space (Wm(Ω) , 〈·, ·〉m) is a Hilbert space.

Proof. We must show that the space Wm(Ω) with the norm ‖ · ‖m is complete. For m = 0this is trivial. We consider m ≥ 1. First note that for v ∈Wm(Ω):

‖v‖2m =

∑

|α|≤m‖Dαv‖2

L2 (2.20)

Let (uk)k≥1 be a Cauchy sequence in Wm(Ω). From (2.20) it follows that if ‖uk − uℓ‖m ≤ εthen ‖Dαuk − Dαuℓ‖L2 ≤ ε for all 0 ≤ |α| ≤ m. Hence, (Dαuk)k≥1 is a Cauchy sequence inL2(Ω) for all |α| ≤ m. Since L2(Ω) is complete it follows that there exists a unique u(α) ∈ L2(Ω)with limk→∞Dαuk = u(α) in L2(Ω). For |α| = 0 this yields limk→∞ uk = u(0) in L2(Ω). For0 < |α| ≤ m and arbitrary φ ∈ C∞0 (Ω) we obtain

〈u(α), φ〉L2 = limk→∞

〈Dαuk, φ〉L2

= limk→∞

(−1)|α|〈uk,Dαφ〉L2 = (−1)|α|〈u(0),Dαφ〉L2

(2.21)

From this it follows that u(α) ∈ L2(Ω) is the αth weak derivative of u(0). We conclude thatu(0) ∈Wm(Ω) and

limk→∞

‖uk − u(0)‖2m = lim

k→∞

∑

|α|≤m‖Dαuk −Dαu(0)‖2

L2

=∑

|α|≤mlimk→∞

‖Dαuk − u(α)‖2L2 = 0

This shows that the Cauchy sequence (uk)k≥1 in Wm(Ω) has its limit point in Wm(Ω) and thusthis space is complete.

24

Similar constructions can be applied if we replace the Hilbert space L2(Ω) by the Banach spaceLp(Ω), 1 ≤ p <∞ of measurable functions for which ‖u‖p := (

∫

Ω |u(x)|p dx)1/p is bounded. Thisresults in Sobolev spaces which are usually denoted by Wm

p (Ω). For notational simplicity wedeleted the index p = 2 in our presentation. For p 6= 2 the Sobolev space Wm

p (Ω) is a Banachspace but not a Hilbert space. In this book we only need the Sobolev space with p = 2 asdefined in (2.18). For p 6= 2 we refer to the literature, e.g. [1].

2.2.2 The spaces Hm(Ω) based on completion

In this section we introduce the Sobolev spaces using a different technique, namely based on theconcept of completion. We recall a basic result (cf. Appendix A.1).

Lemma 2.2.6 Let (Z, ‖·‖) be a normed space. Then there exists a Banach space (X, ‖·‖∗) suchthat Z ⊂ X, ‖x‖ = ‖x‖∗ for all x ∈ Z, and Z is dense in X. The space X is called the completionof Z. This space is unique, except for isometric (i.e., norm preserving) isomorphisms.

Here we consider the function space

Zm := u ∈ C∞(Ω) | ‖u‖m <∞ , (2.22)

endowed with the norm ‖ · ‖m, as defined in (2.19), i.e, we want to construct the completion of(Zm, ‖ · ‖m). For m = 0 this results in L2(Ω), since C∞(Ω) is dense in L2(Ω). Hence, we onlyconsider m ≥ 1. Note that Zm ⊂Wm(Ω) and that this embedding is continuous. One can applythe general result of lemma 2.2.6 which then defines the completion of the space Zm. However,here we want to present a more constructive approach which reveals some interesting relationsbetween this completion and the space Wm(Ω).First, note that due to (2.20) a Cauchy sequence (uk)k≥1 in Zm is also a Cauchy sequence in L2(Ω)and thus to every such sequence there corresponds a unique u ∈ L2(Ω) with limk→∞ ‖uk−u‖L2 =0. The space Vm ⊃ Zm is defined as follows:

Vm := u ∈ L2(Ω) | limk→∞

‖uk − u‖L2 = 0 for a Cauchy sequence (uk)k≥1 in Zm

One easily verifies that Vm is a vector space.

Lemma 2.2.7 Vm is the closure of Zm in the space Wm(Ω).

Proof. Take u ∈ Vm and let (uk)k≥1 be a Cauchy sequence in Zm with limk→∞ ‖uk−u‖L2 = 0.From (2.20) it follows that (Dαuk)k≥1, 0 < |α| ≤ m are Cauchy sequences in L2(Ω). Letu(α) := limk→∞Dαuk in L2(Ω). As in (2.21) one shows that u(α) is the αth weak derivativeDαu of u. Using Dαu = limk→∞Dαuk in L2(Ω), for 0 < |α| ≤ m, we get

limk→∞

‖uk − u‖2m = lim

k→∞

∑

|α|≤m‖Dαuk −Dαu‖2

L2 = 0

Since (uk)k≥1 is a sequence in Zm we have shown that Vm is the closure of Zm in Wm(Ω).

On the space Vm we can take the same inner product (and corresponding norm) as used inthe space Wm(Ω) (cf. (2.19)). From lemma 2.2.7 and the fact that in the space Zm the normis the same as the norm of Wm(Ω) it follows that (Vm , ‖ · ‖m) is the completion of (Zm, ‖ · ‖m).

25

Since the norm ‖ · ‖m is induced by an inner product we have that (Vm , 〈·, · 〉m) is a Hilbertspace. This defines the Sobolev space

Hm(Ω) := (Vm , 〈·, · 〉m) = completion of (Zm , 〈·, ·〉m)

It is clear from lemma 2.2.7 thatHm(Ω) ⊂Wm(Ω)

holds. A fundamental result is the following:

Theorem 2.2.8 The equality Hm(Ω) = Wm(Ω) holds.

Proof. The first proof of this result was presented in [63]. A proof can also be found in[1, 65].

We see that the construction using weak derivatives (space Wm(Ω)) and the one based oncompletion (space Hm(Ω)) result in the same Sobolev space. In the remainder we will only usethe notation Hm(Ω).

The result in theorem 2.2.8 holds for arbitrary open sets Ω in Rn. If the domain satisfiescertain very mild smoothnes conditions (it suffices to have assumption 1.1.3) one can prove asomewhat stronger result, that we will need further on:

Theorem 2.2.9 Let Hm(Ω) be the completion of the space (C∞(Ω) , 〈·, ·〉m). Then

Hm(Ω) = Hm(Ω) = Wm(Ω)

holds.

Proof. We refer to [1].

Note that C∞(Ω) $ Zm and thus Hm(Ω) results from the completion of a smaller space thanHm(Ω).

Remark 2.2.10 If assumption 1.1.3 is not satisfied then it may happen that Hm(Ω) $ Wm(Ω)holds. Consider the example

Ω = (x, y) ∈ R2 | x ∈ (−1, 0) ∪ (0, 1), y ∈ (0, 1) and take u(x, y) = 1 if x > 0, u(x, y) = 0 if x < 0. Then D(1,0)u = D(0,1)u = 0 on Ω and thusu ∈W 1(Ω). However, one can verify that there does not exist a sequence (φk)k≥1 in C1(Ω) suchthat

‖u− φk‖21 = ‖u− φk‖2

L2 +∑

|α|=1

‖Dαφk‖2L2 → 0 for k → ∞

Hence, C1(Ω) is not dense in W 1(Ω), i.e., H1(Ω) 6= W 1(Ω). The equality H1(Ω) = W 1(Ω),however, does hold.

The completion can also be defined if in (2.22) the space C∞(Ω) is replaced by the smaller spaceC∞0 (Ω). This yields another class of important Sobolev spaces:

Hm0 (Ω) ≡ completion of the space (C∞0 (Ω) , 〈·, ·〉m) (2.23)

The space Hm0 (Ω) is a Hilbert space that is in general strictly smaller than Hm(Ω).

26

Remark 2.2.11 In general we have H10 (Ω) $ H1(Ω). Consider, as a simple example, Ω =

(0, 1), u(x) = x. Then u ∈ H1(Ω) but for arbitrary φ ∈ C∞0 (Ω) we have

‖u− φ‖21 = ‖u− φ‖2

L2 + ‖u′ − φ′‖2L2

≥∫ 1

0

(

u′(x) − φ′(x))2dx =

∫ 1

01 − 2φ′(x) + φ′(x)2 dx

≥ 1 − 2

∫ 1

0φ′(x) dx = 1 − 2

(

φ(1) − φ(0))

= 1

Hence u /∈ H10 (Ω) = C∞0 (Ω)

‖·‖1.

The technique of completion can also be applied if instead of ‖ · ‖m one uses the norm ‖u‖m,p =(∑

|α|≤m ‖Dαu‖p)1/p, 1 ≤ p <∞. This results in Sobolev spaces denoted by Hmp (Ω). For p = 2

we have Hm2 (Ω) = Hm(Ω). For p 6= 2 these spaces are Banach spaces but not Hilbert spaces. A

result as in theorem 2.2.8 also holds for p 6= 2: Hmp (Ω) = Wm

p (Ω).

We now formulate a result on a certain class of piecewise smooth functions which form a subsetof the Sobolev space Hm(Ω). This subset plays an important role in the finite element methodthat will be presented in chapter 3.

Theorem 2.2.12 Assume that Ω can be partitioned as Ω = ∪Ni=1Ωi, with Ωi ∩ Ωj = ∅ for alli 6= j and for all Ωi the assumption 1.1.3 is fulfilled. For m ∈ N, m ≥ 1, define

Vm = u ∈ L2(Ω) | u|Ωi∈ Cm(Ωi) for all i = 1, . . . , N

For u ∈ Vm the following holds:

u ∈ Hm(Ω) ⇔ u ∈ Cm−1(Ω)

Proof. First we need some notation. Let Γi := ∂Ωi. The outward unit normal on Γi isdenoted by n(i). Let Γiℓ := Γi ∩ Γℓ (= Γℓi) and γint denotes the set of all those intersectionsΓiℓ with measn−1(Γiℓ) > 0 (in 2D with triangles: intersections by sides are taken into accountbut intersections by vertices not). Similarly, Γi0 := Γi ∩ ∂Ω and γb the set of all Γi0 withmeasn−1(Γi0) > 0. For Γiℓ ∈ γint let n(iℓ) be the unit normal pointing outward from Ωi (thusn(ℓi) = −n(iℓ). Finally, for u ∈ V1 let

[u]iℓ = limt↓0

(

u(x+ tn(iℓ)) − u(x+ tn(ℓi)))

, x ∈ Γiℓ ∈ γint

be the jump of u across Γiℓ.We now consider the case m = 1, i.e., u ∈ V1. Let v ∈ L2(Ω) be given by v(x) = ∂u(x)

∂xkfor

x ∈ Ωi, i = 1, . . . , N . For arbitrary φ ∈ C∞0 (Ω) we have

∫

Ωu(x)

∂φ(x)

∂xkdx =

N∑

i=1

∫

Ωi

u(x)∂φ(x)

∂xkdx

= −N∑

i=1

∫

Ωi

v(x)φ(x) dx +

N∑

i=1

∫

Γi

u(x)φ(x)n(i)k dx

= −∫

Ωv(x)φ(x) dx +

N∑

i=1

∫

Γi

u(x)φ(x)n(i)k dx (2.24)

27

For the last term in this expression we have

N∑

i=1

∫

Γi

u(x)φ(x)n(i)k dx =

∑

Γiℓ∈γint

∫

Γiℓ

[u]iℓφ(x)n(iℓ)k dx

+∑

Γi0∈γb

∫

Γi0

u(x)φ(x)n(i)k dx =: Rint +Rb

We have Rb = 0 because φ(x) = 0 on ∂Ω.If u ∈ H1(Ω) holds then the weak derivative ∂u

∂xkmust be equal to v (for all k = 1, . . . , n). From

∫

Ωu(x)

∂φ(x)

∂xkdx = −

∫

Ω

∂u(x)

∂xkφ(x) dx = −

∫

Ωv(x)φ(x) dx , ∀φ ∈ C∞0 (Ω),

it follows that Rint = 0 must hold for all φ ∈ C∞0 (Ω). This implies that the jump of u across Γiℓis zero and thus u ∈ C(Ω) holds. Conversely, if u ∈ C(Ω) then Rint = 0 and from the relation(2.24) it follows that the weak derivative ∂u

∂xkexists. Since k is arbitrary we conclude u ∈ H1(Ω).

This completes the proof for the case m = 1.For m > 1 we use an induction argument. Assume that the statement holds for m. We considerm + 1. Take u ∈ Vm+1 and assume that u ∈ Hm+1(Ω) holds. Take an arbitrary multi-indexα with |α| ≤ m − 1. Classical derivatives will be denoted by Dβ and weak ones by Dβ (withβ a multi-index as α). From the induction hypothesis we obtain w := Dαu ∈ C(Ω). Fromu ∈ Hm+1(Ω) it follows that Dβw ∈ H1(Ω) for |β| ≤ 1. Furthermore, for these β values wealso have, due to u ∈ Vm+1 that Dβw = Dβw ∈ C1(Ωi) for i = 1, . . . , N . From the resultfor m = 1 it now follows that Dβw ∈ C(Ω) for |β| ≤ 1. Hence, Dβw is continuous across theinternal interfaces Γiℓ and thus Dβw ∈ C(Ω) holds. We conclude that DαDβu ∈ C(Ω) for all|α| ≤ m−1, |β| ≤ 1, i.e., u ∈ Cm(Ω). Conversely, if u ∈ Vm+1 and u ∈ Cm(Ω) then Dαu ∈ C(Ω)for |α| ≤ m. From the result for m = 1 it follows that Dαu ∈ H1(Ω) for all |α| ≤ m and thusu ∈ Hm+1(Ω) holds.

2.2.3 Properties of Sobolev spaces

There is an extensive literature on the theory of Sobolev spaces, see for example, Adams [1],Marti [61], Necas [65], Wloka [99], Alt [3] , and the references therein. In this section we collecta few results that will be needed further on.

A first important question concerns the smoothness of functions from Hm(Ω) in the classical(i.e., Ck(Ω)) sense. For example, one can show that if Ω ⊂ R then all functions from H1(Ω) mustbe continuous on Ω. This, however, is not true for the two dimensional case, as the followingexample shows:

Example 2.2.13 In this example we show that functions in H1(Ω), with Ω ⊂ R2, are not

necessarily continuous on Ω. With r := (x21 + x2

2)12 let B(0, α) := (x1, x2) ∈ R2 | r < α for

α > 0. We take Ω = B(0, 12). Below we also use Ωδ := Ω \ B(0, δ) with 0 < δ < 1

2 . On Ω wedefine the function u by u(0, 0) := 0, u(x1, x2) := ln(ln(1/r)) otherwise. Using polar coordinatesone obtains

∫

Ωu(x)2 dx = lim

δ↓0

∫

Ωδ

u(x)2 dx = 2π limδ↓0

∫ 12

δ[ln(ln(1/r))]2r dr <∞

28

so u ∈ L2(Ω) holds. Note that u ∈ C∞(Ω \ 0). For the first derivatives we have

∫

Ω

∑

i=1,2

(∂u(x)

∂xi

)2dx = 2π lim

δ↓0

∫ 12

δ

1

r2(ln r)2r dr =

2π

ln 2

It follows that the classical first derivatives ∂u∂xi

exist a.e. on Ω and are elements of L2(Ω). Note,however, remark 2.2.3. For arbitrary φ ∈ C∞0 (Ω) we have, using Green’s formula on Ωδ:

∫

Ωδ

u(x)∂φ(x)

∂x1dx =

∫

∂B(0,δ)u(x)φ(x)nx1

ds−∫

Ωδ

∂u(x)

∂x1φ(x) dx.

Note that

limδ↓0

|∫

∂B(0,δ)u(x)φ(x)nx1

ds| ≤ limδ↓0

2πδ‖φ‖∞| ln(ln(1/δ))| = 0.

So we have∫

Ωu(x)

∂φ(x)

∂x1dx = −

∫

Ω

∂u(x)

∂x1φ(x) dx.

We conclude that ∂u∂x1

is the generalized partial derivative with respect to the variable x1. Thesame argument yields an analogous result for the derivative w.r.t. x2. We conclude that u ∈H1(Ω). It is clear that u is not continuous on Ω.

We now formulate an important general result which relates smoothness in the Sobolov sense(weak derivatives) to classical smoothness properties.For normed spaces X and Y a linear operator I : X → Y is called a continuous embedding if Iis continuous and injective. Such a continuous embedding is denoted by X → Y . Furthermore,usually for x ∈ X the corresponding element Ix ∈ Y is denoted by x, too (X → Y is formallyreplaced by X ⊂ Y ).

Theorem 2.2.14 If m− n2 > k (recall: Ω ⊂ Rn) then there exist continuous embeddings

Hm(Ω) → Ck(Ω) (2.25a)

Hm0 (Ω) → Ck(Ω). (2.25b)

Proof. Given in [1]

It is trivial that form ≥ 0 there are continuous embeddingsHm+1(Ω) → Hm(Ω) andHm+10 (Ω) →

Hm0 (Ω). In the next theorem a result on compactness of embeddings (cf. Appendix A.1) is for-

mulated. We recall that if X, Y are Banach spaces then a continuous embedding X → Y iscompact if and only if each bounded sequence in X has a subsequence which is convergent in Y .

29

Theorem 2.2.15 The continuous embeddings

Hm+1(Ω) → Hm(Ω) for m ≥ 0 (2.26a)

Hm+10 (Ω) → Hm

0 (Ω) for m ≥ 0 (2.26b)

Hm(Ω) → Ck(Ω) for m− n

2> k (2.26c)

Hm0 (Ω) → Ck(Ω) for m− n

2> k (2.26d)

are compact.

Proof. We sketch the idea of the proof. In [1] it is shown that the embeddings

H1(Ω) → L2(Ω), H10 (Ω) → L2(Ω)

are compact. This proves the results in (2.26a) and (2.26b) for m = 0. The results in (2.26a) and(2.26b) for m ≥ 1 can easily be derived from this as follows. Let (uk)k≥1 be a bounded sequencein Hm+1(Ω) (m ≥ 1). Then (Dαuk)k≥1 is a bounded sequence in H1(Ω) for |α| ≤ m. Thusthis sequence has a subsequence (Dαuk′)k′≥1 that converges in L2(Ω). Hence, the subsequence(uk′)k′≥1 converges in Hm(Ω). This proves the compactness of the embedding Hm+1(Ω) →Hm(Ω). The result in (2.26b) for m ≥ 1 can be shown in the same way.With a similar shift argument one can easily show that it suffices to prove the results in (2.26c)and (2.26d) for the case k = 0. The analysis for the case k = 0 is based on the following generalresult (which is easy to prove). If X,Y,Z are normed spaces with continuous embeddingsI1 : X → Y, I2 : Y → Z and if at least one of these embeddings is compact then the continuousembedding I2I1 : X → Z is compact. For m− n

2 > 0 there exist µ, λ ∈ (0, 1) with 0 < λ < µ <m− n

2 . The following continuous embeddings exist:

Hm(Ω) → C0,µ(Ω) → C0,λ(Ω) → C(Ω).

In this sequence only the first embedding is nontrivial. This one is proved in [1], theorem 5.4.Furthermore, from [1] theorem 1.31 it follows that the second embedding is compact. We con-clude that for m− n

2 > 0 the embedding Hm(Ω) → C(Ω) is compact. This then yields the resultin (2.26c) for k = 0. The same line of reasoning can be used to show that (2.26d) holds.

The result in the following theorem is a basic inequality that will be used frequently.

Theorem 2.2.16 (Poincare-Friedrichs inequality) There exists a constant C that dependsonly on diam(Ω) such that

‖u‖L2 ≤ C

√

∑

|α|=1

‖Dαu‖2L2 for all u ∈ H1

0 (Ω).

Proof. Because C∞0 (Ω) is dense in H10 (Ω) it is sufficient to prove the inequality for u ∈

C∞0 (Ω). Without loss of generality we can assume that (0, . . . , 0) ∈ Ω. Let a > 0 be such thatΩ ⊂ [−a, a]n =: E. Take u ∈ C∞0 (Ω) and extend this function by zero outside Ω. Note that

u(x1, . . . , xn) = u(−a, x2, . . . , xn) +

∫ x1

−a

∂u(t, x2, . . . , xn)

∂tdt

30

Since u(−a, x2, . . . , xn) = 0 we obtain, using the Cauchy-Schwarz inequality

u(x)2 =(

∫ x1

−a1∂u(t, x2, . . . , xn)

∂tdt)2

≤∫ x1

−a1 dt

∫ x1

−a

(∂u(t, x2, . . . , xn)

∂t

)2dt

≤ 2a

∫ a

−a

(∂u(t, x2, . . . , xn)

∂t

)2dt for x ∈ E

Note that the latter term does not depend on x1. Integration with respect to the variable x1

results in∫ a

−au(x1, . . . , xn)

2 dx1 ≤ 4a2

∫ a

−a

(

D(1,0,...,0)u(x))2dx1 for x ∈ E

and integration with respect to the other variables gives∫

Eu(x)2 dx ≤ 4a2

∫

E

(

D(1,0,...,0)u(x))2dx ≤ 4a2

∑

|α|=1

‖Dαu‖2L2

and thus the desired result is proved.

Corollary 2.2.17 For u ∈ Hm(Ω) define

|u|2m :=∑

|α|=m‖Dαu‖2

L2 (2.27)

There exists a constant C such that

|u|m ≤ ‖u‖m ≤ C|u|m for all u ∈ Hm0 (Ω) ,

i.e., | · |m and ‖ · ‖m are equivalent norms on Hm0 (Ω).

Proof. The inequality |u|m ≤ ‖u‖m is trivial. For m = 1 the inequality ‖u‖1 ≤ C|u|1 directlyfollows from the Poincare-Friedrichs inequality. For u ∈ H2

0 (Ω) we obtain ‖u‖22 = ‖u‖2

1 + |u|22 ≤C2|u|21 + |u|22. Application of the Poincare-Friedrichs inequality to Dαu ∈ H1

0 (Ω), |α| = 1, yields

|u|21 =∑

|α|=1

‖Dαu‖2L2 ≤ c

∑

|α|=1

∑

|β|=1

‖DβDαu‖2L2 ≤ c|u|22

Thus ‖u‖2 ≤ C|u|2 holds. For m > 2 the same reasoning is applicable.

In the weak formulation of the elliptic boundary value problems one must treat boundary con-ditions. For this the next result will be needed.

Theorem 2.2.18 (Trace operator) There exists a unique bounded linear operator

γ : H1(Ω) → L2(∂Ω), ‖γ(u)‖L2(∂Ω) ≤ c‖u‖1 , (2.28)

with the property that for all u ∈ C1(Ω) the equality γ(u) = u|∂Ω holds.

31

Proof. We define γ : C1(Ω) → L2(∂Ω) by γ(u) = u|∂Ω and will show that

‖γ(u)‖L2(∂Ω) ≤ c‖u‖1 for all u ∈ C1(Ω) (2.29)

holds. The desired result then follows from the extension theorem A.2.3. We give a proofof (2.29) for the two dimensional case. The general case can be treated in the same way. In aneighbourhood of x ∈ ∂Ω we take a local coordinate system (ξ, η) such that locally the boundarycan be represented as

Γloc = (ξ, ψ(ξ)) | ξ ∈ [−a, a] with a > 0, ψ ∈ C0,1([−a, a])

and a small strip below the graph of ψ is contained in Ω:

S := (ξ, η) | ξ ∈ [−a, a], η ∈ [ψ(ξ) − b, ψ(ξ)) ⊂ Ω ,

for b > 0 sufficiently small. Take u ∈ C1(Ω). Note that

u(ξ, ψ(ξ)) = u(ξ, t) +

∫ ψ(ξ)

t

∂u(ξ, η)

∂ηdη for t ∈ [ψ(ξ) − b, ψ(ξ)]

Using the inequality (α + β)2 ≤ 2(α2 + β2) and application of the Cauchy-Schwarz inequalityyields

u(ξ, ψ(ξ))2 ≤ 2u(ξ, t)2 + 2(

∫ ψ(ξ)

t1∂u(ξ, η)

∂ηdη)2

≤ 2u(ξ, t)2 + 2(ψ(ξ) − t)

∫ ψ(ξ)

t

(∂u(ξ, η)

∂η

)2dη

≤ 2u(ξ, t)2 + 2b

∫ ψ(ξ)

ψ(ξ)−b

(∂u(ξ, η)

∂η

)2dη

In this last expression only the first term on the right handside depends on t. Integration overt ∈ [ψ(ξ) − b, ψ(ξ)] results in

b u(ξ, ψ(ξ))2 ≤ 2

∫ ψ(ξ)

ψ(ξ)−bu(ξ, t)2 dt+ 2b2

∫ ψ(ξ)

ψ(ξ)−b

(∂u(ξ, η)

∂η

)2dη

= 2

∫ ψ(ξ)

ψ(ξ)−bu(ξ, η)2 + b2

(∂u(ξ, η)

∂η

)2dη

Integration over ξ ∈ [−a, a] and division by b gives

∫ a

−au(ξ, ψ(ξ))2 dξ ≤ 2

∫

Sb−1u(ξ, η)2 + b

(∂u(ξ, η)

∂η

)2dηdξ

If ψ ∈ C1([−a, a]) then for the local arc length variable s on Γloc we have

ds =√

1 + ψ′(ξ)2dξ

Since 0 ≤ ψ′(ξ)2 ≤ C for ξ ∈ [−a, a] we obtain∫

Γloc

u(s)2 ds ≤ C

∫ a

−au(ξ, ψ(ξ))2 dξ

≤ C(

b−1‖u‖2L2(S) + b|u|21,S

)

≤ C‖u‖21,S

(2.30)

32

If ψ is only Lipschitz continuous on [−a, a] then ψ′ exists almost everywhere on [−a, a] and|ψ′(ξ)| is bounded (Rademacher’s theorem). Hence, the same argument can be applied. Finallynote that ∂Ω can be covered by a finite number of local parts Γloc. Addition of the local in-equalities in (2.30) then yields the result in (2.29).

The operator defined in theorem 2.2.18 is called the trace operator. For u ∈ H1(Ω) the functionγ(u) ∈ L2(∂Ω) represents the boundary “values” of u and is called the trace of u. For γ(u) oneoften uses the notation u|∂Ω. For example, for u ∈ H1(Ω), the identity u|∂Ω = 0 means thatγ(u) = 0 in the L2(∂Ω) sense.

The space range(γ) can be shown to be dense in L2(∂Ω) but is strictly smaller than L2(∂Ω).For a characterization of this subspace one can use a Sobolev space with a broken index:

H12 (∂Ω) = range(γ) = v ∈ L2(∂Ω) | ∃ u ∈ H1(Ω) : v = γ(u) (2.31)

The space H12 (∂Ω) is a Hilbert space which has similar properties as the usual Sobolev spaces.

We do not discuss this topic here, since we will not need this space in the remainder.

Using the trace operator one can give another natural characterization of the space H10 (Ω):

Theorem 2.2.19 The equality

H10 (Ω) = u ∈ H1(Ω) | u|∂Ω = 0

holds.

Proof. We only prove “⊂”. For a proof of “⊃” we refer to [47] theorem 6.2.42 or [1]remark 7.54. First note that u ∈ H1(Ω) | u|∂Ω = 0 = ker(γ). Furthermore, C∞0 (Ω) ⊂ ker(γ)and the trace operator γ : H1(Ω) → L2(∂Ω) is continuous. From the latter it follows that ker(γ)is closed in H1(Ω). This yields:

H10 (Ω) = C∞0 (Ω)

‖·‖1 ⊂ ker(γ)‖·‖1

= ker(γ)

and this proves “⊂”.

Finally, we collect a few results on Green’s formulas that hold in Sobolev spaces. For notationalsimplicity the function arguments x are deleted in the integrals, and in boundary integrals like,for example,

∫

∂Ω γ(u)γ(v) ds we delete the trace operator γ.

33

Theorem 2.2.20 The following identities hold, with n = (n1, . . . , nn) the outward unit normalon ∂Ω and Hm := Hm(Ω):

∫

Ωu∂v

∂xidx = −

∫

Ω∂u∂xiv dx+

∫

∂Ω uvni ds, u, v ∈ H1, 1 ≤ i ≤ n (2.32a)

∫

Ω∆u v dx = −

∫

Ω ∇u · ∇v dx+∫

∂Ω ∇u · nv ds, u ∈ H2, v ∈ H1 (2.32b)

∫

Ωudivv dx = −

∫

Ω ∇u · v dx+∫

∂Ω uv · n ds, u ∈ H1,v ∈ (H1)n. (2.32c)

Proof. These results immediately follow from the corresponding formulas in C∞(Ω) the con-tinuity of the trace operator and a density argument based on theorem 2.2.9.

For the dual space of Hm0 (Ω) the following notation is used:

H−m(Ω) :=(

Hm0 (Ω)

)′. (2.33)

The norm on this space is denoted by ‖ · ‖−m:

‖φ‖−m := supv∈Hm

0 (Ω)

|φ(v)|‖v‖m

, φ ∈ H−m(Ω).

2.3 General results on variational formulations

In section 2.1 we already gave an example of a variational problem. In the previous section weintroduced Hilbert spaces that will be used for the variational formulation of elliptic boundaryproblems in the sections 2.5 and 2.6. In this section we present some general existence anduniqueness result for variational problems. These results will play a key role in the analysis ofwell-posedness of the weak formulations of elliptic boundary value problems. They will also beused in the discretization error analysis for the finite element method in chapter 3.

A remark on notation: For elements from a Hilbert space we use boldface notation (e.g., u),elements from the dual space (i.e., bounded linear functionals) are denoted by f, g, etc., and forlinear operators between spaces we use capitals (e.g., L).

Let H1 and H2 be Hilbert spaces. A bilinear form k : H1 × H2 → R is continuous if thereis a constant Γ such that for all x ∈ H1, y ∈ H2:

|k(x,y)| ≤ Γ‖x‖H1‖y‖H2

. (2.34)

For a continuous bilinear form k : H1×H2 → R we define its norm by ‖k‖ = sup |k(x,y)| | ‖x‖H1=

1, ‖y‖H2= 1 . A fundamental result is given in the following theorem:

34

Theorem 2.3.1 Let H1, H2 be Hilbert spaces and k : H1 ×H2 → R be a continuous bilinearform. For f ∈ H ′2 consider the variational problem:

find u ∈ H1 such that k(u,v) = f(v) for all v ∈ H2. (2.35)

The following two statements are equivalent:

1. For arbitrary f ∈ H ′2 the problem (2.35) has a unique solution u ∈ H1 and ‖u‖H1≤ c‖f‖H′

2

holds with a constant c independent of f .

2. The conditions (2.36) and (2.37) hold:

∃ ε > 0 : supv∈H2

k(u,v)

‖v‖H2

≥ ε‖u‖H1for all u ∈ H1, (2.36)

∀ v ∈ H2, v 6= 0, ∃ u ∈ H1 : k(u,v) 6= 0. (2.37)

Moreover, for the constants c and ε one can take c = 1ε .

Proof. We introduce the linear continuous operator L : H1 → H ′2

(Lu)(v) := k(u,v). (2.38)

Note that for all u ∈ H1

‖Lu‖H′

2= sup

v∈H2

(Lu)(v)

‖v‖H2

= supv∈H2

k(u,v)

‖v‖H2

(2.39)

Furthermore, u ∈ H1 satisfies (2.35) if and only if Lu = f holds.“1.⇒ 2.” From 1. it follows that L : H1 → H ′2 is bijective. For arbitrary u ∈ H1 and f := Luwe have:

‖u‖H1≤ c‖f‖H′

2= c ‖Lu‖H′

2= c sup

v∈H2

k(u,v)

‖v‖H2

.

From this it follows that (2.36) holds with ε = 1c .

Take a fixed v ∈ H2, v 6= 0. The linear functional w → 〈v,w〉H2is an element of H ′2.

There exists u ∈ H1 such that k(u,w) = 〈v,w〉H2for all w ∈ H2. Taking w = v yields

k(u,v) = ‖v‖2H2

> 0. Hence, (2.37) holds.“1.⇐ 2.” Let u ∈ H1 be such that Lu = 0. Then k(u,v) = (Lu)(v) = 0 for all v ∈ H2.From condition (2.36) it follows that u = 0. We conclude that L : H1 → H ′2 is injective. LetR(L) ⊂ H ′2 be the range of L and L−1 : R(L) → H1 the inverse mapping. From (2.39) and(2.36) it follows that ‖Lu‖H′

2≥ ε‖u‖H1

for all u ∈ H1 and thus

‖L−1f‖H1≤ 1

ε‖f‖H′

2for all f ∈ R(L). (2.40)

Hence the inverse mapping is bounded. From corollary A.2.6 it follows that R(L) is closed inH ′2. Assume that R(L) 6= H ′2. Then there exists g ∈ R(L)⊥, g 6= 0. Let J : H ′2 → H2 be theRiesz isomorphism. For arbitrary u ∈ H1 we get

0 = 〈g, Lu〉H′

2= 〈Jg, JLu〉H2

= (Lu)(Jg) = k(u, Jg).

This is a contradiction to (2.37). We conclude that R(L) = H ′2 and thus L : H1 → H ′2 isbijective. From (2.40) we obtain, with u := L−1f , ‖u‖H1

≤ 1ε‖f‖H′

2for arbitrary f ∈ H ′2.

35

Remark 2.3.2 The conditon (2.37) can also be formulated as follows:

[

v ∈ H2 such that k(u,v) = 0 for all u ∈ H1

]

⇒ v = 0.

The condition (2.36) is equivalent to

∃ ε > 0 : infu∈H1\0

supv∈H2

k(u,v)

‖u‖H1‖v‖H2

≥ ε, (2.41)

and is often called the inf-sup condition. In the finite dimensional case with dim(H1) =dim(H2) < ∞ this condition implies the result in (2.37) and thus is necessary and sufficientfor existence and uniqueness, as can been seen from the following. Let L : H1 → H ′2 be as in(2.38). If dim(H1) = dim(H ′2) <∞ we have

L is bijective ⇔ L is injective

⇔ infu6=0

‖Lu‖H′

2

‖u‖H1

> 0 ⇔ infu6=0

supv

k(u,v)

‖u‖H1‖v‖H2

> 0.(2.42)

The latter condition seems to be weaker than the inf-sup condition in (2.41), since ε > 0 isrequired there. However, in the finite dimensional case it is easy to show, using a compactnessargument, that these two conditions are equivalent. In infinite dimensional Hilbert spaces theinf-sup condition (2.41) is in general really stronger than the one in (2.42).

As we saw in the analysis above, it is natural to identify the bounded bilinear form k : H1 ×H2 → R with a bounded linear operator L : H1 → H ′2 via (Lu)(v) = k(u,v). The result intheorem 2.3.1 is a reformulation of the following result that can be found in functional analysistextbooks. Let L : H1 → H ′2 be a bounded linear operator. Then L is an isomorphism if andonly if the following two conditions hold:

(a) L is injective and R(L) is closed in H ′2,

(b) L′ : H2 → H ′1 defined by (L′v)(u) = (Lu)(v) is injective.

These two conditions correspond to (2.36) and (2.37), respectively.

The following lemma will be used in the analysis below.

Lemma 2.3.3 Let H1, H2 be Hilbert spaces and k : H1 × H2 → R be a continuous bilinearform. For every u ∈ H1 there exists a unique w ∈ H2 such that

〈w,v〉H2= k(u,v) for all v ∈ H2.

Furthermore, if (2.36) is satisfied, then ‖w‖H2≥ ε‖u‖H1

, with ε > 0 as in (2.36), holds.

Proof. Take u ∈ H1. Then g : v → k(u,v) defines a continuous linear functional on H2.Form the Riesz representation theorem it follows that there exists a unique w ∈ H2 such that〈w,v〉H2

= g(v) = k(u,v) for all v ∈ H2, and ‖g‖H′

2= ‖w‖H2

. Using (2.36) we obtain

‖w‖H2= ‖g‖H′

2= sup

v∈H2

g(v)

‖v‖H2

= supv∈H2

k(u,v)

‖v‖H2

≥ ε ‖u‖H1

and thus the result is proved.

36

Definition 2.3.4 Let H be a Hilbert space. A bilinear form k : H×H → R is called H-ellipticif there exists a constant γ > 0 such

k(u,u) ≥ γ‖u‖2H for all u ∈ H

As an immediate consequence of theorem 2.3.1 we obtain the following

Theorem 2.3.5 (Lax-Milgram) Let H be a Hilbert space and k : H ×H → R a continuousH-elliptic bilinear form with ellipticity constant γ. Then for every f ∈ H ′ there exists a uniqueu ∈ H such that

k(u,v) = f(v) for all v ∈ H.

Furthermore, the inequality ‖u‖H ≤ 1γ ‖f‖H′ holds.

This theorem will play an important role in the analysis of well-posedness of the weak for-mulation of scalar elliptic problems in the section 2.5.

In the remainder of this section we analyze well-posedness for a variational problem whichhas a special saddle point structure. This structure is such that the analysis applies to theStokes problem. This application is treated in section 2.6.Let V and M be Hilbert spaces and

a : V × V → R, b : V ×M → R

be continuous bilinear forms. For f1 ∈ V ′, f2 ∈M ′ we define the following variational problem:find (φ,λ) ∈ V ×M such that

a(φ,ψ) + b(ψ,λ) = f1(ψ) for all ψ ∈ V (2.43a)

b(φ,µ) = f2(µ) for all µ ∈M. (2.43b)

We now define H := V ×M and

k : H ×H → R, k(u,v) := a(φ,ψ) + b(φ,µ) + b(ψ,λ)

with u := (φ,λ), v := (ψ,µ).(2.44)

On H we use the product norm ‖u‖2H = ‖φ‖2

V + ‖λ‖2M , for u = (φ,λ) ∈ H. If we define

f ∈ H ′ = V ′ ×M ′ by f(φ,λ) = f1(ψ) + f2(µ) then the problem (2.43) can be reformulated inthe setting of theorem 2.3.1 as follows:

find u ∈ H such that k(u,v) = f(v) for all v ∈ H. (2.45)

Below we will analyze the well-posedness of the problem (2.45) and thus of (2.43). In particular,we will derive conditions on the bilinear forms a(·, ·) and b(·, ·) that are necessary and sufficientfor existence and uniqueness of a solution. The main result is presented in theorem 2.3.10.We start with a few useful results. We need the following null space:

V0 := φ ∈ V | b(φ,λ) = 0 for all λ ∈M . (2.46)

Note that both V0 and V ⊥0 := φ ∈ V | 〈φ,ψ〉V = 0 for all ψ ∈ V0 are closed subspacesof V . These subspaces are Hilbert spaces if we use the scalar product 〈·, ·〉V . As usual the

37

corresponding dual spaces are denoted by V ′0 and (V ⊥0 )′.

We introduce an inf-sup condition for the bilinear form b(·, ·):

∃ β > 0 : supψ∈V

b(ψ,λ)

‖ψ‖V≥ β ‖λ‖M for all λ ∈M. (2.47)

Remark 2.3.6 Due to V = V0 ⊕ V ⊥0 and b(ψ,λ) = 0 for all ψ ∈ V0,λ ∈ M , the condition in(2.47) is equivalent to the condition

∃ β > 0 : supψ∈V ⊥

0

b(ψ,λ)

‖ψ‖V≥ β ‖λ‖M for all λ ∈M. (2.48)

Lemma 2.3.7 Assume that (2.47) holds. For g ∈ (V ⊥0 )′ the variational problem

find λ ∈M such that b(ψ,λ) = g(ψ) for all ψ ∈ V ⊥0 (2.49)

has a unique solution. Furthermore, ‖λ‖M ≤ 1β‖g‖(V ⊥

0 )′ holds.

Proof. We apply theorem 2.3.1 with H1 = M , H2 = V ⊥0 and k(λ,ψ) = b(ψ,λ) (note theinterchange of arguments). From (2.48) it follows that condition (2.36) is fulfilled. Take ψ ∈ V ⊥0 ,ψ 6= 0. Then ψ /∈ V0 and thus there exists λ ∈M with b(ψ,λ) 6= 0. Hence, the second condition(2.37) is also satisfied. Application of theorem 2.3.1 yields that the problem (2.49) has a uniquesolution λ ∈M and ‖λ‖M ≤ 1

β‖g‖(V ⊥

0 )′ holds.

Note that, opposite to (2.36), in (2.47) we take the supremum over the first argument of thebilinear form. In the following lemma we formulate a result in which the supremum is takenover the second argument:

Lemma 2.3.8 The condition (2.47) (or (2.48)) implies:

∃ β > 0 : supλ∈M

b(ψ,λ)

‖λ‖M≥ β ‖ψ‖V for all ψ ∈ V ⊥0 . (2.50)

Proof. Take ψ ∈ V ⊥0 , ψ 6= 0, and define g ∈ (V ⊥0 )′ by g(ψ) = 〈ψ, ψ〉V for ψ ∈ V ⊥0 . Fromlemma 2.3.7 it follows that there exists a unique λ ∈M such that

b(ψ, λ) = g(ψ) for all ψ ∈ V ⊥0

and ‖λ‖M ≤ 1β‖g‖(V ⊥

0)′ = 1

β‖ψ‖V holds. We conclude that

supλ∈M

b(ψ,λ)

‖λ‖M≥ b(ψ, λ)

‖λ‖M=

g(ψ)

‖λ‖M=

‖ψ‖2V

‖λ‖M≥ β ‖ψ‖V

holds for arbitrary ψ ∈ V ⊥0 .

In general, (2.50) does not imply (2.47). Take, for example, the bilinear that is identicallyzero, i.e., b(ψ,λ) = 0 for all ψ ∈ V,λ ∈ M . Then (2.47) does not hold. However, since V0 = Vand thus V ⊥0 = 0 it follows that (2.50) does hold.

Application of lemma 2.3.3 in combination with the infsup properties in (2.47) and (2.50)yields the following corollary.

38

Corollary 2.3.9 Assume that (2.47) holds. For every λ ∈ M there exists a unique ξ ∈ V ⊥0such that

〈ξ,ψ〉V = b(ψ,λ) for all ψ ∈ V ⊥0 .

Furthermore, ‖ξ‖V ≥ β ‖λ‖M holds. For every ψ ∈ V ⊥0 there exists a unique ν ∈M such that

〈ν,λ〉M = b(ψ,λ) for all λ ∈M.

Furthermore, ‖ν‖M ≥ β ‖ψ‖V holds.

Proof. The first part follows by applying lemma 2.3.3 with H1 = M,H2 = V ⊥0 , k(λ,ψ) =b(ψ,λ) in combination with (2.48). For the second part we use lemma 2.3.3 with H1 = V ⊥0 ,H2 =M, k(ψ,λ) = b(ψ,λ) in combination with (2.50).

In the following main theorem we present necessary and sufficient conditions on the bilinearforms a(·, ·) and b(·, ·) such that the saddle point problem (2.43) has a unique solution whichdepends continuously on the data.

Theorem 2.3.10 Let V, M be Hilbert spaces and a : V ×V → R, b : V ×M → R be continuousbilinear forms. Define H := V ×M and let k : H × H → R be the continuous bilinear formdefined in (2.44). For f ∈ H ′ = V ′ ×M ′ consider the variational problem

find u ∈ H such that k(u,v) = f(v) for all v ∈ H. (2.51)

The following two statements are equivalent:

1. For arbitrary f ∈ H ′ the problem (2.51) has a unique solution u ∈ H and ‖u‖H ≤ c‖f‖H′

holds with a constant c independent of f .

2. The inf-sup condition (2.47) holds and the conditions (2.52a), (2.52b) are satisfied:

∃ δ > 0 : supψ∈V0

a(φ,ψ)

‖ψ‖V≥ δ ‖φ‖V for all φ ∈ V0 (2.52a)

∀ ψ ∈ V0, ψ 6= 0, ∃φ ∈ V0 : a(φ,ψ) 6= 0. (2.52b)

Moreover, if the second statement holds, then for c in the first statement one can takec = (β + 2‖a‖)2δ−1β−2.

Proof. From theorem 2.3.1 (with H1 = H2 = H) it follows that statement 1 is equivalent to

1′. For H1 = H2 = H the conditions (2.36) and (2.37) are satisfied.

We now prove that the statements 1′ and 2 are equivalent. We recall the condition (2.36):

sup(ψ,µ)∈H

a(φ,ψ) + b(φ,µ) + b(ψ,λ)

(‖ψ‖2V + ‖µ‖2

M )12

≥ ε(

‖φ‖2V + ‖λ‖2

M

)12 for all (φ,λ) ∈ H. (2.53)

Define u = (φ,λ), v = (ψ,µ) and k(u,v) as in (2.44).We have to prove: (2.53), (2.37) ⇔ (2.47), (2.52a), (2.52b). This is done in the following

39

5 steps: a) (2.53) ⇒ (2.47), b) (2.53), (2.47) ⇒ (2.52a), c) (2.47), (2.37) ⇒ (2.52b), d)(2.47), (2.52a) ⇒ (2.53), e) (2.52a), (2.52b) ⇒ (2.37).a). If in (2.53) we take φ = 0 we obtain

supψ∈V

b(ψ,λ)

‖ψ‖V= sup

(ψ,µ)∈H

b(ψ,λ)

(‖ψ‖2V + ‖µ‖2

M )12

≥ ε‖λ‖2M for all λ ∈M

and thus the inf-sup condition (2.47) holds.b). Take φ0 ∈ V0. The functional g : ψ → −a(φ0,ψ), ψ ∈ V ⊥0 is linear and bounded, i.e., g ∈(V ⊥0 )′. Application of lemma 2.3.7 yields that there exists λ ∈M such that b(ψ, λ) = −a(φ0,ψ)for all ψ ∈ V ⊥0 . In (2.53) we take (φ,λ) = (φ0, λ). Every ψ ∈ V is decomposed as ψ = ψ0 +ψ⊥with ψ0 ∈ V0, ψ⊥ ∈ V ⊥0 . Using b(φ0,µ) + b(ψ, λ) = b(ψ⊥, λ) we obtain from (2.53)

sup(ψ,µ)∈H

a(φ0,ψ0) + a(φ0,ψ⊥) + b(ψ⊥, λ)

(‖ψ‖2V + ‖µ‖2

M )12

≥ ε(

‖φ0‖2V + ‖λ‖2

M

)12 .

From this we get, using b(ψ⊥, λ) = −a(φ0,ψ⊥) for all ψ⊥ ∈ V ⊥0 :

supψ0∈V0

a(φ0,ψ0)

‖ψ0‖V= sup

(ψ,µ)∈H

a(φ0,ψ0)

(‖ψ‖2V + ‖µ‖2

M )12

≥ ε(

‖φ0‖2V + ‖λ‖2

M

) 12 ≥ ε‖φ0‖V ,

and thus the condition (2.52a) holds.c). Take ψ0 ∈ V0, ψ0 6= 0. The functional g : φ→ −a(φ,ψ0), φ ∈ V ⊥0 is an element of (V ⊥0 )′.From lemma 2.3.7 it follows that there exists µ ∈ M such that b(ψ, µ) = −a(ψ,ψ0) for allψ ∈ V ⊥0 . In condition (2.37) we take v = (ψ0, µ). Then there exists u = (φ,λ) ∈ H such thatk(u,v) 6= 0, i.e.,

a(φ,ψ0) + b(φ, µ) + b(ψ0,λ) = a(φ,ψ0) + b(φ, µ) 6= 0.

Decompose φ as φ = φ0 + φ⊥, φ0 ∈ V0, φ⊥ ∈ V ⊥0 and use the definition of µ to get

0 6= a(φ,ψ0) + b(φ, µ) = a(φ0,ψ0) + a(φ⊥,ψ0) + b(φ⊥, µ) = a(φ0,ψ0).

Hence, the result in (2.52b) holds.d). Let u = (φ,λ) ∈ H be given. We decompose φ as φ = φ0 + φ⊥, φ0 ∈ V0, φ⊥ ∈ V ⊥0 . Weassume that λ 6= 0, φ⊥ 6= 0. From corollary 2.3.9 it follows that:

∃ ξ ∈ V ⊥0 : 〈ξ,ψ〉V = b(ψ,λ) ∀ ψ ∈ V ⊥0 ; ‖ξ‖V ≥ β ‖λ‖M , (2.54)

∃ ν ∈M : 〈ν,µ〉M = b(φ⊥,µ) ∀µ ∈M ; ‖ν‖M ≥ β ‖φ⊥‖V . (2.55)

From assumption (2.52a) it follows that there exist δ > 0 and ψ0 ∈ V0 with ‖ψ0‖V = 1 suchthat a(φ0,ψ0) ≥ δ‖φ0‖V holds. We now introduce

ψ := α1ψ0 + ξ, ξ :=ξ

‖ξ‖V, α1 ∈ R,

µ := α2ν, ν :=ν

‖ν‖M, α2 ∈ R.

40

Note that ‖ψ‖2V + ‖µ‖2

M = α21 + 1 + α2

2. We obtain:

supv∈H

k(u,v)

‖v‖H2

≥ sup(ψ,µ)∈H

a(φ, ψ) + b(φ, µ) + b(ψ,λ)

(‖ψ‖2V + ‖µ‖2

M )12

= supα1,α2

a(φ0, ψ) + a(φ⊥, ψ) + b(φ⊥, µ) + b(ξ,λ)

(1 + α21 + α2

2)12

= supα1,α2

a(φ0, ψ) + a(φ⊥, ψ) + 〈ν, µ〉M + 〈ξ, ξ〉V(1 + α2

1 + α22)

12

= supα1,α2

α1a(φ0,ψ0) + a(φ0, ξ) + a(φ⊥, ψ) + α2‖ν‖M + ‖ξ‖V(1 + α2

1 + α22)

12

≥ supα1,α2

(α1δ − ‖a‖)‖φ0‖V +(

α2β − ‖a‖(α1 + 1))

‖φ⊥‖V + β‖λ‖M(1 + α2

1 + α22)

12

We take α1, α2 such that α1δ − ‖a‖ = β and α2β − ‖a‖(α1 + 1) = β. This results in

(α1, α2) =‖a‖ + β

δβ(β, ‖a‖ + δ).

Note that δ ≤ ‖a‖, α1 ≥ 1, α2 ≥ 1, and thus (1+α21 +α2

2)12 ≤ α1 +α2 ≤ (β+2‖a‖)2

δβ . We concludethat

supv∈H

k(u,v)

‖v‖H≥ δβ2

(β + 2‖a‖)2(

‖φ0‖V + ‖φ⊥‖V + ‖λ‖M)

≥ δβ2

(β + 2‖a‖)2 ‖u‖H (2.56)

holds. Using a continuity argument the same result holds if λ = 0 or φ⊥ = 0. Hence condition

(2.53) holds with ε = δβ2

(β+2‖a‖)2 .

e). Take φ ∈ V0, φ 6= 0. From (2.52) and theorem 2.3.1 with H1 = H2 = V0, k(u,v) = a(u,v)it follows that there exists a unique ξ ∈ V0 such that a(ξ,ψ) = 〈φ,ψ〉V for all ψ ∈ V0 and‖ξ‖V ≤ δ−1‖φ‖V ′

0= δ−1‖φ‖V . If we take ψ = φ we obtain

supψ∈V0

a(ψ,φ)

‖ψ‖V≥ a(ξ,φ)

‖ξ‖V=

‖φ‖2V

‖ξ‖V≥ δ‖φ‖V . (2.57)

We introduce the adjoint bilinear form (with u = (φ,λ), v = (ψ,µ)):

k(u,v) := k(v,u) = a(φ,ψ) + b(φ,µ) + b(ψ,λ), a(φ,ψ) := a(ψ,φ).

From (2.57) it follows that

supψ∈V0

a(φ,ψ)

‖ψ‖V≥ δ‖φ‖V for all φ ∈ V0.

Using this one can prove with exactly the same arguments as in part d) that

supv∈H

k(u,v)

‖v‖H≥ δβ2

(β + 2‖a‖)2 ‖u‖H for all u ∈ H

holds. Thus for every u ∈ H, u 6= 0 there exists v ∈ H such that k(u,v) 6= 0 and thusk(u,v) 6= 0, too. This shows that (2.37) holds and completes the proof of a)–e). The final state-ment in the theorem follows from the final result in theorem 2.3.1 and the choice of ε in part d).

41

Remark 2.3.11 The final result in theorem 2.3.10 predicts that if we scale such that ‖a‖ = 1then the stability constant c = (β + 2‖a‖)2δ−1β−2 is large when the values of the constants δand β are much smaller than one. We now give an example with ‖a‖ = 1 in which the stabilitydeteriorates like δ−1β−2 for δ ↓ 0, β ↓ 0. This shows that the behaviour c ∼ δ−1β−2 for thestability constant is sharp.Take V = R2 with the euclidean norm ‖ · ‖2, M = R and let e1 = (1 0)T , e2 = (0 1)T be thestandard basis vectors in R2. For fixed β > 0, δ > 0 we introduce the bilinear forms

b(ψ,λ) = β eT1ψλ, ψ ∈ R2, λ ∈ R,

a(φ,ψ) = φTAψ, A :=

(

0 11 δ

)

, φ,ψ ∈ R2.

We then have V0 = span(e2) and a simple computation yields

supψ∈V

b(ψ,λ)

‖ψ‖2= β |λ| for all λ,

supψ∈V0

a(φ,ψ)

‖ψ‖2= δ ‖φ‖2 for all φ ∈ V0.

With u = (φ,λ),v = (ψ,µ) ∈ R3 we have

k(u,v) = a(φ,ψ) + b(φ,µ) + b(ψ,λ) = uTCv, C :=

0 1 β1 δ 0β 0 0

.

We consider the functional f(v) = f1(ψ) + f2(µ) = µ = (0 0 1)v with norm ‖f‖H′ = 1. Theunique solution u ∈ V ×M = R3 such that k(u,v) = f(v) for all v ∈ V ×M is the uniquesolution of Cu = (0 0 1)T . Hence

u =( 1

β− 1

δβ

1

δβ2

)T.

From this is follows that for all 0 < β ≤ 1, 0 < δ ≤ 1 we have

1

δβ2‖f‖H′ ≤ ‖u‖H ≤

√3

δβ2‖f‖H′ .

Important sufficient conditions for well-posedness of the problem (2.43) are formulated in thefollowing corollary.

42

Corollary 2.3.12 For arbitrary f1 ∈ V ′, f2 ∈M ′ consider the variational problem (2.43):find (φ,λ) ∈ V ×M such that


b(φ,µ) = f2(µ) for all µ ∈M. (2.58b)

Assume that the bilinear forms a(·, ·) and b(·, ·) are continuous and satisfy the following twoconditions:

∃ β > 0 : supψ∈V

b(ψ,λ)

‖ψ‖V≥ β ‖λ‖M ∀ λ ∈M (inf-sup condition), (2.59a)

∃ γ > 0 : a(φ,φ) ≥ γ ‖φ‖2V ∀ φ ∈ V (V-ellipticity). (2.59b)

Then the conditions (2.47) and (2.52) (with δ = γ) are satisfied and the problem (2.58) has a

unique solution (φ,λ). Moreover, the stability bound ‖(φ,λ)‖H ≤ (β+2‖a‖)2γβ2 ‖(f1, f2)‖H′ holds.

Proof. Apply theorem 2.3.10.

2.4 Minimization of functionals and saddle-point problems

In the variational problems treated in the previous section we did not assume symmetry of thebilinear forms. In this section we introduce certain symmetry properties and show that in thatcase equivalent alternative problem formulations can be derived.

First we reconsider the case of a continuous bilinear form k : H × H → R that is H-elliptic.This situation is considered in the Lax-Milgram lemma 2.3.5. In addition we now assume thatthe bilinear form is symmetric: k(u,v) = k(v,u) for all u,v.

Theorem 2.4.1 Let H be a Hilbert space and k : H×H → R a continuous H-elliptic symmetricbilinear form. For f ∈ H ′ let u ∈ H be the unique solution of the variational problem

k(u,v) = f(v) for all v ∈ H. (2.60)

Then u is the unique minimizer of the functional

J(v) :=1

2k(v,v) − f(v). (2.61)

43

Proof. From the Lax-Milgram theorem it follows that the variational problem (2.60) has aunique solution u ∈ H. For arbitrary z ∈ H, z 6= 0 we have, with ellipticity constant γ > 0:

J(u + z) =1

2k(u + z,u + z) − f(u + z)

=1

2k(u,u) − f(u) + k(u, z) − f(z) +

1

2k(z, z)

= J(u) +1

2k(z, z) ≥ J(u) +

1

2γ‖z‖2

H > J(u).

This proves the desired result.

We now reconsider the variational problem (2.43) and the result formulated in corollary 2.3.12.

Theorem 2.4.2 For arbitrary f1 ∈ V ′, f2 ∈M ′ consider the variational problem (2.43):find (φ,λ) ∈ V ×M such that


b(φ,µ) = f2(µ) for all µ ∈M. (2.62b)

Assume that the bilinear forms a(·, ·) and b(·, ·) are continuous and satisfy the conditions (2.47)and (2.52). In addition we assume that a(·, ·) is symmetric. Define the functional L : V ×M → Rby

L(ψ,µ) =1

2a(ψ,ψ) + b(ψ,µ) − f1(ψ) − f2(µ)

Then the unique solution (φ,λ) of (2.62) is also the unique element in V ×M for which

L(φ,µ) ≤ L(φ,λ) ≤ L(ψ,λ) for all ψ ∈ V, µ ∈M (2.63)

holds.

Proof. ddd From theorem 2.3.10 it follows that the problem (2.62) has a unique solution.Take a fixed element (φ, λ) ∈ V ×M . We will prove:

L(φ,µ) ≤ L(φ, λ) ∀µ ∈M ⇔ b(φ,ν) = f2(ν) ∀ν ∈ML(φ, λ) ≤ L(ψ, λ) ∀ψ ∈ V ⇔ a(φ,ψ) + b(ψ, λ) = f1(ψ) ∀ψ ∈ V.

(2.64)

From this it follows that (φ, λ) satisfies (2.63) if and only if (φ, λ) is a solution of (2.62). Thisthen proves the statement of the theorem. We now prove (2.64). Note that

L(φ,µ) ≤ L(φ, λ) ∀µ ∈M

⇔ b(φ,µ) − f2(µ) ≤ b(φ, λ) − f2(λ) ∀µ ∈M

⇔ b(φ,ν) ≤ f2(ν) ∀ν ∈M

⇔ b(φ,ν) = f2(ν) ∀ν ∈M.

From this the first result in (2.64) follows. For the second result we first note

L(φ, λ) ≤ L(ψ, λ) ∀ψ ∈ V

⇔ 1

2a(φ, φ) + b(φ, λ) − f1(φ) ≤ 1

2a(ψ,ψ) + b(ψ, λ) − f1(ψ) ∀ψ ∈ V

⇔ − 1

2a(ξ, ξ) ≤ a(φ, ξ) + b(ξ, λ) − f1(ξ) ∀ ξ ∈ V.

44

Now note that ξ → a(ξ, ξ) is a quadratic term and ξ → a(φ, ξ) + b(ξ, λ) − f1(ξ) is linear. Ascaling argument now yields

L(φ, λ) ≤ L(ψ, λ) ∀ψ ∈ V

⇔ 0 ≤ a(φ, ξ) + b(ξ, λ) − f1(ξ) ∀ ξ ∈ V

⇔ 0 = a(φ, ξ) + b(ξ, λ) − f1(ξ) ∀ ξ ∈ V,

and thus the second result in (2.64) holds.

Note that if a(·, ·) is symmetric then (2.52a) implies (2.52b). Due to the property (2.63) theproblem (2.62) with a symmetric bilinear form a(·, ·) is called a saddle-point problem.

2.5 Variational formulation of scalar elliptic problems

2.5.1 Introduction

We reconsider the example of section 2.1, i.e., the two-point boundary value problem

−(au′)′ = 1 in (0, 1), (2.65a)

u(0) = u(1) = 0. (2.65b)

with a(x) > 0 for all x ∈ [0, 1]. Let V1, k(·, ·) and f(·) be as defined in section 2.1:

V1 = v ∈ C2([0, 1]) | v(0) = v(1) = 0

k(u, v) =

∫ 1

0a(x)u′(x)v′(x) dx, f(v) =

∫ 1

0v(x) dx.

The two-point boundary value problem has a corresponding variational formulation:

find u ∈ V1 such that

k(u, v) = f(v) for all v ∈ V1.(2.66)

One easily checks that u ∈ V1 solves (2.65) iff u is a solution of (2.66). Hence, if the problem(2.65) has a solution u ∈ C2([0, 1]) this must also be the unique (due to lemma 2.1.2) solution of(2.66). As in section 2.1 we now consider this problem with a discontinuous piecewise constantfunction a. Then the classical formulation (2.65) is not well-defined, whereas the variationalproblem does make sense. However, in section 2.1 it is shown that the problem (2.66) has nosolution (the space V1 is “too small”). Since in the bilinear form k(·, ·) only first derivativesoccur, the larger space V2 := v ∈ C1([0, 1]) | v(0) = v(1) = 0 seems to be more appropriate.This leads to the weaker variational formulation:

find u ∈ V2 such that

k(u, v) = f(v) for all v ∈ V2.(2.67)

However, it is shown in section 2.1 that the problem (2.67) still has no solution. The key stepis to take the completion of the space V2 (or , equivalently, of V1):

H10

(

(0, 1))

= C∞0 ([0, 1])‖·‖1

= V‖·‖11 = V

‖·‖12 .

45

Thus we consider

find u ∈ H10

(

(0, 1))

such that

k(u, v) = f(v) for all v ∈ V2.

Both the bilinear form k(·, ·) and f(·) are continuous on H10

(

(0, 1))

and thus this problem isequivalent to the variational problem

find u ∈ H10

(

(0, 1))

such that

k(u, v) = f(v) for all v ∈ H10

(

(0, 1))

.(2.68)

From the Lax-Milgram lemma 2.3.5 it follows that there exists a unique solution (which is usuallycalled the weak solution) of the variational problem (2.68). For this existence and uniquenessresult it is essential that we used the Sobolev space H1

0

(

(0, 1))

, which is a Hilbert space. Insection 2.1 we considered a space V3 with V2 ⊂ V3 ⊂ H1

0

(

(0, 1))

and showed that the function

u given in (2.11) solves the variational problem in the space V3. Due to V‖·‖13 = H1

0

(

(0, 1))

thisfunction u is also the unique solution of (2.68).

We summarize the fundamental steps discussed in this section in the following diagram:

(2.65)→Weaker formulation,due to reduction oforder of differentiation.

→ (2.67)→Weaker formulation,due to completionof space.

→ (2.68).

A very similar approach can be applied to a large class of elliptic boundary value problems,as will be shown in the following sections

Remark 2.5.1 For the weak formulation in (2.68) to have a unique solution it is importantthat the bilinear form is elliptic. The following examples illustrates this. Consider (2.65) witha(x) =

√x. Then the solution of this problem is given by u(x) = 2

3

√x(1−x) (as can be checked

by substitution). Note that u /∈ V2. Since u ∈ C2(

(0, 1))

∩C([0, 1]), this is the classical solution

of (2.65), cf. section 1.2. However, due to∫ 10 u′(x)2 dx = ∞ it follows that u /∈ H1

(

(0, 1))

andthus the weak formulation as in (2.68) does not have a solution.

2.5.2 Elliptic BVP with homogeneous Dirichlet boundary conditions

In this section we derive and analyze variational formulations for a class of scalar elliptic bound-ary value problems. We consider a linear second order differential operator of the form

Lu = −n∑

i,j=1

∂

∂xi

(

aij∂u

∂xj

)

+

n∑

i=1

bi∂u

∂xi+ cu. (2.69)

Note that this form differs from the one in (1.2). If the coefficients aij are differentiable, thendue to

n∑

i,j=1

∂

∂xi

(

aij∂u

∂xj

)

=n∑

i,j=1

aij∂2u

∂xi∂xj+

n∑

i,j=1

∂aij∂xi

∂u

∂xj

the operator L can be written in the form as in (1.2) with the same c as in (2.69) but with aij

and bi in (1.2) replaced by −aij and bi −∑n

j=1∂aji

∂xj, respectively.

46

As in section 1.2.1 the coefficients that determine the principal part of the operator are collectedin a matrix

A(x) = (aij(x))1≤i,j≤n.

We assume that the problem is uniformly elliptic:

∃ α0 > 0 ξTA(x)ξ ≥ α0ξT ξ for all ξ ∈ Rn, x ∈ Ω. (2.70)

We use the notation b(x) = (b1(x), . . . , bn(x))T .In this section we only discuss the case with homogeneous Dirichlet boundary conditions, i.e.,we consider the following elliptic boundary value problem:

Lu = f in Ω (2.71a)

u = 0 on ∂Ω. (2.71b)

We now derive a (weaker) variational formulation of this problem along the same lines as inthe previous section. For this derivation we assume that the equation (2.71a) has a solution uin the space V := u ∈ C2(Ω) | u = 0 on ∂Ω . Multiplication of (2.71a) with v ∈ C∞0 (Ω) andusing partial integration implies that u also satisfies

∫

Ω∇uTA∇v + b · ∇uv + cuv dx =

∫

Ωfv dx.

Based on this, we introduce a bilinear form and linear functional:

k(u, v) =

∫

Ω∇uTA∇v + b · ∇uv + cuv dx , f(v) =

∫

Ωfv dx. (2.72)

We conclude that a solution u ∈ V also solves the following variational problem:

find u ∈ V such that

k(u, v) = f(v) for all v ∈ C∞0 (Ω).

Note that in the bilinear form no higher than first derivatives occur. This motivates to use spaces

obtained by completion w.r.t. the norm ‖·‖1 and leads to the Sobolev space H10 (Ω) = C∞0 (Ω)

‖·‖1.

One may check that C∞0 (Ω) ⊂ V ⊂ H10 (Ω) and thus V

‖·‖1 = H10 (Ω). We thus obtain the follow-

ing.

The variational formulation of (2.71) is given by:

find u ∈ H10 (Ω) such that

k(u, v) = f(v) for all v ∈ H10 (Ω),

(2.73)

with k(·, ·) and f(·) as in (2.72)

It is easy to verify that if the problem (2.73) has a smooth solution u ∈ C2(Ω) and if thecoefficients are sufficiently smooth then u is also a solution of (2.71). In this sense this problemis the correct weak formulation.

47

Remark 2.5.2 There is a subtle reason why in the derivation of the weak formulation we usedthe test space C∞0 (Ω) and not C∞(Ω). The reason for this choice is closely related to the typeof boundary condition. In the situation here we have prescribed boundary values which are

automatically fulfilled in the space V and also (after completion) in V‖·‖1 = H1

0 (Ω). Therefore,the differential equation should be “tested” in the form

∫

Ω(Lu−f)v dx = 0 only in the interior ofΩ, i.e., with functions v that are zero on the boundary. Hence we take v ∈ C∞0 (Ω). In problemswith other types of boundary conditions it may be necessary to take test functions v ∈ C∞(Ω).This will be further explained in section 2.5.3.

We now analyze existence and uniqueness of the variational problem (2.73) by means of theLax-Milgram lemma 2.3.5. We use the following mild smoothness assumptions concerning thecoefficients in the differential operator:

aij ∈ L∞(Ω) ∀ i, j, bi ∈ H1(Ω) ∩ L∞(Ω) ∀ i, c ∈ L∞(Ω). (2.74)

Theorem 2.5.3 Let (2.70) and (2.74) hold and assume that the condition

−1

2divb + c ≥ 0 a.e. in Ω

is fulfilled. Then for every f ∈ L2(Ω) the variational problem (2.73) with f(v) :=∫

Ω fv dx hasa unique solution u. Moreover, the inequality

‖u‖1 ≤ C‖f‖L2

holds with a constant C independent of f .

Proof. We use the Lax-Milgram lemma and the fact that ‖ · ‖1 and | · |1, defined by |u|21 =∑

|α|=1 ‖Dαu‖2L2 , are equivalent norms on H1

0 (Ω).From

|f(v)| = |∫

Ωfv dx| ≤ ‖f‖L2‖v‖L2 ≤ ‖f‖L2‖v‖1

it follows that f(·) defines a bounded linear functional on H10 (Ω). We now check the boundedness

of the bilinear form k(·, ·) for u, v ∈ H10 (Ω):

|k(u, v)| ≤∣

∣

n∑

i,j=1

∫

Ωaij

∂u

∂xj

∂v

∂xidx∣

∣+∣

∣

n∑

i=1

∫

Ωbi∂u

∂xiv dx

∣

∣+∣

∣

∫

Ωcuv dx

∣

∣

≤n∑

i,j=1

‖aij‖L∞‖ ∂u∂xj

‖L2‖ ∂v∂xi

‖L2 +

n∑

i=1

‖bi‖L∞‖ ∂u∂xi

‖L2‖v‖L2

+ ‖c‖L∞‖u‖L2‖v‖L2

≤n∑

i,j=1

‖aij‖L∞‖u‖1‖v‖1 +

n∑

i=1

‖bi‖L∞‖u‖1‖v‖L2 + ‖c‖L∞‖u‖L2‖v‖L2

≤ C‖u‖1‖v‖1.

Note that C∞0 (Ω) is dense in H10 (Ω), the bilinear form is continuous and ‖ · ‖1 and | · |1 are

equivalent norms. Hence, for the ellipticity, k(u, u) ≥ γ‖u‖21 (with γ > 0), to hold it suffices to

48

show k(u, u) ≥ γ|u|21 for all u ∈ C∞0 (Ω).Take u ∈ C∞0 (Ω). From the uniform ellipticity assumption (2.70) it follows (with ξ = ∇u) that

n∑

i,j=1

∫

Ωaij

∂u

∂xj

∂u

∂xidx ≥ α0

∫

Ω

n∑

j=1

( ∂u

∂xj

)2dx = α0|u|21 ,

with α0 > 0 holds. Using partial integration we obtain

n∑

i=1

∫

Ωbi∂u

∂xiu dx =

1

2

n∑

i=1

∫

Ωbi∂(u2)

∂xidx = −1

2

∫

Ωdivbu2 dx.

Collecting these results we get

k(u, u) ≥ α0|u|21 +

∫

Ω

(

− 1

2divb + c

)

u2 dx.

The desired result follows from the assumption −12divb + c ≥ 0 (a.e.).

We now formulate two important special cases.

Corollary 2.5.4 For every f ∈ L2(Ω) the Poisson equation (in variational formulation)


∫

Ω ∇u · ∇v dx =∫

Ω fv dx for all v ∈ H10 (Ω)

has a unique solution. Moreover, ‖u‖1 ≤ C‖f‖L2 holds with a constant C independent of f .

Corollary 2.5.5 For f ∈ L2(Ω), ε > 0 and bi ∈ H1(Ω) ∩ L∞(Ω), i = 1, . . . , n, consider theconvection-diffusion problem (in variational formulation)


ε∫

Ω ∇u · ∇v dx+∫

Ω b · ∇u v dx =∫

Ω fv dx for all v ∈ H10 (Ω).

If div b ≤ 0 holds (a.e.), then this problem has a unique solution, and ‖u‖1 ≤ C‖f‖L2 holds witha constant C independent of f .

We note that the condition divb ≤ 0 holds, for example, if all bi, i = 1, . . . , n, are constants.In the singular perturbation case it may happen that the stability constant deteriorates: C =C(ε) ↑ ∞ if ε ↓ 0.

2.5.3 Other boundary conditions

In this section we consider variational formulations of elliptic problems with boundary conditionsthat are not of homogeneous Dirichlet type. For simplicity we only discuss the case L = −∆.The corresponding results for general second order differential operators (L as in (2.69)) arevery similar.

Inhomogeneous Dirichlet boundary conditions. First we treat the Poisson equation withDirichlet boundary data φ that are not identically zero:

−∆u = f in Ω

u = φ on ∂Ω.

49

Assume that this problem has a solution in the space V = u ∈ C2(Ω) | u = φ on ∂Ω .After completion (w.r.t. ‖ · ‖1) this will yield the space u ∈ H1(Ω) | γ(u) = φ where γ is thetrace operator. As in the previous section the boundary conditions are automatically fulfilledin this space and thus we take test functions v ∈ C∞0 (Ω) (cf. remark 2.5.2). Multiplication ofthe differential equation with such a function v, partial integration and using completion withrespect to ‖ · ‖1 results in the following variational problem:

find u ∈ u ∈ H1(Ω) | u|∂Ω = φ such that∫

Ω ∇u · ∇v dx =∫

Ω fv dx for all v ∈ H10 (Ω).

(2.75)

If u and u are solutions of (2.75) then u − u ∈ H10 (Ω) and

∫

Ω ∇(u − u) · ∇v dx = 0 for allv ∈ H1

0 (Ω). Taking v = u− u it follows that u = u and thus we have at most one solution. Toprove existence we introduce a transformed problem. For the identity u|∂Ω = φ (⇔ γ(u) = φ)to make sense, we assume that the boundary data φ are such that φ ∈ range(γ). Then thereexists u0 ∈ H1(Ω) such that u0|∂Ω = φ.

Lemma 2.5.6 Assume f ∈ L2(Ω) and φ ∈ range(γ). Take u0 ∈ H1(Ω) such that γ(u0) = φ.Then u solves the variational problem (2.75) iff w = u− u0 solves the following:

find w ∈ H10 (Ω) such that

∫

Ω ∇w · ∇v dx =∫

Ω fv dx−∫

Ω ∇u0 · ∇v dx for all v ∈ H10 (Ω).

(2.76)

Proof. Trivial.

Note that f(v) =∫

Ω fv dx −∫

Ω ∇u0 · ∇v dx defines a continuous linear functional on H10 (Ω).

The Lax-Milgram lemma yields existence of a solution of (2.76) and thus of (2.75).

Natural boundary conditions. We now consider a problem in which also (normal) derivativesof u occur in the boundary condition:

−∆u = f in Ω (2.77a)

∂u

∂n+ β u = φ on ∂Ω (2.77b)

with β ∈ R a constant and ∂u∂n = (n · ∇)u = ∇u ·n the normal derivative at the boundary. For

this problem the following difficulty arises related to the (normal) derivative in the boundarycondition. For u ∈ H1(Ω) the weak derivative Dαu, |α| = 1, is an element of L2(Ω). It canbe shown that it is not possible to define unambiguously v|∂Ω for v ∈ L2(Ω). In other words,

there is no trace operator which in a satisfactory way defines ∂u∂n |∂Ω

for u ∈ H1(Ω). This is the

reason why for the solution u we search in the space H1(Ω) which does not take the boundaryconditions into account. Due to this, for the derivation of an appropriate weak formulation, wemultiply (2.77a) with test functions v ∈ C∞(Ω) (and not C∞0 (Ω), cf. remark 2.5.2). This resultsin

∫

Ω∇u · ∇v dx−

∫

∂Ω∇u · n v ds =

∫

Ωfv dx for all v ∈ C∞(Ω).

We now use the boundary condition (2.77b) and then obtain

∫

Ω∇u · ∇v dx+ β

∫

∂Ωuv ds =

∫

Ωfv dx+

∫

∂Ωφv ds for all v ∈ C∞(Ω).

50

This results in the following variational problem:

find u ∈ H1(Ω) such that∫

Ω ∇u · ∇v dx+ β∫

∂Ω uv ds =∫

Ω fv dx+∫

∂Ω φv ds ∀ v ∈ H1(Ω).(2.78)

It is easy to verify that if the problem (2.78) has a smooth solution u ∈ C2(Ω) and if φ issufficiently smooth then u is also a solution of (2.77). In this sense this problem is the correctweak formulation.Note that now the space H1(Ω) is used and not H1

0 (Ω). The space H1(Ω) does not containany information concerning the boundary condition (2.77b). The boundary data are part ofthe bilinear form used in (2.78). In the case of Dirichlet boundary conditions, as in (2.73)and (2.75), this is the other way around: The solution space is such that the boundary condi-tions are automatically fulfilled and the boundary data do not influence the bilinear form. Thelatter class of boundary conditons are called essential boundary conditions (these are a-priorifulfilled by the choice of the solution space). Boundary conditions as in (2.77b) are called nat-ural boundary conditions (these are “automatically” fulfilled if the variational problem is solved).

We now analyze existence and uniqueness of the variational formulation. For this we needtwo variants of the Poincare-Friedrichs inequality:

Lemma 2.5.7 There exist constants C1 and C2 such that

‖u‖21 ≤ C1

(

|u|21 +

∫

∂Ωu2 ds

)

for all u ∈ H1(Ω), (2.79a)

‖u‖21 ≤ C2

(

|u|21 + |∫

Ωu dx|2

)

for all u ∈ H1(Ω). (2.79b)

[Note that for u ∈ H10 (Ω) the first result reduces to the Poincare-Friedrichs inequality]

Proof. For i = 1, 2 we define qi : H1(Ω) → R by

q1(u) =

∫

∂Ωu2 ds, q2(u) = |

∫

Ωu dx|2.

Then qi is continuous on H1(Ω) (for i = 1 this follows from the continuity of the trace operator),qi(αu) = α2qi(u) for all α ∈ R and if u is equal to a constant, say c, then qi(u) = qi(c) = 0 iffc = 0.Assume that there does not exist a constant C such that ‖u‖2

1 ≤ C(

|u|21+qi(u))

for all u ∈ H1(Ω).Then there exists a sequence (uk)k≥1 in H1(Ω) such that

1 = ‖uk‖21 ≥ k

(

|uk|21 + qi(uk))

for all k. (2.80)

This sequence is bounded in H1(Ω). Since the embedding H1(Ω) → L2(Ω) is compact, thereexists a subsequence (uk(ℓ))ℓ≥1 that converges in L2(Ω):

limℓ→∞

uk(ℓ) = u in L2(Ω).

From (2.80) it follows that limℓ→∞ |uk(ℓ)|1 = 0 and thus

limℓ→∞

Dαuk(ℓ) = 0, if |α| = 1, in L2(Ω).

51

From this we conclude that

u ∈ H1(Ω), limℓ→∞

uk(ℓ) = u in H1(Ω), Dαu = 0 (a.e.) if |α| = 1.

Hence, u must be constant (a.e.) on Ω, say u = c. From (2.80) we obtain

limℓ→∞

qi(uk(ℓ)) = 0.

Using the continuity of qi it follows that qi(c) = qi(u) = 0 and thus c = 0. This yields acontradiction:

1 = limℓ→∞

‖uk(ℓ)‖21 = ‖u‖2

1 = ‖c‖21 = 0.

Thus the results are proved.

Using this lemma we can prove the following result for the variational problem in (2.78):

Theorem 2.5.8 Consider the variational problem (2.78) with β > 0, f ∈ L2(Ω) and φ ∈L2(∂Ω). This problem has a unique solution u and the inequality

‖u‖1 ≤ C(

‖f‖L2 + ‖φ‖L2(∂Ω)

)

holds with a constant C independent of f and φ.

Proof. For v ∈ H1(Ω) define the linear functional:

g : v →∫

Ωfv dx+

∫

∂Ωφv ds.

Using the continuity of the trace operator we obtain

|g(v)| ≤ ‖f‖L2‖v‖L2 + ‖φ‖L2(∂Ω)‖γ(v)‖L2(∂Ω)

≤ c(

‖f‖L2 + ‖φ‖L2(∂Ω)

)

‖v‖1

and thus g ∈(

H1(Ω))′

. Define k(u, v) =∫

Ω ∇u · ∇v dx + β∫

∂Ω uv ds. The continuity of thisbilinear form follows from

|k(u, v)| ≤ |u|1|v|1 + β‖γ(u)‖L2(∂Ω)‖γ(v)‖L2(∂Ω)

≤ |u|1|v|1 + Cβ‖u‖1‖v‖1 ≤ C‖u‖1‖v‖1.

The ellipticity can be concluded from the result in (2.79a):

k(u, u) = |u|21 + β

∫

∂Ωu2 ds ≥ C‖u‖2

1 for all u ∈ H1(Ω)

and with C > 0. Application of the Lax-Milgram lemma completes the proof.

We now analyze the problem with pure Neumann boundary conditions, i.e., β = 0 in (2.77b).Clearly, for this problem we can not have uniqueness: if u is a solution then for any constant cthe function u+ c is also a solution. Moreover, for existence of a solution the data f and φ mustsatisfy a certain condition. Assume that u ∈ H2(Ω) is a solution of (2.77) for the case β = 0,then

∫

Ωf dx = −

∫

Ω∆u dx =

∫

Ω∇u · ∇1 dx−

∫

∂Ω∇u · n ds = −

∫

∂Ωφds

52

must hold. This motivates the introduction of the compatibility relation:∫

Ωf dx+

∫

∂Ωφds = 0. (2.81)

To obtain uniqueness, for the solution space we take a subspace of H1(Ω) consisting of functionsu with 〈u, 1〉L2 = 〈u, 1〉1 = 0:

H1∗ (Ω) := u ∈ H1(Ω) |

∫

Ωu dx = 0 .

Since this is a closed subspace of H1(Ω) it is a Hilbert space. Instead of (2.78) we now consider:

find u ∈ H1∗ (Ω) such that

∫

Ω ∇u · ∇v dx =∫

Ω fv dx+∫

∂Ω φv ds for all v ∈ H1(Ω).(2.82)

For this problem we have existence and uniqueness:

Theorem 2.5.9 Consider the variational problem (2.82) with f ∈ L2(Ω), φ ∈ L2(∂Ω) andassume that the compatibility relation (2.81) holds. Then this problem has a unique solution uand the inequality

‖u‖1 ≤ C(

‖f‖L2 + ‖φ‖L2(∂Ω)

)

holds with a constant C independent of f and φ.

Proof. For v ∈ H1(Ω) define the linear functional:

g : v →∫

Ωfv dx+

∫

∂Ωφv ds.

The continuity of this functional is shown in the proof of theorem 2.5.8. Define k(u, v) =∫

Ω ∇u ·∇v dx. The continuity of this bilinear form is trivial. For u ∈ H1∗ (Ω) we have

∫

Ω u dx = 0and thus, using the result in (2.79b), we get

k(u, u) = |u|21 + |∫

Ωu dx|2 ≥ C‖u‖2

1 for all u ∈ H1∗ (Ω)

with a constant C > 0. Hence, the bilinear form is H1∗ (Ω)-elliptic. From the Lax-Milgram

lemma it follows that there exists a unique solution u ∈ H1∗ (Ω) such that k(u, v) = g(v) for all

v ∈ H1∗ (Ω). Note that k(u, 1) = 0 and, due to the compatibility relation, g(1) = 0. It follows

that for the solution u we have k(u, v) = g(v) for all v ∈ H1(Ω).

Remark 2.5.10 For the case β < 0 it may happen that the problem (2.77) has a nontrivialkernel (and thus we do not have uniqueness). Moreover, in general this kernel is not as simpleas for the case β = 0. As an example, consider the problem

−u′′(x) = 0 for x ∈ (0, 1)

−u′(0) − 2u(0) = 0

u′(1) − 2u(1) = 0.

All functions in span(u∗) with u∗(x) = 2x− 1 are solutions of this problem.

53

Mixed boundary conditionsIt may happen that in a boundary value problem both natural and essential boundary valuesoccur. We discuss a typical example. Let Γe and Γn be parts of the boundary ∂Ω such thatmeasn−1(Γe) > 0, measn−1(Γn) > 0, Γe ∩ Γn = ∅, Γe ∪ Γn = ∂Ω. Now consider the followingboundary value problem:

−∆u = f in Ω (2.83a)

u = 0 on Γe (2.83b)

∂u

∂n= φ on Γn. (2.83c)

The Dirichlet (= essential) boundary conditions are fulfilled by the choice of the solution space:

H1Γe

(Ω) := u ∈ H1(Ω) | γ(u) = 0 on Γe .

The Neumann (= natural) boundary conditions will be part of the linear functional used in thevariational problem. A similar derivation as for the previous examples results in the followingvariational problem:

find u ∈ H1Γe

(Ω) such that∫

Ω ∇u · ∇v dx =∫

Ω fv dx+∫

Γnφv ds for all v ∈ H1

Γe(Ω).

(2.84)

One easily verifies that if this problem has a smooth solution u ∈ C2(Ω) then u is a solution ofthe problem (2.83).

Remark 2.5.11 For the proof of existence and uniqueness we need the following Poincare-Friedrichs inequality in the space H1

Γe(Ω):

∃ C > 0 : ‖u‖L2 ≤ C|u|1 for all u ∈ H1Γe

(Ω). (2.85)

For a proof of this result we refer to the literature, e.g. [3], Remark 5.16.

Theorem 2.5.12 The variational problem (2.84) with f ∈ L2(Ω) and φ ∈ L2(∂Ω) has a uniquesolution u and the inequality

‖u‖1 ≤ C(

‖f‖L2 + ‖φ‖L2(∂Ω)

)

holds with a constant C independent of f and φ holds.

Proof. For v ∈ H1Γe

(Ω) define the linear functional:

g : v →∫

Ωfv dx+

∫

∂Ωφv ds.

The continuity of this linear functional can be shown as in the proof of theorem 2.5.8. Definethe bilinear form k(u, v) =

∫

Ω ∇u · ∇v dx. The continuity of k(·, ·) is trivial. From (2.85) itfollows that k(u, u) ≥ C‖u‖2

1 for all u ∈ H1Γe

(Ω) and with C > 0. Hence the bilinear form isH1

Γe(Ω)-elliptic. Application of the Lax-Milgram lemma completes the proof.

54

2.5.4 Regularity results

In this section we present a few results from the literature on global smoothness of the solution.First the notion of Hm-regularity is introduced. For ease of presentation we restrict ourselvesto elliptic boundary value problems with homogeneous Dirichlet boundary conditions.

Definition 2.5.13 (Hm-regularity.) Let k : H10 (Ω) × H1

0 (Ω) → R be a continuous ellipticbilinear form and f ∈ H−1(Ω) =

(

H10 (Ω)

)′. For the unique solution u ∈ H1

0 (Ω) of the problem

k(u, v) = f(v) for all v ∈ H10 (Ω)

the inequality

‖u‖1 ≤ C‖f‖−1

holds with a constant C independent of f . This property is calledH1-regularity of the variationalproblem. If for some m > 1 and f ∈ Hm−2(Ω) the unique solution u of

k(u, v) =

∫

Ωfv dx for all v ∈ H1

0 (Ω)

satisfies

‖u‖m ≤ C‖f‖m−2

with a constant C independent of f then the variational problem is said to be m-regular.

The result in the next theorem is an analogon of the result in theorem 1.2.7, but now thesmoothness is measured using Sobolev norms instead of Holder norms.

Theorem 2.5.14 ([39], Theorem 8.13) Assume that u ∈ H10 (Ω) is a solution of (2.73) (for

existence of u: see theorem 2.5.3). For some m ∈ N assume that ∂Ω ∈ Cm+2 and:

for m = 0 : f ∈ L2(Ω), aij ∈ C0,1(Ω) ∀i, j, bi ∈ L∞(Ω) ∀i, c ∈ L∞(Ω),

for m ≥ 1 : f ∈ Hm(Ω), aij ∈ Cm,1(Ω) ∀i, j, bi ∈ Cm−1,1(Ω) ∀i, c ∈ Cm−1,1(Ω).

Then u ∈ Hm+2(Ω) holds and

‖u‖m+2 ≤ C(

‖u‖L2 + ‖f‖m)

(2.86)

with a constant C independent of f and u.

Corollary 2.5.15 Assume that the assumptions of theorem 2.5.3 and of theorem 2.5.14 arefulfilled. Then the variational problem (2.73) is Hm+2-regular.

Proof. Due to theorem 2.5.3 the problem has a unique solution u and ‖u‖1 ≤ C‖f‖L2holds.

Now combine this with the result in (2.86):

‖u‖m+2 ≤ C(

‖u‖L2 + ‖f‖m)

≤ C1(‖f‖L2 + ‖f‖m)

≤ 2C1‖f‖m

with a constant C1 independent of f .

55

Note that in these regularity results there is a severe condition on the smoothness of the bound-ary. For example, for m = 0, i.e., H2-regularity, we have the condition ∂Ω ∈ C2. In practice,this assumption often does not hold. For convex domains one can prove H2-regularity withoutassuming such a strong smoothness condition on ∂Ω. The following result is due to [53]:

Theorem 2.5.16 Let Ω be convex. Suppose that the assumptions of theorem 2.5.3 hold and inaddition aij ∈ C0,1(Ω) for all i, j. Then the unique solution of (2.73) satisfies

‖u‖2 ≤ C‖f‖L2

with a constant C independent of f , i.e., the variational problem (2.73) is H2-regular.

We note that very similar regularity results hold for elliptic problems with natural boundaryconditions (as in (2.77b)). In problems with mixed boundary conditions, however, one in generalhas less regularity.

2.5.5 Riesz-Schauder theory

In this section we show that for the variational formulation of the convection-diffusion problemresults on existence and uniqueness can be derived that avoid the condition −1

2div b+ c ≥ 0 asused in theorem 2.5.3. The analysis is based on the so-called Riesz-Schauder theory and usesresults on compact embeddings of Sobolev spaces. ([47], Thm. 7.2.14)....In preparation .......

2.6 Weak formulation of the Stokes problem

We recall the classical formulation of the Stokes problem with homogeneous Dirichlet boundaryconditions:

−∆u + ∇p = f in Ω, (2.87a)

divu = 0 in Ω (2.87b)

u = 0 on ∂Ω. (2.87c)

In this section we derive a variational formulation of this problem and prove existence anduniqueness of a weak solution.

From the formulation of the Stokes problem it is clear that the pressure p is determined onlyup to a constant. In order to eliminate this degree of freedom we introduce the additionalrequirement

〈p, 1〉L2 =

∫

Ωp dx = 0.

Assume that the Stokes problem has a solution u ∈ V := u ∈ C2(Ω)n | u = 0 on ∂Ω ,p ∈ M := p ∈ C1(Ω) |

∫

Ω p dx = 0 . Then (u, p) also solves the following variational problem:find (u, p) ∈ V ×M such that

∫

Ω∇u · ∇v dx−

∫

Ωp divv dx =

∫

Ωf · v dx ∀ v ∈ C∞0 (Ω)n

∫

Ωq divu dx = 0 ∀ q ∈ M

(2.88)

56

with∫

Ω ∇u · ∇v dx = 〈∇u,∇v〉L2 :=∑n

i=1〈∇ui,∇vi〉L2 . We introduce the bilinear forms andthe linear functional

a(u,v) :=

∫

Ω∇u · ∇v dx (2.89a)

b(v, q) := −∫

Ωq divv dx (2.89b)

f(v) :=

∫

Ωf · v dx. (2.89c)

Note that no derivatives of the pressure occur in (2.88). To obtain a weak formulation inappropriate Hilbert spaces we apply the completion principle. For the velocity we use completionw.r.t. ‖ · ‖1 and for the pressure we use the norm ‖ · ‖L2 :

V‖·‖1 = C∞0 (Ω)n‖·‖1

= H10 (Ω)n, M‖·‖

L2 = L20(Ω) := p ∈ L2(Ω) |

∫

Ωp dx = 0.

This results in the following weak formulation of the Stokes problem, with V := H10 (Ω)n, M :=

L20(Ω):

Find (u, p) ∈ V ×M such that

a(u,v) + b(v, p) = f(v) for all v ∈ V (2.90a)

b(u, q) = 0 for all q ∈M. (2.90b)

Lemma 2.6.1 Suppose that u ∈ C2(Ω)n and p ∈ C1(Ω) satisfy (2.90). Then (u, p) is a solutionof (2.87).

Proof. Using partial integration it follows that

∫

Ω

(

− ∆u + ∇p− f)

· v dx = 0 for all v ∈ C∞0 (Ω)n

and thus −∆u + ∇p = f in Ω. Note that by Green’s formula we have that∫

Ω divu dx = 0 foru ∈ H1

0 (Ω)n. Hence in (2.90b) we can take q = divu, which yields∫

Ω(divu)2 dx = 0 and thusdivu = 0 in Ω.

To show the well-posedness of the variational formulation of the Stokes problem we apply corol-lary 2.3.12. For this we need the following inf-sup condition, which will be proved in section 2.6.1

∃ β > 0 : supv∈H1

0 (Ω)n

∫

Ω q divv dx

‖v‖1≥ β ‖q‖L2 ∀ q ∈ L2

0(Ω). (2.91)

Using this property we obtain a fundamental result on well-posedness of the variational Stokesproblem:

57

Theorem 2.6.2 For every f ∈ L2(Ω)n the Stokes problem (2.90) has a unique solution (u, p) ∈V ×M . Moreover, the inequality

‖u‖1 + ‖p‖L2 ≤ C ‖f‖L2

holds with a constant C independent of f .

Proof. We can apply corollary 2.3.12 with V, M, a(·, ·), b(·, ·) as defined above and f1(v) =∫

Ω f · v dx, f2 = 0. The continuity of b(·, ·) on V ×M follows from

|b(v, q)| =∣

∣

∫

Ωq divv dx

∣

∣ ≤ ‖q‖L2‖div v‖L2 ≤ √n ‖q‖L2‖v‖1 for (u, q) ∈ V ×M.

The inf-sup condition is given in (2.91). Note that the minus sign in (2.89b) does not play a rolefor the inf-sup condition in (2.91). The continuity of a(·, ·) on V × V is clear. The V -ellipticityfollows from the Poincare-Friedrichs inequality:

a(u,u) =n∑

i=1

|ui|21 ≥ cn∑

i=1

‖ui‖21 = c‖u‖2

1 for all u ∈ V.

Application of corollary 2.3.12 yields the desired result.

Remark 2.6.3 Note that the bilinear form a(·, ·) is symmetric and thus we can apply theo-rem 2.4.2. This shows that the variational formulation of the Stokes problem is equivalent to asaddle-point problem.

2.6.1 Proof of the inf-sup property

In this section we derive the fundamental inf-sup property in (2.91). First we note that a functionf ∈ L2(Ω) induces a bounded linear functional on H1

0 (Ω) that is given by:

u→∫

Ωf(x)u(x) dx, u ∈ H1

0 (Ω), ‖f‖−1 = supu∈H1

0(Ω)

|〈f, u〉L2 |‖u‖1

.

We now define the first (partial) derivative of an L2-function (in the sense of distributions). Forf ∈ L2(Ω) the mapping

F : u→ −∫

Ωf(x)Dαu(x) dx, |α| = 1, u ∈ H1

0 (Ω),

defines a bounded linear functional on H10 (Ω). This functional is denoted by F =: Dαf ∈

H−1(Ω) =(

H10 (Ω)

)′. Its norm is defined by

‖Dαf‖−1 := supu∈H1

0 (Ω)

|(Dαf)(u)|‖u‖1

= supu∈H1

0 (Ω)

|〈f,Dαu〉L2 |‖u‖1

.

58

Based on these partial derivatives we define

∇f =( ∂f

∂x1, . . . ,

∂f

∂xn

)

,

‖∇f‖−1 =

√

√

√

√

n∑

i=1

‖ ∂f∂xi

‖2−1 =

√

∑

|α|=1

‖Dαf‖2−1.

In the next theorem we present a rather deep result from analysis. Its proof (for the case∂Ω ∈ C0,1) is long and technical.

Theorem 2.6.4 There exists a constant C such that for all p ∈ L2(Ω):

‖p‖L2 ≤ C(

‖p‖−1 + ‖∇p‖−1

)

.

Proof. We refer to [65], lemma 7.1 or [31], remark III.3.1.

Remark 2.6.5 From the definitions of ‖p‖−1, ‖∇p‖−1 it immediately follows that ‖p‖−1 ≤‖p‖L2 and ‖∇p‖−1 ≤ ‖p‖L2 for all p ∈ L2(Ω). Hence, using theorem 2.6.4 it follows that ‖ · ‖L2

and ‖·‖−1 +‖∇·‖−1 are equivalent norms on L2(Ω). This can be seen as a (nontrivial) extensionof the (trivial) result that on Hm(Ω), m ≥ 1, the norms ‖ · ‖m and ‖ · ‖m−1 + ‖∇ · ‖m−1 areequivalent.

From the result in theorem 2.6.4 we obtain the following:

Lemma 2.6.6 There exists a constant C such that

‖p‖L2 ≤ C‖∇p‖−1 for all p ∈ L20(Ω).

Proof. Suppose that this result does not hold. Then there exists a sequence (pk)k≥1 in L20(Ω)

such that

1 = ‖pk‖L2 ≥ k‖∇pk‖−1 for all k. (2.92)

From the fact that the continuous embedding H10 (Ω) → L2(Ω) is compact, it follows that

L2(Ω) = L2(Ω)′ →(

H10 (Ω)

)′= H−1(Ω) is a compact embedding. Hence there exists a subse-

quence (pk(ℓ))ℓ≥1 that is a Cauchy sequence in H−1(Ω). From (2.92) and theorem 2.6.4 it followsthat (pk(ℓ))ℓ≥1 is a Cauchy sequence in L2(Ω) and thus there exists p ∈ L2(Ω) such that

limℓ→∞

pk(ℓ) = p in L2(Ω). (2.93)

From (2.92) we get limℓ→∞ ‖∇pk(ℓ)‖−1 = 0 and thus

limℓ→∞

∂pk(ℓ)

∂xi(φ) = 0 for all φ ∈ C∞0 (Ω) and all i = 1, . . . , n.

In combination with (2.93) this yields

0 = limℓ→∞

∂pk(ℓ)

∂xi(φ) = − lim

ℓ→∞〈pk(ℓ),

∂φ

∂xi〉L2 = −〈p, ∂φ

∂xi〉L2 for i = 1, . . . , n.

59

Hence p ∈ H1(Ω) and ∇p = 0. It follows that p is equal to a constant (a.e.), say p = c. From(2.93) and pk(ℓ) ∈ L2

0(Ω) it follows that∫

Ω p dx =∫

Ω c dx = 0 and thus c = 0. This results in acontradiction:

1 = limℓ→∞

‖pk(ℓ)‖L2 = ‖p‖L2 = ‖c‖L2 = 0,

and thus the proof is complete.

Theorem 2.6.7 The inf-sup property (2.91) holds.

Proof. From lemma 2.6.6 it follows that there exists c > 0 such that ‖∇q‖−1 ≥ c‖q‖L2 forall q ∈ L2

0(Ω). Hence, for suitable k with 1 ≤ k ≤ n we have

‖ ∂q∂xk

‖−1 ≥ c√n‖q‖L2 for all q ∈ L2

0(Ω).

Thus there exists v ∈ H10 (Ω) with ‖v‖1 = 1 and

| ∂q∂xk

(v)| =∣

∣

∫

Ωq∂v

∂xkdx∣

∣ ≥ 1

2

c√n‖q‖L2 for all q ∈ L2

0(Ω).

For v = (v1, . . . , vn) ∈ H10 (Ω)n defined by vk = v, vi = 0 for i 6= k we have

supv∈H1

0 (Ω)n

∫

Ω q divv dx

‖v‖1= sup

v∈H10 (Ω)n

∣

∣

∫

Ω q divv dx∣

∣

‖v‖1

≥∣

∣

∫

Ω q div v dx∣

∣

‖v‖1=

∣

∣

∫

Ω q∂v∂xk

dx∣

∣

‖v‖1≥ 1

2

c√n‖q‖L2

for all q ∈ L20(Ω). This completes the proof.

2.6.2 Regularity of the Stokes problem

We present two results from the literature concerning regularity of the Stokes problem.The first result is proved in [26, 56]:

Theorem 2.6.8 Let (u, p) ∈ H10 (Ω)n ×L2

0(Ω) be the solution of the Stokes problem (2.90). Form ≥ 0 assume that f ∈ Hm

0 (Ω)n and ∂Ω ∈ Cm+2. Then u ∈ Hm+2(Ω)n , p ∈ Hm+1(Ω) and theinequality

‖u‖m+2 + ‖p‖m+1 ≤ C‖f‖m (2.94)

holds, with a constant C independent of f .

If the property in (2.94) holds then the Stokes problem is said to be Hm+2-regular. Note thateven for H2-regularity (i.e., m = 0) one needs the assumption ∂Ω ∈ C2, which in practice isoften not fulfilled. For convex domains this assumption can be avoided (as in theorem 2.5.16).The following result is presented in [54] (only n = 2) and in [30] (for n ≥ 2):

Theorem 2.6.9 Let (u, p) ∈ H10 (Ω)n × L2

0(Ω) be the solution of the Stokes problem (2.90).Suppose that Ω is convex. Then u ∈ H2(Ω)n, p ∈ H1(Ω) and the inequality

‖u‖2 + ‖p‖1 ≤ C‖f‖L2

holds, with a constant C independent of f .

60

2.6.3 Other boundary conditions

For a Stokes problem with nonhomogeneous Dirichlet boundary conditions, say u = g on ∂Ω, acompatibility condition is needed:

∫

∂Ωg · n ds = 0 (2.95)

...other boundary conditions ...

...in preparation ....

61

Chapter 3

Galerkin discretization and finite

element method

3.1 Galerkin discretization

We consider a variational problem as in theorem 2.3.1, i.e., for f ∈ H ′2 the variational problemis given by:

find u ∈ H1 such that k(u,v) = f(v) for all v ∈ H2. (3.1)

We assume that the bilinear form k(·, ·) is continuous

∃ M : k(u,v) ≤M‖u‖H1‖v‖H2

for all u ∈ H1, v ∈ H2 (3.2)

and that the conditions (2.36) and (2.37) from theorem 2.3.1 hold:

∃ ε > 0 : supv∈H2

k(u,v)

‖v‖H2

≥ ε ‖u‖H1for all u ∈ H1, (3.3)

∀ v ∈ H2, v 6= 0, ∃ u ∈ H1 : k(u,v) 6= 0. (3.4)

From theorem 2.3.1 we know that for a continuous bilinear form the conditions (3.3) and (3.4)are necessary and sufficient for well-posedness of the variational problem in (3.1).The Galerkin discretization of the problem (3.1) is based on the following simple idea. Weassume a finite dimensional subspaces H1,h ⊂ H1, H2,h ⊂ H2 (note: in concrete cases the indexh will correspond to some mesh size parameter) and consider the finite dimensional variationalproblem

find uh ∈ H1,h such that k(uh,vh) = f(vh) for all vh ∈ H2,h. (3.5)

This problem is called the Galerkin discretization of (3.1) (in H1,h ×H2,h). We now discuss thewell-posedness of this Galerkin-discretization. First note that the continuity of k : H1,h×H2,h →R follows from (3.2). From theorem 2.3.1 it follows that we need the conditions (3.3) and (3.4)with Hi replaced by Hi,h, i = 1, 2. However, because Hi,h is finite dimensional we only need (3.3)since this implies (3.4) (see remark 2.3.2). Thus we formulate the following (discrete) inf-supcondition in the space H1,h ×H2,h:

∃ εh > 0 : supvh∈H2,h

k(uh,vh)

‖vh‖H2

≥ εh ‖uh‖H1for all uh ∈ H1,h. (3.6)

63

We now prove two fundamental results:

Theorem 3.1.1 (Cea-lemma.) Let (3.2), (3.3), (3.4), (3.6) hold. Then the variational prob-lem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh, respectively.Furthermore, the inequality

‖u − uh‖H1≤(

1 +M

εh

)

infvh∈H1,h

‖u − vh‖H1(3.7)

holds.

Proof. The result on existence and uniqueness follows from theorem 2.3.1 and the fact thatin the finite dimensional case (3.3) implies (3.4). From (3.1) and (3.5) it follows that

k(u − uh,vh) = 0 for all vh ∈ H2,h. (3.8)

For arbitrary vh ∈ H1,h we have, due to (3.6), (3.8), (3.2):

‖vh − uh‖H1≤ 1

εhsup

wh∈H2,h

k(vh − uh,wh)

‖wh‖H2

=1

εhsup

wh∈H2,h

k(vh − u,wh)

‖wh‖H2

≤ M

εh‖vh − u‖H1

.

From this and the triangle inequality

‖u − uh‖H1≤ ‖u − vh‖H1

+ ‖vh − uh‖H1for all vh ∈ H1,h

the result follows.

The result in this theorem simplifies if we consider the important special case H1 = H1 =: H,H1,h = H2,h =: Hh and assume that the bilinear form k(·, ·) is elliptic on H.

Corollary 3.1.2 Consider the case H1 = H2 =: H and H1,h = H2,h =: Hh. Assume that(3.2) holds and that the bilinear form k(·, ·) is H-elliptic with ellipticity constant γ. Then thevariational problem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh,respectively. Furthermore, the inequality

‖u − uh‖H ≤ M

γinf

vh∈Hh

‖u− vh‖H (3.9)

holds.

Proof. Because k(·, ·) is H-elliptic the conditions (3.3) (with ε = γ), (3.4) and (3.6) (withεh = γ) are satisfied. From theorem 3.1.1 we conclude that unique solutions u and uh exist.Using k(u − uh,vh) = 0 for all vh ∈ Hh and the ellipticity and continuity we get for arbitraryvh ∈ Hh:

‖u − uh‖2H ≤ 1

γk(u − uh,u− uh) =

1

γk(u − uh,u − vh)

≤ M

γ‖u − uh‖H‖u − vh‖H .

64

Hence the inequality in (3.9) holds.

In chapter 4 and chapter 5 we will use theorem 3.1.1 in the discretization error analysis. Inthe remainder of this chapter we only consider cases with H1 = H1 =: H, H1,h = H2,h =: Hh

and H-elliptic bilinear forms, such that corollary 3.1.2 can be applied.

An improvement of the bound in (3.9) can be obtained if k(·, ·) is symmetric:

Corollary 3.1.3 Assume that the conditions as in corollary 3.1.2 are satisfied. If in additionthe bilinear form k(·, ·) is symmetric, the inequality

‖u − uh‖H ≤√

M

γinf

vh∈Hh

‖u − vh‖H (3.10)

holds.

Proof. Introduce the norm |||v||| := k(v,v)12 on H. Note that

√γ‖v‖H ≤ |||v||| ≤

√M‖v‖H for all v ∈ H.

The space (H, ||| · |||) is a Hilbert space and due to |||v|||2 = k(v,v), k(u,v) ≤ |||u||||||v||| the bilinearform has ellipticity constant and continuity constant w.r.t. the norm ||| · ||| both equal to 1.Application of corollary 3.1.2 in the space (H, ||| · |||) yields

|||u − uh||| ≤ infvh∈Hh

|||u − vh|||

and thus we obtain

‖u − uh‖H ≤ 1√γ|||u − uh||| ≤

1√γ

infvh∈Hh

|||u − vh||| ≤√

M

γinf

vh∈Hh

‖u − vh‖H ,

which completes the proof.

Assume H1 = H2 = H and H1,h = H2,h = Hh. For the actual computation of the solutionuh of the Galerkin discretization we need a basis of the space Hh. Let φi1≤i≤N be a basis ofHh, i.e., every vh ∈ Hh has a unique representation

vh =

N∑

j=1

vjφj with v := (v1, . . . , vN )T ∈ RN .

The Galerkin discretization can be reformulated as:

find v ∈ RN such thatN∑

j=1

k(φj ,φi)vj = f(φi) ∀ i = 1, . . . , N (3.11)

This yields the linear system of equations

Kv = b , with Kij = k(φj ,φi), bi = f(φi), 1 ≤ i, j ≤ N. (3.12)

65

In the remainder of this chapter we discuss concrete choices for the space Hh, namely theso-called finite element spaces. These spaces turn out to be very suitable for the Galerkindiscretization of scalar elliptic boundary value problems. Finite element spaces can also be usedfor the Galerkin discretization of the Stokes problem. This topic is treated in chapter 5. Once aspace Hh is known one can investigate approximation properties of this space and derive boundsfor infvh∈Hh

‖u − vh‖H (with u the weak solution of the elliptic boundary value problem), cf.section 3.3. Due to the Cea-lemma we then have a bound for the discretization error ‖u−uh‖H(see section 3.4).In Part III of this book (iterative methods) we discuss techniques that can be used for solvingthe linear system in (3.12).

3.2 Examples of finite element spaces

In this section we introduce finite element spaces that are appropriate for the Galerkin discretiza-tion of elliptic boundary value problems. We only present the main principles. An extensivetreatment of finite element techniques can be found in, for example, [27], [28], [21].

To simplify the presentation we only consider finite element methods for elliptic boundary valueproblems in Rn with n ≤ 3.Starting point for the finite element approach is a subdivsion of the domain Ω in a finite numberof subsets T . Such a subdivision is called a triangulation and is denoted by Th = T. For thesubsets T we only allow:

T is an n-simplex (i.e., interval, triangle, tetrahedron), or,

T is an n-rectangle.(3.13)

Furthermore, the triangulation Th = T should be such that

Ω = ∪T∈ThT , (3.14a)

intT1 ∩ intT2 = ∅ for all T1, T2 ∈ Th, T1 6= T2 , (3.14b)

any edge [face] of any T1 ∈ Th is either a subsetof ∂Ω or an edge [face] of another T2 ∈ Th.

(3.14c)

Definition 3.2.1 A triangulation that satisfies (3.13) and (3.14) is called admissible.

Note that a triangulation can be admissible only if the domain Ω is polygonal (i.e., ∂Ω consistsof lines and/or planes). If the domain is not polygonal we can approximate it by a polygonaldomain Ωh and construct an admissible triangulation of Ωh (see ...) or use isoparametric finiteelements (section 3.6).

We consider a family of admissible triangulations denoted by Th. Let hT := diam(T ) forT ∈ Th. The index parameter h of Th is taken such that

h = maxhT | T ∈ Th .Furthermore, for T ∈ Th we define

ρT := supdiam(B) | B is a ball contained in T ,

σT :=hTρT

∈ [1,∞).

66

Definition 3.2.2 A family of admissible triangulations Th is called regular if

1. The parameter h approaches zero: infh | Th ∈ Th = 0,

2. ∃ σ : hT

ρT= σT ≤ σ for all T ∈ Th and all Th ∈ Th.

A family of admissible triangulations Th is called quasi-uniform if

∃ σ :h

ρT≤ σ for all T ∈ Th and all Th ∈ Th.

3.2.1 Simplicial finite elements

We now introduce a very important class of finite element spaces. Let Th be a family ofadmissible triangulations of Ω consisting only of n-simplices.The space of polynomials in Rn of degree less than or equal k is denoted by Pk, i.e., p ∈ Pk isof the form

p(x) =∑

|α|≤kγαx

α1

1 xα2

2 . . . xαnn , γα ∈ R.

The dimension of Pk is

dimPk =

(

n+ kk

)

.

The spaces of simplicial finite elements are given by

X0h := v ∈ L2(Ω) | v|T ∈ P0 for all T ∈ Th , (3.15a)

Xkh := v ∈ C(Ω) | v|T ∈ Pk for all T ∈ Th , k ≥ 1. (3.15b)

Thus these spaces consist of piecewise polynomials which, for k ≥ 1, are continuous on Ω.

Remark 3.2.3 From theorem 2.2.12 it follows that Xkh ⊂ H1(Ω) for all k ≥ 1.

We will also need simplicial finite element spaces with functions that are zero on ∂Ω:

Xkh,0 := Xk

h ∩H10 (Ω), k ≥ 1. (3.16)

3.2.2 Rectangular finite elements

Let Th be a family of admissible triangulations consisting only of n-rectangles.The space of polynomials in Rn of degree less than or equal k with respect to each of the variablesis denoted by Qk, i.e., p ∈ Qk is of the form

p(x) =∑

0≤αi≤kγαx

α1

1 xα2

2 . . . xαnn , γα ∈ R.

The dimension of Qk isdimQk = (k + 1)n.

The spaces of rectangular finite elements are given by

Q0h := v ∈ L2(Ω) | v|T ∈ Q0 for all T ∈ Th , (3.17a)

Qkh := v ∈ C(Ω) | v|T ∈ Qk for all T ∈ Th , k ≥ 1, (3.17b)

Qkh,0 := Qk

h ∩H10 (Ω) , k ≥ 1. (3.17c)

67

3.3 Approximation properties of finite element spaces

In this section, for u ∈ H2(Ω) we derive bounds for the approximation error infvh∈Hh‖u− vh‖1

with Hh = Xkh or Hh = Qk

h (note that Hh depends on the parameter k).The main idea of the analysis is as follows. First we will introduce an interpolation operatorIkh : C(Ω) → Hh. Recall that we assumed n ≤ 3. The Sobolev embedding theorem 2.2.14 yields

Hm(Ω) → C(Ω) for m ≥ 2

and thus the interpolation operator is well-defined for u ∈ Hm(Ω), m ≥ 2. We will proveinterpolation error bounds of the form (cf. theorem 3.3.9)

‖u− Ikhu‖t ≤ chm−t|u|m for 2 ≤ m ≤ k + 1, t = 0, 1.

This implies (corollary 3.3.10)

infvh∈Hh

‖u− vh‖t ≤ chm−t|u|m for 2 ≤ m ≤ k + 1, t = 0, 1.

We first introduce the interpolation operators IkX : C(Ω) → Xkh and IkQ : C(Ω) → Qk

h. Then weformulate some useful results that will be applied to prove the main result in theorem 3.3.9.

We start with the definition of an interpolation operator IkX : C(Ω) → Xkh. For the descrip-

tion of this operator the so-called barycentric coordinates are useful:

Definition 3.3.1 Let T be a nondegenerate n-simplex and aj ∈ Rn, j = 1, . . . , n+1 its vertices.Then T can be described by

T = n+1∑

j=1

λjaj | 0 ≤ λj ≤ 1 ∀ j,n+1∑

j=1

λj = 1 . (3.18)

To every x ∈ T there corresponds a unique n+1-tuple (λ1, . . . , λn+1) as in (3.18). These λj, 1 ≤j ≤ n + 1, are called the barycentric coordinates of x ∈ T . The mapping x → (λ1, . . . , λn+1) isaffine.

Using these barycentric coordinates we define the set

Lk(T ) :=

n+1∑

j=1

λjaj | λj ∈ 0, 1

k, . . . ,

k − 1

k, 1 ∀ j,

n+1∑

j=1

λj = 1

which is called the principal lattice of order k (in T ). Examples for n = 2 and n = 3 are givenin figure...

figure

This principal lattice can be used to determine a unique polynomial p ∈ Pk:

Lemma 3.3.2 Let T be a nondegenerated n-simplex. Then any polynomial p ∈ Pk is uniquelydetermined by its values on the principal lattice Lk(T ).

Proof. For example, in [67].

Let Th = T be an admissible triangulation of Ω consisting only of n-simplices. For u ∈ C(Ω)

68

we define a corresponding function IkXu ∈ L2(Ω) by piecewise polynomial interpolation on eachsimplex T ∈ Th:

∀ T ∈ Th : (IkXu)|T ∈ Pk such that (IkXu)(xj) = u(xj) ∀ xj ∈ Lk(T ). (3.19)

The piecewise polynomial function IkXu is continuous on Ω:

Lemma 3.3.3 For k ≥ 1 and u ∈ C(Ω) we have IkXu ∈ Xkh.

Proof. By definition we have that (IkXu)|T ∈ Pk. Thus we only have to show that IkXuis continuous across interfaces between adjacent n-simplices T1, T2 ∈ Th. For n = 1 this istrivial, since the endpoints a1, a2 of a 1-simplex [a1, a2] are used as interpolation points. Wenow consider n = 2. Define Γ := T1 ∩ T2 and pi := (IkXu)|Ti

, i = 1, 2. Note that k + 1 points ofthe principal lattice lie on the face Γ:

Lk(T1) ∩ Γ = Lk(T2) ∩ Γ =: x1, . . . , xk+1 with xi 6= xj for i 6= j.

Since these xj are interpolation points we have that p1(xj) = p2(xj) = u(xj) for j = 1, . . . , k+1.The functions (pi)|Γ are one-dimensional polynomials of degree k. We conclude that (p1)|Γ =

(p2)|Γ holds, and thus IkXu is continuous across the interface Γ.The case n = 3 (or even n ≥ 3) can be treated similarly.

For the space of rectangular finite elements, Qkh, an interpolation operator IkQ : C(Ω) → Qk

h

can be defined in a very similar way. For this we introduce a uniform grid on a rectangle in Rn.For a given interval [a, b] a uniform grid with mesh size b−a

k is given by

Gk[a,b] := a+ jb− a

k| 0 ≤ j ≤ k .

On an n-rectangle T =∏ni=1[ai, bi] we define a uniform lattice by

Lk(T ) :=

n∏

i=1

Gk[ai,bi].

Using a tensor product argument it follows that any polynomial p ∈ Qk, k ≥ 1, is uniquelydetermined by its values on the set Lk(T ). Let Th = T be an admissible triangulation of Ωconsisting only of n-rectangles. For u ∈ C(Ω) we define a corresponding function IkQu ∈ L2(Ω)by piecewise polynomial interpolation on each n-rectangle T ∈ Th:

∀ T ∈ Th : (IkQu)|T ∈ Qk such that (IkQu)(xj) = u(xj) ∀ xj ∈ Lk(T ) (3.20)

With similar arguments as used in the proof of lemma 3.3.3 one can show the following:

Lemma 3.3.4 For k ≥ 1 and u ∈ C(Ω) we have IkQu ∈ Qkh.

For the analysis of the interpolation error we begin with two elementary lemmas.

Lemma 3.3.5 Let T , T ⊂ Rn be two sets as in (3.13) and F (x) = Bx + c an affine mappingsuch that F (T ) = T . Then the following inequalities hold:

‖B‖2 ≤ hTρT

, ‖B−1‖2 ≤ hTρT.

69

Proof. We will prove the first inequality. The second one then follows from the first one byusing F−1(T ) = T with F−1(x) = B−1x− B−1c.Note that

‖B‖2 =1

ρTmax ‖Bx‖2 | x ∈ Rn, ‖x‖2 = ρT . (3.21)

Let B(a; ρT ) be a ball with centre a and diameter ρT that is contained in T . Take x ∈ Rn with

‖x‖2 = ρT . For y1 = a+ 12 x ∈ T and y2 = a− 1

2 x ∈ T we have

x = y1 − y2, F (yi) ∈ T, i = 1, 2,

and thus‖Bx‖2 = ‖B(y1 − y2)‖2 = ‖F (y1) − F (y2)‖2. ≤ hT (3.22)

From (3.21) and (3.22) we obtain ‖B‖2 ≤ hT

ρT

.

Lemma 3.3.6 Let K and K be Lipschitz domains in Rn that are affine equivalent:

F (K) = K with F (x) = Bx+ c, detB 6= 0.

For m ≥ 0, v ∈ Hm(K) define v := v F : K → R. Then v ∈ Hm(K) and there exists aconstant C such that

|v|m,K ≤ C‖B‖m2 |detB|− 12 |v|m,K for all v ∈ Hm(K), (3.23a)

|v|m,K ≤ C‖B−1‖m2 |detB| 12 |v|m,K for all v ∈ Hm(K). (3.23b)

Proof. Since C∞(Ω) is dense in Hm(Ω) it suffices to prove (3.23a) for v ∈ C∞(Ω). For m = 0this result follows from

|v|20,K

=

∫

Kv(x)2 dx =

∫

Kv(x)2|detB|−1 dx = |detB|−1|v|20,K .

For the case m ≥ 1 we need some basic results on Frechet derivatives. For v ∈ C∞(Ω) theFrechet derivative Dmv(x) : Rn × . . . × Rn → R is an m-linear form. Let ej be the j-th basisvector in Rn. For |α| = m and for suitable i1, . . . , im ∈ N we have

Dαv(x) =∂|α|v(x)

∂xα1

1 . . . ∂xαnn

=∂|α|v(x)

∂xi1 . . . ∂xim= Dmv(x)(ei1 , . . . , eim) (3.24)

(note the subtle difference in notation between Dα and Dm). Let E be an m-linear form on Rn.Then both

‖E‖2 := maxyi∈Rn

|E(y1, . . . , ym)|‖y1‖2 . . . ‖ym‖2

and ‖E‖∗ := max1≤ij≤n

|E(ei1 , . . . , eim)|

define norms on the space of m-linear forms on Rn. Using the norm equivalence property itfollows that there exists a constant c, independent of E, such that

‖E‖∗ ≤ ‖E‖2 ≤ c ‖E‖∗.

70

If we take E = Dmv(x) and use (3.24) we get

max|α|=m

|Dαv(x)| ≤ ‖Dmv(x)‖2 ≤ c max|α|=m

|Dαv(x)|. (3.25)

The chain rule applied to v(x) = v(Bx + c), with x = Bx+ c, results in

Dmv(x)(y1, . . . , ym) = Dmv(x)(By1, . . . ,Bym)

and thus

‖Dmv(x)‖2 ≤ ‖B‖m2 ‖Dmv(x)‖2. (3.26)

Combination of (3.25) and (3.26) yields

max|α|=m

|Dαv(x)| ≤ c ‖B‖m2 max|α|=m

|Dαv(x)|.

Using this we finally obtain

|v|2m,K

=∑

|α|=m

∫

KDαv(x)2 dx ≤ C

∫

K

(

max|α|=m

|Dαv(x)|)2dx

≤ C ‖B‖2m2

∫

K

(

max|α|=m

|Dαv(x)|)2dx

= C ‖B‖2m2 max|α|=m

∫

K|Dαv(x)|2|detB|−1 dx

≤ C ‖B‖2m2 |detB|−1

∑

|α|=m

∫

K|Dαv(x)|2 dx = C ‖B‖2m

2 |detB|−1|v|2m,K .

This proves the result in (3.23a). The result in (3.23b) follows from (3.23a) and F−1(K) = Kwith F−1(x) = B−1x− B−1c.

The following result is a generalization of the Poincare-Friedrichs inequality in (2.79b) andwill be used in the proof of theorem 3.3.8.

Lemma 3.3.7 Let K be a Lipschitz domain in Rn. There exists a constant C such that

‖u‖2m ≤ C

(

|u|2m +∑

|α|≤m−1

(

∫

KDαu dx

)2)

for all u ∈ Hm(K).

(Here | · |m and ‖ · ‖m denote Sobolev (semi)norms on the domain K)

Proof. For m = 1 this result is given in (2.79b). From the result in (2.79b) it also followsthat

‖u‖2L2 ≤ C

(

|u|21 +(

∫

Ku dx

)2)for all u ∈ H1(K). (3.27)

We introduce the notation (for u ∈ Hm(K)):

βℓ :=∑

|α|=ℓ‖Dαu‖2

L2(K) , δℓ :=∑

|α|=ℓ

(

∫

KDαu dx

)2, ℓ = 0, . . . ,m.

71

Note that for ℓ ≤ m − 1 we have∑

|α|=ℓ |Dαu|21 = βℓ+1. Using this and the inequality (3.27)with u replaced by Dαu we get for ℓ ≤ m− 1:

βℓ =∑

|α|=ℓ‖Dαu‖2

L2(K) ≤ C(

βℓ+1 +∑

|α|=ℓ

(

∫

KDαu dx

)2)= C(βℓ+1 + δℓ).

From this it follows that

‖u‖2m =

m∑

ℓ=0

βℓ ≤ C(

βm +

m−1∑

ℓ=0

δℓ)

= C(

|u|2m +∑

|α|≤m−1

(

∫

KDαu dx

)2),


The next theorem, due to Bramble-Hilbert [20], is a fundamental one:

Theorem 3.3.8 Let K be a Lipschitz domain in Rn and Y a Banach space. Suppose L :Hm(K) → Y , m ≥ 1, is a linear bounded operator such that

L(p) = 0 for all p ∈ Pm−1.

Then there exists a constant C such that

‖Lu‖Y ≤ C |u|m for all u ∈ Hm(K). (3.28)

Proof. First note that

‖Lu‖Y = ‖L(u− p)‖Y ≤ ‖L‖‖u − p‖m for all p ∈ Pm−1. (3.29)

Let p(x) =∑

|α|≤m−1 γαxα1

1 . . . xαnn ∈ Pm−1. For any given u ∈ Hm(K) one can show that the

coefficients γα can be taken such that

∫

KDαp dx =

∫

KDαu dx for |α| ≤ m− 1

holds (hint: the ordering |α| = m− 1, |α| = m− 2, . . ., yields a linear system for the coefficientsγα with a nonsingular lower triangular matrix). Using the result in lemma 3.3.7 we obtain

‖u− p‖2m ≤ C

(

|u− p|2m +∑

|α|≤m−1

(

∫

KDα(u− p) dx

)2)

= C|u− p|2m = C|u|2m(3.30)

Combination of (3.29) and (3.30) completes the proof.

We now present a main result on the interpolation error:

72

Theorem 3.3.9 Let Th be a regular family of triangulations of Ω consisting of n-simplicesand let Xk

h be the corresponding finite element space as in (3.15b). For 2 ≤ m ≤ k + 1 andt ∈ 0, 1 the following holds:

‖u− IkXu‖t ≤ Chm−t|u|m for all u ∈ Hm(Ω). (3.31)

Let Th be a regular family of triangulations of Ω consisting of n-rectangles and let Qkh be the

corresponding finite element space as in (3.17b). For 2 ≤ m ≤ k+ 1 and t ∈ 0, 1 the followingholds:

‖u− IkQu‖t ≤ Chm−t|u|m for all u ∈ Hm(Ω). (3.32)

The constants C in(3.31) and (3.32) are independent of u and of Th ∈ Th.

Proof. We will prove the result in (3.31). Very similar arguments can be used to show thatthe result in (3.32) holds. Take 2 ≤ m ≤ k + 1. The constants C used below are all uniformwith respect to u ∈ Hm(Ω) and Th ∈ Th. We will show that for all ℓ ∈ 0, 1

|u− IkXu|ℓ ≤ Chm−ℓ|u|m for all u ∈ Hm(Ω)

holds, with | · |0 := ‖·‖L2 . The result in (3.31) then follows from this and from ‖v‖21 = |v|20 + |v|21.

Due to

|u− IkXu|2ℓ =∑

T∈Th

|u− IkXu|2ℓ,T

it suffices to prove for ℓ ∈ 0, 1 and for arbitrary T ∈ Th:

|u− IkXu|ℓ,T ≤ Chm−ℓ|u|m,T for all u ∈ Hm(Ω). (3.33)

Let T be the unit n-simplex and F : T → T an affine transformation F (x) = Bx+ c such thatF (T ) = T . Due to the fact that the family Th is regular, there exists a constant C such that

‖B‖2‖B−1‖2 ≤ chTρT

≤ C. (3.34)

Note that ‖p‖∗ :=∑

x∈Lk(T ) |p(x)| defines a norm on Pk. Since all norms on Pk are equivalentthere exists a constant C such that

‖p‖m,T ≤ C ‖p‖∗ for all p ∈ Pk. (3.35)

The continuous embedding Hm(T ) → C(T ) yields:

∃ C : ‖v‖∞,T ≤ C ‖v‖m,T for all v ∈ Hm(T ). (3.36)

Let IkX : C(T ) → Pk be the interpolation operator on the unit n-simplex as defined in (3.19)

(with T = T ). We then have

(IkXu) F = IkX(u F ) = IkXu with u := u F. (3.37)

Define the linear operator L := id − IkX : Hm(T ) → Hm(T ). For this operator we have Lp = 0for all p ∈ Pk and thus, due to m ≤ k + 1, Lp = 0 for all p ∈ Pm−1. Furthermore, using (3.35)

73

and (3.36) we get

‖Lv‖m,T ≤ ‖v‖m,T + ‖IkXv‖m,T ≤ ‖v‖m,T + C‖IkXv‖∗≤ ‖v‖m,T + C

∑

x∈Lk(T )

|v(x)| ≤ ‖v‖m,T +C ‖v‖∞,T ≤ C‖v‖m,T .

Thus we can apply theorem 3.3.8, which yields

‖v − IkXv‖m,T ≤ C|v|m,T for all v ∈ Hm(T ). (3.38)

For u ∈ Hm(Ω) we obtain, using lemma 3.3.5, lemma 3.3.6 and the results in (3.34), (3.37),(3.38):

|u− IkXu|ℓ,T ≤ C ‖B−1‖ℓ2|detB| 12 |u F − (IkXu) F |ℓ,T= C ‖B−1‖ℓ2|detB| 12 |u− IkXu|ℓ,T≤ C ‖B−1‖ℓ2|detB| 12 ‖u− IkXu‖m,T≤ C ‖B−1‖ℓ2|detB| 12 |u|m,T ≤ C ‖B−1‖ℓ2‖B‖m2 |u|m,T≤ C

(

‖B−1‖2‖B‖2

)ℓ‖B‖m−ℓ2 |u|m,T≤ C ‖B‖m−ℓ2 |u|m,T ≤ C hm−ℓ|u|m,T .

This proves the result in (3.33)

Corollary 3.3.10 Under the same assumption as in theorem 3.3.9 we have

infvh∈Xk

h

‖u− vh‖t ≤ C hm−t|u|m, (3.39)

infvh∈Qk

h

‖u− vh‖t ≤ C hm−t|u|m. (3.40)

Furthermore, the results in (3.39)and (3.40) hold for u ∈ Hm(Ω) ∩H10 (Ω) with Xk

h, Qkh replaced

by Xkh,0 and Qk

h,0, respectively.

Proof. The first part is clear. The second part follows from the fact that for u ∈ Hm(Ω) ∩H1

0 (Ω) we have IkXu ∈ Xkh,0 and IkQu ∈ Qk

h,0.

We now prove so-called local and global inverse inequalities. These results can be used tobound the H1-norm of a finite element function in terms of its L2-norm.

Lemma 3.3.11 (inverse inequalities) Let Th be a regular family of triangulations of Ωconsisting of n-simplices (n-rectangles) and let Vh := Xk

h (= Qkh) be the corresponding finite

element space. For m ≥ 0 there exists a constant c independent of h such that

|vh|m+1,T ≤ ch−1T |vh|m,T for all T ∈ Th and all vh ∈ Vh.

If in addition the family of triangulations is quasi-uniform, then there exists a constant c inde-pendent of h such that

|vh|1 ≤ c h−1‖vh‖L2 for all vh ∈ Vh.

74

Proof. We consider the case of simplices. The other case can be treated very similar. ForT ∈ Th let F (x) = BT x + c be an affine transformation such that F (T ) = T , where T is theunit simplex. Note that on the finite dimensional space Pk(T ) all norms are equivalent. Usinglemma 3.3.6 we get, with vh = vh F ,

|vh|m+1,T ≤ c‖B−1T ‖m+1

2 |detBT |12 |vh|m+1,T

≤ ch−m−1T |detBT |

12 |vh|m+1,T

≤ ch−m−1T |detBT |

12 |vh|m,T ≤ ch−1

T |vh|m,T ,

which proves the local inverse inequality. Note that vh ∈ H10 (Ω). Thus for m = 0 we can sum up

these local results and using the quasi-uniformity assumption (i.e., h−1T ≤ ch−1) we then obtain

|vh|21 =∑

T∈Th

|vh|21,T ≤ c∑

T∈Th

h−2T |vh|20,T ≤ c h−2

∑

T∈Th

|vh|20,T = ch−2‖vh‖2L2

and thus the global inverse inequality is proved.

3.4 Finite element discretization of scalar elliptic problems

In this section we consider the Galerkin discretization of the scalar elliptic problem



(3.41)

with a bilinear form and right handside as in (2.72), i.e.:

k(u, v) =∫

Ω ∇uTA∇v + b · ∇uv + cuv dx , f(v) =∫

Ω fv dx (3.42a)

with − 12divb + c ≥ 0 a.e. in Ω, (3.42b)

and ∃ α0 > 0 ξTA(x)ξ ≥ α0ξT ξ for all ξ ∈ Rn, x ∈ Ω, (3.42c)

and aij ∈ L∞(Ω) ∀ i, j, bi ∈ H1(Ω) ∩ L∞(Ω) ∀ i, c ∈ L∞(Ω). (3.42d)

For the Galerkin discretization we use finite element subspaces Hh = Xkh,0 and Hh = Qk

h,0. Weprove bounds for the discretization error ‖u−uh‖1 (section 3.4.1) and ‖u−uh‖L2 (section 3.4.2).

3.4.1 Error bounds in the norm ‖ · ‖1

We first consider the Galerkin discretization of (3.41) with simplicial finite elements. Let Thbe a regular family of triangulations of Ω consisting of n-simplices and Xk

h,0, k ≥ 1, the corre-sponding finite element space as in (3.16). The discrete problem is given by

find uh ∈ Xkh,0 such that

k(uh, vh) = f(vh) for all vh ∈ Xkh,0.

(3.43)

75

We have the following result concerning the discretization error:

Theorem 3.4.1 Assume that the conditions (3.42b)-(3.42d) are fulfilled and that the solutionu ∈ H1

0 (Ω) of (3.41) lies in Hm(Ω) with m ≥ 2. Let uh be the solution of (3.43). For 2 ≤ m ≤k + 1 the following holds

‖u− uh‖1 ≤ C hm−1|u|m ,

with a constant C independent of u and of Th ∈ Th.

Proof. From the proof of theorem 2.5.3 it follows that the bilinear form k(·, ·) is continuousand H1

0 (Ω)-elliptic. From corollary 3.1.2 it follows that the continuous and discrete problemshave unique solutions and that

‖u− uh‖1 ≤ C infvh∈Xk

h,0

‖u− vh‖1

holds. Now apply corollary 3.3.10 with t = 1.

A very similar result holds for the Galerkin discretization with rectangular finite elements.Let Th be a regular family of triangulations of Ω consisting of n-rectangles and Qk

h,0, k ≥ 1,the corresponding finite element space as in (3.17c). The discrete problem is given by

find uh ∈ Qkh,0 such that

k(uh, vh) = f(vh) for all vh ∈ Qkh,0.

(3.44)

We have the following result concerning the discretization error:

Theorem 3.4.2 Assume that the conditions in (3.42b)-(3.42d) are fulfilled and that the solutionu ∈ H1

0 (Ω) of (3.41) lies in Hm(Ω) with m ≥ 2. Let uh be the solution of (3.44). For 2 ≤ m ≤k + 1 the following holds

‖u− uh‖1 ≤ C hm−1|u|m ,

with a constant C independent of u and of Th ∈ Th.

Proof. The same arguments as in the proof of theorem 3.4.1 can be used.

Note that in the preceding two theorems we used the smoothness assumption u ∈ H10 (Ω)∩Hm(Ω)

with m ≥ 2. Sufficient conditions for this to hold are given in section 2.5.4, theorem 2.5.14 andtheorem 2.5.16. In the literature one can find discretization error bounds for the case when uis less regular, i.e., u ∈ H1

0 (Ω) but u /∈ H2(Ω) (cf., for example, [?]). One simple result for thecase of minimal smoothness (u ∈ H1

0 (Ω) only) is given in:

Theorem 3.4.3 Assume that the conditions of theorem 2.5.3 are fulfilled. Let uh be the solutionof (3.43). Then we have:

limh→0

‖u− uh‖1 = 0.

76

Proof. Define V := H10 (Ω) ∩H2(Ω). Note that V

‖·‖1 = H10 (Ω). Take ε > 0.

From corollary 3.1.2 we obtain

‖u− uh‖1 ≤ C infvh∈Xk

h,0

‖u− vh‖1. (3.45)

There exists v ∈ V such that

‖u− v‖1 ≤ ε

2C(3.46)

From corollary 3.3.10 it follows that ‖v − IkXv‖1 ≤ C h|v|2, and thus for h sufficiently small wehave

‖v − IkXv‖1 ≤ ε

2C. (3.47)

Combination of (3.45), (3.46) and (3.47) yields

‖u− uh‖1 ≤ C‖u− IkXv‖1 ≤ C(

‖u− v‖1 + ‖v − IkXv‖1

)

≤ ε


Remark 3.4.4 Comment on results for cases with other boundary conditions ....

3.4.2 Error bounds in the norm ‖ · ‖L2

In this section we derive a bound for the discretization error in the L2-norm. For the analysis wewill need a duality argument, i.e., an argument in which the dual problem of the given variationalproblem (3.41) plays a role. For k(·, ·) and f(·) as in (3.42a) we define the dual problem by


k(v, u) = f(v) for all v ∈ H10 (Ω).

(3.48)

Note that if k(·, ·) is continuous and H10 (Ω)-elliptic then this dual problem has a unique solution.

The dual problem is said to be H2-regular (cf. section 2.5.4) if

∃ C : ‖u‖2 ≤ C ‖f‖L2 for all f ∈ L2(Ω).

The following result concerning the finite element discretization error holds:

Theorem 3.4.5 Suppose that the assumptions of theorem 3.4.1 [theorem 3.4.2] are fulfilled andthat the dual problem (3.48) is H2-regular. For 2 ≤ m ≤ k + 1 the inequality

‖u− uh‖L2 ≤ C hm|u|m

holds, with a constant C independent of u and of Th ∈ Th.

Proof. We give the proof for the case of simplicial finite elements. Exactly the same argu-ments can be used for rectangular finite elements.The bilinear form k(·, ·) is continuous and H1

0 (Ω)-elliptic and thus the problem (3.41), its

77

Galerkin discretization and the dual problem (3.48) are uniquely solvable. Define eh = u − uhand note that eh ∈ H1

0 (Ω). Let u ∈ H10 (Ω) ∩H2(Ω) be the solution of the dual problem

k(v, u) =

∫

Ωehv dx for all v ∈ H1

0 (Ω).

Using the Galerkin orthogonality, k(eh, vh) = 0 for all vh ∈ Xkh,0, we get

‖eh‖2L2 =

∫

Ωe2h dx = k(eh, u)

= k(eh, u− IkXu) ≤ C‖eh‖1‖u− IkXu‖1.

(3.49)

From corollary 3.3.10 and the H2-regularity of the dual problem we obtain

‖u− IkXu‖1 ≤ C h|u|2 ≤ C h‖eh‖L2 . (3.50)

Combining (3.49) and (3.50) results in

‖eh‖L2 ≤ C h‖eh‖1.

Now apply theorem 3.4.1 [theorem 3.4.2].

Remark 3.4.6 Comment on sufficient conditions for H2-regularity of the dual problem ...

3.5 Stiffness matrix

In this section we consider the discrete problem in (3.43) with a bilinear form and right handsideas in (3.42). We will discuss properties of the linear system described in (3.12). For this weneed a suitable basis of the finite element space Xk

h,0. The following lemma gives a general toolfor constructing a basis in some finite element space.

Lemma 3.5.1 Let H be a finite dimensional vector space. Assume that for i = 1, . . . , N wehave φi ∈ H and ℓi ∈ H ′ such that the following conditions are satisfied:

ℓi(φi) 6= 0 for all i, ℓi(φj) = 0 for all i 6= j , (3.51a)

for all v ∈ H, v 6= 0 : ℓi(v) 6= 0 for some i (3.51b)

Then (φi)1≤i≤N forms a basis of H.

Proof. Let α1, . . . , αN be such that∑N

j=1 αjφj = 0. Using (3.51a) we get

0 = ℓi(

N∑

j=1

αjφj)

=

N∑

j=1

αjℓi(φj) = αiℓi(φi) for i = 1, . . . , N,

and thus αi = 0 for all i. This yields that φi, i = 1, . . . , N , are independent. Hence, N ≤ k :=dim(H) = dim(H ′) holds. We now show that N ≥ k holds, too. Let v1, . . . ,vk be a basis of H.Define the matrix L ∈ RN×k by Lij = ℓi(vj). Let x ∈ Rk be such that Lx = 0. We then have

k∑

j=1

ℓi(vj)xj = ℓi(

k∑

j=1

xjvj) = 0 for all i = 1, . . . , N

78

Using (3.51b) this yields∑k

j=1 xjvj = 0 and thus x = 0. Hence, L has full column rank andthus N ≥ k holds.

The set (ℓi)1≤i≤N as in (3.51) forms a basis of H ′ and is called the dual basis of (φi)1≤i≤N .We now construct a so-called nodal basis of the space of simplicial finite elements Xk

h,0. Wewill associate a basis function to each interpolation point in the principal lattice of T ∈ Th thatlies not on ∂Ω. To make this more precise, for an admissible triangulation Th, consisting ofn-simplices, we introduce the grid

∪T∈Thxj ∈ Lk(T ) | xj /∈ ∂Ω =: x1, . . . , xN =: V

with xi 6= xj for all i 6= j. For each xi ∈ V we define a corresponding function φi as follows:

∀ T ∈ Th :

(φi)|T ∈ Pk and ∀ xj ∈ Lk(T ) : φi(xj) =

0 if xj 6= xi

1 if xj = xi

(3.52)

From lemma 3.3.3 it follows that for all k ≥ 1 we have φi ∈ Xkh,0. Thus we have a collection of

functions (φi)1≤i≤N with the properties:

φi ∈ Xkh,0 ; ∀ xj ∈ V : φi(xj) = δij , 1 ≤ i ≤ N (3.53)

Lemma 3.5.2 The functions (φi)1≤i≤N form a basis of Xkh,0.

Proof. Introduce the linear functional ℓi ∈ (Xkh,0)′:

ℓi(u) = u(xi) for u ∈ Xkh,0, xi ∈ V, i = 1, . . . , N

One easily verifies that for φi, ℓi, i = 1, . . . , N the conditions of lemma 3.5.1 are satisfied.

Due to the property φi(xj) = δij the functions φi are called nodal basis functions. In ex-actly the same way one can construct nodal basis functions for other finite element spaces like,for example, Xk

h, Qkh,0, Qk

h.

We consider the discrete problem (3.43) and use the nodal basis (φi)1≤i≤N of Xkh,0 to refor-

mulate this problem as explained in (3.11)-(3.12). This results in the linear system of equations

Khvh = bh , with (Kh)ij = k(φj , φi), (bh)i = f(φi), 1 ≤ i, j ≤ N (3.54)

The matrix Kh is called the stiffness matrix. In the remainder of this section we derive someimportant properties of this matrix that will play an important role in the chapters 6-9. In thesechapters we discuss iterative solution methods for the linear system in (3.54). Below we assumethat for the bilinear form k(·, ·) the conditions (3.42b)-(3.42d) are satisfied.

The stiffness matrix is sparse.We introduce

qrow(Kh) = maximum number of nonzero entries per row in Kh

qcol(Kh) = maximum number of nonzero entries per column in Kh

79

Lemma 3.5.3 Let Th be a regular family of triangulations consisting of n-simplices and foreach Th let Kh be the stiffness matrix defined in (3.54). There exists a constant q independentof Th ∈ Th such that

max qrow(Kh) , qcol(Kh) ≤ q

Proof. Take a fixed i with 1 ≤ i ≤ N . Define a neighbourhood of xi by

Nxi:= T ∈ Th | xi ∈ Lk(T ) = supp(φi)

From the assumption that we have a regular family of triangulations it follows that

|Nxi| ≤M (3.55)

with a constant M independent of i and of Th ∈ Th. Assume that (Kh)ij 6= 0. Using the factthat we have a nodal basis it follows that

xj ∈ Nxi,

i.e., xj is a lattice point in Nxi. Using (3.55) we get that the number of lattice points in Nxi

can be bounded by a constant, say q, independent of i and of Th ∈ Th. Hence qrow(Kh) ≤ qholds. The same arguments apply if one interchanges i and j.

Note that the constant q depends on the degree k used in the finite element space Xkh,0. The re-

sult of this lemma shows that the number of nonzero entries in the N×N -matrix Kh is boundedby qN . If h ↓ 0 then N → ∞ and the number of nonzero entries in Kh is proportional to N .Therefore the stiffness matrix is said to be sparse.

The stiffness matrix is positive definite.

Lemma 3.5.4 For the stiffness matrix defined in (3.54) we have:

Kh +KTh is symmetric positive definite

Proof. Take v ∈ RN , v 6= 0 and define u =∑N

j=1 vjφj ∈ Xkh,0. Note that u 6= 0. Using the

fact that the bilinear form is elliptic we get

vT (Kh +KTh )v = 2vTKhv = 2k(u, u) > 0

and thus the symmetric matrix Kh +KTh is positive definite.

As a direct consequence we have:

Corollary 3.5.5 If in (3.42) we have b = 0 then the bilinear form k(·, ·) is symmetric and thestiffness matrix Kh is symmetric positive definite.

The stiffness matrix is ill-conditioned.We now derive sharp bounds for the condition number of the stiffness matrix. We restrictourselves to the case b = 0 in (3.42). Then the stiffness matrix Kh is symmetric positive definiteand its spectral condition number is given by

κ(Kh) = ‖Kh‖2‖K−1h ‖2 =

λmax(Kh)

λmin(Kh)

We first give a result (due to [89]) on diagonal scaling of a symmetric positive definite matrix.We use the notation DA := diag(A) for a square matrix A.

80

Lemma 3.5.6 Let A ∈ RN×N be a symmetric positive definite matrix and let q be such thatqrow(A) ≤ q. For any nonsingular diagonal matrix D ∈ RN×N we have

κ(D− 1

2

A AD− 1

2

A ) ≤ q κ(DAD)

Proof. Define A = D− 1

2

A AD− 1

2

A and note that this matrix is symmetric positive definite anddiag(A) = I. Let A = LLT be the Cholesky factorization of A. Let ei be the ith standard basisvector in RN . Then we have

‖LT ei‖22 = 〈LT ei, LT ei〉 = 〈Aei, ei〉 = Aii = 1

|Aij | = |〈LT ej, LT ei〉| ≤ ‖LT ej‖2‖LT ei‖2 = 1

and thus ‖A‖2 ≤ ‖A‖∞ ≤ q holds. For an arbitrary nonsingular diagonal matrix D ∈ RN×N

we have:

κ(A) = ‖A‖2‖A−1‖2 ≤ q ‖L−TL−1‖2 = q ‖L−1‖22

≤ q ‖L−1D−1‖22 max

j|Djj|2 = q ‖L−1D−1‖2

2 maxj

‖LTDej‖22

≤ q ‖L−1D−1‖22‖LTD‖2

2 = q ‖D−1A−1D−1‖2‖DAD‖2 = q κ(DAD)

This shows that the desired result holds.

The result in this lemma shows that for the sparse symmetric positive definite stiffness ma-trix Kh the symmetric scaling with the diagonal matrix DKh

is in a certain sense optimal.Hence, we investigate the condition number of the scaled matrix

Kh := D− 1

2

KhKhD

− 12

Kh(3.56)

The following result is based on the analysis presented in [9].

Theorem 3.5.7 Suppose b = 0 in (3.42). Let Th be a regular family of triangulations con-sisting of n-simplices and for each Th let Kh be the stiffness matrix defined in (3.54). Then thereexists a constant C independent of Th ∈ Th such that

κ(Kh) ≤

CN(1 + log hhmin

) if n = 2

CN23 if n = 3

(3.57)

with hmin = minhT | T ∈ Th .Proof. We need the following embedding results (cf. [1], theorem 5.4):

H1(Ω) → L6(Ω) for n = 3 (3.58)

H1(Ω) → Lq(Ω) for n = 2, q > 0 (3.59)

For the embedding in (3.59) one can analyze the dependence of the norm of the embeddingoperator on q. This results in (cf. [9]):

‖u‖Lq ≤ C√q‖u‖1 for all u ∈ H1(Ω), q > 0 , (3.60)

with a constant C independent of u and q. Note that if for c1 > 0, c2 we have

c1〈DKhv,v〉 ≤ 〈Khv,v〉 ≤ c2〈DKh

v,v〉 for all v ∈ RN (3.61)

81

then κ(Kh) ≤ c2c1

holds.

For v ∈ RN we define u :=∑N

i=1 viφi. Note that each nodal basis function φi is associated toa grid point xi such that φi(xi) = 1, φi(xj) = 0 for j 6= i. The set of grid points (xi)1≤i≤N isdenoted by V. We have

〈DKhv,v〉 =

∑

xi∈V

(

DKh

)

iiu(xi)

2 =∑

xi∈Vk(φi, φi)u(xi)

2 (3.62)

There are constants d1 > 0 and d2 such that d1|φi|21 ≤ k(φi, φi) ≤ d2|φi|21 for all i. Using thelemmas 3.3.5 and 3.3.6 one can show that there are constants d1 > 0 and d2 independent ofTh ∈ Th such that

d1h−2T |T | ≤

∑

xi∈Tk(φi, φi) ≤ d2h

−2T |T | for all T ∈ Th (3.63)

Combination of (3.62) and (3.63) yields

d1

∑

T∈Th

h−2T ‖u‖2

0,T ≤ 〈DKhv,v〉 ≤ d2

∑

T∈Th

h−2T ‖u‖2

0,T (3.64)

with constants d1 > 0 and d2 independent of Th ∈ Th and of v ∈ RN . For T ∈ Th letF (x) = Bx+ c be an affine mapping with F (T ) = T , where T is the unit n-simplex. From

〈Khv,v〉 = k(u, u) ≤ C |u|21 = C∑

T∈Th

|u|21,T

≤ C∑

T∈Th

|u|21,Th−2T |detB| ≤ C

∑

T∈Th

‖u‖20,Th−2T |detB|

≤ C∑

T∈Th

h−2T ‖u‖2

0,T ≤ C〈DKhv,v〉

(3.65)

it follows that the second inequality in (3.61) holds with a constant c2 independent of v andof Th ∈ Th. We now consider the first inequality in (3.61). First note that for arbitraryα > 2, β ≤ 0 and w ∈ Lα(Ω) we have, using the discrete Holder inequality:

∑

T∈Th

hβα

T

(

∫

Twα dx

) 2α ≤

(

∑

T∈Th

hβ

α−2

T

)α−2

α(

∑

T∈Th

∫

Twα dx

) 2α

≤ hβα

min

(

∑

T∈Th

1)

α−2α ‖w‖2

Lα(Ω)

≤ Chβα

minNα−2

α ‖w‖2Lα(Ω)

(3.66)

We now distinguish n = 3 and n = 2. First we treat n = 3. We use the Holder inequality to get

〈DKhv,v〉 ≤ C

∑

T∈Th

h−2T ‖u‖2

0,T = C∑

T∈Th

∫

Th−2T u2 dx

≤∑

T∈Th

(

∫

Th−2pT dx

)1p(

∫

Tu2q dx

)1q (

1

p+

1

q= 1)

≤ C∑

T∈Th

h3−2p

p

T

(

∫

Tu2q dx

) 1q

82

Now take p = 32 , q = 3 and apply (3.66) with β = 0, α = 6. This yields

〈DKhv,v〉 ≤ C

∑

T∈Th

(

∫

Tu6 dx

) 13 ≤ C N

23‖u‖2

L6(Ω)

We use the embedding result (3.58) and thus obtain

〈DKhv,v〉 ≤ C N

23 ‖u‖2

1 ≤ C N23 k(u, u) = C N

23 〈Khv,v〉 (3.67)

Combination of the results in (3.65) and (3.67) proves the result in (3.57) for n = 3.We consider n = 2. Using the Holder inequality it follows that for p > 1:

‖u‖20,T ≤

(

∫

Tu2p dx

) 1p(

∫

T1 dx

)1− 1p ≤ C h

2− 2p

T

(

∫

Tu2p dx

) 1p

Using this we get

〈DKhv,v〉 ≤ C

∑

T∈Th

h−2T ‖u‖2

0,T ≤∑

T∈Th

h− 2

p

T

(

∫

Tu2p dx

) 1p

We apply (3.66) with α = 2p > 2, β = −4 and use the result in (3.60). This yields

〈DKhv,v〉 ≤ C h

− 2p

minNp−1

p ‖u‖2L2p(Ω) ≤ C ph

− 2p

minNp−1

p ‖u‖21

Note that |Ω| ≤∑T∈Thh2T ≤ CNh2 and thus N ≥ Ch−2. We then obtain

〈DKhv,v〉 ≤ C p

( h

hmin

)2pN‖u‖2

1 ≤ C p( h

hmin

)2pN〈Khv,v〉

The constant C can be chosen independent of p. For p = max2, hhmin

we have p(

hhmin

) 2p ≤

C(1 + log hhmin

) and thus

〈DKhv,v〉 ≤ C(1 + log

h

hmin)N〈Khv,v〉 (3.68)

Combination of the results in (3.65) and (3.68) proves the result (3.57) for n = 2.

Remark 3.5.8 In [9] one can find an example which shows that for n = 2 the logarithmic termcan not be avoided. If the family of triangulations is quasi-uniform then h

hmin≤ σ for a constant

σ independent of Th ∈ Th and furthermore, N = O(h−2) for n = 2, N23 = O(h−2) for n = 3.

Hence, for the quasi-uniform case we have κ(Kh) ≤ Ch−2 for n = 2, n = 3. Moreover, in thiscase the diagonal of Kh is well-conditioned, κ(DKh

) = O(1), and thus the scaling in (3.56) isnot essential. We emphasize that for the general case of a regular (possibly non quasi-uniform)family of triangulations the scaling is essential: a result as in (3.57) does in general not hold forthe matrix Kh. Finally we note that for the quasi-uniform case it is not difficult to prove thatthere exists a constant C > 0 independent of Th ∈ Th such that κ(Kh) ≥ Ch−2 holds, bothfor n = 2 and n = 3.

83

3.5.1 Mass matrix

Apart from the stiffness matrix the so-called mass matrix also plays an important role in finiteelements. This matrix depends on the choice of the basis in the finite element space but not onthe bilinear form k(·, ·).Let (φi)1≤i≤N be the nodal basis of the finite element space Xk

h,0 as defined in (3.52). The mass

matrix Mh ∈ RN×N is given by

(Mh)ij =

∫

Ωφiφj dx = 〈φi, φj〉L2 (3.69)

Note that this matrix is symmetric positive definite. As for the stiffness matrix we use a diagonalscaling with the diagonal matrix DMh

:= diag(Mh). The next result shows that the scaled massmatrix is uniformly well-conditioned:

Theorem 3.5.9 Let Th be a regular family of triangulations consisting of n-simplices and foreach Th let Mh be the mass matrix defined in (3.69). Then there exists a constant C independentof Th ∈ Th such that

κ(D− 1

2

MhMhD

− 12

Mh) ≤ C

Proof. Take Th ∈ Th. For v ∈ RN we define u :=∑N

i=1 viφi. The constants that appear in

the proof are independent of Th and of v. For each T ∈ Th = T let F : T → T be an affinetransformation between the unit simplex T and T . Furthermore, u := u F . We use the indexset

IT := i | T ⊂ supp(φi) Note that |IT | is uniformly bounded. We have

〈Mhv,v〉 = 〈u, u〉L2 =∑

T∈Th

|u|20,T (3.70)

The nodal point associated to φi is denoted by xi (1 ≤ i ≤ N). Using lemma 3.3.6 and the normequivalence property in the space Pk(T ) it follows that there exist constants c1 > 0 and c2 suchthat

c1|u|20,T ≤ |T |∑

zi∈Lk(T )

u(zi)2 ≤ c2|u|20,T

and thus, using u(xi) = vi, we get

c1|u|20,T ≤ |T |∑

i∈IT

v2i ≤ c2|u|20,T

Define di := |supp(φi)|. For i ∈ IT the quantity |T |d−1i is uniformly (w.r.t. T ) bounded both

from below by a strictly positive constant and from above (by 1). If we combine this with theresult in (3.70) we get (with different constants c1 > 0, c2):

c1〈Mhv,v〉 ≤∑

T∈Th

∑

i∈IT

div2i ≤ c2〈Mhv,v〉

Hence

c1〈Mhv,v〉 ≤N∑

i=1

div2i ≤ c2〈Mhv,v〉

84

with c1 > 0. Note

(DMh)ii = 〈Mhei, ei〉 =

∫

supp(φi)φ2i dx

thus there are constants c1 > 0, c2 independent of i such that c1di ≤ (DMh)ii ≤ c2di. We then

obtainc1〈Mhv,v〉 ≤ 〈DMh

v,v〉 ≤ c2〈Mhv,v〉with c1 > 0. Thus the result is proved.

Corollary 3.5.10 Let Th be a quasi-uniform family of triangulations consisting of n-simplicesand for each Th let Mh be the mass matrix defined in (3.69). Then there exists a constant Cindependent of Th ∈ Th such that

κ(Mh) ≤ C

Proof. Note that

(Mh)ii =

∫

supp(φi)φ2i dx

Using this in combination with the quasi-uniformity of Th it follows that the spectral conditionnumber of DMh

= diag(Mh) is uniformly bounded. Now apply theorem 3.5.9.

3.6 Isoparametric finite elements

See Handbook Ciarlet, chapter 6.

3.7 Nonconforming finite elements

85

Chapter 4

Finite element discretization of a

convection-diffusion problem

4.1 Introduction

In this chapter we consider the convection-diffusion boundary value problem

−ε∆u+ b · ∇u+ cu = f in Ω

u = 0 on ∂Ω

with a constant ε ∈ (0, 1], bi ∈ H1(Ω) ∩ L∞(Ω) ∀ i, c ∈ L∞(Ω) and f ∈ L2(Ω). Furthermore,we also assume the smoothness property divb ∈ L∞(Ω). The weak formulation of the problemis analyzed in section 2.5.2. We introduce

k(u, v) =

∫

Ωε∇u · ∇v + b · ∇uv + cuv dx , f(v) =

∫

Ωfv dx. (4.1)

The weak formulation of this convection-diffusion problem is as follows:


k(u, v) = f(v) for all v ∈ H10 (Ω).

(4.2)

In theorem 2.5.3 it is shown that if we assume

−1

2divb + c ≥ 0 in Ω (4.3)

then this variational problem has a unique solution. In this chapter we treat the finite elementdiscretization of the problem (4.2) for the convection-dominated case, i.e., ε ≪ maxi ‖bi‖L∞ .Then the problem is singularly perturbed and the standard finite element method in generalyields a poor approximation of the continuous solution. A significant improvement results if oneintroduces suitable artificial stabilization terms in the Galerkin discretization. For such a stabi-lization many different techniques exist, which leads to a large class of so-called stabilized finiteelement methods that are known in the literature. In section 4.4.2 we will explain and analyzeone very popular method from this class, namely the streamline diffusion finite element method(SDFEM). In section 4.3 we consider a simple one-dimensional convection-diffusion equation toillustrate a few basic phenomena related to standard finite element discretization. To gain abetter understanding of the (poor) behaviour of the standard finite element in the convection-dominated case we reconsider its discretization error analysis in section 4.4.

87

In the remainder of this section we briefly discuss the topic of regularity of the variationalproblem (4.2). In section 2.5.4 regularity results of the form ‖u‖m ≤ C‖f‖m−2, m = 1, 2, . . .,with a constant C independent of f , are presented (with smoothness assumptions on the coef-ficients and on the domain). In the convection-dominated case it is of interest to analyze thedependence of the (stability) constant C on ε. An important result of this analysis is given inthe following theorem.

Theorem 4.1.1 Assume that

−1

2divb(x) + c(x) ≥ β0 > 0 a.e. in Ω. (4.4)

Then the solution u of (4.2) satisfies

ε12 ‖u‖1 + ‖u‖L2 ≤ C‖f‖L2 (4.5)

with a constant C independent of f and of ε. Furthermore, if the regularity property u ∈ H2(Ω)holds, then the inequality

ε112 ‖u‖2 ≤ C‖f‖L2 (4.6)

holds, with a constant C independent of f and of ε.

Proof. Using partial integration, (4.4) and the Poincare-Friedrichs inequaliy we get

k(u, u) = ε|u|21 +

∫

Ω

(

− 1

2divb + c

)

u2 dx

≥ ε|u|21 + β0‖u‖2L2 ≥ c

(

ε‖u‖21 + ‖u‖2

L2

)

with c > 0 independent of ε. In combination with

k(u, u) = f(u) ≤ ‖f‖L2‖u‖L2 ≤ ‖f‖L2

(

ε‖u‖21 + ‖u‖2

L2

) 12

this yields

ε12‖u‖1 + ‖u‖L2 ≤

√2(

ε‖u‖21 + ‖u‖2

L2

) 12 ≤

√2 c−1‖f‖L2

and thus the result in (4.5) holds. If u ∈ H2(Ω) then the equality −ε∆u+b ·∇u+ cu = f holds(where all derivatives are weak ones). Hence, using (4.5) and ε ≤ 1, we obtain

ε‖∆u‖L2 ≤ ‖f‖L2 + ‖b‖L∞‖u‖1 + ‖c‖L∞‖u‖L2 ≤ cε−12 ‖f‖L2 (4.7)

with a constant c independent of f and ε. We use the following result (lemma 8.1 in [57])

∃ c : ‖v‖2 ≤ c‖∆v‖L2 + ‖v‖L2 for all v ∈ H2(Ω).

Combination of this with (4.7) and (4.5) yields

ε112 ‖u‖2 ≤ c

(

ε112 ‖∆u‖2 + ε1

12‖u‖L2

)

≤ c‖f‖L2

and thus the result (4.6) holds.

88

Remark 4.1.2 The constants in (4.5) and (4.6) depend on β0 in (4.4). For the analysis theassumption β0 > 0 is essential. For the case β0 ≥ 0 a slight modification of the analysis resultsin a stability bound

ε12 ‖u‖1 +

√

β0 ‖u‖L2 ≤ Cmin

ε−12 , β− 1

2

0

‖f‖L2,

with a constant C independent of f , ε and β0.

The results in theorem 4.1.1 indicate that derivatives of the solution u (e.g., ‖u‖1) grow if ε ↓ 0.This is due to the fact that in general in such a convection-diffusion problem there are boundaryand internal layers in which the solution (or some of its derivatives) can vary exponentially. Foran analysis of these boundary layers we refer to the literature, e.g. [76]. In certain special casesit is possible to obtain bounds on the derivative in streamline direction which are significantlybetter than the general bound ‖u‖1 ≤ Cε−

12‖f‖L2 in (4.5). We now present two such results.

The first one is for a relatively simple one-dimensional problem, whereas the second one isrelated to a two-dimensional convection-diffusion problem with Neumann boundary conditionson the outflow boundary.

Theorem 4.1.3 For f ∈ L2([0, 1]) consider the following problem (with weak derivatives):

−εu′′(x) + u′(x) = f(x) for x ∈ (0, 1), u(0) = u(1) = 0

with ε ∈ (0, 1]. The unique solution u satifies:

max‖u‖L∞ , ε‖u′‖L∞ ≤ (1 − e−1)−1‖f‖L1 , (4.8)

‖u′‖L1 ≤ 2(1 − e−1)−1‖f‖L1 . (4.9)

Proof. We reformulate the differential equation in the equivalent form

(

e−x/εu′(x))′

= −1

εe−x/ε f(x) =: g(x).

In textbooks on ordinary differential equations (e.g. [95]) it is shown that the solution canbe represented using a Green’s function. For this we introduce the two fundamental solutionsu1(x) = 1 − ex/ε and u2(x) = 1 − e(x−1)/ε (note: u1 and u2 satisfy the homogeneous differentialequation and u1(0) = u2(1) = 0). The solution is given by

u(x) =

∫ 1

0G(x, t)g(t) dt, G(x, t) :=

ε

1 − e−1/ε

u1(t)u2(x) if t ≤ x,

u1(x)u2(t) if t > x.

We use C := (1 − e−1/ε)−1. Note that C ≤ (1 − e−1)−1 for ε ∈ (0, 1]. Using g(t) = −1εe−t/ε f(t)

we get (for x ∈ [0, 1])

u(x) = Cu2(x)

∫ x

0(1 − e−t/ε)f(t) dt − Cu1(x)e

−x/ε∫ 1

xe(x−t)/ε

(

1 − e(t−1)/ε)

f(t) dt.

Note that |u2(x)| ≤ 1, |u1(x)|e−x/ε ≤ 1. We obtain

|u(x)| ≤ C

∫ x

0|f(t)| dt + C

∫ 1

x|f(t)| dt = C‖f‖L1. (4.10)

89

From

u′(x) = Cu′2(x)∫ x

0(1 − e−t/ε)f(t) dt − Cu′1(x)

∫ 1

xe−t/ε

(

1 − e(t−1)/ε)

f(t) dt

= −Cεe(x−1)/ε

∫ x

0(1 − e−t/ε)f(t) dt +

C

ε

∫ 1

xe(x−t)/ε

(

1 − e(t−1)/ε)

f(t) dt

we get

|u′(x)| ≤ C

ε

∫ x

0|f(t)| dt +

C

ε

∫ 1

x|f(t)| dt =

C

ε‖f‖L1 . (4.11)

Combination of (4.10) and (4.11) proves the result in (4.8). We also have

∫ 1

0|u′(x)| dx ≤ C

ε

∫ 1

0e(x−1)/ε

∫ x

0(1 − e−t/ε)|f(t)| dt dx

+C

ε

∫ 1

0ex/ε

∫ 1

xe−t/ε

(

1 − e(t−1)/ε)

|f(t)| dt dx.

For the first term on the right handside we have

1

ε

∫ 1

0e(x−1)/ε

∫ x

0(1 − e−t/ε)|f(t)| dtdx =:

1

ε

∫ 1

0e(x−1)/ε F (x) dx

= F (1) −∫ 1

0e(x−1)/ε

(

1 − e−x/ε)

|f(x)| dx

≤ F (1) ≤∫ 1

0|f(t)| dt = ‖f‖L1 .

The second term can be treated similarly. This then yields

∫ 1

0|u′(x)| dx ≤ 2C‖f‖L1

and thus the result in (4.9).

Note that in (4.9) we have a bound on the derivative measured in the L1-norm (which is weakerthan the L2-norm) that is independent of ε. Similar results for a more general one-dimensionalconvection-diffusion problem are given in [76] (section 1.1.2) and [38].

We now present a result for a two-dimensional problem.

Theorem 4.1.4 For f ∈ L2(Ω), Ω := (0, 1)2 and a constant c ≥ 0, consider the convection-diffusion problem

−ε∆u+ ux + cu = f in Ω

∂u

∂x= 0 on ΓE := (x, y) ∈ Ω | x = 1

u = 0 on ∂Ω \ ΓE.

(4.12)

This problem has a unique solution u ∈ H2(Ω) and the inequality

c‖u‖L2 + ‖ux‖L2 ≤ 2‖f‖L2 (4.13)

holds.

90

Proof. First note that the weak formulation of this problem has a unique solution u ∈ H1(Ω).Using the fact that Ω is convex it follows that u ∈ H2(Ω) holds and thus the problem (withweak derivatives) in (4.12) has a unique solution u ∈ H2(Ω). From the differential equation weget

‖ux‖2L2 = 〈f, ux〉L2 + ε〈uyy, ux〉L2 + ε〈uxx, ux〉L2 − c〈u, ux〉L2 .

Using Green’s formulas and the boundary conditions for the solution u we obtain, with ΓW := (x, y) ∈ Ω | x = 0 :

〈uyy, ux〉L2 = −1

2〈 ∂∂x

(uy)2, 1〉L2 = −1

2

∫

ΓE

u2y dy ≤ 0

〈uxx, ux〉L2 =1

2〈 ∂∂x

(ux)2, 1〉L2 = −1

2

∫

ΓW

u2x dy ≤ 0

〈u, ux〉L2 =

∫

ΓE

u2 dy − 〈ux, u〉L2 and thus 〈u, ux〉L2 ≥ 0.

Hence, we have

‖ux‖2L2 ≤ 〈f, ux〉L2 ≤ ‖f‖L2‖ux‖L2 . (4.14)

Testing the differential equation with u (instead of ux) yields

c ‖u‖2L2 = 〈f, u〉L2 − ε‖∇u‖2

L2 − 〈ux, u〉L2 .

This yields

c ‖u‖2L2 ≤ ‖f‖L2‖u‖L2 . (4.15)


We note that for this problem a similar ε-independent bound for the derivative uy does not

hold. A sharp inequality of the form ‖uy‖L2 ≤ ε−12 ‖f‖L2 can be shown. Furthermore, for the

uniform bound on ‖ux‖L2 in (4.13) to hold it is essential that we consider the convection-diffusionproblem with Neumann boundary conditions at the outflow boundary. Due to this there is noexponential boundary layer at the outflow boundary.

4.2 A variant of the Cea-lemma

In the analysis of the finite element method in chapter 3 the basic Cea-lemma plays an importantrole. In the analysis of finite element methods for convection-dominated elliptic problems wewill need a variant of this lemma that is presented in this section and is based on a basic lemmagiven in [94]:

Lemma 4.2.1 Let U be a normed linear space with norm ‖ · ‖ and V a subspace of U . Lets(·, ·) be a continuous bilinear forms on U × V and t(·, ·) a bilinear form on U × V such thatfor all u ∈ U the functional v → t(u, v) is bounded on V . Define r := s + t and assume that ris V -elliptic. Let c0 > 0 and c1 be such that

r(v,v) ≥ c0‖v‖2 for all v ∈ V (4.16)

s(u,v) ≤ c1‖u‖ ‖v‖ for all u ∈ U, v ∈ V. (4.17)

91

On U we define the semi-norm ‖u‖∗ := supv∈Vt(u,v)‖v‖ . Then the following holds:

|r(u,v)| ≤ maxc1, 1(

‖u‖ + ‖u‖∗)

‖v‖ for all u ∈ U, v ∈ V (4.18)

supv∈V

r(u,v)

‖v‖ ≥ c01 + c0 + c1

(

‖u‖ + ‖u‖∗)

for all u ∈ V. (4.19)

Proof. For u ∈ U, v ∈ V we have

|r(u,v)| ≤ |s(u,v)| + |t(u,v)| ≤ c1‖u‖‖v‖ + ‖u‖∗‖v‖≤ maxc1, 1

(

‖u‖ + ‖u‖∗)

‖v‖and thus (4.18) holds. We now consider (4.19). Take a fixed u ∈ V and θ ∈ (0, 1). Then thereexists vθ ∈ V such that ‖vθ‖ = 1 and θ‖u‖∗ ≤ t(u,vθ). Note that

r(u,vθ) = s(u,vθ) + t(u,vθ) ≥ θ‖u‖∗ − c1‖u‖and thus for wθ := u + c0‖u‖

1+c1vθ ∈ V we obtain

r(u,wθ) = r(u,u) +c0‖u‖1 + c1

r(u,vθ)

≥ c0‖u‖2 +c0‖u‖1 + c1

(

θ‖u‖∗ − c1‖u‖)

=c0

1 + c1

(

‖u‖ + θ‖u‖∗)

‖u‖. (4.20)

Furthermore,

‖wθ‖ ≤ ‖u‖ +c0‖u‖1 + c1

=1 + c0 + c1

1 + c1‖u‖ (4.21)

holds. Combination of (4.20) and (4.21) yields

r(u,wθ)

‖wθ‖≥ c0

1 + c0 + c1

(

‖u‖ + θ‖u‖∗)

.

Because wθ ∈ V and θ ∈ (0, 1) is arbitrary this proves the result in (4.19).

We emphasize that the seminorm ‖ · ‖∗ on U depends on the bilinear form t(·, ·) and on thesubspace V . Also note that in (4.18) we have a boundedness result on U × V , whereas in (4.19)we have an infsup bound on V × V .

Using this lemma we can derive the following variant of the Cea-lemma (theorem 3.1.1).

Theorem 4.2.2 Let the conditions as in lemma 4.2.1 be satisfied. Take f ∈ U ′ and assumethat there exist u ∈ U , v ∈ V such that

r(u,w) = f(w) for all w ∈ U (4.22a)

r(v,w) = f(w) for all w ∈ V. (4.22b)

Then the following holds:

‖u − v‖ + ‖u − v‖∗ ≤ C infw∈V

(

‖u− w‖ + ‖u − w‖∗)

(4.23)

with C := 1 + maxc1, 11 + c0 + c1

c0. (4.24)

92

Proof. Let w ∈ V be arbitrary. Using (4.18), (4.19) and the Galerkin property r(u−v, z) = 0for all z ∈ V we get

‖v − w‖ + ‖v − w‖∗ ≤1 + c0 + c1

c0supz∈V

r(v − w, z)

‖z‖ =1 + c0 + c1

c0supz∈V

r(u− w, z)

‖z‖

≤ maxc1, 11 + c0 + c1

c0

(

‖u − w‖ + ‖u − w‖∗)

.

Using this and the triangle inequality

‖u− v‖ + ‖u − v‖∗ ≤ ‖v − w‖ + ‖v − w‖∗ + ‖u− w‖ + ‖u− w‖∗

we obtain the result.

In this theorem there are significant differences compared to the Cea-lemma. For example,in theorem 4.2.2 we do not assume that U (or V ) is a Hilbert space and we do not assume aninfsup property for the bilinear form r(·, ·) on U × V (only on V × V , cf. (4.19)). On the otherhand, in theorem 4.2.2 we assume existence of solutions in U and V , cf. (4.22), whereas in theCea-lemma existence and uniqueness of solutions follows from assumptions on continuity andinfsup properties of the bilinear form.

4.3 A one-dimensional hyperbolic problem and its finite element

discretization

If in a convection-diffusion problem with a bilinear form as in (4.1) one formally takes ε = 0this results in a hyperbolic differential operator. In this section we give a detailed treatment of avery simple one-dimensional hyperbolic problem. We show well-posedness of this problem andexplain why a standard finite element discretization method suffers from an instability. Further-more, a stabilization technique is introduced that results in a finite element method with muchbetter properties. In section 4.4 essentially the same analysis is applied to the finite elementdiscretization of the convection-diffusion problem (4.1)-(4.2).

We consider the hyperbolic problem

bu′(x) + u(x) = f(x), x ∈ I := (0, 1), b > 0 a given constant,

u(0) = 0.(4.25)

For the weak formulation we introduce the Hilbert spaces H1 = v ∈ H1(I) | v(0) = 0 , H2 =L2(I). The norm on H1 is ‖v‖2

1 = ‖v′‖2L2 + ‖v‖2

L2 . We define the bilinear form

k(u, v) =

∫ 1

0bu′v + uv dx

on H1 ×H2.

Theorem 4.3.1 Let f ∈ L2(I). There exists a unique u ∈ H1 such that

k(u, v) = 〈f, v〉L2 for all v ∈ H2. (4.26)

Moreover, ‖u‖1 ≤ c‖f‖L2 holds with c independent of f .

93

Proof. We apply theorem 2.3.1. The bilinear form k(·, ·) is continuous on H1 ×H2:

|k(u, v)| ≤ b‖u′‖L2‖v‖L2 + ‖u‖L2‖v‖L2 ≤√

2max1, b‖u‖1‖v‖L2 , u ∈ H1, v ∈ H2.

For u ∈ H1 we have

supv∈H2

k(u, v)

‖v‖L2

= supv∈H2

〈bu′ + u, v〉L2

‖v‖L2

= ‖bu′ + u‖L2 =(

b2‖u′‖2L2 + ‖u‖2

L2 + 2b〈u′, u〉L2

) 12 .

Using u(0) = 0 we get 〈u′, u〉L2 = u(1)2 − 〈u, u′〉L2 and thus 〈u′, u〉L2 ≥ 0. Hence we get

supv∈H2

k(u, v)

‖v‖L2

≥ min1, b‖u‖1 for all u ∈ H1,

i.e., the infsup condition (2.36) in theorem 2.3.1 is satisfied. We now consider the condition(2.37) in this theorem. Let v ∈ H2 be such that k(u, v) = 0 for all u ∈ H1. This impliesb∫ 10 u′v dx = −

∫ 10 uv dx for all u ∈ C∞0 (I) and thus v ∈ H1(I) with v′ = 1

bv (weak derivative).Using this we obtain

−∫ 1

0uv dx = b

∫ 1

0u′v dx = bu(1)v(1) − b

∫ 1

0uv′ dx = bu(1)v(1) −

∫ 1

0uv dx for all u ∈ H1,

and thus u(1)v(1) = 0 for all u ∈ H1. This implies v(1) = 0. Using this and bv′ − v = 0 yields

‖v‖2L2 = 〈v, v〉L2 + 〈bv′ − v, v〉L2 = b〈v′, v〉L2 =

b

2

(

v(1)2 − v(0)2)

= − b2v(0)2 ≤ 0.

This implies v = 0 and thus condition (2.37) is satisfied. Application of theorem 2.3.1 now yieldsexistence and uniqueness of a solution u ∈ H1 and

‖u‖1 ≤ c supv∈H2

〈f, v〉L2

‖v‖L2

= c ‖f‖L2 ,


For the discretization of this problem we use a Galerkin method with a standard finite ele-ment space. To simplify the notation we use a uniform grid and consider only linear finiteelements. Let h = 1

N , xi = ih, 0 ≤ i ≤ N , and

Xh = v ∈ C(I) |v(0) = 0, v|[xi,xi+1] ∈ P1 for 0 ≤ i ≤ N − 1 .

Note that Xh ⊂ H1 and Xh ⊂ H2. The discretization is as follows:

determine uh ∈ Xh such that k(uh, vh) = 〈f, vh〉L2 for all vh ∈ Xh. (4.27)

For the error analysis of this method we apply the Cea-lemma (theorem 3.1.1). The conditions(3.2), (3.3), (3.4) in theorem 3.1.1 have been shown to hold in the proof of theorem 4.3.1. Itremains to verify the discrete infsup condition:

∃ εh > 0 : supvh∈Xh

k(uh, vh)

‖vh‖L2

≥ εh ‖uh‖1 for all uh ∈ Xh. (4.28)

Related this we give the following lemma:

94

Lemma 4.3.2 The infsup property (4.28) holds with εh = c h, c > 0 independent of h.

Proof. For uh ∈ Xh we have 〈u′h, uh〉L2 = 12uh(1)

2 ≥ 0 and thus

supvh∈Xh

k(uh, vh)

‖vh‖L2

≥ k(uh, uh)

‖uh‖L2

=b〈u′h, uh〉L2 + ‖uh‖2

L2

‖uh‖L2

≥ ‖uh‖L2 .

Now apply an inverse inequality, cf. lemma 3.3.11, ‖v′h‖L2 ≤ ch−1‖vh‖L2 for all vh ∈ Xh, result-ing in ‖uh‖L2 ≥ 1

2‖uh‖L2 + ch‖u′h‖L2 ≥ ch ‖u‖1 with a constant c > 0 independent of h.

Remark 4.3.3 The result in the previous lemma is sharp in the sense that the best (i.e. largest)infsup constant εh in (4.28) in general satisfies εh ≤ c h. This can be deduced from a numer-ical experiment or a technical analytical derivation. Here we present results of a numericalexperiment. We consider the continuous and discrete problems as in (4.26), (4.27) with b = 1.Discretization of the bilinear forms (u, v) → 〈u′, v〉L2 , (u, v) → 〈u, v〉L2 and (u, v) → 〈u′, v′〉L2

in the finite element space Xh (with respect to the nodal basis) results in N ×N -matrices

Ch =1

2

0 1−1 0 1 ∅

. . .. . .

. . .

∅ −1 0 1−1 1

, Mh =2h

3

1 16

16 1 1

6 ∅. . .

. . .. . .

∅ 16 1 1

616

12

, (4.29)

Ah =1

h

2 −1−1 2 −1 ∅

. . .. . .

. . .

∅ −1 2 −1−1 1

.

Note that

infuh∈Xh

supvh∈Xh

k(uh, vh)

‖uh‖1‖vh‖L2

= infx∈RN

supy∈RN

〈Chx+Mhx, y〉2〈(Ah +Mh)x, x〉

12

2 〈Mhy, y〉12

2

= infx∈RN

‖M−12

h (Ch +Mh)x‖2

‖(Ah +Mh)12x‖2

= infz∈RN

‖M−12

h (Ch +Mh)(Ah +Mh)− 1

2 z‖2

‖z‖2

= ‖(Ah +Mh)12 (Ch +Mh)

−1M12

h ‖−12 =: εh.

A (MATLAB) computation of the quantity q(h) := εh/h yields: q( 110 ) = 1.3944, q( 1

50 ) =1.3987, q( 1

250 ) = 1.3988. Hence, in this case εh is proportional to h.

Using the infsup result of the previous lemma we obtain the following corollaries.

Corollary 4.3.4 The discrete problem (4.27) has a unique solution uh ∈ Xh and the followingstability bound holds:

‖uh‖L2 ≤ ‖f‖L2 . (4.30)

95

Proof. Existence and uniqueness of a solution follows from continuity and ellipticity of thebilinear form k(·, ·) on Xh × Xh. From

‖uh‖2L2 ≤ k(uh, uh) = 〈f, uh〉L2 ≤ ‖f‖L2‖uh‖L2

we obtain the stability result.

Note that this stability result for the discrete problem is weaker than the one for the contin-uous problem in theorem 4.3.1.

Corollary 4.3.5 Let u ∈ H1 and uh ∈ Xh be the solutions of (4.26) and (4.27), respectively.From theorem 3.1.1 we obtain the error bound

‖u− uh‖1 ≤ c h−1 infvh∈Xh

‖u− vh‖1,

with a constant c independent of h. If u ∈ H2(I) holds, we obtain

‖u− uh‖1 ≤ c‖u′′‖L2 . (4.31)

Remark 4.3.6 Experiment. Is the result in (4.31) sharp ?Expectation (for suitable f):

|u− uh|1 ∼ c

‖u− uh‖L2 ≤ ch

These results show that, due to the deterioration of the infsup stability constant εh for h ↓ 0, thediscretization with standard linear finite elements is not satisfactory. A heuristic explanation forthis instability phenomenom can be given via the matrix Ch in (4.29) that represents the finiteelement discretization of u → u′. The differential equation (in strong form) is u′ = −1

bu + 1bf

on (0, 1), which is a first order ordinary differential equation. The initial condition is given byu(0) = 0. For discretization of u′(xi) we use (cf. Ch) the central difference 1

2h

(

u(xi+1)−u(xi−1))

.Thus for the approximation of u′ at “time” x = xi we use u at the future “time” x = xi+1,which is an unnatural approach.

We now turn to the question how a better finite element discretization for this very simpleproblem can be developed. One possibility is to use suitable different finite element spacesH1,h ⊂ H1 and H2,h ⊂ H2. This leads to a so-called Petrov-Galerkin method. We do not treatsuch methods here, but refer to the literature, for example [76]. From an implementation pointof view it is convenient to use only one finite element space instead of two different ones. Wewill show how a satisfactory discretization with (only) the space Xh of linear finite elements canbe obtained by using the concept of stabilization.

A first stabilized method is based on the following observation. If u ∈ H1 satisfies (4.26),then

∫ 1

0(bu′ + u)bv′ dx = 〈f, bv′〉L2 for all v ∈ H1 (4.32)

also holds. By adding this equation to the one in (4.26) it follows that the solution u ∈ H1

satisfies〈bu′ + u, bv′ + v〉L2 = 〈f, bv′ + v〉L2 for all v ∈ H1. (4.33)

96

Based on this we introduce the notation

k1(u, v) := 〈bu′ + u, bv′ + v〉L2 , u, v ∈ H1,

f1(v) := 〈f, bv′ + v〉L2 , v ∈ H1.

The bilinear form k1(·, ·) is continuous on H1×H1 and f1 is continuous on H1. Moreover, k1(·, ·)is symmetric and using 〈v′, v〉L2 = 1

2v(1)2 ≥ 0 for v ∈ H1, we get

k1(v, v) = b2‖v′‖2L2 + ‖v‖2

L2 + 2b〈v′, v〉L2 ≥ minb2, 1‖v‖21, for v ∈ H1, (4.34)

and thus k1(·, ·) is elliptic on H1. The discrete problem is as follows:

determine uh ∈ Xh such that k1(uh, vh) = f1(vh) for all vh ∈ Xh. (4.35)

Due to Xh ⊂ H1 and the H1-ellipticity of k1(·, ·) this problem has a unique solution uh ∈ Xh.For the discretization error we obtain the following result.

Lemma 4.3.7 Let u ∈ H1 be the solution of (4.26) (or (4.33)) and uh the solution of (4.35).The following holds:

‖u− uh‖1 ≤ c infvh∈Xh

‖u− vh‖1

with a constant c independent of h. If u ∈ H2(I) then

‖u− uh‖1 ≤ ch‖u′′‖L2 (4.36)

holds with a constant c independent of h.

Proof. Apply corollary 3.1.2.

From (4.34) and f1(v) ≤ maxb, 1√

2‖f‖L2‖v‖1 it follows that the discrete problem has thestability property ‖uh‖1 ≤ c ‖f‖L2 which is similar to the stability property of the continuoussolution given in theorem 4.3.1 and (significantly) better than the one for the original discreteproblem, cf. (4.30). This explains why the discretization in (4.35) is called a stabilized finiteelement method. From (4.34) one can see that k1(·, ·) contains an (artificial) diffusion term thatis not present in k(·, ·). Note that the bounds in lemma 4.3.7 are significantly better than theones in corollary 4.3.5.

If u ∈ H2(I) then from (4.36) we have the L2-error bound

‖u− uh‖L2 ≤ c h‖u′′‖L2 . (4.37)

In section 3.4.2, for elliptic problems, a duality argument is used to derive an L2-error bound ofthe order h2 for linear finite elements. Such a duality argument can not be applied to hyperbolicproblems due to the fact that the H2-regularity assumption that is crucial in the analysis insection 3.4.2 is usually not satisfied for hyperbolic problems. In the following remark this ismade clear for the simple hyperbolic problem that is treated in this section.

Remark 4.3.8 Consider the problem in (4.25) with b = 1 and f(x) = 1−ββ e−βx − 1, with a

constant β ≥ 1. Substitution shows that the solution is given by u(x) = 1β (e−βx− 1). Note that

u, f ∈ C∞(I). Further elementary computations yield

‖f‖L2 ≤ 2, ‖u′′‖L2 ≥ 1

4

√

β.

Hence a bound ‖u′′‖L2 ≤ c ‖f‖L2 with a constant c independent of f ∈ L2(I) can not hold, i.e.,this problem is not H2-regular.

97

We now generalize the stabilized finite element method presented above and show that usingthis generalization we can derive a method with an H1-error bound of the order h (as in (4.36))

and an improved L2-error bound of the order h1 12 . This generalization is obtained by adding

δ-times, with δ a parameter in [0, 1], the equation in (4.32) to the one in (4.26). This shows thatthe solution u ∈ H1 of (4.26) also satisfies

kδ(u, v) = fδ(v) for all v ∈ H1, with (4.38a)

kδ(u, v) := 〈bu′ + u, δbv′ + v〉L2 , fδ(v) := 〈f, δbv′ + v〉L2 . (4.38b)

Note that for δ = 0 we have the original variational formulation and that δ = 1 results in theproblem (4.33). For δ 6= 1 the bilinear form kδ(·, ·) is not symmetric. For all δ ∈ [0, 1] we havefδ ∈ H ′1. The discrete problem is as follows:

determine uh ∈ Xh such that kδ(uh, vh) = fδ(vh) for all vh ∈ Xh. (4.39)

The discrete solution uh (if it exists) depends on δ. We investigate how δ can be chosen suchthat the discretization error (bound) is minimized. For this analysis we use the abstract resultsin section 4.2. We write

kδ(u, v) = sδ(u, v) + tδ(u, v), u, v ∈ H1, with

sδ(u, v) = δ〈bu′, bv′〉L2 + 〈u, v〉L2 , tδ(u, v) = δ〈u, bv′〉L2 + 〈bu′, v〉L2 .

The bilinear form sδ(·, ·) defines a scalar product on H1. We introduce the norm and theseminorm (cf. lemma 4.2.1)

|||u|||δ := sδ(u, u)12 , ‖u‖∗,h,δ := sup

vh∈Xh

tδ(u, vh)

|||vh|||δ, for u ∈ H1.

Note that

tδ(u, u) = b(δ + 1)〈u′, u〉L2 =1

2b(δ + 1)u(1)2 ≥ 0 for all u ∈ H1, (4.40)

and1√2(b√δ|u|1 + ‖u‖L2) ≤ |||u|||δ ≤ b

√δ|u|1 + ‖u‖L2 for all u ∈ H1. (4.41)

Lemma 4.3.9 For all δ ∈ [0, 1] the continuous problem (4.38) and the discrete problem (4.39)have unique solutions u and uh, respectively. The discrete solution satisfies the stability bound

b√δ|uh|1 + ‖uh‖L2 ≤ 2‖f‖L2 .

Proof. For δ = 0 the existence and uniqueness of solutions is given in theorem 4.3.1 andcorollary 4.3.4. The stability result for δ = 0 also follows from corollary 4.3.4. For δ > 0 weobtain, using (4.40),

kδ(u, u) = δb2‖u′‖2L2 + ‖u‖2

L2 + tδ(u, u) ≥ γ‖u‖21 for u ∈ H1, with γ := minδb2, 1 > 0,

and

kδ(u, v) ≤ (b‖u′‖L2 + ‖u‖L2)(δb‖v′‖L2 + ‖v‖L2) ≤ c ‖u‖1‖v‖1 for u, v ∈ H1.

98

Hence kδ(·, ·) is elliptic and continuous on H1. The Lax-Milgram lemma implies that both thecontinuous and the discrete problem have a unique solution. For v ∈ H1 we have, cf. (4.40),

kδ(v, v) = sδ(v, v) + tδ(v, v) ≥ |||v|||2δ . (4.42)

Furthermore, using δ ≤ 1 we get

fδ(v) ≤ ‖f‖L2(bδ|v|1 + ‖v‖L2) ≤ ‖f‖L2

(

b√δ|v|1 + ‖v‖L2

)

≤√

2‖f‖L2 |||v|||δ for v ∈ H1.

This yields |||uh|||2δ ≤ kδ(uh, uh) = fδ(uh) ≤√

2‖f‖L2 |||uh|||δ and thus

b√δ|uh|1 + ‖uh‖L2 ≤

√2|||uh|||δ ≤ 2‖f‖L2 ,


Lemma 4.3.10 Let u and uh be the solutions of (4.38) and (4.39), respectively. The followingerror bound holds:

|||u− uh|||δ + ‖u− uh‖∗,h,δ ≤ 4 infvh∈Xh

(

|||u− vh|||δ + ‖u− vh‖∗,h,δ)

. (4.43)

Proof. To derive this error bound we use theorem 4.2.2 with U = H1, V = Xh, r(·, ·) = kδ(·, ·),‖ · ‖ = ||| · |||δ, ‖ · ‖∗ = ‖ · ‖∗,h,δ. We verify the corresponding conditions in lemma 4.2.1. Thebilinear form sδ(·, ·) is continuous on U = H1: sδ(u, v) ≤ |||u|||δ |||v|||δ . Hence (4.17) is satisfied withc1 = 1. For u ∈ H1 the functional v → tδ(u, v) is clearly continuous on V = Xh. From (4.42) itfollows that condition (4.16) is satisfied with c0 = 1. Application of theorem 4.2.2 yields

|||u− uh|||δ + ‖u− uh‖∗,h,δ ≤ C infvh∈Xh

(

|||u− vh|||δ + ‖u− vh‖∗,h,δ)

,

with C = 1 + maxc1, 11+c0+c1c0

= 4.

For the Sobolev space H1 we have H1 ⊂ C(I) and thus the nodal interpolation

IX : H1 → C(I), (IXu)(xi) = u(xi), 0 ≤ i ≤ N,

is well-defined.

Theorem 4.3.11 Let u ∈ H1 and uh ∈ Xh be the solutions of (4.38) and (4.39), respectively.For all δ ∈ [0, 1] the error bound

b√δ|u− uh|1 + ‖u− uh‖L2 ≤ C

(

b√δ|u− IXu|1 + (1 + min b

h,

1√δ)‖u− IXu‖L2

)

holds with a constant C independent of h, δ, b and u.

Proof. From lemma 4.3.10 and (4.41) we obtain

b√δ|u− uh|1 + ‖u− uh‖L2 ≤ 4

√2(

b√δ|u− IXu|1 + ‖u− IXu‖L2 + ‖u− IXu‖∗,h,δ

)

. (4.44)

Define eh := u− IXu, and note that eh(0) = eh(1) = 0. Thus we we have

〈e′h, vh〉L2 = −〈eh, v′h〉L2 for all vh ∈ Xh.

99

Using this and the inverse inequality |vh|1 ≤ ch−1‖vh‖L2 for all vh ∈ Xh we obtain

‖eh‖∗,h,δ = supvh∈Xh

tδ(eh, vh)

|||vh|||δ= sup

vh∈Xh

b(1 − δ)〈eh, v′h〉L2

(δb2|vh|21 + ‖vh‖2L2)

12

≤ supvh∈Xh

b‖eh‖L2 |vh|1(δ + ch2b−2)

12 b|vh|1

≤ c min 1√

δ,b

h

‖eh‖L2 ,

(4.45)

with c independent of δ, h and b. The result follows from combination of (4.44) and (4.45).

Corollary 4.3.12 Let u and uh be as in theorem 4.3.11 and assume that u ∈ H2(I). Then thefollowing error bound holds for δ ∈ [0, 1]:

b√δ|u− uh|1 + ‖u− uh‖L2 ≤ Ch

[

h+ b√δ + b min1, h

b√δ]

‖u′′‖L2 (4.46)

with a constant C independent of h, δ, b and u.

The term between square brackets in (4.46) is minimal for h ≤ b if we take

δ = δopt =h

b. (4.47)

We consider three cases:δ = 0 (no stabilization): Then we get ‖u − uh‖L2 ≤ ch‖u′′‖L2 . This bound for the L2-error isbetter than the one that follows from (4.31), cf. also remark 4.3.6.δ = 1 (full stabilization): Then we obtain ‖u− uh‖1 ≤ ch‖u′′‖L2 , which is the same bound as inlemma 4.3.7.δ = δopt (optimal value): This results in

|u− uh|1 ≤ ch‖u′′‖L2 , ‖u− uh‖L2 ≤ ch1 12 ‖u′′‖L2 . (4.48)

Hence, the bound for the norm | · |1 is the same as for δ = 1, but we have an improvement inthe L2-error bound.

From these discretization error results and from the stability result in lemma 4.3.9 we seethat δ = 0 leads to poor accuracy and poor stability properties. The best stability property isfor the case δ = 1. A somewhat weaker stability property but a better approximation propertyis obtained for δ = δopt. For δ = δopt we have a good compromise between sufficient stabilityand high approximation quality. Finding such a compromise is a topic that is important in allstabilized finite element methods.

Remark 4.3.13 Experiments to show dependence of errors on δ. Is the bound in (4.46) sharp?

4.4 The convection-diffusion problem reconsidered

In chapter 3 we treated the finite element discretization of the variational problem in (4.2).Under the assumption (4.3) we have

k(u, u) ≥ ε|u|21 for all u ∈ H10 (Ω), (4.49)

k(u, v) ≤M |u|1|v|1 for all u, v ∈ H10 (Ω), (4.50)

100

with a constant M independent of ε. Now recall the standard Galerkin discretization with linearfinite elements, i.e.: uh ∈ X1

h,0 such that

k(uh, vh) = f(vh) for all vh ∈ X1h,0.

From the analysis in chapter 3 (corollary 3.1.2 and corollary 3.3.10) we obtain the discretizationerror bound

|u− uh|1 ≤ M

εinf

vh∈X1h,0

|u− vh|1 ≤ CM

εh|u|2 , (4.51)

provided u ∈ H2(Ω). The constant C is independent of h and ε. We can apply the dualityargument used in section 3.4.2 to derive a bound for the error in the L2-norm. The dualproblem has the same form as in(4.1)-(4.2) but with b replaced by −b. Assume that (4.4) alsoholds with b replaced by −b and that the solution of the dual problem lies in H2(Ω) (the latteris true if Ω is convex). Then a regularity result as in (4.6) holds for the solution of the dualproblem. Using this we obtain

‖u− uh‖L2 ≤ Cε−2 12h2|u|2 , (4.52)

with a constant C independent of h and ε.Even if |u|2 remains bounded for ε ↓ 0 (no boundary layers) the bounds in (4.51) and (4.52) tendto infinity for ε ↓ 0. These bounds, however, are too pessimistic and do not reflect importantphenomena that are observed in numerical experiments. For example, from experiments it isknown that the standard linear finite element discretization yields satisfactory results if h ≈ εand ε ↓ 0. This, however, is not predicted by the bounds in (4.51) and (4.52).

Below we present a refined analysis based on the same approach as in section 4.3 which re-sults in better bounds for the discretization error. These bounds reflect important properties ofthe standard Galerkin finite element discretization applied to the convection-diffusion problemin (4.2) and show the effect of introducing a stabilization. In section 4.4.1 we consider well-posedness of the problem in (4.2). In section 4.4.2 we analyze a stabilized finite element methodfor this problem.

4.4.1 Well-posedness of the continuous problem

The (sharp) result in (4.49) shows that in the norm | · |1 (or equivalently ‖ · ‖1) the convection-diffusion problem is not uniformly well-posed for ε ↓ 0. In this section we introduce other normsin which the problem is uniformly well-posed. For a better understanding of the analysis wefirst present results for a two-dimensional hyperbolic problem:

Remark 4.4.1 We consider a two-dimensional variant of the hyperbolic problem treated insection 4.3. Let Ω := (0, 1)2, b := (1 0)T , f ∈ L2(Ω). The continuous problem is as follows:determine u such that

b · ∇u+ u = f in Ω

u = 0 on ΓW := (x, y) ∈ Ω | x = 0 .(4.53)

Let H1 be the space of functions u ∈ L2(Ω) for which the weak derivative ux = ∂u∂x exists:

H1 := u ∈ L2(Ω) | ux ∈ L2(Ω) . This space with the scalar product

〈u, v〉H1= 〈u, v〉L2 + 〈b · ∇u,b · ∇v〉L2 = 〈u, v〉L2 + 〈ux, vx〉L2 (4.54)

101

is a Hilbert space (follows from the same arguments as for the Sobolev space H1(Ω)). TakeH2 := L2(Ω). The bilinear form corresponding to the problem in (4.53) is

k(u, v) = 〈b · ∇u, v〉L2 + 〈u, v〉L2 , u ∈ H1, v ∈ H2.

Using the same arguments as in the proof of theorem 4.3.1 one can show that there exists aunique u ∈ H1 such that

k(u, v) = 〈f, v〉L2 for all v ∈ H2

and that ‖u‖H1≤ c ‖f‖L2 holds with a constant c independent of f ∈ L2(Ω). Thus this

hyperbolic problem is well-posed in the space H1 ×H2. Note that the stability result is similarto the one for the convection-diffusion problem in theorem 4.1.4.

We now turn to the convection-diffusion problem as in (4.1)-(4.2). As in section 4.3 theanalysis uses the abstract setting given in section 4.2. We will need the following assumption:

There are constants β0, cb such that − 1

2divb + c ≥ β0 ≥ 0, ‖c‖L∞ ≤ cbβ0. (4.55)

We take cb := 0 if β0 = 0. Note that this assumption is somewhat stronger than the one in (4.3)but still covers the important special case divb = 0, c constant and c ≥ 0.

Theorem 4.4.2 Consider the variational problem in (4.2) and assume that (4.55) holds. Foru ∈ H1

0 (Ω) define the (semi-)norms

|||u|||ε :=(

ε|u|21 + β0‖u‖2L2

)12 , (4.56a)

‖b · ∇u‖−ε = ‖u‖∗ := supv∈H1

0 (Ω)

∫

Ω b · ∇u v dx|||v|||ε

. (4.56b)

Then we have the continuity bound

k(u, v) ≤ maxcb, 1(

|||u|||ε + ‖b · ∇u‖−ε)

|||v|||ε for all u, v ∈ H10 (Ω), (4.57)

and the infsup result

supv∈H1

0 (Ω)

k(u, v)

|||v|||ε≥ 1

2 + maxcb, 1(

|||u|||ε + ‖b · ∇u‖−ε)

for all u ∈ H10 (Ω). (4.58)

Proof. We apply lemma 4.2.1 with U = V = H10 (Ω), norm ‖ · ‖ = ||| · |||ε and

s(u, v) =

∫

Ωε∇u · ∇v + cuv dx , t(u, v) =

∫

Ωb · ∇uv dx .

For given u ∈ H10 (Ω) we have

|t(u, v)| ≤ c ‖v‖L2 ≤ c |v|1 ≤ c

ε|||v|||ε

102

and thus v → t(u, v) is bounded on V . Note that k(u, v) = s(u, v)+t(u, v) holds. For u ∈ H10 (Ω)

we have

k(u, u) =

∫

Ωε∇u · ∇u+ b · ∇uu+ cu2 dx

=

∫

Ωε∇u · ∇u+ (−1

2divb + c)u2 dx

≥∫

Ωε∇u · ∇u+ β0u

2 dx = |||u|||2ε

and thus (4.16) is satisfied with c0 = 1. Furthermore, for all u, v ∈ H10 (Ω) we have

|s(u, v)| ≤ ε|u|1|v|1 + ‖c‖L∞‖u‖L2‖v‖L2

≤(

ε|u|21 + cbβ0‖u‖2L2

)12(

ε|v|21 + cbβ0‖v‖2L2

)12

≤ maxcb, 1|||u|||ε|||v|||ε.

Hence (4.17) holds with c1 = maxcb, 1. The results in (4.18) and (4.19) then yield (4.57) and(4.58), respectively.

The result in this theorem can be interpreted as follows. Let H1 be the space H10 (Ω) endowed

with the norm ||| · |||ε + ‖b · ∇ · ‖−ε and H2 the space H10 (Ω) with the norm ||| · |||ε. Note that

these norms are problem dependent. The spaces H1 and H2 are Hilbert spaces. Using the linearoperator L : H1 → H ′2, L(u)(v) := k(u, v) the variational problem (4.2) can be reformulated asfollows: find u ∈ H1 such that Lu = f . From the results in theorem 2.3.1 and theorem 4.4.2 itfollows that L is an isomorphism and that the inequalities

‖L‖H′

2←H1≤ maxcb, 1 , ‖L−1‖H1←H′

2≤ 2 + maxcb, 1

hold. Hence, the operator L : H1 → H ′2 is well-conditioned uniformly in ε. In this sense thenorms ||| · |||ε + ‖b · ∇ · ‖−ε and ||| · |||ε are natural for the convection-diffusion problem if one isinterested in the case ε ↓ 0. If we take β0 > 0 and in the definition of the norms in (4.56)

formally put ε = 0 then using a density argument it follows that ‖b ·∇u‖−ε=0 = β− 1

2

0 ‖b ·∇u‖L2 .Furthermore |||u|||ε=0 =

√β0‖u‖L2 . The resulting norms in the spaces H1 and H2 are precisely

those used in the well-posedness of the hyperbolic problem in remark 4.4.1.

The infsup bound in the previous theorem implies the following stability result for the vari-ational problem (4.2).

Corollary 4.4.3 Consider the variational problem in (4.2) and assume that (4.55) holds withβ0 > 0. Then the inequality

√ε |u|1 +

√

β0 ‖u‖L2 +√

2 ‖b · ∇u‖−ε ≤√

2(

2 + maxcb, 1)

β− 1

2

0 ‖f‖L2 (4.59)

holds.

103

Proof. From k(u, v) = 〈f, v〉L2 for all v ∈ H10 (Ω) and (4.58) we obtain

|||u|||ε + ‖b · ∇u‖−ε ≤(

2 + maxcb, 1)

supv∈H1

0 (Ω)

〈f, v〉L2

|||v|||ε

≤(

2 + maxcb, 1)

supv∈H1

0 (Ω)

‖f‖L2‖v‖L2

(ε|v|21 + β0‖v‖2L2)

12

≤(

2 + maxcb, 1)

β− 1

2

0 ‖f‖L2 .

Furthermore, note that

|||u|||ε ≥1√2

(√ε |u|1 +

√

β0 ‖u‖L2

)

holds.

From this corollary it follows that for the case β0 > 0 the inequality ε12 ‖u‖1 + ‖u‖L2 ≤ C‖f‖L2

holds with a constant C independent of f and ε. This result is proved in theorem 4.1.1, too.However, from corollary 4.4.3 we also obtain

‖b · ∇u‖−ε ≤ C‖f‖L2 (4.60)

with C independent of f and ε. Hence, we have a bound on the derivative in streamline direction.Taking (formally) ε = 0 we obtain a bound ‖u‖L2 + ‖b · ∇u‖L2 ≤ C‖f‖L2, which is (for theexample b = (1 0)T ) the same as the stability bound ‖u‖H1

≤ c‖f‖L2 derived in remark 4.4.1.For ε = 0 and β0 > 0 the norm ‖ · ‖−ε is equivalent to the L2-norm. A result that relates

the norm ‖ · ‖−ε to the more tractable L2-norm also for ε > 0 is given in the following lemma.

Lemma 4.4.4 Let Th be a regular quasi-uniform family of triangulations of Ω consistingof n-simplices and let Vh := Xk

h,0 ⊂ H10 (Ω) be the corresponding finite element space. Let

Ph : L2(Ω) → Vh be the L2-orthogonal projection on Vh:

〈Phw, vh〉L2 = 〈w, vh〉L2 for all w ∈ L2(Ω), vh ∈ Vh.

Assume that β0 > 0. There exists a constant C > 0 independent of h and ε such that for0 ≤ ε ≤ h2:

C‖Phw‖L2 ≤ ‖w‖−ε ≤ β− 1

2

0 ‖w‖L2 for all w ∈ L2(Ω). (4.61)

Proof. The second inequality in (4.61) follows from

‖w‖−ε = supv∈H1

0 (Ω)

〈w, v〉L2

(ε|v|21 + β0‖v‖2L2)

12

≤ β− 1

2

0 ‖w‖L2

For the first inequality we need the global inverse inequality from lemma 3.3.11: |vh|1 ≤ch−1‖vh‖L2 for all vh ∈ Vh. Using this inequality we get

‖w‖∗ = supv∈H1

0(Ω)

〈w, v〉L2

|||v|||ε≥ 〈w,Phw〉L2

|||Phw|||ε

=‖Phw‖2

L2

(ε|Phw|21 + β0‖Phw‖2L2)

12

≥(

c2ε

h2+ β0)

− 12‖Phw‖L2 ≥ C‖Phw‖L2 ,

and thus the first inequality in (4.61) holds, too.

104

4.4.2 Finite element discretization

We now analyze the Galerkin finite element discretization of the convection-diffusion problem.For ease of presentation we only consider simplicial finite elements. The case with rectangularfinite elements can be treated analogously. Let Th be a regular family of triangulations of Ωconsisting of n-simplices and let

Vh := Xkh,0 ⊂ H1

0 (Ω), k ≥ 1,

be the corresponding finite element space. The standard discretization is as follows:

determine uh ∈ Vh such that k(uh, vh) = 〈f, vh〉L2 for all vh ∈ Vh. (4.62)

We now use the same stabilization approach as in section 4.3. Assume that for the solutionu ∈ H1

0 (Ω) of the convection-diffusion problem we have u ∈ H2(Ω). Then

∫

Ω

(

− ε∆u+ b · ∇u+ cu)

v dx = 〈f, v〉L2 for all v ∈ H10 (Ω),

holds, but also for arbitrary δ ∈ R:

∫

Ω

(

− ε∆u+ b · ∇u+ cu)

δb · ∇v dx = 〈f, δb · ∇v〉L2 for all v ∈ H10 (Ω).

Adding these equations it follows that the solution u satisfies

〈−ε∆u+ b · ∇u+ cu, v + δb · ∇v〉L2 = 〈f, v + δb · ∇〉L2 for all v ∈ H10 (Ω),

or, equivalently,

k(u, v) + δ〈−ε∆u+ b · ∇u+ cu,b · ∇v〉L2 = 〈f, v〉L2 + δ〈f,b · ∇v〉L2 for all v ∈ H10 (Ω).

This leads to the following discretization: determine uh ∈ Vh such that

kδ(uh, vh) = fδ(vh) for all vh ∈ Vh, (4.63a)

with kδ(uh, vh) := k(uh, vh) +∑

T∈Th

δT 〈−ε∆uh + b · ∇uh + cuh,b · ∇vh〉T , (4.63b)

fδ(vh) := 〈f, vh〉L2 +∑

T∈Th

δT 〈f,b · ∇vh〉T . (4.63c)

We use the notation 〈·, ·〉T = 〈·, ·〉L2(T ). In (4.63) we consider a sum∑

T∈Th〈·, ·〉T instead of

〈·, ·〉L2 because for uh ∈ Vh the second derivatives in ∆uh are well-defined in each T ∈ Th but ∆uhis not well-defined across edges (faces) in the triangulation. In (4.63) we use a (stabilization)parameter δT for each T ∈ Th instead of one global parameter δ. This offers the possibility toadapt the stabilization parameter to the local mesh size and thus obtain better results if thetriangulation is strongly nonuniform.

The continuous solution u ∈ H10 (Ω) satisfies

kδ(u, v) = fδ(v) for all v ∈ H10 (Ω), (4.64)

provided u ∈ H2(T ) for all T ∈ Th. In the remainder we assume that u has this regularityproperty. If δT = 0 for all T we have the standard (unstabilized) method as in (4.62). In the

105

remainder of this section we present an error analysis of the discretization (4.63) along the samelines as in section 4.3. We use the abstract analysis in section 4.2 with the spaces

U := v ∈ H10 (Ω) | v|T ∈ H2(T ) for all T ∈ Th , V = Vh.

Note that U depends on Th. We split kδ(·, ·) as follows:

kδ(u, v) = sδ(u, v) + tδ(u, v), u, v ∈ U,

sδ(u, v) := ε〈∇u,∇v〉L2 + 〈cu, v〉L2 +∑

T∈Th

δT 〈b · ∇u,b · ∇v〉T ,

tδ(u, v) := 〈b · ∇u, v〉L2 +∑

T∈Th

δT 〈−ε∆u+ cu,b · ∇v〉T .

We only consider δT ≥ 0. Then ‖ · ‖ = ||| · |||δ defines a norm on U :

|||u|||2δ := ε|u|21 + β0‖u‖2L2 +

∑

T∈Th

δT ‖b · ∇u‖2T , u ∈ U.

We also use the seminorm

‖u‖∗,h,δ := supvh∈Vh

tδ(u, vh)

|||vh|||δ, u ∈ U. (4.65)

We will apply theorem 4.2.2. For this we have to verify the corresponding conditions inlemma 4.2.1. First note that vh → tδ(u, vh) is trivially bounded on Vh and thus the semi-norm in (4.65) is well-defined. The conditions (4.16)-(4.17) in lemma 4.2.1 are verified in thefollowing two lemmas.

We always assume that (4.55) holds.

Lemma 4.4.5 The bilinear form sδ(·, ·) is continuous on U × U :

sδ(u, v) ≤ maxcb, 1|||u|||δ |||v|||δ for all u, v ∈ U.Proof. The result follows from:

sδ(u, v) ≤ ε|u|1|v|1 + cbβ0‖u‖L2‖v‖L2 +∑

T∈Th

δT ‖b · ∇u‖T ‖b · ∇v‖T

≤ maxcb, 1(

ε|u|21 + β0‖u‖2L2 +

∑

T∈Th

δT ‖b · ∇u‖2T

) 12(

ε|v|21 + β0‖v‖2L2 +

∑

T∈Th

δT ‖b · ∇v‖2T

) 12

= maxcb, 1|||u|||δ |||v|||δ .

Below we use a local inverse inequality (lemma 3.3.11):

|vh|m,T ≥ µinvhT |vh|m+1,T for all vh ∈ Vh, m = 0, 1.T ∈ Th, (4.66)

with a constant µinv > 0 independent of h and T .

Lemma 4.4.6 If

0 ≤ δT ≤ 1

2min

1

β0c2b, µ2

inv

h2T

ε

for all T ∈ Th (4.67)

holds then the bilinear form kδ(·, ·) is elliptic on Vh:

kδ(vh, vh) ≥1

2|||vh|||2δ for all vh ∈ Vh.

106

Proof. Using 〈b · ∇vh, vh〉L2 = −12〈div b vh, vh〉L2 and (4.55) we obtain

kδ(vh, vh) ≥ ε|vh|21 + β0‖vh‖2L2 +

∑

T∈Th

δT ‖b · ∇vh‖2T +

∑

T∈Th

δT 〈−ε∆vh + cvh,b · ∇vh〉T

= |||vh|||2δ +∑

T∈Th

δT 〈−ε∆vh + cvh,b · ∇vh〉T .(4.68)

For a bound on the last term in (4.68) we use ‖∆vh‖T ≤ µ−1invh

−1T |vh|1,T ,

√δT ≤ 1√

2µinvhT ε

− 12

and√δT ≤ 1√

2β− 1

2

0 c−1b :

∣

∣

∑

T∈Th

δT 〈−ε∆vh + cvh,b · ∇vh〉T∣

∣

≤∑

T∈Th

δT

(

εµ−1invh

−1T |vh|1,T ‖b · ∇vh‖T + β0cb‖vh‖T ‖b · ∇vh‖T

)

≤∑

T∈Th

[

(√ε|vh|1,T

)( 1√2

√

δT ‖b · ∇vh‖T)

+(√

β0‖vh‖T)( 1√

2

√

δT ‖b · ∇vh‖T)

]

≤ 1

2

∑

T∈Th

[

ε|vh|21,T + β0‖vh‖2T + δT ‖b · ∇vh‖2

T

]

=1

2

(

ε|vh|21 + β0‖vh‖2L2 +

∑

T∈Th

δT ‖b · ∇vh‖2T

)

=1

2|||vh|||2δ .

Using this in (4.68) proves the result.

In the condition (4.67) the bound β−10 c−1

b should be taken ∞ if β0 = 0 or cb = 0.From the ellipticity result in the previous lemma we see that if we take δT = δ > 0 such

that (4.67) is satisfied, a term δ‖b · ∇vh‖2L2 is added in the ellipticity lower bound |||vh|||2δ which

enhances stability. The bilinear form corresponding to this additional term is (u, v) → δ〈b ·∇u,b · ∇v〉L2 , which models diffusion in the streamline direction b. Therefore the stabilizedmethod (4.63) with δT > 0 is called the streamline diffusion finite element method, SDFEM.

Remark 4.4.7 If Vh is the space of piecewise linear finite elements then (∆vh)|T = 0. Inspectionof the proof shows that the result of the lemma holds with the condition (4.67) replaced by theweaker condition 0 ≤ δT ≤ 1

2β−10 c−2

b .

Corollary 4.4.8 If (4.67) is satisfied then the discrete problem (4.63a) has a unique solutionuh ∈ Vh. For β0 > 0 we have the stability bound

√ε|uh|1 +

√

β0‖uh‖L2 +(

∑

T∈Th

δT ‖b ·∇uh‖2T

) 12 ≤ 2

√3( 1√

β0+√

δh)

‖f‖L2 , with δh := maxT∈Th

δT .

Proof. The bilinear form kδ(·, ·) is trivially bounded on the finite dimensional space Vh.Lemma 4.4.6 yields Vh-ellipticity of the bilinear form. Existence of a unique solution followsfrom the Lax-Milgram lemma. For the left handside of the stabiliy inequality we have

√ε|uh|1 +

√

β0‖uh‖L2 +(

∑

T∈Th

δT ‖b · ∇u‖2T

) 12 ≤

√3|||uh|||δ.

107

Lemma 4.4.6 yields|||uh|||2δ ≤ 2kδ(uh, uh) = 2fδ(uh)

Combining this with

fδ(uh) = 〈f, uh〉L2 +∑

T∈Th

δT 〈f,b · ∇uh〉T

≤ ‖f‖L2‖uh‖L2 +(

∑

T∈Th

δT ‖f‖2T

) 12(

∑

T∈Th

δT ‖b · ∇uh‖2T

) 12

≤ 1√β0

‖f‖L2 |||uh|||δ +√

δh‖f‖L2 |||uh|||δ =( 1√

β0+√

δh)

‖f‖L2 |||uh|||δ.

completes the proof.

Remark 4.4.9 As an example consider the case with linear finite elements, δT = δ for all Tand β0 = 1. Then the stability result of this corollary takes the form

√ε|uh|1 + ‖uh‖L2 +

√δ‖b · ∇uh‖L2 ≤ 2

√3(

1 +√δ)

‖f‖L2 , for δ ∈ [0,1

2c−2b ]. (4.69)

note the similarity of this result with the one in corollary 4.4.3 (for the continuous problem)and in corollary 4.3.9 (for the stabilized finite element method applied to the 1D hyperbolicproblem).

From the results in corollary 4.4.3 and (4.69) we see that one obtains the strongest stabilityresult if δT is chosen as large as possible (maximal stabilization). In section 4.3 it is shown thatsmaller values for the stabilization parameter may lead to smaller discretization errors. Belowwe give an analyis on how to chose the parameter δT such that the (bound for the) discretizationerror in minimized.

Application of theorem 4.2.2 yields:

Theorem 4.4.10 Assume that (4.67) is satisfied. For the discrete solution uh of (4.63a) wehave the error bound

|||u− uh|||δ + ‖u− uh‖∗,h,δ ≤ C infvh∈Vh

(

|||u− vh|||δ + ‖u− vh‖∗,h,δ)

(4.70)

with C = 1 + maxcb, 1(

3 + 2maxcb, 1)

.

Proof. From lemma 4.4.5 and lemma 4.4.6 it follows that the conditions (4.16) and (4.17)are satisfied with c0 = 1

2 , c1 = maxcb, 1. Now we use theorem 4.2.2.

The norm ||| · |||δ is given in terms of the usual L2- and H1-norm. For the right handside in(4.70) we need a bound on ‖u− vh‖∗,h,δ. Such a result is given in the following lemma. We willneed the assumption

‖divb‖L∞ ≤ γ0β0. (4.71)

(β0 as in (4.55)). This can always be satisfied (for suitable γ0) if β0 > 0. For the cases β0 = 0this implies that divb = 0 must hold.

Lemma 4.4.11 Let δT be such that (4.67) holds and assume that (4.55) and (4.71) are satisfied.For u ∈ U the following estimate holds:

‖u‖∗,h,δ ≤√ε|u|1 +

√

β0(1 + γ0)‖u‖L2 +(

∑

T∈Th

min

δ−1T ,

‖b‖∞,Tε+ µ2

invh2Tβ0

‖u‖2T

) 12.

108

Proof. By definition we have

‖u‖∗,h,δ = supvh∈Vh

〈b · ∇u, vh〉L2 +∑

T∈ThδT 〈−ε∆u+ cu,b · ∇vh〉T

|||vh|||δ. (4.72)

We first treat the second term in the nominator. Using the inverse inequality (4.66) and δ12

T ≤1√2µinvhT ε

− 12 , δ

12

T ≤ 1√2β− 1

2

0 c−1b we get

∣

∣

∑

T∈Th

δT 〈−ε∆u+ cu,b · ∇vh〉T∣

∣ ≤∑

T∈Th

δT(

ε|u|2,T + cbβ0‖u‖T)

‖b · ∇vh‖T

≤∑

T∈Th

1√2

(√ε|u|1,T +

√

β0‖u‖T)√

δT ‖b · ∇vh‖T

≤[

∑

T∈Th

1

2

(√ε|u|1,T +

√

β0‖u‖T)2]1

2 |||vh|||δ

≤(

ε|u|21 + β0‖u‖2L2

)12 |||vh|||δ

≤(√ε|u|1 +

√

β0‖u‖L2

)

|||vh|||δ.

(4.73)

For the first term in the nominator in (4.72) we obtain, using partial integration,

|〈b · ∇u, vh〉L2 | ≤ |〈u, (div b)vh〉L2 | + |〈u,b · ∇vh〉L2 |≤ γ0β0‖u‖L2‖vh‖L2 + |〈u,b · ∇vh〉L2 |≤ γ0

√

β0‖u‖L2 |||vh|||δ + |〈u,b · ∇vh〉L2 |.(4.74)

We write |||vh|||2δ =∑

T∈Thε|vh|21,T + β0‖vh‖2

T + δT ‖b · ∇vh‖2T =:

∑

T∈Thξ2T . For the last term in

(4.74) we have

|〈u,b · ∇vh〉L2 | =∣

∣

∑

T∈Th

〈u,b · ∇vh〉T∣

∣ ≤∑

T∈Th

‖u‖T ‖b · ∇vh‖T .

From ‖b · ∇vh‖T ≤ δ− 1

2

T ξT and

‖b · ∇vh‖T ≤ ‖b‖∞,T |vh|1,T =‖b‖∞,T

(ε+ µ2invh

2Tβ0)

12

(

ε|vh|21,T + µ2invh

2Tβ0|vh|21,T

) 12

≤ ‖b‖∞,T(ε+ µ2

invh2Tβ0)

12

(

ε|vh|21,T + β0|vh|2T)

12 ≤ ‖b‖∞,T

(ε+ µ2invh

2Tβ0)

12

ξT ,

we get

|〈u,b · ∇vh〉L2 | ≤∑

T∈Th

‖u‖T min

δ− 1

2

T ,‖b‖∞,T

(ε+ µ2invh

2Tβ0)

12

ξT

≤[

∑

T∈Th

min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

‖u‖2T

]12 |||vh|||δ .

Using this in (4.74) we get

|〈b · ∇u, vh〉L2 ||||vh|||δ

≤ γ0

√

β0‖u‖L2 +[

∑

T∈Th

min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

‖u‖2T

]12.

109

Combining this with the results in (4.72) and (4.73) completes the proof.

For the estimation of the approximation error in (4.70) we use an interpolation operator (e.g.,nodal interpolation)

IVh: H → Vh = Xk

h,0 (k ≥ 1),

that satisfies

‖u− IVhu‖T ≤ chmT |u|m,T (4.75a)

|u− IVhu|1,T ≤ chm−1

T |u|m,T , (4.75b)

for u ∈ Hm(Ω), with 2 ≤ m ≤ k+1. A main discretization error result is given in the followingtheorem.

Theorem 4.4.12 Assume that (4.55), (4.71) hold and that δT is such that (4.67) is satisfied.Let u be the solution of (4.2) and assume that u ∈ H2(Ω). Let uh ∈ Vh = Xk

h,0 be the solutionof the discrete problem (4.63). For 2 ≤ m ≤ k + 1 the discretization error bound

√ε |u− uh|1 +

√

β0 ‖u− uh‖L2 +(

∑

T∈Th

δT ‖b · ∇(u− uh)‖2T

) 12

≤ C(√

εhm−1 +√

β0(1 + γ0)hm)

|u|m (4.76)

+C

∑

T∈Th

[

δT ‖b‖2∞,Th

2m−2T + min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

h2mT

]

|u|2m,T

12, (4.77)

holds, with C independent of u, h, ε, β0, δT and b.

Proof. We apply theorem 4.4.10. For the left handside in (4.70) we have

|||u− uh|||δ + ‖u− uh‖∗,h,δ ≥1√3

[√ε |u− uh|1 +

√

β0 ‖u− uh‖L2 +(

∑

T∈Th

δT ‖b · ∇(u− uh)‖2T

) 12

]

.

For the right handside in (4.70) we obtain, using ‖b · ∇(u − vh)‖T ≤ ‖b‖∞,T |u − vh|1,T andlemma 4.4.11:

infvh∈Vh

(

|||u− vh|||δ + ‖u− vh‖∗,h,δ)

≤ |||u− IVhu|||δ + ‖u− IVh

u‖∗,h,δ

≤ C(√

ε|u− IVhu|1 +

√

β0(1 + γ0)‖u− IVhu‖L2

)

+

C(

∑

T∈Th

[

δT ‖b‖2∞,T |u− IVh

u|21,T + min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

‖u− IVhu‖2

T

]

)12.

Using the approximation error bounds in (4.75) we obtain the result.

Note that this theorem covers the cases δT = 0 for all T (i.e., no stabilization) and β0 = 0.To gain more insight we consider a special case:

Remark 4.4.13 We take b = (1 0)T , c ≡ 1 (hence β0 = 1, γ0 = 0), δT = δ for all T and m = 2.Then the estimate in the previous theorem takes the form

√ε|u−uh|+‖u−uh‖L2 +

√δ‖ ∂∂x

(u−uh)‖L2 ≤ C h(√

ε+h+√δ+min

h√δ,

1√

ε/h + 1

)

|u|2,

110

with C independent of u, h, ε and δ. For ε ↓ 0 this result is very similar to the one incorollary 4.3.12 for the one-dimensional hyperbolic problem.

For the case without stabilization we obtain the following.

Corollary 4.4.14 If the assumptions of theorem 4.4.12 are fulfilled we have the following dis-cretization error bounds for the case δ = 0 for all T :

|u− uh|1 ≤ C(

1 +h

ε

)

hm−1|u|m (4.78)

‖u− uh‖L2 ≤ Chm−1|u|m if β0 > 0, (4.79)

with a constant C independent of u, h and ε.

For ε ↓ 0 these bounds are much better than the ones in (4.51) and (4.52) which resulted from thestandard finite element errror analysis. Moreover, the results in (4.78), (4.79) reflect importantproperties of the standard Galerkin finite element discretization that are observed in practice:

• If h . ε holds then (4.78) yields |u − uh|1 ≤ Chm−1|u|m with C independent of h andε, which is an optimal discretization error bound. This explains the fact that for h . εthe standard Galerkin finite element method applied to the convection-diffusion problemusually yields an accurate discretization. Note, however, that for ε ≪ 1 the conditionh . ε is very unfavourable in view of computational costs.

• For fixed h, even if u is smooth (i.e., |u|m bounded for ε ↓ 0) the H1-error bound in (4.78)tends to infinity for ε ↓ 0. Thus, if the analysis is sharp, we expect that an instabilityphenomenom can occur for ε ↓ 0.

• For the case β0 > 0 we have the suboptimal bound hm−1|u|m for the L2-norm of thediscretization error (for an optimal bound we should have hm|u|m). If u is smooth (|u|mbounded for ε ↓ 0) the discretization error in ‖·‖L2 will be arbitrarily small if h is sufficientlysmall, even if ε≪ h. Hence, for the case β0 > 0 the L2-norm of the discretization error cannot show an instability phenomenon as described in the previous remark for the H1-norm.Note, however, that the L2-norm is weaker than the H1-norm and in particular allows amore “oscillatory behaviour” of the error.

We now turn to the question whether the results can be improved by chosing a suitable valuefor the stabilization parameter δT . For the term between square brackets in (4.77) we have

δT ‖b‖2∞,Th

2m−2T + min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

h2mT ≤ gT (δT )‖b‖∞,Th2m−2

T , (4.80)

with gT (δ) := δ‖b‖∞,T + min

1δ‖b‖∞,T

,‖b‖∞,T

ε

h2T . For ε ≤ hT ‖b‖−1

∞,T the function gT attains

its minimum at δ = hT ‖b‖−1∞,T . Remember the condition on δT in (4.67). This leads to the

parameter choice

δT,opt := ξThT

‖b‖∞,T, with ξT := min

1,hTε

1

2µ2

inv‖b‖∞,T

. (4.81)

If ‖b‖∞,T = 0 we take δT,opt = 0. Note that δT,opt ≤ hT ‖b‖−1∞,T and thus for hT sufficiently

small the condition δT,opt ≤ 12β−10 c−2

b in (4.67) is satisfied. The second condition in (4.67) issatisfied due to the definition of δT,opt in (4.81). If ξT = 1 we have

gT (δT,opt) ≤ δT,opt‖b‖∞,T +h2T

δT,opt‖b‖∞,T= 2hT ,

111

and if ξT < 1 this implies hT

ε ‖b‖∞,T ≤ 2µ−2inv and thus

gT (δT,opt) ≤ δT,opt‖b‖∞,T +‖b‖∞,T

εh2T ≤ (1 + 2µ−2

inv)hT .

Hence, for δT = δT,opt we obtain the following bound for the δT -dependent term in (4.77):

∑

T∈Th

[

δT ‖b‖2∞,Th

2m−2T + min

δ−1T ,

‖b‖2∞,T

ε+ µ2invh

2Tβ0

h2mT

]

|u|2m,T

12 ≤ C‖b‖

12

L∞hm− 1

2 |u|m.

Using this in theorem 4.4.12 we obtain the following corollary.

Corollary 4.4.15 Let the assumptions be as in theorem 4.4.12. For δT = δT,opt we get theestimate

√ε |u− uh|1 +

√

β0 ‖u− uh‖L2 +(

∑

T∈Th

δT,opt‖b · ∇(u− uh)‖2T

) 12

≤ C(√

ε+√

β0(1 + γ0)h+ ‖b‖12

L∞

√h)

hm−1|u|m. (4.82)

This implies

|u− uh|1 ≤ C(

1 +

√β0(1 + γ0)√

εh+

‖b‖12

L∞√ε

√h)

hm−1|u|m, (4.83a)

‖u− uh‖L2 ≤ C(

√ε√β0

+ (1 + γ0)h+‖b‖

12

L∞√β0

√h)

hm−1|u|m if β0 > 0,

(4.83b)(

∑

T∈Th

δT,opt‖b · ∇(u− uh)‖2T

) 12 ≤ C

(√ε+

√

β0(1 + γ0)h+ ‖b‖12

L∞

√h)

hm−1|u|m. (4.83c)

The constants C are independent of u, h, ε, β0, δT and b.

Some comments related to these discretization error bounds:

• The bound in (4.83a) is of the form c(

1 +√

hε

)

hm−1|u|m and thus better than the one in

(4.78) if ε≪ h.

• The bound in (4.83b) is of the form c(√ε+

√h)hm−1|u|m and thus better than the one in

(4.79) if ε ≪ h. For ε ↓ 0 we have a bound of the form chm−12 |u|m which, for m = 2, is

similar to the bound in (4.48) for the one-dimensional hyperbolic problem.

• The result in (4.83c) shows a control on the streamline derivative of the discretizationerror. Such a control is not present in the case δT = 0 (no stabilization). If ξT = 1 for allT (i.e., ε ≤ 1

2µ2inv‖b‖∞,ThT ) and hT ≥ c0h with c0 > 0 independent of T and h we obtain

‖b · ∇(u− uh)‖L2 ≤ c(

√

ε

h+ 1)

hm−1|u|m,

and thus an optimal bound of the form ‖b · ∇(u− uh)‖L2 ≤ chm−1|u|m if ε ≤ h.

112

• In (4.82) there is a correct scaling of ε, β0 and b. Note that δT,opt = ξThT

‖b‖∞,Thas a scaling

w.r.t. ‖b‖∞,T that is the same as in the one-dimensional hyperbolic problem in (4.47).

• In case of linear finite elements the condition on δT in (4.67) can be simplified to δT ≤12β−10 c−2

b , cf. remark 4.4.7. Due to this one can take ξT = 1 in (4.81) and thus δT,opt =hT

‖b‖∞,T. In the general case (quadratic or higher order finite elements) one does not use

δT,opt as in (4.81) in practice, because µinv is not known. Instead one often takes thesimplified form

δT,opt := ξThT

‖b‖∞,T, with ξT := min

1,hTε

1

2‖b‖∞,T

,

in which, if necessary, ‖b‖∞,T is replaced by an approximation.

4.4.3 Stiffness matrix for the convection-diffusion problem

Stiffness matrix for SDFEM: nonsymmetric. Condition number ? (example: hyperbolic problemwith SDFEM).

113

Chapter 5

Finite element discretization of the

Stokes problem

5.1 Galerkin discretization of saddle-point problems

We recall the abstract variational formulation of a saddle-point problem as introduced in sec-tion 2.3, (2.43).Let V and M be Hilbert spaces and

a : V × V → R, b : V ×M → R

be continuous bilinear forms. For f1 ∈ V ′, f2 ∈M ′ we consider the following variational problem:find (φ,λ) ∈ V ×M such that


b(φ,µ) = f2(µ) for all µ ∈M (5.1b)

We define H := V ×M and

k : H ×H → R, k(U,V) := a(φ,ψ) + b(φ,µ) + b(ψ,λ)

with U := (φ,λ), V := (ψ,µ)(5.2)

If we define F (ψ,µ) = f1(ψ) + f2(µ) then the problem (5.1) can be reformulated as follows:

find U ∈ H such that k(U,V) = F (V) for all V ∈ H (5.3)

We note that in this section we use the notation U,V, F instead of the (more natural) notationu,v, f that is used in the chapters 2 and 3. The reason for this is that the latter symbols areconfusing in view of the notation used in section 5.2.For the Galerkin discretization of this variational problem we introduce finite dimensional sub-spaces Vh and Mh:

Vh ⊂ V, Mh ⊂M, Hh := Vh ×Mh ⊂ H

The Galerkin discretization is as follows:

find Uh ∈ Hh such that k(Uh,Vh) = F (Vh) for all Vh ∈ Hh (5.4)

An equivalent formulation is: find (φh,λh) ∈ Vh ×Mh such that

a(φh,ψh) + b(ψh,λh) = f1(ψh) for all ψh ∈ Vh (5.5a)

b(φh,µh) = f2(µh) for all µh ∈Mh (5.5b)

115

For the discretization error we have a variant of the Cea-lemma 3.1.1:

Theorem 5.1.1 Consider the variational problem (5.1) with continuous bilinear forms a(·, ·)and b(·, ·) that satisfy:

∃ β > 0 : supψ∈Vb(ψ,λ)‖ψ‖V ≥ β ‖λ‖M ∀ λ ∈M (5.6a)

∃ γ > 0 : a(φ,φ) ≥ γ ‖φ‖2V ∀ φ ∈ V (5.6b)

∃ βh > 0 : supψh∈Vh

b(ψh,λh)‖ψh‖V

≥ βh ‖λh‖M ∀ λh ∈Mh (5.6c)

Then the problem (5.1) and its Galerkin discretization (5.5) have unique solutions (φ,λ) and(φh,λh), respectively. Furthermore the inequality

‖φ− φh‖V + ‖λ− λh‖M ≤ C(

infψh∈Vh

‖φ−ψh‖V + infµh∈Mh

‖λ− µh‖M)

holds, with C =√

2(

1 + γ−1β−2h (2‖a‖ + ‖b‖)3

)

.

Proof. We shall apply lemma 3.1.1 to the variational problem (5.1) and its Galerkin dis-cretization (5.5). Hence, we have to verify the conditions (3.2), (3.3), (3.4), (3.6). First notethat

|k(U,V)| ≤(

‖a‖ + ‖b‖)

‖U‖H‖V‖Hholds and thus the condition (3.2) is satisfied with M = ‖a‖ + ‖b‖. Due to the assumptions(5.6a) and (5.6b) it follows from corollary 2.3.12 and theorem 2.3.10 that the conditions (3.3),(3.4) are satisfied. Due to (5.6b) and (5.6c) it follows from corollary 2.3.12 and theorem 2.3.10,with V and M replaced by Vh and Mh, respectively, that the condition (3.6) is fulfilled, too.Moreover, from the final statement in theorem 2.3.10 we obtain that (3.6) holds with

εh = γβ2h

(

βh + 2‖a‖)−2 ≥ γβ2

h

(

‖b‖ + 2‖a‖)−2

Application of lemma 3.1.1 yields

(

‖φ− φh‖2V + ‖λ− λh‖2

M

)12 ≤ (1 +

M

εh) inf

(ψh,µh)∈Vh×Mh

(

‖φ−ψh‖2V + ‖λ− µh‖2

M

)12

From this and the inequalities α+ β ≤√

2√

α2 + β2 ≤√

2(α+ β), for α ≥ 0, β ≥ 0, the resultfollows.

Remark 5.1.2 The condition (5.6c) implies dim(Vh) ≥ dim(Mh). This can be shown by thefollowing argument. Let (ψj)1≤j≤m be a basis of Vh and (λi)1≤i≤k a basis of Mh. Define the

matrix B ∈ Rk×m byBij = b(ψj,λi)

From (5.6c) it follows that for every λh ∈Mh, λh 6= 0, there exists ψh ∈ Vh such that b(ψh,λh) 6=0. Thus for every y ∈ Rk, y 6= 0, there exists x ∈ Rm such that yTBx 6= 0, i.e., xTBTy 6= 0.This implies that all columns of BT , and thus all rows of B, are independent. A necessarycondition for this is k ≤ m.

116

5.2 Finite element discretization of the Stokes problem

We recall the variational formulation of the Stokes problem (with homogeneous Dirichlet bound-ary conditions) given in section 2.6 : find (u, p) ∈ V ×M such that

a(u,v) + b(v, p) = f(v) for all v ∈ V (5.7a)

b(u, q) = 0 for all q ∈M (5.7b)

with

V := H10 (Ω)n, M := L2

0(Ω)

a(u,v) :=

∫

Ω∇u · ∇v dx

b(v, q) := −∫

Ωq divv dx

f(v) :=

∫

Ωf · v dx

For the Galerkin discretization of this problem we use the simplicial finite element spaces definedin section 3.2.1, i.e., for a given family Th of admissible triangulations of Ω we define the pairof spaces:

(Vh,Mh) :=(

(Xkh,0)

n , Xk−1h ∩ L2

0(Ω))

, k ≥ 1 (5.8)

A short discussion concerning other finite element spaces that can be used for the Stokes problemis given in section 5.2.2. For k ≥ 2, the spaces in (5.8) are called Hood-Taylor finite elements [52].Note that for k = 1 the pressure-space X0

h∩L20(Ω) contains discontinuous functions, whereas for

k ≥ 2 all functions in the pressure space are continuous. The Stokes problem fits in the generalsetting presented in section 5.1. The discrete problem is as in (5.5): find (uh, ph) such that

a(uh,vh) + b(vh, ph) = f(vh) for all vh ∈ Vh (5.9a)

b(uh, qh) = 0 for all qh ∈Mh (5.9b)

From the analysis in section 2.6 it follows that the conditions (5.6a) and (5.6b) in theorem 5.1.1are satisfied. The following remark shows that the condition in (5.6c), which is often called thediscrete inf-sup condition, needs a careful analysis:

Remark 5.2.1 Take n = 2, Ω = (0, 1)2 and a uniform triangulation Th of Ω that is defined asfollows. For N ∈ N and h := 1

N+1 the domain Ω is subdivided in squares with sides of length hand vertices in the set (ih, jh) | 0 ≤ i, j ≤ N + 1 . The triangulation Th is obtained by subdi-viding every square in two triangles by inserting a diagonal from (ih, jh) to ((i+ 1)h, (j + 1)h).The spaces (Vh,Mh) are defined as in (5.8) with k = 1. The space Vh has dimension 2N2 anddim(Mh) = 2(N + 1)2 − 1. From dim(Vh) < dim(Mh) and remark 5.1.2 it follows that thecondition (5.6c) does not hold.The same argument applies to the three dimensional case with a uniform triangulation of(0, 1)3 consisting of tetrahedra (every cube is subdivided in 6 tetrahedra). In this case wehave dim(Vh) = 3N3 and dim(Mh) = 6(N + 1)3 − 1.We now show that also the lowest order rectangular finite elements in general do not satisfy(5.6c). For this we consider n = 2, Ω = (0, 1)2 and use a triangulation consisting of squares

Tij := [ih, (i + 1)h)] × [jh, (j + 1)h], 0 ≤ i, j ≤ N, h :=1

N + 1

117

We assume that N is odd. The corresponding lowest order rectangular finite element spaces(Vh,Mh) = (Q1

h,0,Q0h ∩ L2

0(Ω)) are defined in (3.17). We define ph ∈Mh by

(ph)|Tij= (−1)i+j (checkerboard function)

For uh ∈ Vh we use the notation uh = (u, v), u(ih, jh) =: ui,j, v(ih, jh) =: vi,j. Then we have:∫

Tij

phdivuh dx = (−1)i+j∫

∂Tij

uh · n ds

= (−1)i+jh

2

[

(ui+1,j+1 + ui+1,j) − (ui,j+1 + ui,j)

+ (vi+1,j+1 + vi,j+1) − (vi+1,j + vi,j)]

Using (uh)|∂Ω = 0 we get, for 0 ≤ k ≤ N + 1,

N∑

j=0

(−1)j(uk,j+1 + uk,j) = 0,

N∑

i=0

(−1)i(vi+1,k + vi,k) = 0

and thus

b(uh, ph) = −∫

Ωphdivuh dx =

N∑

i,j=0

∫

Tij

phdivuh dx = 0

for arbitrary uh ∈ Vh. We conclude that there exists ph ∈Mh, ph 6= 0, such that supvh∈Vhb(vh, ph) =

0 and thus the discrete inf-sup conditon does not hold for the pair (Vh,Mh).

Definition 5.2.2 Let Th be a family of admissible triangulations of Ω. Suppose that to everyTh ∈ Th there correspond finite element spaces Vh ⊂ V and Mh ⊂ M . The pair (Vh,Mh) iscalled stable if there exists a constant β > 0 independent of Th ∈ Th such that

supvh∈Vh

b(vh, qh)

‖vh‖1≥ β ‖qh‖L2 for all qh ∈Mh (5.10)

5.2.1 Error bounds

In this section we derive error bounds for the discretization of the Stokes problem with Hood-Taylor finite elements. We will prove that for k = 2 the Hood-Taylor spaces are stable. Theo-rem 5.1.1 is used to derive error bounds.In the analysis of this finite element method we will need an approximation operator which isapplicable to functions u ∈ H1(Ω) and yields a “reasonable” approximation of u in the subspaceof continuous piecewise linear functions. Such an operator was introduced by Clement in [29]and is denoted by ICX (Clement operator).Let Th be a regular family of triangulations of Ω consisting of n-simplices and X1

h,0 the corre-sponding finite element space of continuous piecewise linear functions. For the definition of theClement operator we need the nodal basis of this finite element space. Let xi1≤i≤N be the setof vertices in Th that lie in the interior of Ω. To every xi we associate a basis function φi ∈ X1

h,0

with the property φi(xi) = 1, φi(xj) = 0 for all j 6= i. Then φi1≤i≤N forms a basis of thespace X1

h,0. We define a neighbourhood of the point xi by

ωxi:= supp(φi) = ∪T ∈ Th | xi ∈ T

118

and a neighbourhood of T ∈ Th by

ωT := ∪ωxi| xi ∈ T

The local L2-projection Pi : L2(ωxi) → P0 is defined by:

Piv = |ωxi|−1

∫

ωxi

v dx

The Clement operator ICX : H10 (Ω) → X1

h,0 is defined by

ICX u =N∑

i=1

(Piu)φi (5.11)

For this operator the following approximation properties hold:

Theorem 5.2.3 (Clement operator.) There exists a constant C independent of Th ∈ Thsuch that for all u ∈ H1

0 (Ω) and all T ∈ Th:

‖ICX u‖1,T ≤ C ‖u‖1,ωT(5.12a)

‖u− ICX u‖0,T ≤ C hT ‖u‖1,ωT(5.12b)

‖u− ICX u‖0,∂T ≤ C h12

T ‖u‖1,ωT(5.12c)

Proof. We refer to [29] and [13].

Variants of this operator are discussed in [13, 81, 14]. Results as in theorem 5.2.3 also holdif H1

0 (Ω) and X1h,0 are replaced by H1(Ω) and X1

h, respectively.Using the Clement operator one can reformulate the stability condition (5.10) in another formthat turns out to be easier to handle. This reformulation is given in [93] and applies to a largeclass of finite element spaces. Here we only present a simplified variant that applies to theHood-Taylor finite element spaces. We will need the mesh-dependent norm

‖qh‖1,h :=

√

∑

T∈Th

h2T ‖∇qh‖2

0,T , qh ∈ X1h ∩ L2

0(Ω)

Theorem 5.2.4 Let Th be a regular family of triangulations. The Hood-Taylor pair of finiteelement spaces (Vh,Mh) as in (5.8), k ≥ 2, is stable iff there exists a constant β∗ > 0 independentof Th ∈ Th such that

supvh∈Vh

b(vh, qh)

‖vh‖1≥ β∗ ‖qh‖1,h for all qh ∈Mh (5.13)

Proof. For T ∈ Th let F (x) = Bx + c be an affine mapping such that F (T ) = T , where Tis the unit n-simplex. Using the lemmas 3.3.5 and 3.3.6 and dim(Pk) <∞ we get, for qh ∈ Mh

and qh := qh F :

‖∇qh‖20,T = |qh|21,T ≤ C‖B−1‖2

2|detB||qh|21,T≤ C‖B−1‖2

2|detB||qh|20,T ≤ C h−2T ‖qh‖2

0,T

119

with a constant C independent of qh and of T . This yields ‖qh‖1,h ≤ C‖qh‖L2 and thus thestability property implies (5.13).Assume that (5.13) holds. Take an arbitrary qh ∈Mh with ‖qh‖L2 = 1. The constants C beloware independent of qh and of Th ∈ Th. From the inf-sup property of the continuous problemit follows that there exists β > 0, independent of qh, and v ∈ H1

0 (Ω)n such that

‖v‖1 = 1, b(v, qh) ≥ β

We apply the Clement operator to the components of v:

wh := ICX v =(

ICX v1, . . . , ICX vn

)

∈(

X1h,0

)n ⊂ Vh

From theorem 5.2.3 we get

‖wh‖1 ≤ C‖v‖1 = C∑

T∈Th

h−2T ‖v − wh‖2

0,T ≤ C‖v‖21 = C

From this we obtain

|b(wh − v, qh)| = |∫

Ωqhdiv (wh − v) dx|

= |∑

T∈Th

∫

T∇qh · (wh − v) dx|

≤∑

T∈Th

‖∇qh‖0,T ‖wh − v‖0,T

≤(

∑

T∈Th

h2T ‖∇qh‖2

0,T

) 12(

∑

T∈Th

h−2T ‖wh − v‖2

0,T

) 12

≤ C‖qh‖1,h

(5.14)

Define ξ := supvh∈Vh

b(vh,qh)‖vh‖1 . From (5.13) we have ‖qh‖1,h ≤ ξ

β∗ . Using this in combination with

the result in (5.14) we obtain

ξ ≥ b(wh, qh)

‖wh‖1≥ C b(wh, qh) = C b(v, qh) + C b(wh − v, qh)

≥ C(

β − C‖qh‖1,h

)

≥ C(

β − Cξ

β∗)

And thus ξ ≥ β for a suitable constant β > 0 independent of qh and of Th ∈ Th.

Theorem 5.2.5 Let Th be a regular family of triangulations consisting of simplices. Weassume that every T ∈ Th has at least one vertex in the interior of Ω. Then the Hood-Taylorpair of finite element spaces with k = 2 is stable.

Proof. We consider only n ∈ 2, 3. Take qh ∈ Mh, qh 6= 0. The constants used in theproof are independent of Th ∈ Th and of qh. The set of edges in Th is denoted by E . Thisset is partitioned in edges which are in the interior of Ω and edges which are part of ∂Ω:

120

E = Eint ∪ Ebnd. For every E ∈ E , mE denotes the midpoint of the edge E. Every E ∈ Eint withendpoints a1, a2 ∈ Rn is assigned a vector tE := a1 − a2. For E ∈ Ebnd we define tE := 0. Sinceqh ∈ X1

h the function x→ tE · ∇qh(x) is continuous across E, for E ∈ Eint. We define

wE :=(

tE · ∇qh(mE))

tE, for E ∈ E

Due to lemma 3.3.2 a unique wh ∈(

X2h,0

)nis defined by

wh(xi) =

0 if xi is a vertex of T ∈ ThwE if xi = mE for E ∈ E

For each T ∈ Th the set of edges of T is denoted by ET . By using quadrature we see that forany p ∈ P2 which is zero at the vertices of T we have

∫

Tp(x) dx =

|T |2n − 1

∑

E∈ET

p(mE)

We obtain:

−∫

Ωqhdivwh dx =

∫

Ω∇qh ·wh dx =

∑

T∈Th

(∇qh)|T ·∫

Twh dx

=∑

T∈Th

|T |2n− 1

(∇qh)|T ·∑

E∈ET

wh(mE)

=∑

T∈Th

|T |2n− 1

∑

E∈ET

(

tE · ∇qh(mE))2

(5.15)

Using the fact that (∇qh)|T is constant one easily checks that

C‖∇qh‖20,T ≤

∑

E∈ET

(

tE · ∇qh(mE))2 ≤ C‖∇qh‖2

0,T (5.16)

with C > 0 and C independent of T . Combining this with (5.15) we get

−∫

Ωqhdivwh dx ≥ C

∑

T∈Th

|T |2n − 1

‖∇qh‖20,T

≥ C∑

T∈Th

h2T ‖∇qh‖2

0,T = C‖qh‖21,h

(5.17)

Let ET be the set of all edges of the unit n-simplex T . In the space v ∈ P2 | v is zero at the vertices of T the norms ‖v‖1,T and

(∑

E∈ETv(mE)2

) 12 are equivalent. Using this componentwise for the

vector-function wh := wh F we get:

|wh|21,T ≤ Ch−2T |T | |wh|21,T ≤ Ch−2

T |T | ‖wh‖21,T

≤ Ch−2T |T |

∑

E∈ET

‖wh(mE)‖22 = Ch−2

T |T |∑

E∈ET

‖wE‖22

121

Summation over all simplices T yields, using (5.16),

‖wh‖21 ≤ C|wh|21 ≤ C

∑

T∈Th

|wh|21,T ≤ C∑

T∈Th

h−2T |T |

∑

E∈ET

‖wE‖22

= C∑

T∈Th

h−2T |T |

∑

E∈ET

(

tE · ∇qh(mE))2‖tE‖2

2

≤ C∑

T∈Th

h2T ‖∇qh‖2

0,T = C ‖qh‖21,h

(5.18)

From (5.17) and (5.18) we obtain

b(wh, qh)

‖wh‖1≥ C ‖qh‖1,h

with a constant C > 0 independent of qh and of Th ∈ Th. Now apply theorem 5.2.4.

One can also prove stability for higher order Hood-Taylor finite elements:

Theorem 5.2.6 Let Th be a regular family of triangulations as in theorem 5.2.5. Then theHood-Taylor pair of finite element spaces with k ≥ 3 is stable.

Proof. We refer to the literature: [15, 16, 22].

Remark 5.2.7 The condition that every T ∈ Th has at least one vertex in the interior of Ω is amild one. Let S := T ∈ Th | T has no vertex in the interior of Ω . If S 6= ∅ then a suitablebisection of each T ∈ S (and of one of the neighbours of T ) results in a modified admissibletriangulation for which the condition is satisfied. In certain cases the condition can be avoided(for example, for n = 2, k = 2, 3, cf. [87]) or replaced by another similar assumption on thegeometry of the triangulation (cf. remark 3.2 in [16]). An example which shows that the stabilityresult does in general not hold without an assumption on the geometry of the triangulation isgiven in [16] remark 3.3.

For the discretization of the Stokes problem with Hood-Taylor finite elements we have the fol-lowing bound for the discretization error:

Theorem 5.2.8 Let Th be a regular family of triangulations as in theorem 5.2.5. Consider thediscrete Stokes problem (5.9) with Hood-Taylor finite element spaces as in (5.8), k ≥ 2. Supposethat the continuous solution (u, p) lies in Hm(Ω)n ×Hm−1(Ω) with m ≥ 2. For m ≤ k + 1 thefollowing holds:

‖u − uh‖1 + ‖p− ph‖L2 ≤ C hm−1(

|u|m + |p|m−1

)

with a constant C independent of Th ∈ Th and of (u, p).

Proof. We apply theorem 5.1.1 with V = H10 (Ω)n, M = L2

0(Ω) and (Vh,Mh) the pair of Hood-Taylor finite element spaces. From the analysis in section 2.6 it follows that the conditions (5.6a)and (5.6b) are satisfied. From theorem 5.2.5 or theorem 5.2.6 it follows that the discrete inf-supproperty (5.6c) holds with a constant βh independent of Th. Hence we have

‖u − uh‖1 + ‖p− ph‖L2 ≤ C(

infvh∈Vh

‖u − vh‖1 + infqh∈Mh

‖p− qh‖L2

)

(5.19)

122

For the first term on the right handside we use (componentwise) the result of corollary 3.3.10.This yields:

infvh∈Vh

‖u − vh‖1 ≤ Chm−1|u|m (5.20)

with a constant C independent of u and of Th ∈ Th.Using p ∈ L2

0(Ω) it follows that

infqh∈Mh

‖p − qh‖L2 = infqh∈X

k−1h

‖p− qh‖L2

For m = 2 we can use the Clement operator of theorem 5.2.3 and for m ≥ 3 the result incorollary 3.3.10 to bound the approximation error for the pressure:

infqh∈Mh

‖p− qh‖L2 ≤ Chm−1|p|m−1 (5.21)

Combination of (5.19), (5.20) and (5.21) completes the proof.

Sufficient conditions for (u, p) ∈ H2(Ω)n ×H1(Ω) to hold are given in section 2.6.2.As in section 3.4.2 one can derive an L2-error bound for the velocity using a duality argument.For this we have to assume H2-regularity of the Stokes problem (cf. section 2.6.2):

Theorem 5.2.9 Consider the Stokes problem and its Hood-Taylor discretization as describedin theorem 5.2.8. In addition we assume that the Stokes problem is H2-regular. Then for2 ≤ m ≤ k + 1 the inequality

‖u − uh‖L2 ≤ Chm(

|u|m + |p|m−1

)

holds, with C independent of Th ∈ Th and of (u, p).

Proof. The variational Stokes problem can be reformulated as in (5.3) :

find U ∈ H such that k(U,V) = F (V) for all V ∈ H

with H = H10 (Ω)n × L2

0(Ω), k(·, ·) as in (5.2), U = (u, p), F (V) = F ((v, q)) =∫

Ω f · v dx. Leteh = U−Uh = (u−uh, p− ph) be the discretization error. We consider the dual problem withf = u− uh: let U = (u, p) ∈ H be such that

k(U,V) = k(V, U) = Fe(V) :=

∫

Ω(u − uh) · v dx ∀ V = (v, q) ∈ H

For V = eh we obtain, using the Galerkin orthogonality of eh:

‖u − uh‖2L2 = Fe(eh) = k(eh, U) = k(eh, U − wh) ∀ wh ∈ Hh = Vh ×Mh

and thus

‖u− uh‖2L2 ≤ C ‖eh‖H inf

wh∈Hh

‖U − wh‖H

≤ C ‖eh‖H(

infvh∈Vh


‖p− qh‖L2

)

For the second term on the right handside we can use approximation results as in (5.20), (5.21).In combination with the H2-regularity this yields

infvh∈Vh


‖p− qh‖L2 ≤ C h(

|u|2 + |p|1)

≤ C h‖u − uh‖L2

For ‖eh‖H ≤ ‖u− uh‖1 + ‖p− ph‖L2 we can use the result in theorem 5.2.8. Thus we conlcudethat

‖u− uh‖2L2 ≤ C hm−1

(

|u|m + |p|m−1

)

h ‖u − uh‖L2

holds.

123

5.2.2 Other finite element spaces

In this section we briefly discuss some other pairs of finite element spaces that are used in prac-tice for solving Stokes (and Navier-Stokes) problems.

Rectangular finite elements.Let Th be a regular family of triangulations consisting of n-rectangles. The pair of rectangularfinite element spaces is given by (cf. (3.17)):

(Vh,Mh) =(

(Qkh,0)

n , Qk−1h ∩ L2

0(Ω))

, k ≥ 1

In remark 5.2.1 it is shown that for k = 1 this pair in general will not be stable. In [11] itis proved that the pair (Vh,Mh) with k = 2 is stable both for n = 2 and n = 3. In [87] it isproved that for all k ≥ 2 the pair (Vh,Mh) is stable if n = 2. In these stable cases one can provedisretization error bounds as in theorem 5.2.8 and theorem 5.2.9. The analysis is very similarto the one presented for the case of simplicial finite elements in section 5.2.1.

Mini-element.Let Th be a regular family of triangulations consisting of simplices. For every T ∈ Th we candefine a so-called “bubble” function

bT (x) =

∏n+1i=1 λi(x) for x ∈ T

0 otherwise

with λi(x), i = 1, . . . , n + 1, the barycentric coordinates of x ∈ T . Define the space of bubblefunctions B := span bT | T ∈ Th . The mini-element, introduced in [4] is given by the pair

(Vh,Mh) =(

(X1h,0 ⊕ B)n , X1

h ∩ L20(Ω)

)

This element is stable, cf. [40, 4]. An advantage of this element compared to, for example, theHood-Taylor element with k = 2 is that the implementation of the former is relatively easy. Thisis due to the following. The unknowns associated to the bubble basis functions can be elimi-nated by a simple local technique (so-called static condensation) and the remaining unknownsfor the velocity and pressure basis functions are associated to the same set of points, namely thevertices of the simplices. In case of Hood-Taylor elements (k = 2) one also needs the midpointsof edges for some of the velocity unknowns. Hence, the data structures for the mini-elementare relatively simple. A disadvantage of the mini-element is its low accuracy (only P1 for thevelocity).

IsoP2 −P1 element.This element is a variant of the Hood-Taylor element with k = 2. Let Th be a regular familyof triangulations consisting of simplices. Given Th we construct a refinement T 1

2h by dividing

each n-simplex T ∈ Th, n = 2 or n = 3, into 2n subsimplices by connecting the midpoints ofthe edges of T . Note that for n = 3 this construction is not unique. The space of continuousfunctions which are piecewise linear on the simplices in T 1

2h and zero on ∂Ω is denoted by X1

12h,0

.

The isoP2 − P1 element consists of the pair of spaces

(Vh,Mh) =(

(X112h,0

)n , X1h ∩ L2

0(Ω))

Both for n = 2 and n = 3 this pair is stable. This can be shown using the analysis of sec-tion 5.2.1. The proofs of theorem 5.2.4 and of theorem 5.2.5 apply, with minor modifications,

124

to the isoP2 −P1 pair, too. In the discrete velocity space Vh the degrees of freedom (unknowns)are associated to the vertices and the midpoints of edges of T ∈ Th. This is the same as for thediscrete velocity space in the Hood-Taylor pair with k = 2. This explains the name isoP2 −P1.Note, however, that the accuracy for the velocity for the isoP2 − P1 element is only O(h) inthe norm ‖ · ‖1, whereas for the Hood-Taylor pair with k = 2 one has O(h2) in the norm ‖ · ‖1

(provided the solution is sufficiently smooth).

Nonconforming Crouzeix-Raviart: in preparation.

In certain situations, if the pair (Vh,Mh) of finite element spaces is not stable, one can stillsuccessfully apply these spaces for discretization of the Stokes problem, provided one uses anappropriate stabilization technique. We do not discuss this topic here. An overview of someuseful stabilization methods is given in [73], section 9.4.

125

Chapter 6

Linear iterative methods

The discretization of elliptic boundary value problems like the Poisson equation or the Stokesequations results in a large sparse linear system of equations. For the numerical solution of sucha system iterative methods are applied. Important classes of iterative methods are treated inthe chapters 7-10. In this chapter we present some basic results on linear iterative methods anddiscuss some classical iterative methods like, for example, the Jacobi and Gauss-Seidel method.In our applications these methods turn out to be very inefficient and thus not very suitable forpractical use. However, these methods play a role in the more advanced (and more efficient)methods treated in the chapters 7-10. Furthermore, these basic iterative methods can be usedto explain important notions such as convergence rate and efficiency. Standard references fora detailed analysis of basic iterative methods are Varga [92], Young [100]. We also refer toHackbusch [46],[48] and Axelsson [6] for an extensive analysis of these methods.

In the remainder of this chapter we consider a (large sparse) linear system of equations

Ax = b (6.1)

with a nonsingular matrix A ∈ Rn×n. The solution of this system is denoted by x∗.

6.1 Introduction

We consider a given iterative method, denoted by xk+1 = Ψ(xk), k ≥ 0, for solving the systemin (6.1). We define the error as

ek = xk − x∗ , k ≥ 0.

The iterative method is called a linear iterative method if there exists a matrix C (dependingon the particular method but independent of k) such that for the errors we have the recursion

ek+1 = Cek , k ≥ 0. (6.2)

The matrix C is called the iteration matrix of the method. In the next section we will see thatbasic iterative methods are linear. Also the multigrid methods discussed in chapter 9 are linear.The Conjugate Gradient method, however, is a nonlinear iterative method (cf. chapter 7).From (6.2) it follows that ek = Cke0 for all k, and thus limk→∞ ek = 0 for arbitrary e0 if andonly if limk→∞Ck = 0. Based on this, the linear iterative method with iteraton matric C is iscalled convergent if

limk→∞

Ck = 0 . (6.3)

127

An important characterization for convergence is related to the spectral radius of the iterationmatrix. To derive this characterization we first need two lemmas.

Lemma 6.1.1 For all B ∈ Rn×n and all ε > 0 there exists a matrix norm ‖ · ‖∗ on Rn×n suchthat

‖B‖∗ ≤ ρ(B) + ε

Proof. For the given matrix B there exists a nonsingular matrix T ∈ Cn×n which transformsB to its Jordan normal form:

T−1BT = J, J = blockdiag(Jℓ)1≤ℓ≤m ,

with Jℓ = λℓ ,

or Jℓ =

λℓ 1 ∅. . .

. . .

. . . 1∅ λℓ

, λℓ ∈ σ(B), 1 ≤ ℓ ≤ m

For the given ε > 0 define Dε := diag(ε, ε2, . . . , εn) ∈ Rn×n and Tε := TDε, Jε := D−1ε JDε.

Note that Jε has the same form as J, only with the entries 1 on the codiagonal replaced by ε.For C ∈ Rn×n define

‖C‖∗ := ‖T−1ε CTε‖∞ = max

1≤i≤n

n∑

j=1

|(T−1ε CTε)ij |

This defines a matrix norm on Rn×n. Furthermore,

‖B‖∗ = ‖T−1ε BTε‖∞ = ‖Jε‖∞ ≤ max

λ∈σ(B)|λ| + ε ≤ ρ(B) + ε

This proves the result of the lemma.

Lemma 6.1.2 (Stable matrices) For B ∈ Rn×n the following holds:

limk→∞

Bk = 0 if and only if ρ(B) < 1.

Proof. “⇐”. Take ε > 0 such that ρ(B) + ε < 1 holds and let ‖ · ‖∗ be the matrix normdefined in lemma 6.1.1. Then

‖Bk‖∗ ≤ ‖B‖k∗ ≤ (ρ(B) + ε)k

holds. Hence, limk→∞ ‖Bk‖∗ = 0 and thus limk→∞Bk = 0.

“⇒”. Let max ‖Cx‖∞‖x‖∞ | x ∈ Cn, x 6= 0 be the complex maximum norm on Cn×n. Take

λ ∈ σ(C) and v ∈ Cn, v 6= 0, such that Cv = λv and |λ| = ρ(C). Then |λ|‖v‖∞ =‖λv‖∞ = ‖Cv‖∞. From this it follows that ρ(C) ≤ ‖C‖∞ holds for arbitrary C ∈ Cn×n.From limk→∞Bk = 0 we get limk→∞ ‖Bk‖∞ = 0 and thus, due to ρ(B)k = ρ(Bk) ≤ ‖Bk‖∞,we have limk→∞ ρ(B)k = 0. Thus ρ(B) < 1 must hold.

128

Corollary 6.1.3 For any B ∈ Rn×n and any matrix norm ‖ · ‖ on Rn×n we have

ρ(B) ≤ ‖B‖

Proof. If ρ(B) = 0 then B = 0 and the result holds. For ρ(B) 6= 0 define B := ρ(B)−1B.Assume that ρ(B) > ‖B‖. Then 1 = ρ(B) > ‖B‖ holds and thus limk→∞ ‖B‖k = 0. Using‖Bk‖ ≤ ‖B‖k this yields limk→∞ Bk = 0. From lemma 6.1.2 we conclude ρ(B) < 1 which givesa contradiction.

From lemma 6.1.2 we obtain the following result:

Theorem 6.1.4 A linear iterative method is convergent if and only if for the correspondingiteration matrix C we have ρ(C) < 1.

If ρ(C) < 1 then the iterative method converges and the spectral radius ρ(C) even yields aquantitative result for the rate of convergence. To see this, we first formulate a lemma:

Lemma 6.1.5 For any matrix norm ‖ · ‖ on Rn×n and any B ∈ Rn×n the following equalityholds:

limk→∞

‖Bk‖1/k = ρ(B).

Proof. From corollary 6.1.3 we get ρ(B)k = ρ(Bk) ≤ ‖Bk‖ and thus

ρ(B) ≤ ‖Bk‖ 1k for all k ≥ 1 (6.4)

Take arbitrary ε > 0 and define B := (ρ(B) + ε)−1B. Then ρ(B) < 1 and thus limk→∞ Bk = 0.Hence there exists k0 such that for all k ≥ k0, ‖Bk‖ ≤ 1, i.e., (ρ(B) + ε)−k‖Bk‖ ≤ 1. We get

‖Bk‖ 1k ≤ ρ(B) + ε for all k ≥ k0 (6.5)

From (6.4) and (6.5) it follows that limk→∞ ‖Bk‖1/k = ρ(B).

For the error ek = Cke0 we have

maxe0∈Rn

‖ek‖‖e0‖ ≤ 1

e⇔ max

e0∈Rn

‖Cke0‖‖e0‖ ≤ 1

e

⇔ ‖Ck‖ ≤ 1

e⇔ ‖Ck‖ 1

k ≤(1

e

)1k

From lemma 6.1.5 we have that, for k large enough,

‖Ck‖1/k ≈ ρ(C)

Hence, to reduce the norm of an arbitrary starting error ‖e0‖ by a factor 1/e we need asymp-totically (i.e. for k large enough) approximately (− ln(ρ(C)))−1 iterations. Based on this wecall − ln(ρ(C)) the asymptotic convergence rate of the iterative method (in the literature, e.g.Hackbusch [48], sometimes ρ(C) is called the asymptotic convergence rate).

The quantity ‖C‖ is the contraction number of the iterative method. Note that

‖ek‖ ≤ ‖C‖‖ek−1‖ for all k ≥ 1

holds, andρ(C) ≤ ‖C‖.

129

From these results we conclude that ρ(C) is a reasonable measure for the rate of convergence,provided k is large enough. For k ”small” it may be better to use the contraction number as ameasure for the rate of convergence. Note that the asymptotic convergence rate does not dependon the norm ‖ · ‖. In some situations, measuring the rate of convergence using the contractionnumber or using the asymptotic rate of convergence is the same. For example, if we use theEuclidean norm and if C is symmetric then

ρ(C) = ‖C‖2 (6.6)

holds. However, in other situations, for example if C is strongly nonsymmetric, one can haveρ(C) ≪ ‖C‖.

To measure the quality (efficiency) of an iterative method one has to consider the followingtwo aspects:

• The arithmetic costs per iteration. This can be quantified in flops needed for one iteration.

• The rate of convergence. This can be quantified using − ln(ρ(C)) (asymptotic convergencerate) or ‖C‖ (contraction number).

To be able to compare iterative methods the notion of complexity is introduced. We assume:

• A given linear system.

• A given error reduction factor R, i.e. we wish to reduce the norm of an arbitrary startingerror by a factor R.

The complexity of an iterative method is then defined as the order of magnitude of the numberof flops needed to obtain an error reduction with a factor R for the given problem. In thisnotion the arithmetic costs per iteration and the rate of convergence are combined. The qualityof different methods for a given problem (class) can be compared by means of this complexityconcept. Examples of this are given in section 6.6

6.2 Basic linear iterative methods

In this section we introduce classical linear iterative schemes, namely the Richardson, (damped)Jacobi, Gauss-Seidel and SOR methods. For the convergence analysis that is presented in thesections 6.3 and 6.5 it is convenient to put these methods in the general framework of of so-calledmatrix splittings (cf. Varga [92]).

We first show how a splitting of the matrix A in a natural way results in a linear iterativemethod. We assume a splitting of the matrix A such that

A = M− N , where (6.7a)

M is nonsingular, and (6.7b)

for arbitrary y we can solve Mx = y with relatively low costs. (6.7c)

For the solution x∗ of the system in (6.1) we have

Mx∗ = Nx∗ + b .

130

The splitting of A results in the following matrix splitting iterative method . For a given startingvector x0 ∈ Rn we define

Mxk+1 = Nxk + b , k ≥ 0 (6.8)

This can also be written as

xk+1 = xk − M−1(Axk − b). (6.9)

From (6.9) it follows that for the error ek = xk − x∗ we have

ek+1 = (I − M−1A)ek.

Hence the iteration in (6.8), (6.9) is a linear iterative method with iteration matrix

C = I− M−1A = M−1N. (6.10)

The condition in (6.7c) is introduced to obtain a method in (6.8) for which the arithmetic costsper iteration are acceptable. Below we will see that the above-mentioned classical iterativemethods can be derived using a suitable matrix splitting. These methods satisfy the conditionsin (6.7b), (6.7c), but unfortunately, when applied to discrete elliptic boundary value problems,the convergence rates of these methods are in general very low. This is illustrated in section 6.6.

Richardson methodThe simplest linear iterative method is the Richardson method :

x0 a given starting vector ,xk+1 = xk − ω(Axk − b) , k ≥ 0 .

(6.11)

with a parameter ω ∈ R, ω 6= 0. The iteration matrix of this method is given by Cω = I− ωA.

Jacobi methodA second classical and very simple method is due to Jacobi. We introduce the notation

D := diag(A) , A = D − L − U , (6.12)

with L a lower triangular matrix with zero entries on the diagonal and U an upper triangularmatrix with zero entries on the diagonal. We assume that A has only nonzero entries on thediagonal, so D is nonsingular. The method of Jacobi is the iterative method as in (6.8) basedon the matrix splitting

M := D , N := L + U .

The method of Jacobi is as follows

x0 a given starting vector ,

Dxk+1 = (L + U)xk + b , k ≥ 0 .

This can also be formulated row by row:

x0 a given starting vector,

aiixk+1i = −∑j 6=i aijx

kj + bi , i = 1, 2, . . . , n , k ≥ 0 .

(6.13)

From this we see that in the method of Jacobi we solve the ith equation (∑n

j=1 aijxj = bi) forthe ith unknown (xi) using values for the other unknowns (xj , j 6= i) computed in the previous

131

iteration.The iteration can also be represented as

xk+1 = (I − D−1A)xk + D−1b, k ≥ 0

In the Jacobi method the computational costs per iteration are ”low”, namely comparable toone matrix-vector multiplication Ax, i.e. cn flops (due to the sparsity if A).We introduce a variant of the Jacobi method in which a parameter is used. This method is givenby

xk+1 = (I − ωD−1A)xk + ωD−1b, k ≥ 0 (6.14)

with a given real parameter ω 6= 0. This method corresponds to the splitting

M =1

ωD, N = (

1

ω− 1)D + L + U (6.15)

For ω = 1 we obtain the Jacobi method. For ω 6= 1 this method is called the damped Jacobimethod (“damped” due to the fact that in practice one usually takes ω ∈ (0, 1)).

Gauss-Seidel methodThis method is based on the matrix splitting

M := D− L , N := U .

This results in the method:


(D − L)xk+1 = Uxk + b , k ≥ 0 .

This can be formulated row wise:


aiixk+1i = −∑j<i aijx

k+1j −∑j>i aijx

kj + bi , i = 1, . . . , n .

(6.16)

For the Gauss-Seidel method to be feasible we assume that D is nonsingular. In the Jacobimethod (6.13) for the computation of xk+1

i (i.e. for solving the ith equation for the ith unknownxi) we use the values xkj , j 6= i, whereas in the Gauss–Seidel method (6.16) for the computation

of xk+1i we use xk+1

j , j < i and xkj , j > i.The iteration matrix of the Gaus-Seidel method is given by

C = (D− L)−1U = I − (D − L)−1A .

SOR methodThe Gauss-Seidel method in (6.16) can be rewritten as


xk+1i = xki −

(∑

j<i aijxk+1j +

∑

j≥i aijxkj − bi

)

/aii , i = 1, . . . , n, k ≥ 0.

132

From this representation it is clear how xk+1i can be obtained by adding a certain correction term

to xki . We now introduce a method in which this correction term is multiplied by a parameterω > 0 :


xk+1i = xki − ω

(∑

j<i aijxk+1j +

∑

j≥i aijxkj − bi

)

/aii ,

i = 1, . . . , n, k ≥ 0 .

(6.17)

This method is the Successive Overrelaxation method (SOR). The terminology ”over” is usedbecause in general one should take ω > 1 (cf. Theorem 6.4.3 below). For ω = 1 the SOR methodresults in the Gauss-Seidel method. In matrix-vector notation the SOR method is as follows:

Dxk+1 = (1 − ω)Dxk + ω(Lxk+1 + Uxk + b),

or, equivalently,1

ω(D − ωL)xk+1 =

1

ω[(1 − ω)D + ωU]xk + b .

From this it is clear that the SOR method is also a matrix splitting iterative method, corre-sponding to the splitting (cf. (6.15))

M :=1

ωD − L , N := (

1

ω− 1)D + U .

The iteration matrix is given by

C = Cω = I− M−1A = I − (1

ωD− L)−1A.

For the SOR method the arithmetic costs per iteration are comparable to those of the Gauss-Seidel method.

The Symmetric Successive Overrelaxation method (SSOR) is a variant of the SOR method.One SSOR iteration consists of two SOR steps. In the first step we apply an SOR iteration asin (6.17) and in the second step we again apply an SOR iteration but now with the reversedordering of the unknowns. In formulas we thus have:

xk+ 1

2

i = xki − ω(

∑

j<i aijxk+ 1

2

j +∑

j≥i aijxkj − bi

)

/aii, i = 1, 2, . . . , n

xk+1i = x

k+ 12

i − ω(

∑

j>i aijxk+1j +

∑

j≤i aijxk+ 1

2

j − bi

)

/aii, i = n, . . . , 1

This method results if we use a matrix splitting with

M =1

ω(2 − ω)(D − ωL)D−1(D − ωU) . (6.18)

Although the arithmetic costs for one SSOR iteration seem to be significantly higher than forone SOR iteration, one can implement SSOR in such a way that the costs per iteration areapproximately the same as for SOR (cf. [68]). In many cases the rate of convergence of bothmethods is about the same. Often, in the SSOR method the sensitivity of the convergence ratewith respect to variation in the relaxation parameter ω is much lower than in the SOR method(cf. Axelsson and Barker [7]). Finally we note that, if A is symmetric positive definite then thematrix M in (6.18) is symmetric positive definite, too (such a property does not hold for theSOR method). Due to this property the SSOR method can be used as a preconditioner for theConjugate Gradient method. This is further explained in chapter 7.

133

6.3 Convergence analysis in the symmetric positive definite case

For the classical linear iterative methods we derive convergence results for the case that A issymmetric positive definite. Recall that for square symmetric matrices B and C we use thenotation B < C (B ≤ C) if C− B is positive definite (semi-definite).We start with an elementary lemma:

Lemma 6.3.1 Let B ∈ Rn×n be a symmetric positive definite matrix. The smallest and largesteigenvalues of B are denoted by λmin(B) and λmax(B), respectively. The following holds:

ρ(I − ωB) < 1 iff 0 < ω <2

λmax(B)(6.19)

minωρ(I − ωB) = ρ(I − ωoptB) = 1 − 2

κ(B) + 1

for ωopt =2

λmin(B) + λmax(B)

(6.20)

Proof. The eigenvalues of I−ωB are given by 1−ωλ | λ ∈ σ(B) . Define ωopt as in (6.20).We then have

ρ(I − ωB) = max |1 − ωλmin(B)| , |1 − ωλmax(B)|

=

1 − ωλmax(B) if ω ≤ 0

1 − ωλmin(B) if 0 ≤ ω ≤ ωopt

ωλmax(B) − 1 if ωopt ≤ ω

Hence ρ(I − ωB) < 1 iff ω > 0 and ωλmax(B) − 1 < 1. This proves the result in (6.19). Theresult in (6.20) follows from

minωρ(I − ωB) = 1 − ωoptλmin(B)

= 1 − 2λmin(B)

λmin(B) + λmax(B)= 1 − 2

κ(B) + 1

As an immediate consequence of this lemma we get a convergence result for the Richardsonmethod.

Corollary 6.3.2 Let A be symmetric positive definite. For the iteration matrix of the Richard-son method Cω = I − ωA we have

ρ(Cω) < 1 iff 0 < ω <2

λmax(A)(6.21)

minωρ(Cω) = ρ(Cωopt

) = 1 − 2

κ(A) + 1

for ωopt =2

λmin(A) + λmax(A)

(6.22)

We now consider the Jacobi method. From Theorem 6.1.4 we obtain that this method is conver-gent if and only if ρ(I − D−1A) < 1 holds. A simple example shows that the method of Jacobi

134

does not converge for every symmetric positive definite matrix A: consider A =

112 1 11 11

2 11 1 11

2

,

with spectrum σ(A) = 12 , 3

12. Then ρ(I − D−1A) = |1 − 1

1 12

312 | = 4

3 .

From the analysis in section 6.5.2 (theorems 6.5.12 and 6.5.13) it follows that if A is symmetricpositive definite and aij ≤ 0 for all i 6= j, then the Jacobi method is convergent.A convergence result for the damped Jacobi method can be derived in which the assumptionaij ≤ 0 is avoided:

Theorem 6.3.3 Let A be a symmetric positive definite matrix. For the iteration matrix of thedamped Jacobi method Cω = I − ωD−1A we have

ρ(Cω) < 1 iff 0 < ω <2

λmax(D−1A)(6.23)

minωρ(Cω) = ρ(Cωopt

) = 1 − 2

κ(D−1A) + 1

for ωopt =2

λmin(D−1A) + λmax(D−1A)

(6.24)

Proof. Note that D−12 AD−

12 is symmetric positive definite. Apply lemma 6.3.1 with

B = D−12AD−

12 and note that σ(D−

12AD−

12 ) = σ(D−1A).

If the matrix A is the stiffness matrix resulting from a finite element discretization of a scalarelliptic boundary value problem then in general the condition number κ(D−1A) is very large(namely ∼ h−2, with h a mesh size parameter). The result in the previous theorem shows thatfor such problems the rate of convergence of a (damped) Jacobi method is very low. This isillustrated in section 6.6.

For the convergence analysis of both the Gauss-Seidel and the SOR method the following lemmais useful.

Lemma 6.3.4 Let A be symmetric positive definite and assume that M is such that

M + MT > A (6.25)

Then M is nonsingular and

ρ(I − M−1A) ≤ ‖I − M−1A‖A < 1

holds.

Proof. Assume that Mx = 0. Then 〈(M + MT )x,x〉 = 0 and using assumption (6.25) this

implies x = 0. Hence M is nonsingular. We introduce C := I − A12 M−1A

12 and note that

‖I − M−1A‖A = ‖C‖2 = ρ(CT C)12

From M + MT > A it follows that

A12 (M−T + M−1)A

12 = A

12M−T (M + MT )M−1A

12 > A

12 M−TAM−1A

12

135

Using this we get

0 ≤ CT C =(

I − A12 M−TA

12

)(

I −A12 M−1A

12

)

= I− A12 (M−T + M−1)A

12 +A

12 M−TAM−1A

12 < I

and thus ρ(CT C) < 1 holds.

Using this lemma we immediately get a main convergence result for the Gauss-Seidel method.

Theorem 6.3.5 Let A be symmetric positive definite. Then we have

ρ(C) ≤ ‖C‖A < 1 with C := I − (D − L)−1A

and thus the Gauss-Seidel method is convergent.

Proof. The Gauss-Seidel method corresponds to the matrix-splitting A = M − N withM = D − L. Note that M + MT = D + (D − L − LT ) = D + A > A holds. Application oflemma 6.3.4 proves the result.

We now consider the SOR method. Recall that this method corresponds to the matrix-splittingA = M − N with M = 1

ωD− L.

Theorem 6.3.6 Let A be symmetric positive definite. Then for ω ∈ (0, 2) we have

ρ(Cω) ≤ ‖Cω‖A < 1 with Cω := I − (1

ωD− L)−1A

and thus the SOR method is convergent.

Proof. For M = 1ωD − L we have

M + MT =2

ωD− L − LT =

(2 − ω

ω

)

D + A > A if ω ∈ (0, 2)

Application of lemma 6.3.4 proves the result.

In the following lemma we show that for every matrix A (i.e., not necessarily symmetric) witha nonsingular diagonal the SOR method with ω /∈ (0, 2) is not convergent.

Lemma 6.3.7 Let A ∈ Rn×n be a matrix with aii 6= 0 for all i. For the iteration matrixCω = I − ( 1

ωD− L)−1A of the SOR method we have

ρ(Cω) ≥ |1 − ω| for all ω 6= 0

Proof. Define L := D−1L and U := D−1U. Then we have

Cω = I − (I − ωL)−1ω(I − L − U) =(

I − ωL)−1(

(1 − ω)I + ωU)

Hence, det(Cω) = det[(

I−ωL)−1(

(1−ω)I+ωU)]

= (1−ω)n. Let λi | 1 ≤ i ≤ n = σ(Cω) bethe spectrum of the iteration matrix. Then due to fact that the determinant of a matrix equalsthe product of its eigenvalues we get Πn

i=1|λi| = |1−ω|n. Thus there must be an eigenvalue withmodulus at least |1 − ω|.

Jacobi for the nonsymmetric case .....in preparation.

Block-Jacobi method ....in preparation.

136

6.4 Rate of convergence of the SOR method

The result in theorem 6.3.6 shows that in the symmetric positive definite case the SOR method isconvergent if we take ω ∈ (0, 2). This result, however, does not quantify the rate of convergence.Moreover, it is not clear how the rate of convergence depends on the choice of the parameter ω.It is known that for certain problems a suitable choice of the parameter ω can result in an SORmethod which has a much higher rate of convergence than the Jacobi and Gauss-Seidel methods.This is illustrated in example 6.6.4. However, the relation between the rate of convergence andthe parameter ω is strongly problem dependent and for most problems it is not known how agood (i.e. close to optimal) value for the parameter ω can be determined.In this section we present an analysis which, for a relatively small class of block-tridiagonalmatrices, shows the dependence of the spectral radius of the SOR iteration matrix on theparameter ω. For related (more general) results we refer to the literature, e.g., Young [100],Hageman and Young [50], Varga [92]. For a more recent treatment we refer to Hackbusch [48].We start with a technical lemma. Recall the decomposition A = D−L−U, with D = diag(A),L and U strictly lower and upper triangular matrices, respectively.

Lemma 6.4.1 Consider A = D − L − U with det(A) 6= 0. Assume that A has the block-tridiagonal structure

A =

D11 A12

A21. . .

. . . ∅. . .

. . .. . .

∅ . . .. . . Ak−1,k

Ak,k−1 Dkk

, Dii ∈ Rni×ni , 1 ≤ i ≤ k,

with diag(A) = blockdiag(D11, . . . ,Dkk)

(6.26)

Then the eigenvalues of zD−1L + 1zD−1U are independent of z ∈ C, z 6= 0.

Proof. For z ∈ C, z 6= 0 define Gz := zD−1L + 1zD−1U. Note that

Gz = −

0 1zD−111 A12

zD−122 A21 0

. . . ∅. . .

. . .. . .

∅ . . . 0 1zD−1k−1,k−1Ak−1,k

zD−1kkAk,k−1 0

Introduce Tz := blockdiag(I1, zI2, . . . , zk−1Ik) with Ii the ni × ni identity matrix. Now note

that

T−1z GzTz = G1 = D−1L + D−1U

This similarity transformation with Tz does not change the spectrum and thus σ(Gz) =σ(D−1L + D−1U) holds for all z 6= 0. The latter spectrum is independent of z.

We collect a few properties in the following lemma.

137

Lemma 6.4.2 Let A be as in lemma 6.4.1. Let CJ = I − D−1A and Cω = I − ( 1ωD − L)−1A

be the iteration matrices of the Jacobi and SOR method, respectively. The following holds

(a) ξ ∈ σ(CJ) ⇔ − ξ ∈ σ(CJ)

(b) 0 ∈ σ(Cω) ⇒ ω = 1

(c) For λ 6= 0, ω 6= 0 we have

λ ∈ σ(Cω) ⇔ λ+ ω − 1

ωλ12

∈ σ(CJ)

Proof. With L := D−1L and U := D−1U we have CJ = L + U, Cω = (I − ωL)−1(

(1 −ω)I+ωU

)

. From lemma 6.4.1 with z = 1 and z = −1 we have σ(L+ U) = σ(−L− U) and thus

ξ ∈ σ(L + U) ⇔ − ξ ∈ σ(−L− U) = σ(L + U) holds, which proves (a). If 0 ∈ σ(Cω) then wehave

0 = det(

(I − ωL)−1(

(1 − ω)I + ωU))

= det(

(1 − ω)I + ωU)

= (1 − ω)n

and thus ω = 1, i.e., the result in (b) holds. For λ ∈ σ(Cω), λ 6= 0 and ω 6= 0 we have

det(Cω − λI) = det(

(I − ωL)−1[(1 − ω)I + ωU − λ(I − ωL)])

= det(

ωλ12 (λ

12 L + λ−

12 U) − (λ+ ω − 1)I

)

= ωnλ12n det

(

(λ12 L + λ−

12 U) − λ+ ω − 1

ωλ12

I)

Using lemma 6.4.1 we get (for λ 6= 0, ω 6= 0):

λ ∈ σ(Cω) ⇔ λ+ ω − 1

ωλ12

∈ σ(λ12 L + λ−

12 U) = σ(CJ)

which proves the result in (c).

Now we can prove a main result on the rate of convergence of the SOR method.

Theorem 6.4.3 Let A be as in lemma 6.4.1 and CJ , Cω the iteration matrices of the Jacobi andSOR method, respectively. Assume that all eigenvalues of CJ are real and that µ := ρ(CJ) < 1(i.e. the method of Jacobi is convergent). Define

ωopt :=2

1 +√

1 − µ2= 1 +

(

µ

1 +√

1 − µ2

)2

(6.27)

The following holds

ρ(Cω) =

14

(

ωµ+√

ω2µ2 − 4(ω − 1))2

for 0 < ω ≤ ωopt

ω − 1 for ωopt ≤ ω < 2(6.28)

and

ωopt − 1 = ρ(Cωopt) < ρ(Cω) < 1 for all ω ∈ (0, 2), ω 6= ωopt, (6.29)

holds.

138

Proof. We only consider ω ∈ (0, 2). Introduce L := D−1L and U := D−1U. First we treatthe case where there exists ω ∈ (0, 2) such that ρ(Cω) = 0, i.e., Cω = 0. This implies ω = 1,U = 0, µ = 0 and ωopt = 1. From U = 0 we get Cω = (1 − ω)(I − ωL)−1, which yieldsρ(Cω) = |1 − ω|. One now easily verifies that for this case the results in (6.28) and (6.29) hold.We now consider the case with ρ(Cω) > 0 for all ω ∈ (0, 2). Take λ ∈ σ(Cω), λ 6= 0. Fromlemma 6.4.2 it follows that

λ+ ω − 1

ωλ12

= ξ ∈ σ(CJ) ⊂ [−µ, µ]

A simple computation yields

λ =1

4

(

ω|ξ| ±√

ω2ξ2 − 4(ω − 1))2

(6.30)

We first consider ω with ωopt ≤ ω < 2. Then ω2ξ2 − 4(ω − 1) ≤ ω2µ2 − 4(ω − 1) ≤ 0 and thusfrom (6.30) we obtain

|λ| =1

4

(

ω2ξ2 − (ω2ξ2 − 4(ω − 1))

= ω − 1

Hence in this case all eigenvalues of Cω have modulus ω − 1 and this implies ρ(Cω) = ω − 1,which proves the second part of (6.28). We now consider ω with 0 < ω ≤ ωopt and thusω2µ2 − 4(ω − 1) ≥ 0. If ξ is such that ω2ξ2 − 4(ω − 1) ≥ 0 then (6.30) yields

|λ| =1

4

(

ω|ξ| ±√

ω2ξ2 − 4(ω − 1))2

The maximum value is attained for the “+” sign and with the value ξ = ±µ, resulting in

|λ| =1

4

(

ωµ+√

ω2µ2 − 4(ω − 1))2

(6.31)

There may be eigenvalues λ ∈ σ(Cω) that correspond to ξ ∈ σ(CJ) with ω2ξ2 − 4(ω − 1) < 0.As shown above, this yields corresponding λ ∈ σ(Cω) with |λ| = ω − 1. Due to

1

4

(

ωµ+√

ω2µ2 − 4(ω − 1))2 ≥ 1

4ω2µ2 ≥ ω − 1

we conclude that the maximum value for |λ| is attained for the case (6.31) and thus

ρ(Cω) =1

4

(

ωµ+√

ω2µ2 − 4(ω − 1))2

which proves the first part in (6.28). An elementary computation shows that for ω ∈ (0, 2) thefunction ω → ρ(Cω) as defined in (6.28) is continuous, monotonically decreasing on (0, ωopt] andmonotonically increasing on [ωopt, 2). Morover, both for ω ↓ 0 and ω ↑ 2 we have the functionvalue 1. From this the result in (6.29) follows.

In (6.27) we see that ωopt > 1 holds, which motivates the name ”over”-relaxation. Note that wedo not require symmetry of the matrix A. However, we do assume that the eigenvalues of CJ

are real. A sufficient condition for the latter to hold is that A is symmetric. For different valuesof µ the function ω → ρ(Cω) defined in (6.28) is shown in figure 6.1.

139

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

omega

Spe

ctra

l rad

ius

SO

R it

erat

ion

mat

rix

mu=0.6

mu=0.9

mu=0.95

Figure 6.1: Function ω → ρ(Cω)

Corollary 6.4.4 If we take ω = 1 then the SOR method is the same as the Gauss-Seidelmethod. Hence, if A satisfies the assumptions in theorem 6.4.3 we obtain from (6.28)

ρ(C1) = µ2 = ρ(CJ)2

Thus − ln(ρ(C1)) = −2 ln(CJ), i.e., the asymptotic convergence rate of the Gauss-Seidel methodis twice the one of the Jacobi method.

Assume that for µ = ρ(CJ) we have µ = 1 − δ with δ ≪ 1. From Theorem 6.4.3 weobtain (provided A fulfills the conditions of this theorem) the following estimate related to theconvergence of the SOR method: ρ(Cωopt

) = (1− δ)2(1 +√

2δ√

1 − δ/2)−2 = 1− 2√

2δ +O(δ).Hence the method of Jacobi has an asymptotic convergence rate − ln(µ) = − ln(1 − δ) ≈ δ andthe SOR method has an asymptotic convergence rate − ln(ρ(Cωopt

)) ≈ − ln(1− 2√

2δ) ≈ 2√

2δ.

Note that for δ small we have 2√

2δ ≫ δ, and thus the SOR method has a significantly higherrate of convergence than the method of Jacobi.

6.5 Convergence analysis for regular matrix splittings

We present a general convergence analysis for so called regular matrix splitting methods, dueto Varga [92]. For this analysis we need some fundamental results on the largest eigenvalueof a positive matrix and its corresponding eigenvector. These results, due to Perron [71], arepresented in section 6.5.1. In this section for B,C ∈ Rn×n we use the notation B ≥ C (B > C)iff bij ≥ cij (bij > cij) for all i, j. The same ordering notation is used for vectors. For B ∈ Rn×n

we define |B| = (|bij |)1≤i,j≤n and similarly for vectors.

140

6.5.1 Perron theory for positive matrices

For a matrix A ∈ Rn×n an eigenvalue λ ∈ σ(A) for which |λ| = ρ(A) holds is not necessarilyreal. If we assume A > 0 then it can be shown that ρ(A) ∈ σ(A) holds and, moreover, that thecorresponding eigenvector is strictly positive. These and other related results, due to Perron [71],are given in lemma 6.5.2, theorem 6.5.3 and theorem 6.5.5.We start the analysis with an elementary lemma.

Lemma 6.5.1 For B,C ∈ Rn×n the following holds

0 ≤ B ≤ C ⇒ ρ(B) ≤ ρ(C)

Proof. From 0 ≤ B ≤ C we get 0 ≤ Bk ≤ Ck for all k. Hence, ‖Bk‖∞ ≤ ‖Ck‖∞ for all k.

Recall that for arbitrary A ∈ Rn×n we have ρ(A) = limk→∞ ‖Ak‖1/k∞ (cf. lemma 6.1.5). Using

this we get ρ(B) ≤ ρ(C).

Lemma 6.5.2 Take A ∈ Rn×n with A > 0. For λ ∈ σ(A) with |λ| = ρ(A) and w ∈ Cn, w 6= 0,with Aw = λw the relation

A|w| = ρ(A)|w|holds, i.e., ρ(A) is an eigenvalue of A.

Proof. With these λ and w we have

ρ(A)|w| = |λ||w| = |λw| = |Aw| ≤ |A||w| = A|w| (6.32)

Assume that we have ”<” in (6.32). Then there exsts α > ρ(A) such that α|w| ≤ A|w|and thus Ak|w| ≥ αk|w| for all k ∈ N. This yields ‖Ak‖∞ ≥ ‖Ak |w| ‖∞

‖ |w| ‖∞ ≥ αk and thus

ρ(A) = limk→∞ ‖Ak‖1/k∞ ≥ α, which is a contradiction with α > ρ(A). We conclude that in

(6.32) equality must hold, i.e., A|w| = ρ(A)|w|.

Theorem 6.5.3 (Perron) For A ∈ Rn×n with A > 0 the following holds:

ρ(A) > 0 is an eigenvalue of A (6.33a)

There exists a vector v > 0 such that Av = ρ(A)v (6.33b)

If Aw = ρ(A)w holds, then w ∈ span(v) with v from (6.33b) (6.33c)

Proof. From lemma 6.5.2 we obtain that there exists w 6= 0 such that

A|w| = ρ(A)|w| (6.34)

holds. Thus ρ(A) is an eigenvalue of A. The eigenvector |w| from (6.34) contains at least oneentry that is strictly positive. Due to this and A > 0 we have that A|w| > 0, which due to(6.34) implies ρ(A) > 0 and |w| > 0. From this the results in (6.33a) and (6.33b) follow.Assume that there exists x 6= 0 independent of v such that Ax = ρ(A)x. For arbitrary 1 ≤ k ≤ ndefine α = xk

vkand z = x−αv. Note that zk = 0 and due to the assumption that x and v are in-

dependent we have z 6= 0. We also have Az = ρ(A)z. From lemma 6.5.2 we get A|z| = ρ(A)|z|,

141

which results in a contradicton, because (A|z|)k > 0 and ρ(A)(|z|)k = 0. Thus the result in(6.33c) is proved.

The eigenvalue ρ(A) and corresponding eigenvector v > 0 (which is unique up to scaling)are called the Perron root and Perron vector.If instead of A > 0 we only assume A ≥ 0 then the results (6.33a) and (6.33b) hold with ”>”replaced by ”≥” as is shown in the following corollary. Clearly, for A ≥ 0 the result (6.33c) doesnot always hold (take A = 0).

Corollary 6.5.4 For A ∈ Rn×n with A ≥ 0 the following holds:

ρ(A) is an eigenvalue of A (6.35a)

There exists a nonzero vector v ≥ 0 such that Av = ρ(A)v (6.35b)

Proof. For ε ∈ (0, 1] define Aε :=(

aij + ε)

1≤i,j≤n. From theorem 6.5.3 it follows that for

λε := ρ(Aε) there exists a vector vε > 0 such that Aεvε = λεvε holds. We scale vε suchthat ‖vε‖∞ = 1. Then, for all ε this vector is contained in the compact set x ∈ Rn | ‖x‖∞ =1 =: S. Hence there exists a decreasing sequence 1 > ε1 > ε2 > . . ., with limj→∞ εj = 0 andlimj→∞ vεj

= v ∈ S. Thus v 6= 0 and from vεj> 0 for all j it follows that v ≥ 0. Note that

0 ≤ A ≤ Aεi≤ Aεj

for all i ≥ j. Using lemma 6.5.1 we get ρ(A) ≤ λεi≤ λεi

for all i ≥ j. Fromthis it follows that

limi→∞

λεi= λ ≥ ρ(A) (6.36)

Taking the limit i → ∞ in the equation Aεivεi

= λεivεi

yields Av = λv and thus λ is aneigenvalue of A with v 6= 0. This implies λ ≤ ρ(A). In combination with (6.36) this yieldsλ = ρ(A), which completes the proof.

In the next theorem we present a few further results for the Perron root of a positive matrix.

Theorem 6.5.5 (Perron) For A ∈ Rn×n with A > 0 the following holds:

ρ(A) is a simple eigenvalue of A (note that this implies (6.33c)) (6.37a)

For all λ ∈ σ(A), λ 6= ρ(A), we have |λ| < ρ(A) (6.37b)

No nonnegative eigenvector belongs to any other eigenvalue than ρ(A) (6.37c)

Proof. We use the Jordan form A = TΛT−1 (cf. Appendix B) with Λ a matrix of the formΛ = blockdiag(Λi)1≤i≤s and

Λi =

λi 1 ∅. . .

. . .

. . . 1∅ λi

∈ Rki×ki , 1 ≤ i ≤ s,

with all λi ∈ σ(A). Due to (6.33c) we know that the eigenspace corresponding to the eigenvalueρ(A) is one dimensional. Thus there is only one block Λi with λi = ρ(A). Let the ordering ofthe blocks in Λ be such that the first block Λ1 corresponds to the eigenvalue λ1 = ρ(A). Wewill now show that its dimension must be k1 = 1. Let ej be the j-th basis vector in Rn anddefine t := Te1, t := T−Tek1 . From ATe1 = TΛe1 we get At = ρ(A)t and thus t is the Perronvector of A. This implies t > 0. Note that ATT−Tek1 = T−TΛTek1 and thus AT t = ρ(A)t.

142

Since AT > 0 this implies that t is the Perron vector of AT and thus t > 0. Using that botht and t are strictly positive we get 0 < tT t = eTk1T

−1Te1 = eTk1e1. This can only be true ifk1 = 1. We conclude that there is only one Jordan block corresponding to ρ(A) and that thisblock has the size 1 × 1, i.e., ρ(A) is simple eigenvalue.We now consider (6.37b). Let w ∈ Cn, w 6= 0, λ = eiφρ(A) (i.e., |λ| = ρ(A)) be such thatAw = λw. From lemma 6.5.2 we get that A|w| = ρ(A)|w| and from (6.33c) it follows that|w| > 0 holds. We introduce ψk, rk ∈ R, with rk > 0, such that wk = rke

iψk , 1 ≤ k ≤ n, andD := diag(eiψk)1≤k≤n. Then D|w| = w holds and thus

AD|w| = Aw = λw = λD|w| = eiφρ(A)D|w| = eiφDA|w|

This yields(

e−iφD−1AD− A)

|w| = 0

Consider the k-th row of this identity:

n∑

j=1

(

e−i(φ+ψk−ψj) − 1)

akj|wj | = 0

Due to akj|wj | > 0 for all j this can only be true if e−i(φ+ψk−ψj) − 1 = 0 for all j = 1, . . . , n. Wetake j = k and thus obtain e−iφ = 1, hence λ = eiφρ(A) = ρ(A). This shows that (6.37b) holds.We finally prove (6.37c). Assume Aw = λw with a nonzero vector w ≥ 0 and λ 6= ρ(A)holds. Application of theorem 6.5.3 to AT implies that there exists a vector x > 0 such thatATx = ρ(AT )x = ρ(A)x. Note that xTAw = λxTw and xTAw = wTATx = ρ(A)wTx. Thisimplies (λ− ρ(A))wTx = 0 and thus, because wTx > 0, we obtain λ = ρ(A), which contradictsλ 6= ρ(A). This completes the proof of the theorem.

From corollary 6.5.4 we know that for A ≥ 0 there exists an eigenvector v ≥ 0 corresponding tothe eigenvalue ρ(A). Under the stronger assumption A ≥ 0 and irreducible (cf. Appendix B)this vector must be strictly positive (as for the case A > 0). This and other related results fornonnegative irreducible matrices are due to Frobenius [37].

Theorem 6.5.6 (Frobenius) Let A ∈ Rn×n be irreducible and A ≥ 0. Then the followingholds:

ρ(A) > 0 is a simple eigenvalue of A (6.38a)

There exists a vector v > 0 such that Av = ρ(A)v (6.38b)

No nonnegative eigenvector belongs to any other eigenvalue than ρ(A) (6.38c)

Proof. Given in, for example, theorem 4.8 in Fiedler [34].

6.5.2 Regular matrix splittings

A class of special matrix splittings consists of so-called regular splittings. In this section we willdiscuss the corresponding matrix splitting methods. In particular we show that for a regularsplitting the corresponding iterative method is convergent. We also show how basic iterativemethods like the Jacobi or Gauss-Seidel method fit into this setting.

143

Definition 6.5.7 A matrix splitting A = M − N is called a regular splitting if

M is regular, M−1 ≥ 0 and M ≥ A (6.39)

Recall that the iteration matrix of a matrix splitting method (based on the splitting A =M −N) is given by C = I −M−1A = M−1N.

Theorem 6.5.8 Assume that A−1 ≥ 0 holds and that A = M−N is a regular splitting. Then

ρ(C) = ρ(M−1N) =ρ(A−1N)

1 + ρ(A−1N)< 1

holds.

Proof. The matrices I − C = M−1A and I + A−1N = A−1M are nonsingular. We use theidentities

A−1N = (I − C)−1C (6.40)

C = (I + A−1N)−1A−1N (6.41)

Because C ≥ 0 we can apply corollary 6.5.4. Hence there exists a nonzero vector v ≥ 0 suchthat Cv = ρ(C)v. Due to the fact that I−C is nonsingular we have ρ(C) 6= 1. From (6.40) weget

A−1Nv = (I − C)−1Cv =ρ(C)

1 − ρ(C)v (6.42)

From this and A−1 ≥ 0,N ≥ 0,v ≥ 0 we conclude A−1Nv ≥ 0 and ρ(C) < 1. From (6.42)

it also follows that ρ(C)1−ρ(C) is a positive eigenvalue of A−1N. This implies ρ(C)

1−ρ(C) ≤ ρ(A−1N),which can be reformulated as

ρ(C) ≤ ρ(A−1N)

1 + ρ(A−1N)(6.43)

From A−1N ≥ 0 and corollary 6.5.4 it follows that there exists a nonzero vector w ≥ 0 suchthat A−1Nw = ρ(A−1N)w. Using (6.41) we get

Cw = (I + A−1N)−1A−1Nw =ρ(A−1N)

1 + ρ(A−1N)w

Thus ρ(A−1N)1+ρ(A−1N) is a positive eigenvalue of C. This yields

ρ(A−1N)

1 + ρ(A−1N)≤ ρ(C) (6.44)


From the fact that the function x → x1+x is increasing and using lemma 6.5.1 one immedi-

ately obtains the following result.

Corollary 6.5.9 Assume that A−1 ≥ 0 holds and that A = M1 − N1 = M2 − N2 are tworegular splittings with N1 ≤ N2. Then

ρ(I − M−11 A) ≤ ρ(I − M−1

2 A) < 1

holds.

144

For the application of these general results to concrete matrix splitting methods it is convenientto introduce the following class of matrices.

Definition 6.5.10 (M-matrix) A matrix A ∈ Rn×n is called M-matrix if it is nonsingular andhas the following two properties

A−1 ≥ 0 (6.45a)

aij ≤ 0 for all i 6= j (6.45b)

Consider an M-matrix A and let sk ≥ 0 be the k-th columns of A−1. From the identity Ask = ek(k-th basis vector) it follows that akk(sk)k = 1 −∑j 6=k akj(sk)j ≥ 1 holds, and thus akk > 0.Hence, in an M-matrix all diagonal entries are strictly positive. Another property that we willneed further on is given in the following lemma.

Lemma 6.5.11 Let A be an M-matrix. Assume that the matrix B has the properties bij ≤ 0for all i 6= j and B ≥ A. Then B is an M-matrix, too. Furthermore, the inequalities

0 ≤ B−1 ≤ A−1

hold.

Proof. Let DA := diag(A) and DB := diag(B). Because A is an M-matrix we havethat DA is nonsingular and from B ≥ A it follows that DB is nonsingular, too. Note thatNA := DA − A ≥ 0. We conclude that A = DA − NA is a regular splitting and from theo-rem 6.5.8 it follows that ρ(CA) < 1 with CA := I−D−1

A A. Furthermore, with CB := I−D−1B B

we have 0 ≤ CB ≤ CA and thus ρ(CB) ≤ ρ(CA) < 1 holds. Thus we have the representationsA−1 = (

∑∞k=0 Ck

A)D−1A and B−1 = (

∑∞k=0 Ck

B)D−1B . From the latter and CB ≥ 0, D−1

B ≥ 0we obtain B−1 ≥ 0 and we can conclude that B in an M-matrix. The inequality B−1 ≤ A−1

follows by using CB ≤ CA, D−1B ≤ D−1

A .

There is an extensive literature on properties of M-matrices, cf. [12], [34]. A few results aregiven in the following theorem.

Theorem 6.5.12 For A ∈ Rn×n the following results hold:

(a) If A is irreducibly diagonally dominant and aii > 0 for all i, aij ≤ 0 for all i 6= j, then Ais an M-matrix.

(b) Assume that aij ≤ 0 for all i 6= j. Then A is an M-matrix if and only if all eigenvalues ofA have positive real part.

(c) Assume that aij ≤ 0 for all i 6= j. Then A is an M-matrix if A + AT is positive definite(this follows from (b)).

(d) If A is symmetric positive definite and aij ≤ 0 for all i 6= j, then A is an M-matrix (thisfollows from (b)).

(e) If A is a symmetric M-matrix then A is symmetric positive definite (this follows from (b)).

(f) If A is an M-matrix and B results from A after a Gaussian elimination step without pivot-ing, then B is an M-matrix, too (i.e. Gaussian elimination without pivoting preserves theM-matrix property).

145

Proof. A proof can be found in [12].

We now show that for M-matrices the Jacobi and Gauss-Seidel methods correspond to regu-lar splittings. Recall the decomposition A = D− L − U.

Theorem 6.5.13 Let A be an M-matrix. Then both MJ := D and MGS := D − L result inregular splittings. Furthermore

ρ(I − (D− L)−1A) ≤ ρ(I − D−1A) < 1 (6.46)

holds.

Proof. In the proof lemma 6.5.11 it is shown that the method of Jacobi corresponds to aregular splitting. For the Gauss-Seidel method note that MGS = D − L has only nonpositiveoff-diagonal entries and MGS − A = −U ≥ 0. From lemma 6.5.11 it follows that MGS is anM-matrix, hence M−1

GS ≥ 0 holds. Thus the Gauss-Seidel method corresponds to a regular split-ting, too. Now note that NGS := U ≤ NJ := L + U holds and thus corollary 6.5.9 yields theresult in (6.46).

This result shows that for an M-matrix both the Jacobi and Gauss-Seidel method are con-vergent. Moreover, the asymptotic convergence rate of the Gauss-Seidel method is at least ashigh as for the Jacobi method. If A is the result of the discretization of an elliptic boundaryvalue problem then often the arithmetic costs per iteration are comparable for both methods.In such cases the Gauss-Seidel method is usually more efficient than the method of Jacobi.

The SOR method corresponds to a splitting A = Mω − N with Mω = 1ωD − L. If A is

an M-matrix then for ω > 1 the matrix Mω − A has strictly positive diagonal entries and thusthis is not a regular splitting. For ω ∈ (0, 1] one can apply the same arguments as in the proofof theorem 6.46 to show that for an M-matrix A the SOR method corresponds to a regularsplitting, and

ρ(I − M−1ω1

A) ≤ ρ(I − M−1ω2

A) < 1 for all 0 < ω2 ≤ ω1 ≤ 1

holds.

6.6 Application to scalar elliptic problems

In this section we apply basic iterative methods to discrete scalar elliptic model problems. Werecall the weak formulation of the Poisson equation and the convection-diffusion problem:


∫

Ω ∇u · ∇v dx =∫



ε∫

Ω ∇u · ∇v dx+∫

Ω b · ∇u v dx =∫


with ε > 0 and b = (b1, b2) with given constants b1 ≥ 0, b2 ≥ 0. We take Ω = (0, 1)2. Weuse nested uniform triangulations with mesh size parameter h = 1

202−k, k = 1, 2, 3, 4. Theseproblems are discretized using the finite element method with piecewise linear finite elements.For the convection-diffusion problem, we use the streamline-diffusion stabilization technique (for

146

the convection-dominated case). The resulting discrete problems are denoted by (P) (Poissonproblem) and (CD) (convection-diffusion problem).

Example 6.6.1 (Model problem (P)) For the Poisson equation we obtain a stiffness matrixA that is symmetric positive definite and for which κ(D−1A) = O(h−2) holds. In Table 6.1 weshow the results for the method of Jacobi applied to this problem with different values of h. Forthe starting vector we take x0 = 0. We use the Euclidean norm ‖ · ‖2. By # we denote thenumber of iterations needed to reduce the norm of the starting error by a factor R = 103.We observe that when we halve the mesh size h we need approximately four times as many

h 1/40 1/80 1/160 1/320

# 2092 8345 33332 133227

Table 6.1: Method of Jacobi applied to problem (P).

iterations. This is in agreement with κ(D−1A) = O(h−2) and the result in theorem 6.3.3.

We take a reduction factor R = 103 and consider model problem (P). Then the complexity ofthe method of Jacobi is cn2 flops (c depends on R but is independent of n). For model problem(P) there are methods that have complexity cnα with α < 2. In particular α = 11

2 for the SORmethod, α = 11

4 for preconditioned Conjugate Gradient (chapter 7) and α = 1 for the multigridmethod (chapter 9). It is clear that if n is large a reduction of the exponent α will result in asignificant gain in efficiency, for example, for h = 1/320 we have n ≈ h−2 ≈ 105 and n2 ≈ 1010.Also note that α = 1 is a lower bound because for one matrix-vector multiplication Ax wealready need cn flops.

Example 6.6.2 (Model problem (P)) In Table 6.2 we show results for the situation as de-scribed in example 6.6.1 but now for the Gauss-Seidel method instead of the method of Jacobi.For this model problem with R = 103 the Gauss-Seidel method has a complexity cn2, which is

h 1/40 1/80 1/160 1/320

# 1056 4193 16706 66694

Table 6.2: Gauss-Seidel method applied to problem (P).

of the same order of magnitude as for the method of Jacobi.

Example 6.6.3 (Model problem (CD)) It is important to note that in the Gauss-Seidelmethod the results depend on the ordering of the unknowns, whereas in the method of Jacobithe resulting iterates are independent of the ordering. We consider model problem (CD) withb1 = cos(π/6), b2 = sin(π/6). We take R = 103 and h = 1/160. Using an ordering of the gridpoints (and corresponding unknowns) from left to right in the domain (0, 1)2 we obtain the resultsas in Table 6.3. When we use the reversed node ordering then get the results shown in Table 6.4.

These results illustrate a rather general phenomenon: if a problem is convection-dominatedthen for the Gauss-Seidel method it is advantageous to use a node ordering corresponding (asmuch as possible) to the direction in which information is transported.

147

ε 100 10−2 10−4

# 17197 856 14

Table 6.3: Gauss-Seidel method applied to problem (CD).

ε 100 10−2 10−4

# 17220 1115 285

Table 6.4: Gauss-Seidel method applied to problem (CD).

Example 6.6.4 We consider the model problem (P) as in example 6.6.1, with h = 1/160. InFigure 6.2 for different values of the parameter ω we show the corresponding number of SORiterations (#), needed for an error reduction with a factor R = 103. The same experiment isperformed for the model problem (CD) as in example 6.6.3 with h = 1/160, ε = 10−2. Theresults are shown in Figure 6.3. Note that with a suitable value for ω an enormous reduction

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 210

2

103

104

105

Figure 6.2: SOR method applied to model problem (P).

in the number of iterations needed can be achieved. Also note the rapid change in the numberof iterations (#) close to the optimal ω value.

148

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.910

1

102

103

104

Figure 6.3: SOR method applied to model problem (CD)

149

Chapter 7

Preconditioned Conjugate Gradient

method

7.1 Introduction

In this chapter we discuss the Conjugate Gradient method (CG) for the iterative solution ofsparse linear systems with a symmetric positive definite matrix.

In section 7.2 we introduce and analyze the CG method. This method is based on theformulation of the discrete problem as a minimization problem. The CG method is nonlinear andof a different type as the basic iterative methods discussed in chapter 3. The CG method is notsuitable for solving strongly nonsymmetric problems, as for example a discretized convection-diffusion problem with a dominating convection. Many variants of CG have been developedwhich are applicable to linear systems with nonsymmetric matrix. A few of these methods aretreated in chapter 8. In the CG method and in the variants for nonsymmetric problems theresulting iterates are contained in a so-called Krylov subspace, which explains the terminology”Krylov subspace methods”. A detailed treatment of these Krylov subspace methods is given inSaad [78]. An important concept related to all these Krylov subspace methods is the so-calledpreconditioning technique. This will be explained in section 7.3.

7.2 Conjugate Gradient method

In section 2.4 it is shown that to a variational problem with a symmetric elliptic bilinear formthere corresponds a canonical minimization problem. Similarly, to a linear system with a sym-metric positive definite matrix there corresponds a natural minimization problem. We considera system of equations

Ax = b (7.1)

with A ∈ Rn×n symmetric positive definite. The unique solution of this problem is denoted byx∗. In this chapter we use the notation

〈y,x〉 = yTx , 〈y,x〉A = yTAx , ‖x‖A = 〈x,x〉12

A , (x,y ∈ Rn) . (7.2)

Since A ist symmetric positive definite the bilinear form 〈·, ·〉A defines an inner product on Rn.This inner product is called the A-inner product or energy inner product.We define the functional F : Rn → R by

F (x) =1

2〈x,Ax〉 − 〈x,b〉 . (7.3)

151

For this F we have

DF (x) = ∇F (x) = Ax− b and D2F (x) = A .

So F is a quadratic functional with a second derivative (Hessian) which is positive definite.Hence F has a unique minimizer and the gradient of F is equal to zero at this minimizer. Thuswe obtain:

minF (x) | x ∈ Rn = F (x∗) , (7.4)

i.e. minimization of the functional F yields the unique solution x∗ of the system in (7.1).This result is an analogon of the one discussed for symmetric bilinear forms in section 2.4. Inthis section we consider two methods that are based on (7.4) and in which first certain searchdirections are determined and then a line search is applied. Such methods are of the followingform:

x0 a given starting vector,xk+1 = xk + αopt(x

k,pk)pk , k ≥ 0 .(7.5)

In (7.5), pk 6= 0 is the search direction at xk and αopt(xk,pk) is the optimal steplength at xk in

the direction pk. The vector Axk − b = ∇F (xk) is called the residual (at xk) and denoted by

rk := Axk − b. (7.6)

From the definition of F we obtain the identity

F (xk + αpk) = F (xk) + α〈pk,Axk − b〉 +1

2α2〈pk,Apk〉 =: ψ(α) .

The function ψ : R → R is quadratic and ψ′′(α) > 0 holds. So ψ has a unique minimum at αopt

iff ψ′(αopt) = 0. This results in the following formula for αopt:

αopt(xk,pk) := − 〈pk, rk〉

〈pk,Apk〉. (7.7)

For the residuals we have the recursion

rk+1 = rk + αopt(xk,pk)Apk , k ≥ 0. (7.8)

For ψ′(0), i.e. the derivative of F at xk in the direction pk, we have ψ′(0) = 〈pk, rk〉. Thedirection pk with ‖pk‖2 = 1 for which the modulus of this derivative is maximal is given bypk = rk/‖rk‖2. This follows from |ψ′(0)| = |〈pk, rk〉| ≤ ‖pk‖2‖rk‖2, in which we have equalityonly if pk = γrk (γ ∈ R). The sign and length of pk are irrelevant because the ”right sign” andthe ”optimal length” are determined by the steplength parameter αopt. With the choice pk = rk

we obtain the Steepest Descent method:

x0 a given starting vector,

xk+1 = xk − 〈rk,rk〉〈rk,Ark〉 rk .

In general the Steepest Descent method converges only slowly. The reason for this is alreadyclear from a simple example with n = 2. We take

A =

[

λ1 00 λ2

]

, 0 < λ1 < λ2 , b = (0, 0)T (hence, x∗ = (0, 0)T ).

152

•x0

•x1

•x2

•x3

•

Figure 7.1: Steepest Descent method

The function F (x1, x2) = 12〈x,Ax〉 − 〈x,b〉 = 1

2(λ1x21 + λ2x

22) has level lines Nc = (x1, x2) ∈

R2 | F (x1, x2) = c which are ellipsoids. Assume that λ2 ≫ λ1 holds (so κ(A) ≫ 1). Thenthe ellipsoids are stretched in x1-direction as is shown in Figure 7.1, and convergence is very slow.

We now introduce the CG method along similar lines as in Hackbusch [48]. To be able toformulate the weakness of the Steepest Descent method we introduce the following notion ofoptimality. Let V be a subspace of Rn.

y is called optimal for the subspace V ifF (y) = minz∈V F (y + z)

(7.9)

So y is optimal for V if on the hyperplane y+V the functional F is minimal at y. Assume a giveny and subspace V . Let d1, . . . ,ds be a basis of V and for c ∈ Rs define g(c) = F (y+

∑si=1 cidi).

Then y is optimal for V iff ∇g(0) = 0 holds. Note that

∂g

∂ci(0) = 〈∇F (y),di〉 = 〈Ay − b,di〉

Hence we obtain the following:

y optimal for V ⇔ 〈di,Ay − b〉 = 0 for i = 1, . . . , s . (7.10)

In the Steepest Descent method we have pk = rk. From (7.7) and (7.8) we obtain

〈rk, rk+1〉 = 〈rk, rk〉 − 〈rk, rk〉〈rk,Ark〉

〈rk,Ark〉 = 0 .

Using (7.10) we conclude that in the Steepest Descent method xk+1 is optimal for the subspacespanpk. This is also clear from Fig. 7.1 : for example, x3 is optimal for the subspace spannedby the search direction x3 − x2. From Fig. 7.1 it is also clear that xk is not optimal for thesubspace spanned by all previous search directions. For example x3 can be improved in thesearch direction p1 = γ(x2 − x1): for α = αopt(x

3,p1) we have F (x4) = F (x3 + αp1) < F (x3).Now consider a start with p0 = r0 and thus x1 = x0 + αopt(x

0,p0)p0 (as in Steepest Descent).We assume that the second search direction p1 is chosen such that 〈p1,Ap0〉 = 0 holds. Due to

153

the fact that A is symmetric positive definite we have that p1 and p0 are independent. Definex2 = x1 +αopt(x

1,p1)p1. Note that now 〈p0,b−Ax2〉 = 0 and also 〈p1,b−Ax2〉 = 0 and thus(cf. 7.10) x2 is optimal for spanp0,p1. For the special case n = 2 as in the example shown infigure 7.1 we have spanp0,p1 = R2. Hence x2 is optimal for R2 which implies x2 = x∗ ! Thisis illustrated in Fig. 7.2.We have constructed search directions p0,p1 and an iterand x2 such that x2 is optimal for the

•x0

•x1

•x2

Figure 7.2: Conjugate Gradient method

two-dimensional subspace spanp0,p1. This leads to the basic idea behind the Conjugate Gra-dient (CG) method: we shall use search directions such that xk is optimal for the k-dimensionalsubspace spanp0,p1, . . . ,pk−1. In the Steepest Descent method the iterand xk is optimal forthe one-dimensional subspace spanpk−1. This difference results in much faster convergenceof the CG iterands as compared to the iterands in the Steepest Descent method.

We will now show how to construct appropriate search directions such that this optimalityproperty holds. Moreover, we derive a method for the construction of these search directionswith low computational costs.As in the Steepest Descent method, we start with p0 = r0 and x1 as in (7.5). Recall that x1 isoptimal for spanp0. Assume that for a given k with 1 ≤ k < n, linearly independent searchdirections p0, ...,pk−1 are given such that xk as in (7.5) is optimal for spanp0, ...,pk−1. Weintroduce the notation

Vk = spanp0, ...,pk−1and assume that xk 6= x∗, i.e., rk 6= 0 (if xk = x∗ we do not need a new search direc-tion). We will show how pk can be taken such that xk+1, defined as in (7.5), is optimal forspanp0,p1, ...,pk =: Vk+1. We choose pk such that

pk ⊥A Vk, i.e. pk ∈ V ⊥A

k (7.11)

holds. This A-orthogonality condition does not determine a unique search direction pk. TheSteepest Descent method above was based on the observation that rk = ∇F (xk) is the directionof steepest descent at xk. Therefore we use this direction to determine the new search direction.A unique new search direction pk is given by the following:

pk ∈ V ⊥A

k such that ‖pk − rk‖A = minp∈V ⊥A

k

‖p − rk‖A (7.12)

154

The definition of p1 is illustrated in Fig. 7.3.

-

6

3V1 r1

p1

V ⊥A

1 V1 = spanp0.

Figure 7.3: Definition of a search direction in CG

Note that pk is the A-orthogonal projection of rk on V ⊥A

k . This yields the following formula forthe search direction pk:

pk = rk −k−1∑

j=0

〈pj, rk〉A〈pj ,pj〉A

pj = rk −k−1∑

j=0

〈pj ,Ark〉〈pj ,Apj〉 pj . (7.13)

We assumed that xk is optimal for Vk and that rk 6= 0. From the former we get that 〈pj, rk〉 = 0for j = 0, . . . , k − 1, i.e., rk ⊥ Vk (note that here we have ⊥ and not ⊥A). Using rk 6= 0we conclude that rk /∈ Vk and thus from (7.13) it follows that pk /∈ Vk. Hence, pk is linearlyindependent of p0, . . . ,pk−1 and

Vk+1 = spanp0, . . . ,pk has dimension k + 1. (7.14)

Given this new search direction the new iterand is defined by

xk+1 = xk + αopt(xk,pk)pk (7.15)

with αopt as in (7.7). Using the definition of αopt we obtain

〈pk,b −Axk+1〉 = −〈pk, rk〉 + αopt(xk,pk)〈pk,Apk〉 = 0.

Due to (7.11) and the optimality of xk for the subspace Vk (cf. also (7.10)) we have for j < k

〈pj ,b − Axk+1〉 = 〈pj ,b − Axk〉 − αopt(xk,pk)〈pj ,Apk〉 = 0.

Using (7.10) we conclude that xk+1 is optimal for the subspace Vk+1 ! The search directionspk defined as in (7.13) (p0 := r0) and the iterands as in (7.15) define the Conjugate Gradientmethod. This method is introduced in Hestenes and Stiefel [51].

We now derive some important properties of the CG method.

Theorem 7.2.1 Let x0 ∈ Rn be given and m < n be such that for k = 0, 1, . . . ,m we havexk 6= x∗ and pk, xk+1 as in (7.13), (7.15). Define Vk = spanp0, . . . ,pk−1 (0 ≤ k ≤ m + 1).

155

Then the following holds for all k = 1, . . . ,m+ 1:

dim(Vk) = k (7.16a)

xk ∈ x0 + Vk (7.16b)

F (xk) = minF (x) | x ∈ x0 + Vk (7.16c)

Vk = spanr0, ..., rk−1 = spanr0,Ar0, ...,Ak−1r0 (7.16d)

〈pj , rk〉 = 0, for all j = 0, 1, ..., k − 1 (7.16e)

〈rj , rk〉 = 0, for all j = 0, 1, ..., k − 1 (7.16f)

pk ∈ spanrk,pk−1 (for k ≤ m) (7.16g)

Proof. The result in (7.16a) is shown in the derivation of the method, cf. (7.14). The resultin (7.16b) can be shown by induction using xk = xk−1 + αopt(x

k−1,pk−1)pk−1. The construc-tion of the search directions and new iterands in the CG method is such that xk is optimalfor Vk, i.e., F (xk) = minF (xk + w) | w ∈ Vk . Using xk ∈ x0 + Vk this can be rewritten asF (xk) = minF (x0 + w) | w ∈ Vk which proves the result in (7.16c).We introduce the notation Rk = spanr0, . . . , rk−1 and prove Vk ⊂ Rk by induction. Fork = 1 this holds due to p0 = r0. Assume that it holds for some k ≤ m. Since Vk+1 =spanVk,pk and Vk ⊂ Rk ⊂ Rk+1, we only have to show pk ∈ Rk+1. From (7.13) it fol-lows that pk ∈ spanp0, . . . ,pk−1, rk = spanVk, rk ⊂ Rk+1, which completes the induc-tion argument. Using dim(Vk) = k it follows that that Vk = Rk must hold. Hence the firstequality in (7.16d) is proved. We introduce the notation Wk = spanr0,Ar0, ...,Ak−1r0and prove Rk ⊂ Wk by induction. For k = 1 this is trivial. Assume that for some k ≤ m,Rk ⊂ Wk holds. Due to Rk+1 = spanRk, rk and Rk ⊂ Wk ⊂ Wk+1 we only have to showrk ∈ Wk+1. Note that rk = rk−1 + αopt(x

k−1,pk−1)Apk−1 and rk−1 ∈ Rk ⊂ Wk ⊂ Wk+1,Apk−1 ∈ AVk = ARk ⊂ AWk ⊂Wk+1. Thus rk ∈Wk+1 holds, which completes the induction.Due to dim(Rk) = k it follows that Rk = Wk must hold. Hence the second equality in (7.16d)is proved.The search directions and iterands are such that xk is optimal for Vk = spanp0, . . . ,pk−1.From (7.10) we get 〈pj , rk〉 = 0 for j = 0, . . . , k − 1 and thus (7.16e) holds. Due to Vk =spanr0, ..., rk−1 this immediately yields (7.16f), too. To prove (7.16g) we use the formula(7.13). Note that rj+1 = rj + αopt(x

j ,pj)Apj and thus Apj ∈ spanrj+1, rj. From this and(7.16f) it follows that for j ≤ k − 1 we have 〈pj , rk〉A = 〈Apj , rk〉 = 0. Thus in the sum in(7.13) all terms with j ≤ k − 2 are zero.

The result in (7.16g) is very important for an efficient implementation of the CG method. Com-bining this result with the formula given in (7.13) we immediately obtain that in the summationin (7.13) there is only one nonzero term, i.e. for pk we have the formula

pk = rk − 〈pk−1,Ark〉〈pk−1,Apk−1〉

pk−1 . (7.17)

From (7.17) we see that we have a simple and cheap two term recursion for the search directionsin the CG method. Combination of (7.5),(7.7),(7.8) and (7.17) results in the following CG

156

algorithm:

x0 a given starting vector; r0 = Ax0 − b

for k ≥ 0 (if rk 6= 0) :

pk = rk − 〈rk,Apk−1〉〈pk−1,Apk−1〉 pk−1 ( if k = 0 then p0 := r0)

xk+1 = xk + αopt(xk,pk)pk with αopt(x

k,pk) = − 〈pk,rk〉〈pk,Apk〉

rk+1 = rk + αopt(xk,pk)Apk

(7.18)

Some manipulations result in the following alternative formulas for pk andαopt :

pk = −rk +〈rk, rk〉

〈rk−1, rk−1〉 pk−1 ,

αopt(xk,pk) =

〈rk, rk〉〈pk,Apk〉

.

(7.19)

Using these formulas in (7.18) results in a slightly more efficient algorithm.The subspace Vk = spanr0,Ar0, ...,Ak−1r0 in (7.16d) is called the Krylov subspace of dimen-sion k corresponding to r0, denoted by Kk(A; r0).The CG method is of a different type as the basic iterative methods discussed in chapter 6. Oneimportant difference is that the CG method is nonlinear. The error propagation xk+1 − x∗ =Ψ(xk − x∗) is determined by a nonlinear function Ψ and thus there does not exist an erroriteration matrix (as in the case of basic iterative methods) which determines the convergencebehaviour. Related to this, in the CG method we often observe a phenomenon called superlinearconvergence. This type of convergence behaviour is illustrated in Example 7.2.3. For a detailedanalysis of this phenomenon we refer to Van der Sluis and Van der Vorst [90].

Another difference between CG and basic iterative methods is that the CG method yields theexact solution x∗ in at most n iterations. This follows from the property in (7.16c). However,in practice this will not occur due the effect of rounding errors. Moreover, in practical applica-tions n is usually very large and for efficiency reasons one does not want to apply n CG iterations.

We now discuss the arithmetic costs per iteration and the rate of convergence of the CG method.If we use the CG algorithm with the formulas in (7.19) then in one iteration we have to computeone matrix-vector multiplication, two inner products and a few vector updates, i.e. (if A is asparse matrix) we need cn flops. The costs per iteration are of the same order of magnitude asfor the Jacobi, Gauss-Seidel and SOR method.With respect to the rate of convergence of the CG method we formulate the following theorem.

Theorem 7.2.2 Define P∗k := p ∈ Pk | p(0) = 1 . Let xk, k ≥ 0 be the iterands of the CGmethod and ek = xk − x∗. The following holds:

‖ek‖A = minpk∈P∗

k

‖pk(A)e0‖A (7.20)

≤ minpk∈P∗

k

maxλ∈σ(A)

|pk(λ)| ‖e0‖A (7.21)

≤ 2

(

√

κ(A) − 1√

κ(A) + 1

)k

‖e0‖A (7.22)

157

Proof. From (7.16b) we get ek ∈ e0 + Vk. And due to Vk = spanr0, . . . , rk−1 and (7.16f)we have Aek = rk ⊥ Vk and thus ek ⊥A Vk. This implies ‖ek‖A = minvk∈Vk

‖e0 − vk‖A. Notethat vk ∈ Vk can be represented as

vk =

k−1∑

j=0

ξjAjr0 =

k−1∑

j=0

ξjAj+1e0

Hence,

‖ek‖A = minξ∈Rk

‖e0 −k−1∑

j=0

ξjAj+1e0‖A = min

pk∈P∗

k

‖pk(A)e0‖A

This proves the result in (7.20). The result in (7.21) follows from

‖pk(A)e0‖A ≤ ‖pk(A)‖A‖e0‖A = maxλ∈σ(A)

|pk(λ)|‖e0‖A

Let I = [λmin, λmax] with λmin and λmax the extreme eigenvalues of A. From the results abovewe have

‖ek‖A ≤ minpk∈P∗

k

maxλ∈I

|pk(λ)|‖e0‖A

The min-max quantity in this upper bound can be analyzed using Chebychev polynomials,defined by T0(x) = 1, T1(x) = x, Tm+1(x) = 2xTm(x) − Tm−1(x) for m ≥ 1. These polynomialshave the representation

Tk(x) =1

2

[

(

x+√

x2 − 1)k

+(

x+√

x2 − 1)−k]

(7.23)

and for any interval [a, b] with b < 1 they have the following property

minpk∈Pk ,pk(1)=1

maxx∈[a,b]

|pk(x)| = 1/Tk(2 − a− b

b− a

)

We introduce qk(x) = pk(1 − x) and then get

minpk∈P∗

k

maxλ∈I

|pk(λ)| = minpk∈P∗

k

maxx∈[1−λmax,1−λmin]

|pk(1 − x)|

= minqk∈Pk,qk(1)=1

maxx∈[1−λmax,1−λmin]

|qk(x)|

= 1/Tk(λmax + λmin

λmax − λmin

)

= 1/Tk(κ(A) + 1

κ(A) − 1

)

Using the representation (7.23) we get

Tk(κ(A) + 1

κ(A) − 1

)

≥ 1

2

(κ(A) + 1

κ(A) − 1+

√

(κ(A) + 1

κ(A) − 1

)2 − 1)k

=1

2

(

√

κ(A) + 1√

κ(A) − 1

)k

which then yields the bound in (7.22).So if we measure errors using the A-norm and neglect the factor 2 in (7.22), it follows thaton average per iteration the error is reduced by a factor (

√

κ(A) − 1)/(√

κ(A) + 1). In thisbound one can observe a clear relation between κ(A) and the rate of convergence of the CGmethod: a larger condition number results in a lower rate of convergence. For κ(A) ≫ 1 thereduction factor is of the form 1 − 2/

√

κ(A), which is significantly better than the bounds for

158

the contraction numbers of the Richardson and (damped) Jacobi methods which are of the form1 − c/κ(A). For the case κ(A) ∼ ch−2 the latter takes the form 1 − ch2, whereas for CG wehave an (average) reduction factor 1 − ch.Often the bound in (7.22) is rather pessimistic because the phenomenon of superlinear conver-gence is not expressed in this bound.

For a further theoretical analysis of the CG method we refer to Axelsson and Barker [7],Golub and Van Loan [41] and Hackbusch [48].

Example 7.2.3 (Poisson model problem) We apply the CG method to the discrete Poissonequation from section 6.6. First we discuss the complexity of the CG method for this modelproblem. In this case we have κ(A) ≈ ch−2. Using (7.22) it follows that (in the A-norm) theerror is reduced with approximately a factor

ρ :=

√

κ(A) − 1√

κ(A) + 1≈ 1 − 2c−1h (7.24)

per iteration. The arithmetic costs are cn flops per iteration. So for a reduction of the error witha factor R we need approximately − lnR/ ln(1 − 2c−1h)cn ≈ ch−1n ≈ cn1 1

2 flops. We conludethat the complexity is of the same order of magnitude as for the SOR method with the optimalvalue for the relaxation parameter. However, note that, opposite to the SOR method, in theCG method we do not have the problem of chosing a suitable parameter value. In Table 7.1we show results which can be compared with the results in section 6.6. We use the Euclideannorm and # denotes the number of iterations needed to reduce the starting error with a factorR = 103 .

h 1/40 1/80 1/160 1/320

# 65 130 262 525

Table 7.1: CG method applied to Poisson equation.

In figure 7.4 we illustrate the phenomenon of superlinear convergence in the CG method. Forthe case h = 1/160 we show the actual error reduction in the A-norm, i.e.

ρk :=‖xk − x∗‖A‖xk−1 − x∗‖A

in the first 250 iterations. The factor (√

κ(A)−1)/(√

κ(A)+1) has the value ρ = 0.96 (horizontalline in figure 7.4). There is a clear decreasing tendency of ρk during the iteration process. Forlarge values of k, ρk is significantly smaller than ρ. Finally, we note that an irregular convergencebehaviour as in figure 7.4 is typical for the CG method.

7.3 Introduction to preconditioning

In this section we consider the general concept of preconditioning and discuss a few precondi-tioning techniques. Consider a (sparse) system Ax = b, A ∈ Rn×n (not necessarily symmetricpositive definite), for which an approximation W ≈ A is available with the following properties:

Wx = y can be solved with ”low” computational costs (cn flops). (7.25a)

κ(W−1A) < κ(A). (7.25b)

159

0 50 100 150 200 2500.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Figure 7.4: Error reduction of CG applied to Poisson problem .

An approximation W with these properties is called a preconditioner for A. In (7.25a) it isimplicitly assumed that the matrix W does not contain many more nonzero entries than thematrix A, i.e. W is a sparse matrix, too. In the sections below three popular techniques forconstructing preconditioners will be explained. In section 7.7 results of numerical experimentsare given which show that using an appropriate preconditioner one can improve the efficiency ofan iterative method significantly. The combination of a given iterative method (e.g. CG) with apreconditioner results in a so-called preconditioned iterative method (e.g. PCG in section 7.7).As an introductory example, to explain the basic idea of preconditioned iterative methods, weassume that both A and W are symmetric positive definite and show how the basic Richardsoniterative method (which is not used in practice) can be combined with a preconditioner W. Weconsider the Richardson method with parameter value ω = 1/ρ(A), i.e.:

xk+1 = xk − ω(Axk − b) with ω :=1

ρ(A)(7.26)

For the iteration matrix C of this method we have

ρ(C) = ρ(I − θA) = max|1 − λ/λmax(A)| | λ ∈ σ(A)= 1 − λmin(A)

λmax(A) = 1 − 1κ(A)

(7.27)

When we apply the same method to the preconditioned system

Ax = b , A := W−1A , b := W−1b,

we obtain

xk+1 = xk − ω(Axk − b) = xk − ωW−1(Axk − b) with ω = 1/ρ(A). (7.28)

160

This method is called the preconditioned Richardson method. Note that if we assume that(an estimate of) ρ(A) is known, then we do not need the preconditioned matrix A in thismethod. In (7.28) we have to compute z := W−1(Axk − b), i.e., Wz = Axk − b. Due to thecondition in (7.25a) z can be computed with acceptable arithmetic costs. For the spectral radiusof the iteration matrix C of the preconditioned method we obtain, using σ(A) = σ(W−1A) =

σ(W− 12 AW− 1

2 ) ⊂ (0,∞),

ρ(C) = ρ(I − ωA) = max|1 − λ/λmax(A)| | λ ∈ σ(A)= 1 − λmin(A)

λmax(A)= 1 − 1

κ(A).

(7.29)

From (7.27) and (7.29) we conclude that if κ(W−1A) ≪ κ(A) (cf. (7.25b)), then ρ(C) ≪ ρ(C)and the convergence of the preconditioned method will be much faster than for the original one.Note that for W = diag(A) the preconditioned Richardson method coincides with the dampedJacobi method.

7.4 Preconditioning based on a linear iterative method

In this section we explain how a preconditioner W can be obtained from a given (basic) lineariterative method. Recall the general form of a linear iterative method:

xk+1 = xk − M−1(Axk − b). (7.30)

If one uses this iterative method for preconditioning then W := M is taken as the preconditionerfor A. If the method (7.30) converges then W is a reasonable approximation for A in the sensethat ρ(I − W−1A) < 1.The iteration in (7.30) corresponds to an iterative method and thus M−1y (y ∈ Rn) can becomputed with acceptable arithmetic costs. Hence the condition in (7.25a), with W = M, issatisfied.Related to the implementation of such a preconditioner we note the following. In an iterativemethod the matrix M is usually not used in its implementation (cf. Gauss-Seidel or SOR),i.e. the iteration (7.30) can be implemented without explicitly computing M. The solution ofWx = y, i.e. of Mx = y, is the result of (7.30) with k = 0, x0 = 0, b = y. From this itfollows that the computation of the solution of Wx = y can be implemented by performing oneiteration of the iterative method applied to Az = y with starting vector 0.

A bound for κ(W−1A) (cf. (7.25b)) is presented in the following lemma.

Lemma 7.4.1 We assume that A and M are symmetric positive definite matrices and that themethod in (7.30) is convergent, i.e ρ(I − M−1A) < 1. Then the following holds:

κ(M−1A) ≤ 1 + ρ(C)

1 − ρ(C)(7.31)

Proof. Because A and M are symmetric positive definite it follows that

σ(M−1A) = σ(M− 12AM− 1

2 ) ⊂ (0,∞) .

Using ρ(I − M−1A) < 1 we obtain that σ(M−1A) ⊂ (0, 2). The eigenvalues of M−1A aredenoted by µi:

0 < µ1 ≤ µ2 ≤ . . . ≤ µn < 2 .

161

Hence ρ(C) = max|1 − µ1|, |1 − µn| holds and

κ(M−1A) =µnµ1

=1 + (µn − 1)

1 − (1 − µ1)≤ 1 + |1 − µn|

1 − |1 − µ1|.

So

κ(M−1A) ≤ 1 + ρ(C)

1 − ρ(C)

holds.

With respect to the bound in (7.31) we note that the function x→ 1+x1−x increases monoton-

ically on [0, 1).In the introductory example above we have seen that it is favourable to have a small value forκ(M−1A). In (7.31) we have a bound on κ(M−1A) that decreases if ρ(C) decreases. Thisindicates that the higher the convergence rate of the iterative method in (7.30), the better thequality of M as a preconditioner for A.

Example 7.4.2 (Discrete Poisson equation) We consider the matrix A resulting from thefinite element discretization of the Poisson equation as described in section 6.6. If we use themethod of Jacobi, i.e. M = D then ρ(C) ≈ 1 − ch2 holds and (7.31) results in

κ(D−1A)<∼ 2ch−2 . (7.32)

In this model problem the eigenvalues are known and it can be shown that the exponent −2 in(7.32) is sharp. When we use the SSOR basic iterative method in (7.30), with M as in (6.18)and with an appropriate value for ω, we have ρ(C) ≈ 1 − ch and thus

κ(M−1A)<∼ 2ch−1 . (7.33)

Hence, in this example, for the SSOR preconditioner the quantity κ(M−1A) is significantlysmaller than for the Jacobi preconditioner. So in a preconditioned Richardson method or ina Preconditioned Conjugate Gradient method (cf. Example 7.7.1) the SSOR preconditionerresults in a method with a higher rate of convergence than the Jacobi preconditioner.

7.5 Preconditioning based on incomplete LU factorizations

In this section we discuss a very popular preconditioning technique, which is based on the classi-cal Gaussian elimination principle. Using the Gaussian elimination method in combination withpartial pivoting (row permutations), if necessary, results in an LU factorization of the matrix A.The LU factorization of A can be used for solving a linear system with matrix A. However, forthe large sparse systems that we consider, this solution method is inefficient. In an incompletefactorization method the Gaussian elimination is only partly performed, which then yields anapproximate LU factorization A ≈ LU with L and U sparse.Here we only explain the basic concepts of incomplete factorization methods. For an extensivediscussion of this topic we refer to Axelsson [6], Saad [78] and to Bruaset [25] (the latter containsmany references related to this subject).

162

7.5.1 LU factorization

The direct method of Gaussian elimination for solving a system Ax = b is closely related tothe LU factorization of A. We recall the following: for every square matrix A there exists apermutation matrix P, a lower triangular matrix L, with diag(L) = I and an upper triangularmatrix U such that the factorization

PA = LU (7.34)

holds. If A is nonsingular, then for given P these L and U are unique. To simplify the discussionwe only consider the case P = I, i.e. we do not use pivoting. It is known that a factorization as in(7.34) with P = I exists if A is symmetric positive definite or if A is an M-matrix. Many differentalgorithms for the computation of an LU factorization exist (cf. Golub and Van Loan [41]). Astandard technique is presented in the following algorithm, in which aij is overwritten by lij ifi > j and by uij otherwise.

LU factorization.For k = 1, . . . , n− 1

If akk = 0 then quit elseFor i = k + 1, . . . , nη := aik/akk ; aik := η ;For j = k + 1, . . . , naij := aij − ηakj.

(7.35)

Clearly, the Gaussian elimination process fails if we encounter a zero pivot. In the ”if” conditionin (7.35) it is checked whether the pivot in the kth elimination step is equal to zero. If this con-dition is never true, the Gaussian elimination algorithm (7.35) yields an LU decomposition as in(7.34) with P = I. In the kth step of the Gaussian elimination process we eliminate the nonzeroentries below the diagonal in the kth column. Due to this the entries in the (n − k) × (n − k)right under block in the matrix change. This corresponds to the assignment aij := aij − ηakj,with a loop over i and j. In the assignment aik := η, with a loop over i, the values of lik, i > kare computed. Finally note that in the kth step of the elimination process the entries amj , with1 ≤ m ≤ k and j ≥ m do not change; these are the entries umj (1 ≤ m ≤ k, j ≥ m) of thematrix U.

Another possible implementation of Gaussian elimination is based on solving the n2 equationsA = LU for the n2 unknowns (lij)1≤j<i≤n , (uij)1≤i≤j≤n. These n2 equations are given by

aik =

min(i,k)∑

j=1

lijujk , 1 ≤ i, k ≤ n . (7.36)

This yields the following explicit formulas for lik and uik :

lik = (aik −k−1∑

j=1

lijujk)/ukk , 1 ≤ k < i ≤ n , (7.37)

uik = aik −i−1∑

j=1

lijujk , 1 ≤ i ≤ k ≤ n . (7.38)

Thus we can compute L and U row by row, i.e. we take i = 1, 2, . . . , n and for i fixed we computelik by (7.37) with k = 1, . . . , i− 1 and then uik by (7.38) with k = i, . . . , n. We discuss a simple

163

implementation of this row wise Gaussian elimination process. We take i fixed and introducethe notation

a(m)ik := aik −

m−1∑

j=1

lijujk , 1 ≤ k,m ≤ n. (7.39)

Note that a(1)ik = aik and

uik = a(i)ik for k ≥ i,

lik = a(k)ik /ukk for k < i,

a(m+1)ik = a

(m)ik − limumk .

(7.40)

Using these formulas the entries lik and uik can be computed as follows. Note that u1k = a1k

for k = 1, . . . , n. Assume that the rows 1, . . . , i − 1 of L and U have been computed, then lik,1 ≤ k < i, and uik, i ≤ k ≤ n, are determined by

For k = 1, . . . , i− 1

lik = a(k)ik /ukk

For j = k + 1, . . . , n

a(k+1)ij = a

(k)ij − likukj

For k = i, . . . , n

uik = a(i)ik

As in (7.35) we can overwrite the matrix A, and we then obtain the following algorithm, whichis commonly used for a row-contiguous data structure:

Row wise LU factorization.For i = 2, . . . , n

For k = 1, . . . , i− 1η := aik/akk ; aik := η ;For j = k + 1, . . . , naij := aij − ηakj .

(7.41)

For ease of presentation we deleted the statement ”If akk = 0 then quit”. If in both algorithms,(7.35) and (7.41), a zero pivot does not occur (i.e. akk = 0 is never true), then both algorithmsyield identical LU factorizations.For certain classes of matrices, for example symmetric matrices or matrices having a band struc-ture, there exist Gaussian elimination algorithms which take advantage of special properties ofthe matrix. Such specialized algorithms enhance effiency. A well-known example is the Choleskydecomposition method, in which for a symmetric positive definite matrix A a factorizationA = LLT is computed (here L is lower triangular, but diag(L) is not necessarily equal to I).Based on the formula in (7.36) the following algorithm is obtained:

Cholesky factorization.For k = 1, . . . , n

akk =√

akk −∑k−1

j=1 a2kj

For i = k + 1, . . . , n

aik := (aik −∑k−1

j=1 aijakj)/akk

(7.42)

164

To obtain a stable Gaussian elimination algorithm it is important to use a (partial) pivotingstrategy. We do not discuss this topic here but refer to the literature, e.g. Golub and VanLoan [41]. We note that if the matrix A is symmetric positive definite or weakly diagonallydominant, then a straightforward implementation of Gaussian elimination is stable even withoutusing pivoting. For example, the Cholesky algorithm as in (7.42), applied to a symmetric positivedefinite matrix, is stable.

7.5.2 Incomplete LU factorization

In general, in a Gaussian elimination process applied to a sparse matrix an unacceptable amountof fill-in is created in the matrices L and U. In this section we describe a simple way to avoidexcessive fill-in, resulting in an incomplete LU factorization. We note that in this section weagain use the notation L to denote a lower triangular matrix with diag(L) = I and U to denotean upper triangular matrix. These L and U, however, may differ from the L and U used in theLU decomposition discussed in Section 7.5.1.

We introduce the graph of the matrix A:

G(A) := (i, j) | 1 ≤ i, j ≤ n and aij 6= 0 .

Let S be a subset of the indexset (i, j) | 1 ≤ i, j ≤ n . We call this subset the sparsity patternand we assume:

(i, i) | 1 ≤ i ≤ n ⊂ S , G(A) ⊂ S . (7.43)

In our applications the matrices A are such that all diagonal entries are nonzero and thus (7.43)reduces to the condition G(A) ⊂ S. We now simply enforce sparsity of L and U by settingevery entry in L and U to zero if the corresponding index is outside the sparsity pattern, i.e.we introduce the condition:

lij = uij = 0 if (i, j) /∈ S. (7.44)

We apply Gaussian elimination and we require sparsity of L and U as formulated in (7.44). Thisthen yields an incomplete LU factorization. As for the (complete) LU factorization method dis-cussed in Section 7.5.1, several different implementations of an incomplete factorization methodexist. We present a few well-known algorithms. We assume that no zero pivot occurs in thealgorithms below. Theorem 7.5.3 gives sufficient conditions on the matrix A such that this as-sumption is fulfilled.

We start with the incomplete Cholesky factorization A = LLT − R based on algorithm (7.42).For this algorithm to make sense, the matrix A should be symmetric positive definite. Topreserve symmetry we assume that the sparsity pattern is symmetric, i.e. if (i, j) ∈ S then(j, i) ∈ S, too. We use a formulation in which aij is overwritten by lij if i ≥ j.

Incomplete Cholesky factorization.For k = 1, . . . , n

akk :=√

akk −∑k−1

j=1 a2kj

For i = k + 1, . . . , n If (i, k) ∈ S then


j=1 aijakj)/akk

(7.45)

The sums in this algorithm should be taken only over those j for which the corresponding indexes(k, j) and (i, j) are in S.

165

The first thorough analysis of incomplete factorization techniques is given in Meijerink andVan der Vorst [62]. In that paper a modified (i.e. incomplete) version of algorithm (7.35) isconsidered:

Incomplete LU factorization.For k = 1, . . . , n− 1

For j = k + 1, . . . , n If (k, j) /∈ S then akj := 0 (∗)For i = k + 1, . . . , n If (i, k) /∈ S then aik := 0 (∗)For i = k + 1, . . . , nη := aik/akk ; aik := η ;For j = k + 1, . . . , naij := aij − ηakj.

(7.46)

Compared to algorithm (7.35) only the lines (∗) have been added. In these lines certain entriesin the kth row of U and in the kth column of L are set to zero, according to the condition (7.44).Algorithm (7.46) has a simple structure and is easy to analyze (cf. theorem 7.5.1). However,this algorithm is a rather inefficient implementation of incomplete factorization. Below wereformulate the algorithm, resulting in a significantly more efficient implementation, given inalgorithm (7.56).

Theorem 7.5.1 Assume that algorithm (7.46) does not break down (i.e. akk = 0 is never true).Then this algorithm results in an incomplete factorization A = LU − R with

lij = uji = 0 for 1 ≤ i < j ≤ n , (7.47)

lij = uij = 0 for (i, j) /∈ S , (7.48)

rij = 0 for (i, j) ∈ S. (7.49)

Proof . The result in (7.47) holds because L (U) is lower (upper) triangular. By construction(cf. lines (∗) in the algorithm) the result in (7.48) holds. It remains to prove the result in (7.49).The standard basis vector with value 1 in themth entry is denoted by em. By vm = (v1

m, v2m, . . . , v

nm)T

we denote a generic n-vector with vim = 0 for i ≤ m. Note that a standard (complete) Gaussianelimination as in algorithm (7.35) can be represented in matrix formulation as :

A1 = AFor k = 1, . . . , n− 1

Ak+1 = LkAk , withLk of the form Lk = I + vke

Tk .

(7.50)

The matrices Ak+1 have the property (Ak+1)ij = 0 if i > j and j ≤ k. Then U := An =Ln−1Ln−2 . . .L1A holds. Using L−1

k = I − vkeTk we obtain the LU factorization

A = L−11 L−1

2 . . .L−1n−1U = (I −

n−1∑

k=1

vkeTk )U =: LU . (7.51)

The kth stage of algorithm (7.46) consists of two parts. First the kth row and kth column aremodified by setting certain entries to zero (lines (∗) in (7.46)) and then a standard Gaussian

166

elimination step as in (7.50) is applied to the modified matrix. In matrix formulation this yields:

A1 = A (7.52a)

For k = 1, . . . , n− 1

Ak = Ak + Rk , with (7.52b)

Rk of the form Rk = vkeTk + ekv

Tk , and (7.52c)

(Rk)ij = 0 for all (i, j) ∈ S , (7.52d)

Ak+1 = LkAk , with (7.52e)

Lk of the form Lk = I + vkeTk , (7.52f)

Again, the matrix Ak+1 has the property (Ak+1)ij = 0 if i > j and j ≤ k. The three vectors vkthat occur in (7.52c) and (7.52f) may all be different. From the form of Rk and Lm (cf. (7.52c),(7.52f)) we obtain

LmRk = Rk for m < k . (7.53)

Now note that for the resulting upper triangular matrix U := An we get, using (7.53) and thenotation L := Ln−1Ln−2 . . .L1 :

U = Ln−1An−1 = Ln−1An−1 + Ln−1Rn−1

= Ln−1Ln−2An−2 + LRn−1

= Ln−1Ln−2An−2 + LRn−2 + LRn−1

= ....... = LA + L(R1 + R2 + . . .+ Rn−1) .

As in (7.51) we define L := L−1 = L−11 L−1

2 . . .L−1n−1 = I −∑n−1

k=1 vkeTk and we get

LU = A +n−1∑

j=1

Rj .

Hence A = LU− R with R :=∑n−1

j=1 Rj , and the result in (7.49) follows from (7.52d).

We can use the results in theorem 7.5.1 to derive a much more efficient implementation ofalgorithm (7.46). Using the condition in (7.44) (or (7.48)) for the incomplete LU factorization,we obtain for L = (lij)1≤i,j≤n, U = (uij)1≤i,j≤n :

lij = uji = 0 for 1 ≤ i < j ≤ n ,

lij = uij = 0 for (i, j) /∈ S ,

lii = 1 for 1 ≤ i ≤ n.

(7.54)

By |S| we denote the number of elements in the sparsity pattern S. After using (7.54) there arestill |S| entries in L and U which have to be determined. From (7.49) we deduce that

aij = (LU)ij for (i, j) ∈ S . (7.55)

This yields |S| (nonlinear) equations for these unknown entries of L and U.. We now follow theline of reasoning as in (7.36)-(7.41) for the complete LU factorization. From (7.55) we obtain(cf. (7.36)) :

aik =

min(i,k)∑

j=1

lijujk if 1 ≤ i, k ≤ n and (i, k) ∈ S .

167

This yields explicit formulas for lik and uik (cf. (7.37)) :

lik = (aik −k−1∑

j=1

lijujk)/ukk if 1 ≤ k < i ≤ n and (i, k) ∈ S ,

uik = aik −i−1∑

j=1

lijujk if 1 ≤ i ≤ k ≤ n and (i, k) ∈ S.

Thus we can compute L and U row by row. We take i fixed and use the notation as in (7.39).This yields (cf. (7.40)) :

uik = a(i)ik if k ≥ i and (i, k) ∈ S,

lik = a(k)ik /ukk if k < i and (i, k) ∈ S,

a(m+1)ik = a

(m)ik − limumk .

Using these formulas the entries lik and uik can be computed as follows. Note that for k =1, . . . , n, u1k = a1k if (1, k) ∈ S and u1k = 0 otherwise. Assume that the rows 1, . . . , i − 1 of Land U have been computed, then lik and uik are determined by

For k = 1, . . . , i− 1

If (i, k) ∈ S then lik = a(k)ik /ukk

For j = k + 1, . . . , n

a(k+1)ij = a

(k)ij − likukj (∗)

For k = i, . . . , n

If (i, k) ∈ S then uik = a(i)ik

In the computation of lik and uik we use a(m)ik only for (i, k) ∈ S. Hence the update in line (∗) is

needed only if (i, j) ∈ S. Again we use a formulation in which we overwrite the matrix A, andwe obtain the following incomplete version of algorithm (7.41) :

Incomplete row wise LU factorization.For i = 2, . . . , n

For k = 1, . . . , i− 1 If (i, k) ∈ S thenη := aik/akk ; aik := η ;For j = k + 1, . . . , n If (i, j) ∈ S thenaij := aij − ηakj .

(7.56)

Remark 7.5.2 The results in theorem 7.5.1 and the construction (7.54)-(7.56) following thattheorem show that if algorithm (7.46) does not break down then Algorithm (7.56) and Algo-rithm (7.46) are equivalent, in the sense that these two algorithms yield the same incompleteLU factorization. Moreover, the derivation in (7.54)-(7.56) implies that if an incomplete LUfactorization exists which satisfies (7.47)-(7.49), then these L (with diag(L) = I ) and U areunique.Note that the implementation in (7.56) is more efficient than the implementation (7.46). In thelatter algorithm it may well happen that certain assignments aij := aij − ηakj in the j-loop aresuperfluous, since for a higher value of k (in the k-loop) these previously computed values areset to zero.

168

As stated in theorem 7.5.1 and remark 7.5.2, a unique incomplete LU factorization which satisfies(7.47)-(7.49) exists, if algorithm (7.46) does not break down. The following result is proved in[62] theorem 2.3.

Theorem 7.5.3 If A is an M-matrix then algorithm (7.46) does not break down. If A is inaddition symmetric positive definite, and the pattern S is symmetric, algorithm (7.45) does notbreak down.

If A is an M-matrix, then the unique incomplete LU factorization can be computed using, forexample, algorithm (7.46) or algorithm (7.56). If A is in addition symmetric positive definite,and the pattern S is symmetric, then we can use algorithm (7.45), too. With respect to thestability of an incomplete LU factorization we give the following result, which is proved inMeijerink and Van der Vorst [62] theorem 3.2.

Theorem 7.5.4 Let A be an M-matrix. The incomplete LU factorization of A as described intheorem 7.5.1 is denoted by A = LU − R. The complete LU factorization of A is denoted byA = LU. Then

|lij | ≤ |lij | for all 1 ≤ i, j ≤ n

holds. Hence, the construction of an incomplete LU factorization is at least as stable as theconstruction, without any pivoting, of a complete LU factorization. If A is in addition sym-metric then the construction of an incomplete Cholesky factorization is at least as stable as theconstruction of a complete Cholesky factorization.

Proof . Given in Meijerink and Van der Vorst [62].

Even if A is an M-matrix the construction of a complete LU factorization may suffer frominstabilities. However, if the matrix is weakly diagonally dominant, a complete LU factoriza-tion can be computed, without any pivoting, in a stable way. We conclude that for a weaklydiagonally dominant M-matrix (e.g. the matrices of our model problems) the construction of anincomplete LU factorization (using (7.56) or (7.45)) is a stable process.

We can use the incomplete LU decomposition to construct a basic iterative method. Such amethod is obtained by taking M := LU, with L and U as in theorem 7.5.1, and applying theiteration xk+1 = xk − M−1(Axk − b). For this method we have to compute an incomplete LUfactorization of the given matrix A and per iteration we have to solve a system of the formLUx = y. The latter can be done with low computational costs by a forward and backwardsubstitution process. The iteration matrix of this method is given by I−(LU)−1A. With respectto the convergence of this iterative method we give the following theorem.

Theorem 7.5.5 Assume that A is an M-matrix. For the incomplete LU factorization as intheorem 7.5.1 we have:

ρ(I − (LU)−1A) < 1.

7.5.3 Modified incomplete Cholesky method

In the modified incomplete Cholesky method the fill-in entries which are neglected in the loopover i in algorithm (7.45) are taken into account, in the sense that these are moved to the

169

diagonal . Moving these entries to the corresponding diagonal elements does not cause anyadditional fill-in. The algorithm is as follows:

Modified incomplete Cholesky factorization.For k = 1, . . . , n

akk :=√

akk −∑k−1

j=1 a2kj

For i = k + 1, . . . , nIf (i, k) ∈ S then


j=1 aijakj)/akkelse

aii := (aii −∑k−1

j=1 aijakj)/akk

(7.57)

Again the sums in this algorithm should only be taken over those j for which the correspondingindexes are in S. One can prove that this algorithm, if it does not break down, yields anincomplete factorization

A = LLT + R

with

li,j = 0 if (i, j) /∈ S

(LLT )i,j = ai,j if (i, j) ∈ S, i 6= j,n∑

j=1

(LLT )i,j =n∑

j=1

ai,j for all i.

It is known that for certain problems this “lumping” strategy (moving in each row certainentries to the diagonal) can improve the quality of the incomplete Cholesky preconditionersignificantly. This is illustrated in numerical experiments for the Poisson equation in section 7.7.

7.6 Problem based preconditioning

In preparation

7.7 Preconditioned Conjugate Gradient Method

A convergence analysis of the CG method results in a bound for the rate of convergence of theCG method as in (7.22), which depends on the spectral condition number of the matrix A. Alarger condition number will result in a lower rate of convergence. In Section 7.3 we discussedthe concept of preconditioning. In this section we apply this preconditioning technique to theCG method.

Let B be a given regular n × n-matrix. Consider the following transformation of the origi-nal problem given in (7.1)

Ax = b with A := B−1AB−T , x := BTx , b := B−1b . (7.58)

The matrix A is symmetric positive definite, so we can apply the CG method from (7.18) to thesystem in (7.58). This results in the following algorithm (which is not used in practice, because

170

in general the computation of A = B−1ABT will be too expensive):

x0 a given starting vector ; r0 = Ax0 − b

for k ≥ 0 (if rk 6= 0) :

pk = −rk + <rk,rk><rk−1,rk−1>

pk−1 (if k = 0 : p0 := r0)

xk+1 = xk + αopt(xk, pk)pk with αopt(x

k, pk) = <rk,rk>

<pk,Apk>

rk+1 = rk + αopt(xk, pk)Apk .

(7.59)

This algorithm yields approximations xk for the solution x∗ = BTx∗ of the transformed system.To obtain an algorithm in in the original variables we introduce the notation

pk := B−T pk , xk := B−T xk , rk := Brk , (7.60)

zk := B−T rk = B−TB−1rk = W−1rk , with W := BBT . (7.61)

Using this notation we can reformulate the algorithm in (7.59) as follows

x0 a given starting vector ; r0 := Ax0 − b

for k ≥ 0 (if rk 6= 0) :

solve zk from Wzk = rk

pk := −zk + <zk,rk><zk−1,rk−1>

pk−1 (if k = 0 : p0 := z0)

xk+1 := xk + αopt(xk,pk)pk with αopt(x

k,pk) = <zk,rk><pk,Apk>

rk+1 := rk + αopt(xk,pk)Apk .

(7.62)

This algorithm, which yields approximations xk for the solution x∗ of the original system, iscalled the Preconditioned Conjugate Gradient method (PCG) with preconditioner W. For W = Iwe obtain the algortihm as in (7.18).

Note that in the algorithm in (7.62) the matrx B is involved only in W = BBT . Hence thisalgorithm is applicable if a symmetric positive definite matrix W is available. Such a matrixhas a corresponding Cholesky-decomposition W = BBT . This decomposition, however, playsa role only in the theoretical derivation of the method (cf. (7.58)) and is not needed in thealgorithm.Using the identity ‖xk − x∗‖A = ‖xk − x∗‖

Awe obtain that the error reduction in (7.62), mea-

sured in ‖ · ‖A, is the same as the error reduction in (7.59), measured in ‖ · ‖A

. Based on theresult in (7.22) we have that the rate of convergence of the algorithm in (7.59), and thus of thePCG algorithm, too, is determined by

κ(A) = κ(B−1AB−T ) = κ(W−1A).

So for a significant increase of the rate of convergence due to preconditioning we should have apreconditioner W with κ(W−1A) ≪ κ(A). In the PCG algorithm in (7.62) we have to solve asystem with matrix W in every iteration. So the matrix W should be such that the solution ofthis system can be computed with ”low” computational costs (not much more than one matrix-vector multipication). Note that these requirements for the preconditioner are as in section 7.3.

171

For the PCG methods we need a symmetric positive definite preconditioner W. In section 7.3we discussed the symmetric positive definite preconditioners W = MSSOR (SSOR precondition-ing), W = LLT (incomplete Cholesky) and W = LLT (modified incomplete Cholesky). Inthe example below we apply the PCG method with these three preconditioners to the discretePoisson equation.

Example 7.7.1 (Poisson model problem) We consider the discrete Poisson equation as insection 6.6. We apply the PCG method with the SSOR preconditioner. For the parameter ωin the SSOR preconditioning we use the value for ωopt as in theorem 6.4.3, i.e. ω is such thatthe spectral radius of the iteration matrix of the SOR method is minimal. In Table 7.2 we showresults that can be compared with the results in Table 7.1. We measure the error reduction inthe Euclidean norm ‖ · ‖2. By # we denote the number of iterations needed to reduce the normof the starting error by a factor R = 103.

h 1/40 1/80 1/160 1/320

ω = 2/(1 + sin(πh)) 1.854 1.924 1.961 1.981

# 11 16 22 32

Table 7.2: PCG with SSOR preconditioner applied to Poisson problem.

In Axelsson and Barker [7] it is shown that for this model problem the SSOR preconditioner, withan appropriate value for the parameter ω, results in κ(W−1A) ≈ ch−1 and thus (cf. (7.22)) weexpect an error reduction per iteration (measured in the A-norm) with at least a factor 1− c

√h.

If this is the case, then for this problem the PCG method has a complexity O(n5/4).The results in Table 7.2 are consistent with such a reduction factor of the form 1 − c

√h. Ap-

parently, the choice ω = ωopt, as explained above, is appropriate. Related to this we note thatin Axelsson and Barker [7] it is shown that often the rate of convergence of the PCG methodwith SSOR preconditioning is not very sensitive with respect to perturbations in the value ofthe parameter ω. This phenomenon is illustrated in Table 7.3, where we show the results ofPCG with SSOR preconditioning for h = 1

160 and for several values of ω.

ω 1.80 1.85 1.90 1.95 1.96 1.97 1.98 1.99

# 33 29 26 23 22 22 23 24

Table 7.3: PCG with SSOR preconditioner applied to Poisson problem

In Table 7.4 we show the results obtained with the incomplete Cholesky preconditioner, i.e.W = LLT , and with the modified incomplete Cholesky preconditioner, i.e. W = LLT (cf.Section 7.5). In both cases, for the sparsity pattern we used S = G(A). In the literature thesealgorithms are denoted by ICCG and MICCG, respectively.

The results for ICCG indicate that for the preconditioned system we have κ(W−1A) ≈ ch−2

where the constant is better than for the unpreconditioned system with W = I. The resultsfor MICCG indicate that for the preconditioned system we have κ(W−1A) ≈ ch−1, which iscomparable to the result with SSOR preconditioning.

172

h 1/40 1/80 1/160 1/320

ICCG, # 20 40 79 157

MICCG, # 8 11 14 20

Table 7.4: PCG with (M)IC preconditioner applied to Poisson problem.

Remark 7.7.2 (in preparation) IC preconditioning is often more robust than MIC precondi-tioning, for example for problems with discontinuous coefficients.

Remark 7.7.3 (in preparation) On the eigenvalue distribution of a preconditioned system inrelation to CG convergence.

173

Chapter 8

Krylov Subspace Methods

8.1 Introduction

In Section 7.2 the CG method for solving Ax = b has been derived as a minimization methodfor the functional

F (x) =1

2〈x,Ax〉 − 〈x,b〉.

If A is symmetric positive definite then this F is a quadratic functional with a unique minimizerand minimization of F is equivalent to solving Ax = b.If A is not symmetric positive definite then the nice minimization properties of CG do nothold and it is not clear whether CG is still useful. If A is not symmetric positive definite wecan still try the CG algorithm. In practice we often observe that for nonsymmetric problemsin which the symmetric part (i.e. 1

2(A + AT )) is positive definite, the CG algorithm is stilla fairly efficient solver if the skew-symmetric part (i.e. 1

2(A − AT )) is ”small” compared tothe symmetric part. In other words, the CG algorithm can be used for solving nonsymmetricproblems in which the nonsymmetric part is a perturbation on a symmetric positive definite part.In problems with moderate nonsymmetry (‖A−AT ‖ ≈ ‖A+AT ‖) or with strong nonsymmetry(‖A−AT ‖ ≫ ‖A+AT‖) the CG algorithm generally diverges. For such nonsymmetric problemsother Krylov subspace methods have been developed.

Example 8.1.1 We consider the discrete convection-diffusion problem as in section 6.6 withb1 = cos(π/6), b2 = sin(π/6) and h = 1/160. We use We take x0 = 0 and an error reductionfactor R = 103. The CG algorithm is applied to this problem for different values of the parameterε. The results are shown in Table 8.1.Note that for large values of ε the problem is ”nearly symmetric” and the convergence behaviour

ε 102 101 100 10−1 10−2

# 190 233 322 DIV DIV

Table 8.1: CG method applied to convection-diffusion problem

of the CG method is reasonable. For smaller values of ε the nonsymmetry of the problem isincreasing and the CG method fails.

In section 8.2 below we show that, for A symmetric positive definite, the CG method can beseen as a projection method . Using this point of view we can develop variants of the CG method

175

which can be used for problems in which A is symmetric but indefinite or A is nonsymmetric. Inrecent years, many of such variants have been introduced. For an overview of these methods werefer to Saad [78], Freund et al. [36], Greenbaum [42], Sleijpen and Van der Vorst [85]. We willdiscuss a few important methods and explain the main approaches in this field of nonsymmetricKrylov subspace methods.

8.2 The Conjugate Gradient method reconsidered

For a given nonsingular A ∈ Rn×n and given r ∈ Rn we define the Krylov subspace as follows

Kk(A; r) := spanr,Ar,A2r, ...,Ak−1r . (8.1)

In the remainder of this chapter the Krylov subspace Kk(A; r0), with r0 = Ax0 −b the startingresidual, will play an important role. To avoid certain technical details, we make the followingassumption concerning the starting vector x0:

Assumption 8.2.1 In the remainder of this chapter we assume that x0 is chosen such thatdim(Kk(A; r0)) = k for k = 1, 2, . . . , n.

We note that in the generic case this assumption is fulfilled. Only for special choices of x0 onehas dim(Kk(A; r0)) < k for k < n. We emphasize that the formulations of the algorithms whichare discussed in the remainder of this chapter do not depend on this assumption.

We first reconsider the CG method applied to the problem Ax = b with A symmetric pos-itive definite. Using the results of theorem 7.2.1 we obtain that

xk ∈ x0 + Kk(A; r0)

holds and

A(xk − x∗) = Axk − b = rk ⊥ Kk(A; r0),

or, equivalently,

〈A(xk − x∗), z〉 = 〈xk − x∗, z〉A = 0 for all z ∈ Kk(A; r0).

We conclude that xk−x0 is the A-orthogonal projection (i.e. with respect to the A-inner product〈·, ·〉A) of the starting error x∗ − x0 on Kk(A; r0) . This is illustrated in figure 8.1. From the

-

3x∗ − x0

xk − x0

Kk(A; r0)

the right angle is w.r.t. theinner product 〈·, ·〉A

Figure 8.1: CG as a projection method

176

observations above it follows that the CG iterate xk can be characterized as the unique solutionof the following problem:

determine xk ∈ x0 + Kk(A; r0) such that‖xk − x∗‖A = min‖x − x∗‖A | x ∈ x0 + Kk(A; r0) . (8.2)

Because xk − x∗ ⊥A z ⇔ Axk − b ⊥ z, an equivalent formulation of this problem is:

determine xk ∈ x0 + Kk(A; r0) such thatAxk − b ⊥ Kk(A; r0).

(8.3)

We will now derive an algorithm ((8.16) below), different from the CG algorithm, that canbe used to solve this problems. For this algorithm and the CG algorithm the computationalcosts per iteration are comparable and, in exact arithmetic, these two algorithms yield the sameiterands. The ideas underlying this alternative algorithm will play an important role in thederivation of algorithms for the case that A is not symmetric positive definite.

We start with a simple method for computing an orthogonal basis of the Krylov subspace,the so-called Lanczos method :

q0 := 0; q1 := r0/‖r0‖; β0 := 0;for j ≥ 1 :

qj+1 := Aqj − βj−1qj−1,

αj := 〈qj+1,qj〉,qj+1 := qj+1 − αjq

j, βj := ‖qj+1‖,qj+1 := qj+1/βj .

(8.4)

With induction one easily proves:

Theorem 8.2.2 If A is symmetric then the set q1,q2, ...,qk forms an orthogonal basis of theKrylov subspace Kk(A; r0) (k ≤ n).

Note that the method uses only a three term recursion, i.e. qj+1 can be determined fromqj , qj−1, and that the costs per iteration are low. Given the basis for the Krylov subspaceof dimension j we need one matrix-vector multiplication, two inner products and a few vectorupdates to compute the orthogonal basis for the Krylov subspace of dimension j + 1.Define

Qj := [q1 q2 ... qj ] (n× j − matrix with columns qi) .

The recursion in (8.4) can be rewritten as

Aqj = βj−1qj−1 + αjq

j + βjqj+1 , (8.5)

and thus

AQk = Qk

α1 β1

β1 α2. . . ∅

. . .. . .

. . .. . .

. . . βk−1

∅ βk−1 αk

+ βkqk+1(0, 0, ..., 0, 1)

=: QkTk + βkqk+1eTk (8.6)

177

holds. Due to the orthogonality of the basis we have

QTkQk = Ik (k × k identity matrix); QT

k qk+1 = 0 . (8.7)

For solving the problem (8.3) we have to compute xk ∈ x0 + Kk(A; r0) which satisfies theorthogonality property

〈Axk − b, z〉 = 0 for all z ∈ Kk(A; r0).

This yields the condition

QTk (Axk − b) = QT

k (A(xk − x0) − r0) = 0 . (8.8)

Note that q1 = r0/‖r0‖ and QTk r

0 = ‖r0‖(1, 0, . . . , 0)T =: ‖r0‖e1. Since the vector xk−x0 mustbe an element of Kk(A; r0) it can be represented using the basis q1,q2, . . . ,qk, i.e. there existsan yk ∈ Rk such that xk − x0 = Qky

k. Using this the condition (8.8) can be formulated as

QTkAQky

k = ‖r0‖e1 . (8.9)

With the results in (8.6) and (8.24) we obtain that the solution of the problem (8.3) (or (8.2))is given by

Tkyk = ‖r0‖e1, (8.10a)

xk = x0 + Qkyk. (8.10b)

Note that Tk = QTkAQ is a symmetric positive definite tridiagonal k× k matrix. So the vector

xk can be obtained by first solving the tridiagonal system in (8.10a) and then computing xk

as in (8.10b). This, however, would result in an algorithm with high computational costs periteration. We now show that based on (8.10a), (8.10b) an algorithm can be derived in whichthe iterand xk can be updated from the previous iterand xk−1 in a simple and cheap way (as inthe CG algorithm). To derive this algorithm we represent Tk using its LU factorization (whichexists, because Tk is symmetric positive definite):

Tk = LkUk =

1l2 1 ∅

. . .. . .. . .

. . .

∅ lk 1

u1 β1

u2 β2 ∅. . .

. . .

uk−1 βk−1

∅ uk

,

for k = 1, 2, . . ., where the βi are the same as in the matrix Tk. We also introduce the notationPk = [p1 p2 . . . pk] := QkU

−1k , zk := L−1

k ‖r0‖e1. From (8.10a) and (8.10b) we then obtain

xk = x0 + QkT−1k ‖r0‖e1

= x0 + QkU−1k L−1

k ‖r0‖e1 = x0 + Pkzk. (8.11)

From the k-th column in the identity PkUk = Qk one obtains pk−1βk−1 + pkuk = qk and thusthe simple update formula

pk =1

uk(qk − βk−1p

k−1). (8.12)

From the last row in the identity Tk = LkUk we obtain lkuk−1 = βk−1 and lkβk−1 + uk = αk,i.e.

lk =βk−1

uk−1, uk = αk − lkβk−1 . (8.13)

178

If we represent zk as zk =

[

zk−1

ξk

]

with ξk ∈ R (k = 1, 2, . . .), it follows from (8.11) that

xk = x0 + Pk−1zk−1 + pkξk = xk−1 + ξkp

k . (8.14)

Finally, from the last equation in Lkzk = ‖r0‖e1 it follows that lkξk−1 + ξk = 0 and thus

ξk = −lkξk−1 . (8.15)

In (8.12), (8.13), (8.14) and (8.15) we have recursion formulas which allow a simple updatek−1 → k. Combining these formulas with the Lanczos algorithm (8.4) for computing qk resultsin the following Lanczos iterative solution method :

q0 := 0; q1 := r0/‖r0‖; β0 = p0 = l1 = 0; ξ1 = ‖r0‖;for j ≥ 1 :

qj+1 := Aqj − βj−1qj−1, αj := 〈qj+1,qj〉

if j > 1 then lj =βj−1

uj−1and ξj = −ljξj−1,

uj = αj − ljβj−1,

pj = 1uj

(qj − βj−1pj−1),

xj = xj−1 + ξjpj,

qj+1 := qj+1 − αjqj , βj := ‖qj+1‖,

qj+1 := qj+1/βj .

(8.16)

This algorithm for computing the solution xk of (8.2) (or (8.3)) has about the same computa-tional costs as the CG algorithm presented in section 7.2.

In the derivation of the Lanczos iterative solution method (8.16) the following ingredients areimportant:

An orthogonal basis of the Krylov subspace can be computed with

low costs, using the Lanczos method (8.4). (8.17)

As an approximation of the original system, the “projected” much

smaller system Tkyk = ‖r0‖e1 in (8.10a) is solved. (8.18)

The computation of the orthogonal basis in (8.17) and the solution

of the projected system in (8.18) can be implemented in such a way,

that we only need simple update formulas k − 1 → k. (8.19)

In the derivation of the projected system (8.10a) the fact that we have an orthogonal basis playsa crucial role.The approach discussed above is a starting point for the development of methods which canbe used in cases where A is not symmetric positive definite. In generalizing this appraoch tosystems in which A is not symmetric positive definite one encounters the following two major

179

difficulties:

If A is not symmetric positive definite, then an A-inner product does

not exist, and thus the problem (8.2) does not make sense, (8.20)

and

if A is not symmetric, then an orthogonal basis of Kk(A, r0)

can not be computed with low computational costs. (8.21)

In section 8.3 we consider the case that matrix A is not positive definite, but still symmetric(i.e. symmetric indefinite). Then we can still use the Lanczos method to compute, in a cheapway, an orthogonal basis of the Krylov subspace. To deal with the problem formulated in (8.20)one can replace the error minimization in the A-norm in (8.2) by a residual minimization in theeuclidean norm, i.e. minimize ‖Ax − b‖ over the space x0 + Kk(A; r0). For every nonsingularmatrix A this residual minimization problem has a unique solution. Furthermore, as will beshown in section 8.3, this residual minimization problem can be solved with low computationalcosts if an orthogonal basis of the Krylov subspace is available. A well-known method for solvingsymmetric indefinite problems, which is based on using the Lanczos method (8.4) for computingthe solution of the residual minimization problem is the MINRES method.

In section 8.4 and Section 8.5 we assume that the matrix A is not even symmetric. Thenboth the problem formulated in (8.20) and the problem formulated in (8.21) arise. We can dealwith the problem in (8.20) as in the MINRES method, i.e. we can use residual minimization inthe euclidean norm instead of error minimization in the A-norm. It will turn out that, just as forthe symmetric indefinite case, this residual minimization problem can be solved with low costs ifan orthogonal basis of the Krylov subspace is available. However, due to the nonsymmetry (cf.(8.21)), for computing such an orthogonal basis we now have to use a method which is computa-tionally much more expensive than the Lanczos method. An important method which is basedon the idea of computing an orthogonal basis of the Krylov subspace and using this basis to solvethe residual minimization problem is the GMRES method. We discuss this method in section 8.4.

Another important class of methods for solving nonsymmetric problems is treated in section 8.5.In these methods one does not compute the solution of an error or residual minimization prob-lem (as is done in CG, MINRES, GMRES). Instead one tries to determine xk ∈ x0 +Kk(A; r0)which satisfies an orthogonality condition similar to the one in (8.3). It turns out that usingthis approach one can avoid the expensive computation of an orthogonal basis of the Krylovsubspace. The main example from this class is the Bi-CG method. The Bi-CG method has leadto many variants. A few popular variants are considered in section 8.5, too.

8.3 MINRES method

In this section we discuss the MINRES method (”Minimal Residual”) which can be used forproblems with A symmetric and (possibly) indefinite. The method is introduced in Paige andSaunders [70]. For symmetric A the Lanczos method in (8.4) can be used to find, with low com-putational costs, an orthogonal basis q1,q2, ...,qk of the Krylov space Kk(A, r0), k = 1, 2, . . ..The recursion in (8.4) can be rewritten as

Aqj = βj−1qj−1 + αjq

j + βjqj+1 , (8.22)

180

and thus

AQk = Qk+1

α1 β1 ∅β1 α2

. . .. . .

. . .. . .

. . .. . . βk−1

βk−1 αk∅ βk

=: Qk+1Tk . (8.23)

Note that Tk is a (k + 1) × k matrix. Due to the orthogonality of the basis we have

QTkQk = Ik , QT

k qk+1 = 0 . (8.24)

The MINRES method, introduced in Paige and Saunders [70], is based on the following residualminimization problem:

Given x0 ∈ Rn, determine xk ∈ x0 + Kk(A; r0) such that

‖Axk − b‖ = min ‖Ax − b‖ | x ∈ x0 + Kk(A; r0) ,(8.25)

where r0 := Ax0 − b. Note that the Euclidean norm is used and that for any regular A thisminimization problem has a unique solution xk, which is illustrated in figure 8.2.Clearly, we have a projection: rk = Axk − b is the projection (with respect to 〈·, ·〉) of r0

-

3Ax0 − b

Axk − b

RR = A(Kk(A; r0))

Figure 8.2: Residual minimization

on A(Kk(A; r0)) = spanAr0,A2r0, ...,Akr0. Any x ∈ Kk(A; r0) can be represented as x =−Qky with y ∈ Rk and using this we obtain:

‖A(x0 + x) − b‖ = ‖AQky − r0‖ = ‖Qk+1Tky − r0‖= ‖Qk+1Tky − Qk+1(‖r0‖e1)‖ = ‖Tky − ‖r0‖e1‖ .

(8.26)

So xk as in (8.25) can be obtained from

‖Tkyk − ‖r0‖e1‖ = min ‖Tky − ‖r0‖e1‖ | y ∈ Rk (8.27a)

xk = x0 − Qkyk . (8.27b)

From (8.27) we see that the residual minimization problem in (8.25) leads to a least squaresproblem with the (k+ 1)× k tridiagonal matrix Tk. Due to the structure of this matrix Givensrotations are very suitable for solving the least squares problem in (8.27). Combination of theLanczos algorithm (for computing an orthogonal basis) with a least squares solver based on

181

Givens rotations results in the MINRES algorithm. We will now derive this algorithm.First we recall that for (x, y) 6= (0, 0) a unique orthogonal Givens rotation is given by

G =

(

c s−s c

)

such that

c2 + s2 = 1

G

(

x

y

)

=

(

w

0

)

with w > 0(8.28)

The least squares problem in (8.27a) is solved using an orthogonal transformation Vk ∈ R(k+1)×(k+1)

such that

VkTk =

(

Rk

∅

)

, Rk ∈ Rk×k upper triangular (8.29)

Define bk := Vke1 =:

(

bk∗

)

, with bk ∈ Rk. Then the solution of the least squares problem

is given by yk = R−1k bk. We show how the matrices Rk and vectors bk, k = 1, 2, . . ., can be

computed using short (and thus cheap) recursions. We introduce the notation

Gj =

Ij−1 ∅ ∅∅ cj sj∅ −sj cj

∈ R(j+1)×(j+1) with c2j + s2j = 1

Given T1 one can compute c1, s1, r1 such that

G1

(

α1

β1

)

=

(

r10

)

(8.30)

Given T2 and G1 one can compute c2, s2, r2 such that

G2

G1

(

β1

α2

)

β2

=

(

r2

0

)

, r2 ∈ R2 (8.31)

For k ≥ 3 and for given Tk, Gk−1, Gk−2 one can compute ck, ss, rk such that

Gk

Gk−1

Gk−2

(

0βk−1

)

αk

βk

=

(

rk0

)

, rk ∈ Rk (8.32)

Note that rk has at most three nonzero entries:

rk = (0, . . . , 0, rk,k−2, rk,k−1, rk,k)T (8.33)

Using these Givens transformations Gj the orthogonal transformations Vj , j ≥ 1, are definedas follows

V1 := G1, Vj := Gj

(

Vj−1 ∅∅ 1

)

, j ≥ 2

One easily checks, using induction, that

VkTk =

(

Rk

∅

)

, Rk :=

r1r2

∅ . . .

∅ rk

, rj as in (8.30),(8.31), (8.32)

182

For bk = Vke1 =: (bk, bk,k+1)T , we have the recursion

b1 = G1e1, bj =

bj−1(

cj sj−sj cj

)(

bj−1,j

0

)

, j ≥ 2 (8.34)

(Notation: bj−1,j is the j-th entry of bj−1.) We now derive a simple recursion for the vector xkin (8.27b). Define the matrix QkR

−1k =: Pk = (p1 . . . pk) with columns pj (1 ≤ j ≤ k). From

PkRk = Qk and the nonzero structure of the columns of Rk (cf. (8.33)) it follows that

p1 = q1/r1 , p2 = (q2 − r2,1p1)/r2,2

pj = (qj − rj,j−2pj−2 − rj,j−1pj−1)/rj,j , j ≥ 3(8.35)

Note that using (8.34) we can rewrite (8.27b) as

xk = x0 − ‖r0‖QkR−1k bk = x0 − ‖r0‖Pkbk

= x0 − ‖r0‖Pk−1bk−1 − ‖r0‖bk,kpk = xk−1 − ‖r0‖bk,kpk(8.36)

This leads to the following method:

MINRES algorithm.Given x0, compute r0 = Ax0 − b. For k = 1, 2, . . . :

Compute qk, αk, βk using the Lanczos method.Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).Compute pk using (8.35).Compute bk,k using (8.34).Compute update: xk = xk−1 − ‖r0‖bk,kpk.

Note that in each iteration of this method we need only one matrix-vector multiplication and afew relatively cheap operations, like scalar products and vector additions.

Remark 8.3.1 If for a given symmetric regular matrix A and given starting residual r0 assump-tion 8.2.1 does not hold then there exists a minimal k0〈n such that AKk0(A; r0) = Kk0(A; r0).In the Lanczos method we then obtain (using exact arithmetic) βk0 = 0 and thus the itera-tion stops for k = k0. It can be shown that xk0 computed in the MINRES algorithm satisfiesAxk0 = b and thus we have solved the linear system.

We now derive the preconditioned MINRES algorithm. For this we assume a given symmetricpositive definite matrix M. Let L be such that M = LLT . We consider the preconditionedsystem

L−1AL−Tz = L−1b , z = LTx

Note that A := L−1AL−T is symmetric. For given x0 ∈ Rn we have z0 = LTx0 and the startingresidual of the preconditioned problem satisfies Az0 − L−1b = L−1r0. We apply the Lanczosmethod to construct an orthogonal basis q1, . . .qk of the space Kk(A;L−1r0). We want to avoidcomputations with the matrices L and LT . This can be achieved if we reformulate the algorithmusing the transformations

tj := Lqj , tj := Lqj , wj := L−Tqj = M−1tj

183

Using these definitions we obtain an equivalent formulation of the algorithm (8.4) applied to Awith r = L−1r0, which is called the preconditioned Lanczos method :

t0 := 0; w0 := M−1r0; ‖r‖ := 〈w0, r0〉 12

t1 := r0/‖r‖; w1 := w0/‖r‖; β0 := 0;

for j ≥ 1 :

tj+1 := Awj − βj−1tj−1,

αj := 〈tj+1,wj〉,tj+1 := tj+1 − αjt

j ,

wj+1 := M−1tj+1; βj := 〈wj+1, tj+1〉,tj+1 := tj+1/βj ,

wj+1 := wj+1/βj ,

(8.37)

Note that for M = I we obtain the algorithm (8.4) and that in each iteration a system with thematrix M must be solved. As a consequence of theorem 8.2.2 we get:

Theorem 8.3.2 The set w1,w2, ...,wk defined in algorithm (8.37) is orthogonal with respect to〈·, ·〉M and forms a basis of the Krylov subspace Kk(M−1A;M−1r0) (k ≤ n).

Proof. From theorem 8.2.2 and the defintion of wj it follows that (LTwj)1≤j≤k formsan orthogonal basis of Kk(L−1AL−T ;L−1r0) with respect to the Euclidean scalar product.Note that 〈LTwj ,LTwi〉 = 0 iff 〈wj ,wi〉M = 0, and LTwj ∈ Kk(L−1AL−T ;L−1r0) iff wj ∈Kk(M−1A;M−1r0).

Define Wj :=(

w1 w2 . . . wj)

∈ Rn×j. From theorem 8.3.2 it follows that WTMW = I.

From (8.23) we obtain, using LTWk = Qk, that L−1AL−TLTWk = LTWk+1Tk holds and thus

M−1AWk = Wk+1Tk (8.38)

The matrix M−1A is symmetric with respect to 〈·, ·〉M . Instead of (8.25) we consider thefollowing minimization problem:

Given x0 ∈ Rn, compute xk ∈ x0 + Kk(M−1A;M−1r0) such that

‖M−1Axk − M−1b‖M= min ‖M−1Ax− M−1b‖M | x ∈ x0 + Kk(M−1A;M−1r0)

(8.39)

with r0 = Ax0−b. Using arguments as in (8.26) it follows that the solution of the minimizationproblem (8.39) can be obtained from

‖Tkyk − ‖r‖e1‖ = min ‖Tky − ‖r‖e1‖ | y ∈ Rk (8.40a)

xk = x0 − Wkyk . (8.40b)

with ‖r‖ = 〈M−1r0, r0〉 12 . This problem can be solved using Givens rotations along the same

lines as for the unpreconditioned case. Thus we get the following:

184

Preconditioned MINRES algorithm.Given x0, compute r0 = Ax0 − b, w0 = M−1r0, ‖r‖ = 〈w0, r0〉 1

2 .For k = 1, 2, . . . :

Compute wk, αk, βk using the preconditioned Lanczos method (8.37).Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).Compute pk using (8.35) with qk replaced by wk.Compute bk,k using (8.34).Compute update: xk = xk−1 − ‖r‖bk,kpk.

The minimization property (8.39) yields a convergence result for the preconditioned MINRESmethod:

Theorem 8.3.3 Let A ∈ Rn×n be symmetric and M ∈ Rn×n symmetric positive definite. Forxk, k ≥ 0, computed in the preconditioned MINRES algorithm we define rk = M−1(Axk − b).The following holds:

‖rk‖M = minpk∈Pk;pk(0)=1

‖pk(M−1A)r0‖M

≤ minpk∈Pk;pk(0)=1

maxλ∈σ(M−1A)

|pk(λ)| ‖r0‖M(8.41)

Proof. The equality result follows from

‖rk‖M = minpk−1∈Pk−1

‖M−1b− M−1A(

x0 + pk−1(M−1A)r0

)

‖M

= minpk−1∈Pk−1

‖r0 − M−1Apk−1(M−1A)r0‖M

= minpk∈Pk;pk(0)=1

‖pk(M−1A)r0‖M

Note that M−1A is symmetric with respect to 〈·, ·〉M . And thus

‖pk(M−1A)r0‖M ≤ ‖pk(M−1A)‖M‖r0‖M = maxλ∈σ(M−1A)

|pk(λ)| ‖r0‖M

holds.

From this result it follows that bounds on the reduction of the (preconditioned) residual can beobtained if one assumes information on the spectrum of M−1A. We present two results that arewell-known in the literature. Proofs, which are based on approximation properties of Chebyshevpolynomials are given in, for example, [42].

Theorem 8.3.4 Let A,M and rk be as in theorem 8.3.3. Assume that all eigenvalues of M−1Aare positive. Then

‖rk‖M‖r0‖M

≤ 2(

1 − 2/(√

κ(M−1A) + 1))k, k = 0, 1, . . .

holds.

We note that in this bound the dependence on the condition number κ(M−1A) is the same asin well-known bounds for the preconditioned CG method.

185

Theorem 8.3.5 Let A,M and rk be as in theorem 8.3.3. Assume that σ(M−1A) ⊂ [a, b]∪ [c, d]with a < b < 0 < c < d and b− a = d− c. Then

‖rk‖M‖r0‖M

≤ 2(

1 − 2/(

√

ad

bc+ 1)

)[k/2], k = 0, 1, . . . (8.42)

holds.

In the special case a = −d, b = −c the reduction factor in (8.42) takes the form 1−2/(κ(M−1A)+1). Note that here the dependence on κ(M−1A) is different from the positive definite case intheorem 8.3.4.

8.4 GMRES type of methods

In this (and the following) section we do not assume that A is symmetric. We only assumethat A is regular. In GMRES (“Generalized Minimal Residual”) type of methods one firstcomputes an orthogonal basis of the Krylov subspace and then, using this basis, one determinesthe xk satisfying the minimal residual criterion in (8.25). It can be shown (cf. Faber andManteuffel [33]) that only for a very small class of nonsymmetric matrices it is possible tocompute this xk satisfying the minimal residual criterion with “low” computational costs. Thisis related to the fact that in general for a nonsymmetric matrix we do not have a method forcomputing an orthogonal basis of the Krylov subspace with low computational costs (cf. theLanczos algorithm for the symmetric case).

In GMRES the so-called Arnoldi algorithm, introduced in Arnoldi [5], is used for computingan orthogonal basis of the Krylov subspace :

q1 := r0/‖r0‖;for j ≥ 1 :

qj+1 := Aqj ,

for i = 1, . . . j :

hij := 〈qj+1,qi〉qj+1 := qj+1 − hijq

i

hj+1,j := ‖qj+1‖qj+1 := qj+1/hj+1,j .

(8.43)

When we put the coefficients hij (1 ≤ i ≤ j + 1 ≤ k + 1) in a matrix denoted by Hk we obtain:

Hk =

h11 h12 · · · · · · h1k

h21 h22 h23 · · · h2k

. . .. . .

. . .. . .

. . . hk−1,k

. . . hk,k∅ hk+1,k

(8.44)

This is a (k + 1) × k matrix of upper Hessenberg form. We also use the notation

Qj := [q1 q2 . . . qj ] (n× j − matrix with columns qi).

186

Using this notation, the Arnoldi algorithm results in

AQk = Qk+1Hk . (8.45)

The result in (8.45) is similar to the result in (8.23). However, note that the matrix Hk in (8.45)contains significantly more nonzero elements than the tridiagonal matrix Tk in (8.23).

Using induction it can be shown that q1,q2, ...,qk forms an orthogonal basis of the Krylovsubspace Kk(A; r0). As in the derivation of (8.27a),(8.27b) for the MINRES method, using thefact that we have an orthogonal basis, we obtain that the xk that satisfies the minimal residualcriterion (8.25) can be characterized by the least squares problem:

‖Hkyk − ‖r0‖e1‖ = min‖Hky − ‖r0‖e1‖ | y ∈ Rk (8.46)

xk = x0 − Qkyk . (8.47)

The GMRES algorithm has the following structure :

1. Start : choose x0; r0 := Ax0 − b; q1 := r0/‖r0‖2. Arnoldi method (8.43) for the computation of an orthogonal

basis q1,q2, . . . ,qk of Kk(A; r0)

3. Solve a least squares problem :

yk such that ‖Hkyk − ‖r0‖e1‖ = min‖Hky − ‖r0‖e1‖ | y ∈ Rk.

xk := x0 − Qkyk

(8.48)

The GMRES method is introduced in Saad and Schultz [80]. For a detailed discussion ofimplementation aspects of the GMRES method we refer to that paper. In [80] it is shown thatusing similar techniques as in the derivation of the MINRES method the least squares problemin step 3 in (8.48) can be solved with low computational costs. However, step 2 in (8.48) isexpensive, both with respect to memory and arithmetic work. This is due to the fact that inthe kth iteration we need computations involving q1,q2, ...,qk−1 to determine qk. To avoidcomputations involving all the previous basis vectors, the GMRES method with restart is oftenused in practice. In GMRES(m) we apply m iterations of the GMRES method as in (8.48),then we define x0 := xm and again apply m iterations of the GMRES method with this newstarting vector, etc.. Note that for k > m the iterands xk do not fulfill the minimal residualcriterion (8.25). In Saad and Schultz [80] it is shown that (in exact arithmetic) the GMRESmethod cannot break down and that (as in CG) the exact solution is obtained in at most niterations. The minimal residual criterion implies that in GMRES the residual is reduced inevery iteration . These nice properties of GMRES do not hold for the GMRES(m) algoritm; awell known difficulty with the GMRES(m) method is that it can stagnate.

Example 8.4.1 (Convection-diffusion problem) We apply the GMRES(m) method to thediscrete convection-diffusion problem of section 6.6 with b1 = cos(π6 ), b2 = sin(π6 ) and for severalvalues of ε,m, h. For the starting vector we take x0 = 0. In Table 8.2 we show the number ofiterations needed to reduce the Euclidean norm of the starting residual with a factor 103.

From these results we see that for this model problem the number of iterations increasessignificantly when h is decreased. Also we observe a certain robustness with respect to variationin ε. Based on the results in Table 8.2 we obtain that for m “small” (i.e. 1-20) the GMRES(m)method is more efficient than for m “large” (i.e. ≫ 20).

187

m : 10 20 40 80

ε = 10−1 h = 132 97 72 61 56

ε = 10−2 h = 132 68 75 80 59

ε = 10−4 h = 132 61 59 60 59

ε = 10−1 h = 164 270 191 146 147

ε = 10−2 h = 164 127 134 150 160

ε = 10−4 h = 164 119 114 114 114

Table 8.2: # iterations for GMRES(m).

There are other methods which are of GMRES type, in the sense that these methods (in exactarithmetic) yield iterands defined by the minimal residual criterion (8.25). These methods differin the approach that is used for computing the minimal residual iterand. Examples of GMREStype of methods are the Generalized Conjugate Residual method (GCR) and Orthodir. Thesevariants of GMRES seem to be less popular because for many problems they are at least asexpense as GMRES and numerically less stable. For a further discussion and comparison ofGMRES type methods we refer to Saad and Schultz [79], Barrett et al. [10] and Freund etal. [36].

8.5 Bi-CG type of methods

The GMRES method is expensive (both with respect to memory and arithmetic work) due tothe fact that for the computation of an orthogonal basis of the Krylov subspace we need ”long”recursions (cf. Arnoldi method (8.43)). In this respect there is an essential difference with thesymmetric case, because then we can use ”short” recursions for the computation of an orthogo-nal basis (cf. (8.4)). Also note that the implementation of GMRES (using Givens rotations tosolve the least squares problem) is rather complicated, compared to the implementation of theCG method.

The Bi-CG method which we discuss below is based on a generalized Lanczos method thatis used for computing a ”reasonable” basis of the Krylov subspace. This generalized Lanczosmethod uses “short” recursions (as in the Lanczos method), but the resulting basis will in generalnot be orthogonal. The implementation of the Bi-CG method is as simple as the implementationof the CG method.

The Bi-CG method is based on the bi-Lanczos (also called nonsymmetric Lanczos) method:

v0 := v0 := 0; v1 := v1 := r0/‖r0‖2; β0 = γ0 = 0;

For j ≥ 1 :

αj := 〈Avj ,vj〉wj+1 := Avj − αjv

j − βj−1vj−1;

wj+1 := AT vj − αjvj − γj−1v

j−1;γj := ‖wj+1‖, vj+1 := wj+1/γj ,βj := 〈vj+1, wj+1〉, wj+1 := wj+1/βj .

(8.49)

188

If A = AT holds, then the two recursions in (8.49) are the same and the bi-Lanczos method re-duces to the Lanczos method in (8.4). In the bi-Lanczos method it can happen that 〈vj+1, wj+1〉 =0 , even if vj+1 6= 0 and wj+1 6= 0. In that case the algorithm is not executable anymore andthis is called a (serious) “breakdown”. Using induction we obtain that for the two sequences ofvectors generated by Bi-CG the following properties hold:

spanv1,v2, . . . ,vj = Kj(A; r0) (8.50)

spanv1, v2, . . . , vj = Kj(AT ; r0) (8.51)

〈vi, vj〉 = 0 if i 6= j , 〈vi, vi〉 = 1 . (8.52)

Based on (8.52) we call vi and vj (i 6= j) bi-orthogonal. In general the vj (j = 1, 2, ...) will notbe orthogonal. Using the notation

Vj := [v1 . . . vj ], Vj := [v1 . . . vj ]

we obtain the identities

AVk = Vk

α1 β1 ∅γ1 α2

. . .. . .

. . .. . .

. . .. . . βk−1

∅ γk−1 αk

+ γkvk+1

(

0 . . . 0 1)

=: VkTk + γkvk+1eTk (8.53)

andVTkVk = Ik , VT

k vk+1 = 0 . (8.54)

In Bi-CG we do not use a minimal residual criterion as in (8.25) but the following criterionbased on an orthogonality condition:

Determine xk ∈ x0 + Kk(A; r0) such that Axk − b ⊥ Kk(AT ; r0) . (8.55)

The existence of an xk satisfying the criterion in (8.55) is not guaranteed ! If the criterion (8.55)cannot be fulfilled, the Bi-CG algorithm in (8.58) below will break down. For the case that Ais symmetric positive definite, the criteria in (8.55) and in (8.3) are equivalent and (in exactarithmetic) the Bi-CG algorithm will yield the same iterands as the CG algorithm.

Using (8.50)-(8.52) we see that the Bi-CG iterand xk, characterized in (8.55), satisfies

VTk (AVky

k − r0) = 0 ; xk = x0 + Vkyk (yk ∈ Rk).

Due to the relations in (8.53), (8.54) this yields the following characterization:

Tkyk = ‖r0‖2e1 (8.56)

xk = x0 + Vkyk . (8.57)

Note that this is very similar to the characterization of the CG iterand in (8.10a), (8.10b).However, in (8.56) the tridiagonal matrix Tk need not be symmetric positive definite and Vk ingeneral will not be orthogonal. Using an LU-decomposition of the tridiagonal matrix Tk we can

189

compute yk, provided Tk is nonsingular, and then determine xk. An efficient implementationof this approach can be derived along the same lines as for the Lanczos iterative method insection 8.2. This then results in the Bi-CG algorithm, introduced in Lanczos [59] (cf. alsoFletcher [35]):

starting vector x0; p0 = p0 = r0 = r0 = b− Ax0; ρ0 := ‖r0‖2

For k ≥ 0 :

σk := 〈Apk, pk〉; αk := ρk/σk;xk+1 := xk + αkp

k

rk+1 := rk − αkApk

rk+1 := rk − αkAT pk

ρk+1 := 〈rk+1, rk+1〉; βk+1 := ρk+1/ρk;pk+1 := rk+1 + βk+1p

k

pk+1 := rk+1 + βk+1pk

(8.58)

Note: here and in the remainder of this section the residual is defined by rk = b−Axk (insteadof Axk − b).The Bi-CG algorithm is simple and has low computational costs per iteration (compared toGMRES type methods). A disadvantage is that a breakdown can occur (ρk = 0 or σk = 0).A “near breakdown” will result in numerical instabilities. To avoid these (near) breakdownsvariants of Bi-CG have been developed that use so-called look-ahead Lanczos algorithms forcomputing a basis of the Krylov subspace. Also the criterion in (8.55) can be replaced byanother criterion to avoid a breakdown caused by the fact that the Bi-CG iterand as in (8.55)does not exist. The combination of a look-ahead Lanczos approach and a criterion based onminimization of a “quasi-residual” is the basis of the QMR (“Quasi Minimal Residual”) method.For a discussion of the look-ahead Lanczos approach and QMR we refer to Freund et al. [36].

For the Bi-CG method there are only very few theoretical convergence results. A variantof Bi-CG is analyzed in Bank and Chan [8]. A disadvantage of the Bi-CG method is that weneed a multiplication by AT which is often not easily available. Below we discuss variants ofthe Bi-CG method which only use multiplications with the matrix A (two per iteration). Formany problems these methods have a higher rate of convergence than the Bi-CG method.

We introduce the BiCGSTAB method (from Van der Vorst [91])and the CGS (“ConjugateGradients Squared”) method (from Sonneveld [86]). These methods are derived from the Bi-CGmethod. We assume that the Bi-CG method does not break down.

We first reformulate the Bi-CG method using a notation based on matrix polynomials. WithTk, Pk ∈ Pk defined by

T0(x) = 1, P0(x) = 1 ,

Pk(x) = Pk−1(x) − αk−1xTk−1(x) , k ≥ 1,

Tk(x) = Pk(x) + βkTk−1(x) , k ≥ 1,

with αk, βk as in (8.58), we have for the search directions pk and the residuals rk resulting fromthe Bi-CG method:

rk = Pk(A)r0 , (8.59)

pk = Tk(A)r0. (8.60)

190

Results as in (8.59), (8.60) also hold for rk and pk with A replaced by AT and r0 replaced byr0. For the sequences of residuals and search directions, generated by the Bi-CG method, wedefine related transformed sequences:

rk :=Qk(A)rk , (8.61)

pk :=Qk(A)pk , (8.62)

with Qk ∈ Pk .

Note that corresponding to a given residual rk there corresponds an iterand

xk := A−1(b − rk). (8.63)

In the BiCGSTAB method and the CGS method we compute the iterands xk corresponding toa ”suitable” polynomial Qk. These polynomials are chosen in such a way that the xk can becomputed with simple (i.e short) recursions involving rk , pk and A. The costs per iterationof these algorithms will be roughly the same as the costs per iteration in Bi-CG. An importantadvantage is that we do not need AT . Clearly, from an efficiency point of view it is favourableto have a polynomial Qk such that ‖rk‖ = ‖Qk(A)rk‖ ≪ ‖rk‖ holds. For obtaining a hybridBi-CG method one can try to find a polynomial Qk such that for the corresponding transformedquantities we have low costs per iteration (short recursions) and a (much) smaller transformedresidual. The first example of such a polynomial is due to Sonneveld [86]. He proposes:

Qk(x) = Pk(x), (8.64)

with Pk the Bi-CG polynomial. The iterands xk corresponding to this Qk are computed in theCGS method (cf. (8.75)). Another choice is proposed in Van der Vorst [91]:

Qk(x) = (1 − ωk−1x)(1 − ωk−2x) · · · (1 − ω0x). (8.65)

The choice of the parameters ωj is discussed below (cf. (8.73)). The iterands xk correspondingto this Qk are computed in the BiCGSTAB algorithm.

We now show how the BiCGSTAB algorithm can be derived. First note that for theBiCGSTAB polynomial we have

Qk+1(A) = (I − ωkA)Qk(A).

From the Bi-CG algortihm and the definition of pk we obtain

pk+1 = Qk+1(A)pk+1 = Qk+1(A)rk+1 + βk+1(I − ωkA)Qk(A)pk

= rk+1 + βk+1(pk − ωkApk). (8.66)

Similarly, for the transformed residuals we obtain the recursion

rk+1 = (rk − αkApk) − ωkA(rk − αkApk). (8.67)

For the iterands related to these transformed residuals we have

xk+1 − xk = A−1(rk − rk+1) = αkpk + ωk(r

k − αkApk),

and thus we have the recursion

xk+1 = xk + αkpk + ωk(r

k − αkApk). (8.68)

191

Note that in (8.66), (8.67) and (8.68) we have simple recursions in which the scalars αk and βkdefined in the Bi-CG algorithm are used. We now show that for these scalars one can deriveother, more feasible, formulas. We consider

ρk = 〈rk, rk〉

The coefficient for the highest order term of the Bi-CG polynomial Pk is equal to (−1)kα0α1 · · ·αk−1.So we have

rk = Pk(AT )r0 = (−1)kα0α1 · · ·αk−1(A

T )kr0 + wk,

with wk ∈ Kk(AT ; r0). Using this in the definition of ρk and the orthogonality condition for rk

in (8.55) we obtain the relation

ρk = (−1)kα0α1 · · ·αk−1〈(AT )kr0, rk〉. (8.69)

We now define the quantity

ρk := 〈r0, rk〉in which we use the transformed residual rk. The coefficient for the highest order term of theBiCGSTAB polynomial Qk is equal to (−1)kω0ω1 · · ·ωk−1. So we have

Qk(AT )r0 = (−1)kω0ω1 · · ·ωk−1(A

T )kr0 + wk,

with wk ∈ Kk(AT ; r0). Using this in the definition of ρk and the orthogonality condition for rk

in (8.55) we obtain the relation

ρk = 〈r0, Qk(A)rk〉 = 〈Qk(AT )r0, rk〉= (−1)kω0ω1 · · ·ωk−1〈(AT )kr0, rk〉. (8.70)

The results in (8.69),(8.70) yield the following formula for βk+1

βk+1 =ρk+1

ρk= − αk

〈(AT )k+1r0, rk+1〉〈(AT )kr0, rk〉

= (ρk+1

ρk)(αkωk

). (8.71)

Similarly, for the scalar αk defined in the Bi-CG algorithm, the formula

αk =ρk

〈Apk, r0〉 (8.72)

can be derived. We finally discuss the choice of the parameters ωj in the BiCGSTAB polynomial.We use the notation

rk+12 := rk − αkApk.

The recursion for the transformed residuals can be rewritten as

rk+1 = rk+12 − ωkArk+

12 .

The ωk is now defined by a standard line search:

‖rk+ 12 − ωkArk+

12 ‖ = min

ω‖rk+ 1

2 − ωArk+12‖.

192

This results in

ωk =〈Ark+

12 , rk+

12 〉

〈Ark+12 ,Ark+

12 〉. (8.73)

Using the recursion in (8.66), (8.67), (8.68) , the formulas for the scalars in (8.71), (8.72) andthe choice for ωk as in (8.73) we obtain the following Bi-CGSTAB algorithm, where for ease ofnotation we dropped the “ ” notation for the transformed variables.

starting vector x0; r0 = b− Ax0; choose r0 ( e.g. = r0)

p−1 = c−1 = 0. α−1 = 0, ω−1 = ρ−1 = 1,

for k ≥ 0 :

ρk = 〈rk, r0〉, βk = (αk−1/ωk−1)(ρk/ρk−1),pk = βkp

k−1 + rk − βkωk−1ck−1,

ck = Apk,γk = 〈ck, r0〉, αk = ρk/γk,

rk+12 = rk − αkc

k, ck+12 = Ark+

12 ,

ωk = 〈ck+ 12 , rk+

12 〉/〈ck+ 1

2 , ck+12 〉,

xk+1 = xk + αkpk + ωkr

k+ 12 , rk+1 = rk+

12 − ωkc

k+ 12 .

(8.74)

This Bi-CGSTAB method is introduced in Van der Vorst [91] as a variant of Bi-CG and ofthe CGS method. Variants of the Bi-CGSTAB method, denoted by Bi-CGSTAB(ℓ), are dis-cussed in Sleijpen and Fokkema [84].

In the CGS method we take Qk(x) = Pk(x), with Pk the Bi-CG polynomial. This resultsin transformed residuals satisfying the relation

rk = (Pk(A))2r0.

This explains the name Conjugate Gradients Squared. Along the same lines as above for theBi-CGSTAB method one can derive the following CGS algorithm from the Bi-CG algorithm (cf.Sonneveld [86]):

starting vector x0; r0 := b− Ax0; q0 := q−1 := 0; ρ−1 := 1;

For k ≥ 0 :

ρk := 〈r0; rk〉; βk := ρk/ρk−1;wk := rk + βkq

k

qk := wk + βk(qk + βkq

k−1)vk := Aqk

σk := 〈r0,vk〉; αk := ρk/σk;qk+1 := wk − αkv

k

rk+1 := rk − αkA(wk + qk+1)xk+1 := xk + αk(w

k + qk+1)

(8.75)

Note that both in the Bi-CGSTAB method and in the CGS method we have relatively low costsper iteration (two matrix vector products and a few inner products) and that we do not needAT . The fact that in the CGS polynomial we use the square of the Bi-CG polynomial oftenresults in a rather irregular convergence behaviour (cf. Van der Vorst [91]). The Bi-CGSTABpolynomial (8.65),(8.73) is chosen such that the resulting method in general has a less irregularconvergence behaviour than the CGS method.

193

Example 8.5.1 (Convection-diffusion problem) In Figure 8.3 we show the results of theCGS method and the Bi-CGSTAB method applied to the convection-diffusion equation as inexample 8.4.1, with ε = 10−2, h = 1

32 . As a measure for the error reduction we use

10 log(σdk) := 10 log(||Axk − b||2/||Ax0 − b||2) = 10 log(||Axk − b||2/||b||2).

In figure 8.3 we show the values of 10 log(σdk) for k = 0, 1, 2, · · · , 60.

-12

0

12

60

CGS

Bi-CGSTAB

Figure 8.3: Convergence behaviour of CGS and of Bi-CGSTAB

Note that for both methods we first have a growth of the norm of the defect (in CGS even up to1012 !). In this example we observe that indeed the Bi-CGSTAB has a smoother convergencebehaviour than the CGS method.

We finally note that for all these nonsymmetric Krylov subspace methods the use of a suitablepreconditioner is of great importance for the efficiency of the methods. There is only very littleanalysis in this field, and in general the choice of the preconditioner is based on “trial and error”.Often variants of the ILU factorization are used as a preconditioner.The preconditioned Bi-CGSTAB algorithm, with preconditioner W, is as follows (cf. Sleijpen

194

and Van der Vorst [85]):

starting vector x0; r0 = b− Ax0; choose r0 ( e.g. = r0)

p−1 = c−1 = 0. α−1 = 0, ω−1 = ρ−1 = 1,

for k ≥ 0 :

ρk = 〈rk, r0〉, βk = (αk−1/ωk−1)(ρk/ρk−1),

solve pk+12 from Wpk+

12 = rk − βkωk−1c

k−1,

pk = βkpk−1 + pk+

12 ,

ck = Apk,γk = 〈ck, r0〉, αk = ρk/γk,

rk+12 = rk − αkc

k,

solve yk+12 from Wyk+

12 = rk+

12 ,

ck+12 = Ayk+

12 ,

ωk = 〈ck+ 12 , rk+

12 〉/〈ck+ 1

2 , ck+12 〉,

xk+1 = xk + αkpk + ωky

k+ 12 , rk+1 = rk+

12 − ωkc

k+ 12 .

(8.76)

195

Chapter 9

Multigrid methods

9.1 Introduction

In this chapter we treat multigrid methods (MGM) for solving discrete scalar elliptic boundaryvalue problems. We first briefly discuss a few important differences between multigrid methodsand the iterative methods treated in the preceding chapters .

The basic iterative methods and the Krylov subspace methods use the matrix A and the right-hand side b which are the result of a discretization method. The fact that these data correspondto a certain underlying continuous boundary value problem is not used in the iterative method.However, the relation between the data (A and b) and the underlying problem can be useful forthe development of a fast iterative solver. Due to the fact that A results from a discretizationprocedure we know, for example, that there are other matrices which, in a certain natural sense,are similar to the matrix A. These matrices result from the discretization of the underlyingcontinuous boundary value problem on other grids than the grid corresponding to the givendiscrete problem Ax = b. The use of discretizations of the given continuous problem on severalgrids with different mesh sizes plays an important role in multigrid methods.

We will see that for a large class of discrete elliptic boundary value problems multigrid methodshave a significantly higher rate of convergence than the methods treated in the preceding chap-ters. Often multigrid methods even have “optimal” complexity.

Due to the fact that in multigrid methods discrete problems on different grids are needed,the implementation of multigrid methods is in general (much) more involved than the imple-mentation of, for example, Krylov subspace methods. We also note that for multigrid methods itis relatively hard to develop “black box” solvers which are applicable to a wide class of problems.

In section 9.2 we explain the main ideas of the MGM using a simple one dimensional prob-lem. In section 9.3 we introduce multigrid methods for discrete scalar elliptic boundary valueproblems. In section 9.4 we present a convergence analysis of these multigrid methods. Oppositeto the basic iterative and Krylov subspace methods, in the convergence analysis we will need theunderlying continuous problem. The standard multigrid method discussed in the sections 9.2-9.4is efficient only for diffusion-dominated elliptic problems. In section 9.5 we consider modifica-tions of standard multigrid methods which are used for convection-dominated problems. Insection 9.6 we discuss the principle of nested iteration. In this approach we use computations onrelatively coarse grids to obtain a good starting vector for an iterative method (not necessarily

197

a multigrid method). In section 9.7 we show some results of numerical experiments. In sec-tion 9.8 we discuss so-called algebraic multigrid methods. In these methods, as in basic iterativemethods and Krylov subspace methods, we only use the given matrix and righthand side, butno information on an underlying grid structure. Finally, in section 9.9 we consider multigridtechniques which can be applied directly to nonlinear elliptic boundary value problems withoutusing a linearization technique.

For a thorough treatment of multigrid methods we refer to the monograph of Hackbusch [44].For an introduction to multigrid methods requiring less knowledge of mathematics, we refer toWesseling [96], Briggs [23], Trottenberg et al. [69]. A theoretical analysis of multigrid methodsis presented in [19].

9.2 Multigrid for a one-dimensional model problem

In this section we consider a simple model situation to show the basic principle behind themultigrid approach. We consider the two-point boundary value model problem

−u′′(x) = f(x), x ∈ Ω := (0, 1)u(0) = u(1) = 0 .

(9.1)

The variational formulation of this problem is: find u ∈ H10 (Ω) such that

∫ 1

0u′v′ dx =

∫

fv dx for all v ∈ H10 (Ω)

For the discretization we introduce a sequence of nested uniform grids. For ℓ = 0, 1, 2, . . . , wedefine

hℓ = 2−ℓ−1 (“mesh size”) , (9.2)

nℓ = h−1ℓ − 1 (“number of interior grid points”) , (9.3)

ξℓ,i = ihℓ , i = 0, 1, ..., nℓ + 1 (“grid points”) , (9.4)

Ωintℓ = ξℓ,i | 1 ≤ i ≤ nℓ (“interior grid”) , (9.5)

Thℓ= ∪ [ξℓ,i, ξℓ,i+1] | 0 ≤ i ≤ nℓ (“triangulation”) (9.6)

The space of linear finite elements corresponding to the triangulation Thℓis given by

X1hℓ,0

= v ∈ C(Ω) | v|[ξℓ,i,ξℓ,i+1] ∈ P1 , i = 0, . . . , nℓ, v(0) = v(1) = 0

The standard nodal basis in this space is denoted by (φi)1≤i≤nℓ. This basis induces an isomor-

phism

Pℓ : Rnℓ → X1hℓ,0

, Pℓx =

nℓ∑

i=1

xiφi (9.7)

The Galerkin discretization in the space X1hℓ,0

yields a linear system

Aℓxℓ = bℓ , (Aℓ)ij =

∫ 1

0φ′iφ′j dx, (bℓ)i =

∫ 1

0fφi dx (9.8)

The solution of this discrete problem is denoted by x∗ℓ . The solution of the Galerkin discretizationin the space X1

hℓ,0is given by uℓ = Pℓx

∗ℓ . A simple computation shows that

Aℓ = h−1ℓ tridiag(−1, 2,−1) ∈ Rnℓ×nℓ

198

Note that, apart from a scaling factor, the same matrix results from a standard discretizationwith finite differences of the problem (9.1).Clearly, in practice one should not solve the problem in (9.8) using an iterative method (aCholesky factorization A = LLT is stable and efficient). However, we do apply a basic iterativemethod here, to illustrate a certain “smoothing” property which plays an important role inmultigrid methods. We consider the damped Jacobi method

xk+1ℓ = xkℓ −

1

2ωhℓ(Aℓx

kℓ − bℓ) with ω ∈ (0, 1] . (9.9)

The iteration matrix of this method is given by

Cℓ = Cℓ(ω) = I − 1

2ωhℓAℓ .

In this simple model problem an orthogonal eigenvector basis of Aℓ, and thus of Cℓ, too, isknown. This basis is closely related to the “Fourier modes”:

wν(x) = sin(νπx), x ∈ [0, 1], ν = 1, 2, ... .

Note that wν satisfies the boundary conditions in (9.1) and that −(wν)′′(x) = (νπ)2wν(x) holds,and thus wν is an eigenfunction of the problem in (9.1). We introduce vectors zνℓ ∈ Rnℓ , 1 ≤ν ≤ nℓ, which correspond to the Fourier modes wν restricted to the interior grid Ωint

ℓ :

zνℓ := [wν(ξℓ,1), wν(ξℓ,2), ..., w

ν (ξℓ,nℓ)]T .

These vectors form an orthogonal basis of Rnℓ . For ℓ = 2 we give an illustration in figure 9.1.To a vector zνℓ there corresponds a frequency ν. If ν < 1

2nℓ holds then the vector zνℓ , or the

0 1

: z12

: z42

o

o

o

o

o

o

o

x

x

xx

x

x

xx

o

Figure 9.1: two discrete Fourier modes.

corresponding finite element function Pℓzνℓ , is called a “low frequency mode”, and if ν ≥ 1

2nℓ

199

holds then this vector [finite element function] is called a “high frequency mode”. These vectorszνℓ are eigenvectors of the matrix Aℓ:

Aℓzνℓ =

4

hℓsin2(ν

π

2hℓ)z

νℓ ,

and thus we haveCℓz

νℓ = (1 − 2ω sin2(ν

π

2hℓ))z

νℓ . (9.10)

From this we obtain

‖Cℓ‖2 = max1≤ν≤nℓ|1 − 2ω sin2(ν π2hℓ)|

= 1 − 2ω sin2(π2hℓ) = 1 − 12ωπ

2h2ℓ + O(h4

ℓ ) .(9.11)

From this we see that the damped Jacobi method is convergent, but that the rate of convergencewill be very low for hℓ small (cf. section 6.3).

Note that the eigenvalues and the eigenvectors of Cℓ are functions of νhℓ ∈ [0, 1]:

λℓ,ν := 1 − 2ω sin2(νπ

2hℓ) =: gω(νhℓ) , with (9.12a)

gω(y) = 1 − 2ω sin2(π

2y) (y ∈ [0, 1]) . (9.12b)

Hence, the size of the eigenvalues λℓ,ν can directly be obtained from the graph of the functiongω. In figure 9.2 we show the graph of the function gω for a few values of ω. From the graphs

-1

1

ω = 13

ω = 12

ω = 23

ω = 1

Figure 9.2: Graph of gω.

in this figure we conclude that for a suitable choice of ω we have |gω(y)| ≪ 1 if y ∈ [12 , 1]. Wechoose ω = 2

3 (then |gω(12)| = |gω(1)| holds). Then we have |g 2

3(y)| ≤ 1

3 for y ∈ [12 , 1]. Using this

and the result in (9.12a) we obtain

|λℓ,ν | ≤1

3for ν ≥ 1

2nℓ .

Hence:

200

the high frequency modes are strongly damped by the iteration matrix Cℓ.

From figure 9.2 it is also clear that the low rate of convergence of the damped Jacobi method iscaused by the low frequency modes (νhℓ ≪ 1).

Summarizing, we draw the conclusion that in this example the damped Jacobi method will“smooth” the error. This elementary observation is of great importance for the two-grid methodintroduced below. In the setting of multigrid methods the damped Jacobi method is called a“smoother”. The smoothing property of damped Jacobi is illustrated in figure 9.3. It is impor-

0 1

Graph of a starting error.

0 1

Graph of the error after one dampedJacobi iteration (ω = 2

3).

Figure 9.3: Smoothing property of damped Jacobi.

tant to note that the discussion above concerning smoothing is related to the iteration matrixCℓ, which means that the error will be made smoother by the damped Jacobi method, but not(necessarily) the new iterand xk+1.

In multigrid methods we have to transform information from one grid to another. For thatpurpose we introduce so-called prolongations and restrictions. In a setting with nested finiteelement spaces these operators can be defined in a very natural way. Due to the nestedness theidentity operator

Iℓ : X1hℓ−1,0

→ X1hℓ,0

, Iℓv = v

is well-defined. This identity operator represents linear interpolation as is illustrated for ℓ = 2in figure 9.4.The matrix representation of this interpolation operator is given by

0 1

0 1X1h1,0

I2

?

X1h2,0

x

x

x

x

x

x

x

x

x

xx

x

x

x

Figure 9.4: Canonical prolongation.

pℓ : Rnℓ−1 → Rnℓ , pℓ := P−1ℓ Pℓ−1 (9.13)

201

A simple computation yields

pℓ =

12 ∅112

12112

. . .121

∅ 12

nℓ×nℓ−1

(9.14)

We can also restrict a given grid function vℓ on Ωintℓ to a grid function on Ωint

ℓ−1. An obviousapproach is to use a restriction r based on simple injection:

(rinjvℓ)(ξ) = vℓ(ξ) if ξ ∈ Ωintℓ−1 .

When used in a multigrid method then often this restriction based on injection is not satisfactory(cf. Hackbusch [44], section 3.5). A better method is obtained if a natural Galerkin property issatisfied. It can easily be verified (cf. also lemma 9.3.2) that with Aℓ, Aℓ−1 and pℓ as definedin (9.8), (9.13) we have

rℓAℓpℓ = Aℓ−1 iff rℓ = pTℓ (9.15)

Thus the natural Galerkin condition rℓAℓpℓ = Aℓ−1 implies the choice

rℓ = pTℓ (9.16)

for the restriction operator.

The two-grid method is based on the idea that a smooth error, which results from the ap-plication of one or a few damped Jacobi iterations, can be approximated fairly well on a coarsergrid. We now introduce this two-grid method.

Consider Aℓx∗ℓ = bℓ and let xℓ be the result of one or a few damped Jacobi iterations applied

to a given starting vector x0ℓ . For the error eℓ := x∗ℓ − xℓ we have

Aℓeℓ = bℓ − Aℓxℓ =: dℓ ( “residual” or “defect”) (9.17)

Based on the assumption that eℓ is smooth it seems reasonable to make the approximationeℓ ≈ pℓeℓ−1 with an appropriate vector (grid function) eℓ−1 ∈ Rnℓ−1. To determine the vectoreℓ−1 we use the equation (9.17) and the Galerkin property (9.15). This results in the equation

Aℓ−1eℓ−1 = rℓdℓ

for the vector eℓ−1. Note that x∗ = xℓ + eℓ ≈ xℓ + pℓeℓ−1. Thus for the new iterand we takexℓ := xℓ + pℓeℓ−1. In a more compact formulation this two-grid method is as follows:

procedure TGMℓ(xℓ,bℓ)

if ℓ = 0 then x0 := A−10 b0 else

beginxℓ := Jνℓ (xℓ,bℓ) (∗ ν smoothing it., e.g. damped Jacobi ∗)dℓ−1 := rℓ(bℓ − Aℓxℓ) (∗ restriction of defect ∗)eℓ−1 := A−1

ℓ−1dℓ−1 (∗ solve coarse grid problem ∗)xℓ := xℓ + pℓeℓ−1 (∗ add correction ∗)TGMℓ := xℓ

end;

(9.18)

202

Often, after the coarse grid correction xℓ := xℓ + pℓeℓ−1, one or a few smoothing iterations areapplied again. Smoothing before/after the coarse grid correction is called pre/post-smoothing.Besides the smoothing property a second property which is of great importance for a multigridmethod is the following:

The coarse grid system Aℓ−1eℓ−1 = dℓ−1 is of the same form as the system Aℓxℓ = bℓ.

Thus for solving the problem Aℓ−1eℓ−1 = dℓ−1 approximately we can apply the two-grid algo-rithm in (9.18) recursively. This results in the following multigrid method for solving Aℓx

∗ℓ = bℓ:

procedure MGMℓ(xℓ,bℓ)


beginxℓ := Jν1ℓ (xℓ,bℓ) (∗ presmoothing ∗)dℓ−1 := rℓ(bℓ −Aℓxℓ)

e0ℓ−1 := 0; for i = 1 to τ do eiℓ−1 := MGMℓ−1(e

i−1ℓ−1,dℓ−1);

xℓ := xℓ + pℓeτℓ−1

xℓ := Jν2ℓ (xℓ,bℓ) (∗ postsmoothing ∗)MGMℓ := xℓ

end;

(9.19)

If one wants to solve the system on a given finest grid, say with level number ℓ, i.e. Aℓx∗ℓ

= bℓ,then we apply some iterations of MGMℓ(xℓ,bℓ).

Based on efficiency considerations (cf. section 9.3) we usually choose τ = 1 (“V -cycle”) orτ = 2 (“W -cycle”) in the recursive call in (9.19). For the case ℓ = 3 the structure of one multi-grid iteration with τ ∈ 1, 2 is illustrated in figure 9.5.

ℓ = 3

ℓ = 2

ℓ = 1

ℓ = 0 BB

BBB

BBBB

•

•

•

•

•

•

BB

BBB

BB

BB

AA

A

AA

AA

AA

AA

A

•

•

•

•

•

•

•

• •• •

• : smoothing

: solve exactly

τ = 1 τ = 2

Figure 9.5: Structure of one multigrid iteration

9.3 Multigrid for scalar elliptic problems

In this section we introduce multigrid methods which can be used for solving discretized ellipticboundary value problems. Opposite to the CG method, the applicability of multigrid methods

203

is not restricted to (nearly) symmetric problems. Multigrid methods can also be used for solvingproblems which are strongly nonsymmetric (convection dominated). However, for such problemsone usually has to modify the standard multigrid approach. These modifications are discussedin section 9.5.We will introduce the two-grid and multigrid method by generalizing the approach of section 9.2to the higher (i.e., two and three) dimensional case. We consider the finite element discretizationof scalar elliptic boundary value problems as discussed in section 3.4. Thus the continuousvariational problem is of the form



(9.20)

with a bilinear form and righthand side as in (3.42):

k(u, v) =

∫

Ω∇uTA∇v + b · ∇uv + cuv dx , f(v) =

∫

Ωfv dx

The coefficients A, b, c are assumed to satisfy the conditions in (3.42). For the discretization ofthis problem we use simplicial finite elements. The case with rectangular finite elements can betreated in a very similar way. Let Th be a regular family of triangulations of Ω consisting of n-simplices and Xk

h,0, k ≥ 1, the corresponding finite element space as in (3.16). The presentationand implementation of the multigrid method is greatly simplified if we assume a given sequenceof nested finite element spaces.

Assumption 9.3.1 In the remainder of this chapter we always assume that we have a sequenceVℓ := Xk

hℓ,0, ℓ = 0, 1, . . ., of simplicial finite element spaces which are nested:

Vℓ ⊂ Vℓ+1 for all ℓ (9.21)

We note that this assumption is not necessary for a succesful application of multigrid methods.For a treatment of multigrid methods in case of non-nestedness we refer to [69] (?). The con-struction of a hierarchy of triangulations such that the corresponding finite element spaces arenested is discussed in chapter ??.

In Vℓ we use the standard nodal basis (φi)1≤i≤nℓas explained in section 3.5. This basis

induces an isomorphism

Pℓ : Rnℓ → Vℓ , Pℓx =

nℓ∑

i=1

xiφi

The Galerkin discretization: Find uℓ ∈ Vℓ such that

k(uℓ, vℓ) = f(vℓ) for all vℓ ∈ Vℓ

can be represented as a linear system

Aℓxℓ = bℓ , with (Aℓ)ij = k(φj , φi), (bℓ)i = f(φi), 1 ≤ i, j ≤ nℓ (9.22)

Along the same lines as in the one-dimensional case we introduce a multigrid method for solvingthis system of equations on an arbitrary level ℓ ≥ 0.For the smoother we use one of the basic iterative methods discussed in section 6.2. For thismethod we use the notation

xk+1 = Sℓ(xk,bℓ) = xk − M−1ℓ (Aℓx

k − b) , k = 0, 1, . . .

204

The corresponding iteration matrix is denoted by

Sℓ = I − M−1ℓ Aℓ

For the prolongation we use the matrix representation of the identity Iℓ : Vℓ−1 → Vℓ, i.e.,

pℓ := P−1ℓ Pℓ−1 (9.23)

The choice of the restriction is based on the following elementary lemma:

Lemma 9.3.2 Let Aℓ, ℓ ≥ 0, be the stiffness matrix defined in (9.22) and pℓ as in (9.23). Thenfor rℓ : Rnℓ → Rnℓ−1 we have:

rℓAℓpℓ = Aℓ−1 if and only if rℓ = pTℓ

Proof. For the stiffness matrix matrix the identity

〈Aℓx,y〉 = k(Pℓx, Pℓy) for all x,y ∈ Rnℓ

holds. From this we get

rℓAℓpℓ = Aℓ−1

⇔ 〈Aℓpℓx, rTℓ y〉 = 〈Aℓ−1x,y〉 for all x,y ∈ Rnℓ−1

⇔ k(Pℓ−1x, PℓrTℓ y) = k(Pℓ−1x, Pℓ−1y) for all x,y ∈ Rnℓ−1

Using the ellipticity of k(·, ·) it now follows that

rℓAℓpℓ = Aℓ−1

⇔ PℓrTℓ y = Pℓ−1y for all y ∈ Rnℓ−1

⇔ rTℓ y = P−1ℓ Pℓ−1y = pℓy for all y ∈ Rnℓ−1

⇔ rTℓ = pℓ

Thus the claim is proved.

Thus for the restriction we take:rℓ := pTℓ (9.24)

Using these components we can define a multigrid method with exactly the same structure asin (9.19)

procedure MGMℓ(xℓ,bℓ)


beginxℓ := Sν1ℓ (xℓ,bℓ) (∗ presmoothing ∗)dℓ−1 := rℓ(bℓ − Aℓxℓ)

e0ℓ−1 := 0; for i = 1 to τ do eiℓ−1 := MGMℓ−1(e

i−1ℓ−1,dℓ−1);

xℓ := xℓ + pℓeτℓ−1

xℓ := Sν2ℓ (xℓ,bℓ) (∗ postsmoothing ∗)MGMℓ := xℓ

end;

(9.25)

205

We briefly comment on some important issues related to this multigrid method.

SmoothersFor many problems basic iterative methods provide good smoothers. In particular the Gauss-Seidel method is often a very effective smoother. Other smoothers used in practice are thedamped Jacobi method and the ILU method.

Prolongation and restrictionIf instead of a discretization with nested finite element spaces one uses a finite difference or afinite volume method then one can not use the approach in (9.23) to define a prolongation. How-ever, for these cases other canonical constructions for the prolongation operator exist. We referto Hackbusch [44], [69] or Wesseling [96] for a treatment of this topic. A general technique for theconstruction of a prolongation operator in case of nonnested finite element spaces is given in [17].

Arithmetic costs per iterationWe discuss the arithmetic costs of one MGMℓ iteration as defined in (9.25). For this we introducea unit of arithmetic work on level ℓ:

WUℓ := # flops needed for Aℓxℓ − bℓ computation. (9.26)

We assume:WUℓ−1 . gWUℓ with g < 1 independent of ℓ (9.27)

Note that if Tℓ is constructed through a uniform global grid refinement of Tℓ−1 (for n = 2:subdivision of each triangle T ∈ Tℓ−1 into four smaller triangles by connecting the midpoints ofthe edges) then (9.27) holds with g = (1

2 )n. Furthermore we make the following assumptionsconcerning the arithmetic costs of each of the substeps in the procedure MGMℓ:

xℓ := Sℓ(xℓ,bℓ) : costs . WUℓ

dℓ−1 := rℓ(bℓ − Aℓxℓ)

total costs . 2WUℓxℓ := xℓ + pℓe

τℓ−1

For the amount of work in one multigrid V-cycle (τ = 1) on level ℓ, which is denoted by VMGℓ,we get using ν := ν1 + ν2:

VMGℓ . νWU ℓ + 2WU ℓ + VMGℓ−1 = (ν + 2)WU ℓ + VMGℓ−1

. (ν + 2)(

WU ℓ +WU ℓ−1 + . . .+WU1

)

+ VMG0

. (ν + 2)(

1 + g + . . .+ gℓ−1)

WU ℓ + VMG0

.ν + 2

1 − gWU ℓ

(9.28)

In the last inequality we assumed that the costs for computing x0 = A−10 b0 (i.e., VMG0) are

negligible compared to WU ℓ. The result in (9.28) shows that the arithmetic costs for one V-cycleare proportional (if ℓ → ∞) to the costs of a residual computation. For example, for g = 1

8(uniform refinement in 3D) the arithmetic costs of a V-cycle with ν1 = ν2 = 1 on level ℓ arecomparable to 41

2 times the costs of a residual computation on level ℓ.For the W-cycle (τ = 2) the arithmetic costs on level ℓ are denoted by WMGℓ. We have:

WMGℓ . νWU ℓ + 2WU ℓ + 2WMGℓ−1 = (ν + 2)WU ℓ + 2WMGℓ−1

. (ν + 2)(

WU ℓ + 2WU ℓ−1 + 22WU ℓ−2 + . . .+ 2ℓ−1WU1

)

+WMG0

. (ν + 2)(

1 + 2g + (2g)2 + . . .+ (2g)ℓ−1)

WU ℓ +WMG0

206

From this we see that to obtain a bound proportional to WU ℓ we have to assume

g <1

2

Under this assumption we get for the W-cycle

WMGℓ .ν + 2

1 − 2gWU ℓ

(again we neglected WMG0). Similar bounds can be obtained for τ ≥ 3, provided τg < 1 holds.

9.4 Convergence analysis

In this section we present a convergence analysis for the multigrid method. Our approach isbased on the so-called approximation- and smoothing property, introduced by Hackbusch (cf.[44, 48]). For a discussion of other analyses we refer to remark 9.4.22.

9.4.1 Introduction

One easily verifies that the two-grid method is a linear iterative method. The iteration matrixof this method with ν1 presmoothing and ν2 postsmoothing iterations on level ℓ is given by

CTG,ℓ = CTG,ℓ(ν2, ν1) = Sν2ℓ (I − pℓA−1ℓ−1rℓAℓ)S

ν1ℓ (9.29)

with Sℓ = I − M−1ℓ Aℓ the iteration matrix of the smoother.

Theorem 9.4.1 The multigrid method (9.25) is a linear iterative method with iteration matrixCMG,ℓ given by

CMG,0 = 0 (9.30a)

CMG,ℓ = Sν2ℓ(

I − pℓ(I − CτMG,ℓ−1)A

−1ℓ−1rℓAℓ

)

Sν1ℓ (9.30b)

= CTG,ℓ + Sν2ℓ pℓCτMG,ℓ−1A

−1ℓ−1rℓAℓS

ν1ℓ , ℓ = 1, 2, . . . (9.30c)

Proof. The result in (9.30a) is trivial. The result in (9.30c) follows from (9.30b) and thedefinition of CTG,ℓ. We now prove the result in (9.30b) by induction. For ℓ = 1 it follows from(9.30a) and (9.29). Assume that the result is correct for ℓ−1. Then MGMℓ−1(yℓ−1, zℓ−1) definesa linear iterative method and for arbitrary yℓ−1, zℓ−1 ∈ Rnℓ−1 we have

MGMℓ−1(yℓ−1, zℓ−1) − A−1ℓ−1zℓ−1 = CMG,ℓ−1(yℓ−1 − A−1

ℓ−1zℓ−1) (9.31)

We rewrite the algorithm (9.25) as follows:

x1 := Sν1ℓ (xoldℓ ,bℓ)

x2 := x1 + pℓMGMτℓ−1

(

0, rℓ(bℓ − Aℓx1))

xnewℓ := Sν2ℓ (x2,bℓ)

207

From this we get

xnewℓ − x∗ℓ = xnew

ℓ −A−1ℓ bℓ = Sν2ℓ (x2 − x∗ℓ)

= Sν2ℓ(

x1 − x∗ℓ + pℓMGMτℓ−1

(

0, rℓ(bℓ − Aℓx1))

Now we use the result (9.31) with yℓ−1 = 0, zℓ−1 := rℓ(bℓ − Aℓx1). This yields

xnewℓ − x∗ℓ = Sν2ℓ

(

x1 − x∗ℓ + pℓ(A−1ℓ−1zℓ−1 − Cτ

MG,ℓ−1A−1ℓ−1zℓ−1

)

= Sν2ℓ(

I− pℓ(I −CτMG,ℓ−1)A

−1ℓ−1rℓAℓ

)

(x1 − x∗ℓ )

= Sν2ℓ(

I− pℓ(I −CτMG,ℓ−1)A

−1ℓ−1rℓAℓ

)

Sν1ℓ (xold − x∗ℓ )

This completes the proof.

The convergence analysis will be based on the following splitting of the two-grid iteration matrix,with ν2 = 0, i.e. no postsmoothing:

‖CTG,ℓ(0, ν1)‖2 = ‖(I − pℓA−1ℓ−1rℓAℓ)S

ν1ℓ ‖2

≤ ‖A−1ℓ − pℓA

−1ℓ−1rℓ‖2 ‖AℓS

ν1ℓ ‖2

(9.32)

In section 9.4.2 we will prove a bound of the form ‖A−1ℓ −pℓA

−1ℓ−1rℓ‖2 ≤ CA‖Aℓ‖−1

2 . This resultis called the approximation property. In section 9.4.3 we derive a suitable bound for the term‖AℓS

ν1ℓ ‖2. This is the so-called smoothing property. In section 9.4.4 we combine these bounds

with the results in (9.32) and in theorem 9.4.1. This yields bounds for the contraction numberof the two-grid method and of the multigrid W-cycle. For the V-cycle a more subtle analysis isneeded. This is presented in section 9.4.5. In the convergence analysis we need the following:

Assumption 9.4.2 In the sections 9.4.2–9.4.5 we assume that the family of triangulationsThℓ

corresponding to the finite element spaces Vℓ, ℓ = 0, 1, . . ., is quasi-uniform and thathℓ−1 ≤ chℓ with a constant c independent of ℓ.

We formulate three results that will be used in the analysis further on. First we recall the globalinverse inequality that is proved in lemma 3.3.11:

|vℓ|1 ≤ c h−1ℓ ‖vℓ‖L2 for all vℓ ∈ Vℓ

with a constant c independent of ℓ. Note that for this result we need assumption 9.4.2.We now show that, apart from a scaling factor, the isomorphism Pℓ : (Rnℓ , 〈·, ·〉) → (Vℓ, 〈·, ·〉L2)and its inverse are uniformly (w.r.t. ℓ) bounded:

Lemma 9.4.3 There exist constants c1 > 0 and c2 independent of ℓ such that

c1‖Pℓx‖L2 ≤ h12n

ℓ ‖x‖2 ≤ c2‖Pℓx‖L2 for all x ∈ Rnℓ (9.33)

Proof. Let Mℓ be the mass matrix, i.e., (Mℓ)ij = 〈φi, φj〉L2 . Note the basic equality

‖Pℓx‖2L2 = 〈Mℓx,x〉 for all x ∈ Rnℓ (9.34)

There are constants d1, d2 > 0 independent of ℓ such that

〈φi, φj〉L2

≤ d1hnℓ for all i, j

≥ d2hnℓ for all i = j

208

From this and from the sparsity of Mℓ we obtain

d2hnℓ ≤ (Mℓ)ii ≤ ‖Mℓ‖2 ≤ ‖Mℓ‖∞ ≤ d1h

nℓ (9.35)

Using the upper bound in (9.35) in combination with (9.34) we get

‖Pℓx‖2L2 ≤ ‖Mℓ‖2‖x‖2

2 ≤ d1hnℓ ‖x‖2

2 ,

which proves the first inequality in (9.33). We now use corollary 3.5.10. This yields λmin(Mℓ) ≥c λmax(Mℓ) with a strictly positive constant c independent of ℓ. Thus we have

λmin(Mℓ) ≥ c‖Mℓ‖2 ≥ chnℓ , c > 0, independent of ℓ

This yields‖Pℓx‖2

L2 = 〈Mℓx,x〉 ≥ λmin(Mℓ)‖x‖22 ≥ chnℓ ‖x‖2

2 ,

which proves the second inequality in (9.33).

The third preliminary result concerns the scaling of the stiffness matrix:

Lemma 9.4.4 Let Aℓ be the stiffness matrix as in (9.22). Assume that the bilinear form issuch that the usual conditions (3.42) are satisfied. Then there exist constants c1 > 0 and c2independent of ℓ such that

c1hn−2ℓ ≤ ‖Aℓ‖2 ≤ c2h

n−2ℓ

Proof. First note that

‖Aℓ‖2 = maxx,y∈Rnℓ

〈Aℓx,y〉‖x‖2‖y‖2

Using the result in lemma 9.4.3, the continuity of the bilinear form and the inverse inequalitywe get

maxx,y∈Rnℓ


≤ chnℓ maxvℓ,wℓ∈Vℓ

k(vℓ, wℓ)

‖vℓ‖L2‖wℓ‖L2

≤ chnℓ maxvℓ,wℓ∈Vℓ

|vℓ|1|wℓ|1‖vℓ‖L2‖wℓ‖L2

≤ c hn−2ℓ

and thus the upper bound is proved. The lower bound follows from

maxx,y∈Rnℓ


≥ max1≤i≤nℓ

〈Aℓei, ei〉 = k(φi, φi) ≥ c|φi|21 ≥ chn−2ℓ

The last inequality can be shown by using for T ⊂ supp(φi) the affine transformation from theunit simplex to T .

9.4.2 Approximation property

In this section we derive a bound for the first factor in the splitting (9.32).In the analysis we will use the adjoint operator P ∗ℓ : Vℓ → Rnℓ which satisfies 〈Pℓx, vℓ〉L2 =

〈x, P ∗ℓ vℓ〉 for all x ∈ Rnℓ, vℓ ∈ Vℓ. As a direct consequence of lemma 9.4.3 we obtain

c1‖P ∗ℓ vℓ‖2 ≤ h12n

ℓ ‖vℓ‖L2 ≤ c2‖P ∗ℓ vℓ‖2 for all vℓ ∈ Vℓ (9.36)

209

with constants c1 > 0 and c2 independent of ℓ. We now formulate a main result for the conver-gence analysis of multigrid methods:

Theorem 9.4.5 (Approximation property.) Consider Aℓ, pℓ, rℓ as defined in (9.22),(9.23),(9.24). Assume that the variational problem (9.20) is such that the usual conditions (3.42)are satisfied. Moreover, the problem (9.20) and the corresponding dual problem are assumed tobe H2-regular. Then there exists a constant CA independent of ℓ such that

‖A−1ℓ − pℓA

−1ℓ−1rℓ‖2 ≤ CA‖Aℓ‖−1

2 for ℓ = 1, 2, . . . (9.37)

Proof. Let bℓ ∈ Rnℓ be given. The constants in the proof are independent of bℓ and of ℓ.Consider the variational problems:

u ∈ H10 (Ω) : k(u, v) = 〈(P ∗ℓ )−1bℓ, v〉L2 for all v ∈ H1

0 (Ω)

uℓ ∈ Vℓ : k(uℓ, vℓ) = 〈(P ∗ℓ )−1bℓ, vℓ〉L2 for all vℓ ∈ Vℓuℓ−1 ∈ Vℓ−1 : k(uℓ−1, vℓ−1) = 〈(P ∗ℓ )−1bℓ, vℓ−1〉L2 for all vℓ−1 ∈ Vℓ−1

ThenA−1ℓ bℓ = P−1

ℓ uℓ and A−1ℓ−1rℓbℓ = P−1

ℓ−1uℓ−1

hold. Hence we obtain, using lemma 9.4.3,

‖(A−1ℓ − pℓA

−1ℓ−1rℓ)bℓ‖2 = ‖P−1

ℓ (uℓ − uℓ−1)‖2 ≤ c h− 1

2n

ℓ ‖uℓ − uℓ−1‖L2 (9.38)

Now we apply theorem 3.4.5 and use the H2-regularity of the problem. This yields

‖uℓ − uℓ−1‖L2 ≤ ‖uℓ − u‖L2 + ‖uℓ−1 − u‖L2

≤ ch2ℓ |u|2 + +ch2

ℓ−1|u|2 ≤ ch2ℓ‖(P ∗ℓ )−1bℓ‖L2

(9.39)

Now we combine (9.38) with (9.39) and use (9.36). Then we get

‖(A−1ℓ − pℓA

−1ℓ−1rℓ)bℓ‖2 ≤ c h2−n

ℓ ‖bℓ‖2

and thus ‖A−1ℓ − pℓA

−1ℓ−1rℓ‖2 ≤ c h2−n

ℓ . The proof is completed if we use lemma 9.4.4.

Note that in the proof of the approximation property we use the underlying continuous problem.

9.4.3 Smoothing property

In this section we derive inequalities of the form

‖AℓSνℓ ‖2 ≤ g(ν)‖Aℓ‖2

where g(ν) is a monotonically decreasing function with limν→∞ g(ν) = 0. In the first part ofthis section we derive results for the case that Aℓ is symmetric positive definite. In the secondpart we discuss the general case.

Smoothing property for the symmetric positive definite case.We start with an elementary lemma:

210

Lemma 9.4.6 Let B ∈ Rm×m be a symmetric positive definite matrix with σ(B) ⊂ (0, 1]. Thenwe have

‖B(I −B)ν‖2 ≤ 1

2(ν + 1)for ν = 1, 2, . . .

Proof. Note that

‖B(I − B)ν‖2 = maxx∈(0,1]

x(1 − x)ν =1

ν + 1

( ν

ν + 1

)ν

A simple computation shows that ν →(

νν+1

)νis decreasing on [1,∞).

Below for a few basic iterative methods we derive the smoothing property for the symmetriccase, i.e., b = 0 in the bilinear form k(·, ·). We first consider the Richardson method:

Theorem 9.4.7 Assume that in the bilinear form we have b = 0 and that the usual conditions(3.42) are satisfied. Let Aℓ be the stiffness matrix in (9.22). For c0 ∈ (0, 1] we have the smoothingproperty

‖Aℓ(I −c0

ρ(Aℓ)Aℓ)

ν‖2 ≤ 1

2c0(ν + 1)‖Aℓ‖2 , ν = 1, 2, . . .

holds.

Proof. Note that Aℓ is symmetric positive definite. Apply lemma 9.4.6 with B := ωℓAℓ,ωℓ := c0 ρ(Aℓ)

−1. This yields

‖Aℓ(I − ωℓAℓ)ν‖2 ≤ ω−1

ℓ

1

2(ν + 1)≤ 1

2c0(ν + 1)ρ(Aℓ) =

1

2c0(ν + 1)‖Aℓ‖2


A similar result can be shown for the damped Jacobi method:

Theorem 9.4.8 Assume that in the bilinear form we have b = 0 and that the usual conditions(3.42) are satisfied. Let Aℓ be the stiffness matrix in (9.22) and Dℓ := diag(Aℓ). There existsan ω ∈ (0, ρ(D−1

ℓ Aℓ)−1], independent of ℓ, such that the smoothing property

‖Aℓ(I − ωD−1ℓ Aℓ)

ν‖2 ≤ 1

2ω(ν + 1)‖Aℓ‖2 , ν = 1, 2, . . .

holds.

Proof. Define the symmetric positive definite matrix B := D− 1

2

ℓ AℓD− 1

2

ℓ . Note that

(Dℓ)ii = (Aℓ)ii = k(φi, φi) ≥ c |φi|21 ≥ c hn−2ℓ , (9.40)

with c > 0 independent of ℓ and i. Using this in combination with lemma 9.4.4 we get

‖B‖2 ≤ ‖Aℓ‖2

λmin(Dℓ)≤ c , c independent of ℓ.

Hence for ω ∈ (0, 1c ] ⊂ (0, ρ(D−1

ℓ Aℓ)−1] we have σ(ωB) ⊂ (0, 1]. Application of lemma 9.4.6,

with B = ωB, yields


ν‖2 ≤ ω−1‖D12

ℓ ‖2‖ωB(I − ωB)ν‖2‖D12

ℓ ‖2

≤ ‖Dℓ‖2

2ω(ν + 1)≤ 1

2ω(ν + 1)‖Aℓ‖2


211

Remark 9.4.9 The value of the parameter ω used in theorem 9.4.8 is such that ωρ(D−1ℓ Aℓ) =

ωρ(D− 1

2

ℓ AℓD− 1

2

ℓ ) ≤ 1 holds. Note that

ρ(D− 1

2

ℓ AℓD− 1

2

ℓ ) = maxx∈Rnℓ

〈Aℓx,x〉〈Dℓx,x〉

≥ max1≤i≤nℓ

〈Aℓei, ei〉〈Dℓeiei〉

= 1

and thus we have ω ≤ 1. This is in agreement with the fact that in multigrid methods oneusually use a damped Jacobi method as a smoother.

We finally consider the symmetric Gauss-Seidel method. This method is the same as the SSORmethod with parameter value ω = 1. Thus it follows from (6.18) that this method has aniteration matrix

Sℓ = I− M−1ℓ Aℓ, Mℓ = (Dℓ − Lℓ)D

−1ℓ (Dℓ − LTℓ ) , (9.41)

where we use the decomposition Aℓ = Dℓ−Lℓ−LTℓ with Dℓ a diagonal matrix and Lℓ a strictlylower triangular matrix.

Theorem 9.4.10 Assume that in the bilinear form we have b = 0 and that the usual conditions(3.42) are satisfied. Let Aℓ be the stiffness matrix in (9.22) and Mℓ as in (9.41). The smoothingproperty

‖Aℓ(I − M−1ℓ Aℓ)

ν‖2 ≤ c

ν + 1‖Aℓ‖2 , ν = 1, 2, . . .

holds with a constant c independent of ν and ℓ.

Proof. Note that Mℓ = Aℓ + LℓD−1ℓ LTℓ and thus Mℓ is symmetric positive definite. Define

the symmetric positive definite matrix B := M− 1

2

ℓ AℓM− 1

2

ℓ . From

0 < maxx∈Rnℓ

〈Bx,x〉〈x,x〉 = max

x∈Rnℓ

〈Aℓx,x〉〈Mℓx,x〉

= maxx∈Rnℓ

〈Aℓx,x〉〈Aℓx,x〉 + 〈D−1

ℓ LTℓ x,LTℓ x〉

≤ 1

it follows that σ(B) ⊂ (0, 1]. Application of lemma 9.4.6 yields

‖Aℓ(I −M−1ℓ Aℓ)

ν‖2 ≤ ‖M12

ℓ ‖22 ‖B(I − B)ν‖2 ≤ ‖Mℓ‖2

1

2(ν + 1)

From (9.40) we have ‖D−1ℓ ‖2 ≤ c h2−n

ℓ . Using the sparsity of Aℓ we obtain

‖Lℓ‖2‖LTℓ ‖2 ≤ ‖Lℓ‖∞‖Lℓ‖1 ≤ c(maxi,j

|(Aℓ)ij |)2 ≤ c‖Aℓ‖22

In combination with lemma 9.4.4 we then get

‖Mℓ‖2 ≤ ‖D−1ℓ ‖2‖Lℓ‖2‖LTℓ ‖2 ≤ c h2−n

ℓ ‖Aℓ‖22 ≤ c‖Aℓ‖2 (9.42)

and this completes the proof.

For the symmetric positive definite case smoothing properties have also been proved for otheriterative methods. For example, in [98, 97] a smoothing property is proved for a variant of theILU method and in [24] it is shown that the SPAI (sparse approximate inverse) preconditionersatisfies a smoothing property.

Smoothing property for the nonsymmetric case.For the analysis of the smoothing property in the general (possibly nonsymmetric) case we cannot use lemma 9.4.6. Instead the analysis will be based on the following lemma (cf. [74, 75]):

212

Lemma 9.4.11 Let ‖ · ‖ be any induced matrix norm and assume that for B ∈ Rm×m theinequality ‖B‖ ≤ 1 holds. The we have

‖(I − B)(I + B)ν‖ ≤ 2ν+1

√

2

πν, for ν = 1, 2, . . .

Proof. Note that

(I − B)(I + B)ν = (I − B)

ν∑

k=0

(

νk

)

Bk = I − Bν+1 +

ν∑

k=1

(

(

νk

)

−(

νk − 1

)

)

Bk

This yields

‖(I − B)(I + B)ν‖ ≤ 2 +

ν∑

k=1

∣

∣

(

νk

)

−(

νk − 1

)

∣

∣

Using

(

νk

)

≥(

νk − 1

)

⇔ k ≤ 12(ν + 1) and

(

νk

)

≥(

νν − k

)

we get (with [ · ] the round

down operator):

ν∑

k=1

∣

∣

(

νk

)

−(

νk − 1

)

∣

∣

=

[ 12(ν+1)]∑

1

(

(

νk

)

−(

νk − 1

)

)

+

ν∑

[ 12(ν+1)]+1

(

(

νk − 1

)

−(

νk

)

)

=

[ 12ν]∑

1

(

(

νk

)

−(

νk − 1

)

)

+

[ 12ν]

∑

m=1

(

(

νm

)

−(

νm− 1

)

)

= 2

[ 12ν]∑

k=1

(

(

νk

)

−(

νk − 1

)

)

= 2(

(

ν[12ν]

)

−(

ν0

)

)

An elementary analysis yields (cf., for example, [75])

(

ν[12ν]

)

≤ 2ν√

2

πνfor ν ≥ 1

Thus we have proved the bound.

Corollary 9.4.12 Let ‖ · ‖ be any induced matrix norm. Assume that for a linear iterativemethod with iteration matrix I − M−1

ℓ Aℓ we have

‖I − M−1ℓ Aℓ‖ ≤ 1 (9.43)

Then for Sℓ := I − 12M

−1ℓ Aℓ the following smoothing property holds:

‖AℓSνℓ ‖ ≤ 2

√

2

πν‖Mℓ‖ , ν = 1, 2, . . .

213

Proof. Define B = I− M−1ℓ Aℓ and apply lemma 9.4.11:

‖AℓSνℓ ‖ ≤ ‖Mℓ‖

(1

2

)ν‖(I − B)(I + B)ν‖ ≤ 2

√

2

πν‖Mℓ‖

Remark 9.4.13 Note that in the smoother in corollary 9.4.12 we use damping with a factor 12 .

Generalizations of the results in lemma 9.4.11 and corollary 9.4.12 are given in [66, 49, 32]. In[66, 32] it is shown that the damping factor 1

2 can be replaced by an arbitrary damping factorω ∈ (0, 1). Also note that in the smoothing property in corollary 9.4.12 we have a ν-dependence

of the form ν−12 , whereas in the symmetric case this is of the form ν−1. It [49] it is noted that

this loss of a factor ν12 when going to the nonsymmetric case is due to the fact that complex

eigenvalues may occur. Assume that M−1ℓ Aℓ is a normal matrix. The assumption (9.43) implies

that σ(M−1ℓ Aℓ) ⊂ K := z ∈ C | |1 − z| ≤ 1 . We have:

‖M−1ℓ Aℓ(I −

1

2M−1

ℓ Aℓ)ν‖2

2 ≤ maxz∈K

|z(1 − 1

2z)ν |2 = max

z∈∂K|z(1 − 1

2z)ν |2

= maxφ∈[0,2π]

|1 + eiφ|2|12− 1

2eiφ|2ν

= maxφ∈[0,2π]

4(1

2+

1

2cosφ)(

1

2− 1

2cosφ)ν

= maxξ∈[0,1]

4ξ(1 − ξ)ν =4

ν + 1

( ν

ν + 1

)ν

Note that the latter function of ν also occurs in the proof of lemma 9.4.6. We conclude that forthe class of normal matrices M−1

ℓ Aℓ an estimate of the form

‖M−1ℓ Aℓ(I −

1

2M−1

ℓ Aℓ)ν‖2 ≤ c√

ν, ν = 1, 2, . . .

is sharp with respect to the ν-dependence.

To verify the condition in (9.43) we will use the following elementary result:

Lemma 9.4.14 If E ∈ Rm×m is such that there exists a c > 0 with

‖Ex‖22 ≤ c〈Ex,x〉 for all x ∈ Rm

then we have ‖I − ωE‖2 ≤ 1 for all ω ∈ [0, 2c ].

Proof. Follows from:

‖(I − ωE)x‖22 = ‖x‖2

2 − 2ω〈Ex,x〉 + ω2‖Ex‖22

≤ ‖x‖22 − ω(

2

c− ω)‖Ex‖2

2

≤ ‖x‖22 if ω(

2

c− ω) ≥ 0

We now use these results to derive a smoothing property for the Richardson method.

214

Theorem 9.4.15 Assume that the bilinear form satisfies the usual conditions (3.42). Let Aℓ

be the stiffness matrix in (9.22). There exist constants ω > 0 and c independent of ℓ such thatthe following smoothing property holds:

‖Aℓ(I − ωh2−nℓ Aℓ)

ν‖2 ≤ c√ν‖Aℓ‖2 , ν = 1, 2, . . .

Proof. Using lemma 9.4.3, the inverse inequality and the ellipticity of the bilinear form weget, for arbitrary x ∈ Rnℓ :

‖Aℓx‖2 = maxy∈Rnℓ

〈Aℓx,y〉‖y‖2

≤ c h12n

ℓ maxvℓ∈Vℓ

k(Pℓx, vℓ)

‖vℓ‖L2

≤ c h12n

ℓ maxvℓ∈Vℓ

|Pℓx|1|vℓ|1‖vℓ‖L2

≤ c h12n−1

ℓ |Pℓx|1

≤ c h12n−1

ℓ k(Pℓx, Pℓx)12 = c h

12n−1

ℓ 〈Aℓx,x〉12

From this and lemma 9.4.14 it follows that there exists a constant ω > 0 such that

‖I − 2ωh2−nℓ Aℓ‖2 ≤ 1 for all ℓ (9.44)

Define Mℓ := 12ωh

n−2ℓ I. From lemma 9.4.4 it follows that there exists a constant cM independent

of ℓ such that ‖Mℓ‖2 ≤ cM‖Aℓ‖2. Application of corollary 9.4.12 proves the result of the lemma.

We now consider the damped Jacobi method.

Theorem 9.4.16 Assume that the bilinear form satisfies the usual conditions (3.42). Let Aℓ bethe stiffness matrix in (9.22) and Dℓ = diag(Aℓ). There exist constants ω > 0 and c independentof ℓ such that the following smoothing property holds:


ν‖2 ≤ c√ν‖Aℓ‖2 , ν = 1, 2, . . .

Proof. We use the matrix norm induced by the vector norm ‖y‖D := ‖D12

ℓ y‖2 for y ∈ Rnℓ.

Note that for B ∈ Rnℓ×nℓ we have ‖B‖D = ‖D12

ℓ BD− 1

2

ℓ ‖2. The inequalities

‖D−1ℓ ‖2 ≤ c1 h

2−nℓ , κ(Dℓ) ≤ c2 (9.45)

hold with constants c1, c2 independent of ℓ. Using this in combination with lemma 9.4.3, theinverse inequality and the ellipticity of the bilinear form we get, for arbitrary x ∈ Rnℓ :

‖D−12

ℓ AℓD− 1

2

ℓ x‖2 = maxy∈Rnℓ

〈AℓD− 1

2

ℓ x,D− 1

2

ℓ y〉‖y‖2

= maxy∈Rnℓ

k(PℓD− 1

2

ℓ x, PℓD− 1

2

ℓ y)

‖y‖2

≤ c h−1ℓ max

y∈Rnℓ

|PℓD− 1

2

ℓ x|1‖PℓD− 1

2

ℓ y‖L2

‖y‖2

≤ c h12n−1

ℓ |PℓD− 1

2

ℓ x|1‖D− 1

2

ℓ ‖2 ≤ c |PℓD− 1

2

ℓ x|1≤ c k(PℓD

− 12

ℓ x, PℓD− 1

2

ℓ x)12 = c 〈D−

12

ℓ AℓD− 1

2

ℓ x,x〉 12

215

From this and lemma 9.4.14 it follows that there exists a constant ω > 0 such that

‖I − 2ωD−1ℓ Aℓ‖D = ‖I − 2ωD

− 12

ℓ AℓD− 1

2

ℓ ‖2 ≤ 1 for all ℓ

Define Mℓ := 12ωDℓ. Application of corollary 9.4.12 with ‖·‖ = ‖·‖D in combination with (9.45)

yields

‖Aℓ(I − ωhℓD−1ℓ Aℓ)

ν‖2 ≤ κ(D12

ℓ ) ‖Aℓ(I −1

2M−1

ℓ Aℓ)ν‖D

≤ c√ν‖Mℓ‖D =

c

2ω√ν‖Dℓ‖2 ≤ c√

ν‖Aℓ‖2


9.4.4 Multigrid contraction number

In this section we prove a bound for the contraction number in the Euclidean norm of the multi-grid algorithm (9.25) with τ ≥ 2. We follow the analysis in [44, 48].Apart from the approximation and smoothing property that have been proved in the sec-tions 9.4.2 and 9.4.3 we also need the following stability bound for the iteration matrix ofthe smoother:

∃ CS : ‖Sνℓ ‖2 ≤ CS for all ℓ and ν (9.46)

Lemma 9.4.17 Consider the Richardson method as in theorem 9.4.7 or theorem 9.4.15. Inboth cases (9.46) holds with CS = 1.

Proof. In the symmetric case (theorem 9.4.7) we have

‖Sℓ‖2 = ‖I − c0ρ(Aℓ)

Aℓ‖2 = maxλ∈σ(Aℓ)

∣

∣1 − c0λ

ρ(Aℓ)

∣

∣ ≤ 1

For the general case (theorem 9.4.15) we have, using (9.44):

‖Sℓ‖2 = ‖I − ωh2−nℓ Aℓ‖2 = ‖1

2I +

1

2(I − 2ωh2−n

ℓ Aℓ)‖2

≤ 1

2+

1

2‖I − 2ωh2−n

ℓ Aℓ‖2 ≤ 1

Lemma 9.4.18 Consider the damped Jacobi method as in theorem 9.4.8 or theorem 9.4.16. Inboth cases (9.46) holds.

Proof. Both in the symmetric and nonsymmetric case we have

‖Sℓ‖D = ‖D12

ℓ (I − ωD−1ℓ Aℓ)D

− 12

ℓ ‖2 ≤ 1

and thus

‖Sνℓ ‖2 ≤ ‖D−12

ℓ (D12

ℓ SℓD− 1

2

ℓ )νD12

ℓ ‖2 ≤ κ(D12

ℓ ) ‖Sℓ‖νD ≤ κ(D12

ℓ )

Now note that Dℓ is uniformly (w.r.t. ℓ) well-conditioned.

Treatment of symmetric Gauss-Seidel method: in preparation.

216

Using lemma 9.4.3 it follows that for pℓ = P−1ℓ Pℓ−1 we have

Cp,1‖x‖2 ≤ ‖pℓx‖2 ≤ Cp,2‖x‖2 for all x ∈ Rnℓ−1 (9.47)

with constants Cp,1 > 0 and Cp,2 independent of ℓ.We now formulate a main convergence result for the multigrid method.

Theorem 9.4.19 Consider the multigrid method with iteration matrix given in (9.30) and pa-rameter values ν2 = 0, ν1 = ν > 0, τ ≥ 2. Assume that there are constants CA, CS and amonotonically decreasing function g(ν) with g(ν) → 0 for ν → ∞ such that for all ℓ:

‖A−1ℓ − pℓA

−1ℓ−1rℓ‖2 ≤ CA‖Aℓ‖−1

2 (9.48a)

‖AℓSνℓ ‖2 ≤ g(ν) ‖Aℓ‖2 , ν ≥ 1 (9.48b)

‖Sνℓ ‖2 ≤ CS , ν ≥ 1 (9.48c)

For any ξ∗ ∈ (0, 1) there exists a ν∗ such that for all ν ≥ ν∗

‖CMG,ℓ‖2 ≤ ξ∗ , ℓ = 0, 1, . . .

holds.

Proof. For the two-grid iteration matrix we have

‖CTG,ℓ‖2 ≤ ‖A−1ℓ − pℓA

−1ℓ−1rℓ‖2‖AℓS

νℓ ‖2 ≤ CAg(ν)

Define ξℓ = ‖CMG.ℓ‖2. From (9.30) we obtain ξ0 = 0 and for ℓ ≥ 1:

ξℓ ≤ CAg(ν) + ‖pℓ‖2ξτℓ−1‖A−1

ℓ−1rℓAℓSνℓ ‖2

≤ CAg(ν) + Cp,2C−1p,1ξ

τℓ−1‖pℓA−1

ℓ−1rℓAℓSνℓ ‖2


τℓ−1

(

‖(I − pℓA−1ℓ−1rℓAℓ)S

νℓ ‖2 + ‖Sνℓ ‖2

)


τℓ−1

(

CAg(ν) + CS)

≤ CAg(ν) + C∗ξτℓ−1

with C∗ := Cp,2C−1p,1(CAg(1)+CS). Elementary analysis shows that for τ ≥ 2 and any ξ∗ ∈ (0, 1)

the sequence x0 = 0, xi = CAg(ν)+C∗xτi−1, i ≥ 1, is bounded by ξ∗ for g(ν) sufficiently small.

Remark 9.4.20 Consider Aℓ, pℓ, rℓ as defined in (9.22), (9.23),(9.24). Assume that the vari-ational problem (9.20) is such that the usual conditions (3.42) are satisfied. Moreover, the prob-lem (9.20) and the corresponding dual problem are assumed to be H2-regular. In the multigridmethod we use the Richardson or the damped Jacobi method described in section 9.4.3. Thenthe assumptions 9.48 are fulfilled and thus for ν2 = 0 and ν1 sufficiently large the multigridW-cylce has a contractrion number smaller than one indpendent of ℓ.

Remark 9.4.21 Let CMG,ℓ(ν2, ν1) be the iteration matrix of the multigrid method with ν1 pre-and ν2 postsmoothing iterations. With ν := ν1 + ν2 we have

ρ(

CMG,ℓ(ν2, ν1))

= ρ(

CMG,ℓ(0, ν))

≤ ‖CMG,ℓ(0, ν)‖2

Using theorem 9.4.19 we thus get, for τ ≥ 2, a bound for the spectral radius of the iterationmatrix CMG,ℓ(ν2, ν1).

217

Remark 9.4.22 Note on other convergence analyses. Xu, Yserentant (quasi-uniformity notneeded in BPX). Comment on regularity. Book Bramble.

9.4.5 Convergence analysis for symmetric positive definite problems

In this section we analyze the convergence of the multigrid method for the symmetric positivedefinite case, i.e., the stiffness matrix Aℓ is assumed to be symmetric positive definite. Thisproperty allows a refined analysis which proves that the contraction number of the multigridmethod with τ ≥ 1 (the V-cycle is included !) and ν1 = ν2 ≥ 1 pre- and postsmoothing iterationsis bounded by a constant smaller than one independent of ℓ. The basic idea of this analysis isdue to [18] and is further simplified in [44, 48].

Throughout this section we make the following

Assumption 9.4.23 In the bilinear form k(·, ·) in (9.20) we have b = 0 and the conditions(3.42) are satisfied.

Due to this the stiffness matrix Aℓ is symmetric positive definite and we can define the energyscalar product and corresponding norm:

〈x,y〉A := 〈Aℓx,y〉 , ‖x‖A := 〈x,x〉12

A x,y ∈ Rnℓ

We only consider smoothers with an iteration matrix Sℓ = I−M−1ℓ Aℓ in which Mℓ is symmetric

positive definite. Important examples are the smoothers analyzed in section 9.4.3:

Richardson method : Mℓ = c−10 ρ(Aℓ)I , c0 ∈ (0, 1] (9.49a)

Damped Jacobi : Mℓ = ω−1Dℓ, ω as in thm. 9.4.8 (9.49b)

Symm. Gauss-Seidel : Mℓ = (Dℓ − Lℓ)D−1ℓ (Dℓ − LTℓ ) (9.49c)

For symmetric matrices B,C ∈ Rm×m we use the notation B ≤ C iff 〈Bx,x〉 ≤ 〈Cx,x〉 for allx ∈ Rm.

Lemma 9.4.24 For Mℓ as in (9.49) the following properties hold:

Aℓ ≤ Mℓ for all ℓ (9.50a)

∃CM : ‖Mℓ‖2 ≤ CM‖Aℓ‖2 for all ℓ (9.50b)

Proof. For the Richardson method the result is trivial. For the damped Jacobi method we

have ω ∈ (0, ρ(D−1ℓ Aℓ)

−1] and thus ωρ(D− 1

2

ℓ AℓD− 1

2

ℓ ) ≤ 1. This yields Aℓ ≤ ω−1Dℓ = Mℓ.The result in (9.50b) follows from ‖Dℓ‖2 ≤ ‖Aℓ‖2. For the symmetric Gauss-Seidel method theresults (9.50a) follows from Mℓ = Aℓ + LℓD

−1ℓ LTℓ and the result in (9.50b) is proved in (9.42).

We introduce the following modified approximation property :

∃ CA : ‖M12

ℓ

(

A−1ℓ − pℓA

−1ℓ−1rℓ

)

M12

ℓ ‖2 ≤ CA for ℓ = 1, 2, . . . (9.51)

We note that the standard approximation property (9.37) implies the result (9.51) if we considerthe smoothers in (9.49):

218

Lemma 9.4.25 Consider Mℓ as in (9.49) and assume that the approximation property (9.37)holds. Then (9.51) holds with CA = CMCA.

Proof. Trivial.One easily verifies that for the smoothers in (9.49) the modified approximation property (9.51)implies the standard approximation property (9.37) if κ(Mℓ) is uniformly (w.r.t. ℓ) bounded.The latter property holds for the Richardson and the damped Jacobi method.

We will analyze the convergence of the two-grid and multigrid method using the energy scalarproduct. For matrices B, C ∈ Rnℓ×nℓ that are symmetric w.r.t. 〈·, ·〉A we use the notationB ≤A C iff 〈Bx,x〉A ≤ 〈Cx,x〉A for all x ∈ Rnℓ . Note that B ∈ Rnℓ×nℓ is symmetric w.r.t.〈·, ·〉A iff (AℓB)T = AℓB holds. We also note the following elementary property for symmetricmatrices B, C ∈ Rnℓ×nℓ:

B ≤ C ⇔ BAℓ ≤A CAℓ (9.52)

We now turn to the two-grid method. For the coarse grid correction we introduce the notationQℓ := I − pℓA

−1ℓ−1rℓAℓ. For symmetry reasons we only consider ν1 = ν2 = 1

2ν with ν > 0 even.The iteration matrix of the two-grid method is given by

CTG,ℓ = CTG,ℓ(ν) = S12ν

ℓ QℓS12ν

ℓ

Due the symmetric positive definite setting we have the following fundamental property:

Theorem 9.4.26 The matrix Qℓ is an orthogonal projection w.r.t. 〈·, ·〉A.

Proof. Follows from

Q2ℓ = Qℓ and (AℓQℓ)

T = AℓQℓ

As an direct consequence we have

0 ≤A Qℓ ≤A I (9.53)

The next lemma gives another characterization of the modified approximation property:

Lemma 9.4.27 The property (9.51) is equivalent to

0 ≤A Qℓ ≤A CAM−1ℓ Aℓ for ℓ = 1, 2, . . . (9.54)

Proof. Using (9.52) we get

‖M12

ℓ

(

A−1ℓ − pℓA

−1ℓ−1rℓ

)

M12

ℓ ‖2 ≤ CA for all ℓ

⇔− CAI ≤ M12

ℓ

(

A−1ℓ − pℓA

−1ℓ−1rℓ

)

M12

ℓ ≤ CAI for all ℓ

⇔− CAM−1ℓ ≤ A−1

ℓ − pℓA−1ℓ−1rℓ ≤ CAM−1

ℓ for all ℓ

⇔− CAM−1ℓ Aℓ ≤A Qℓ ≤A CAM−1

ℓ Aℓ for all ℓ

In combination with (9.53) this proves the result.

We now present a convergence result for the two-grid method:

219

Theorem 9.4.28 Assume that (9.50a) and (9.51) hold. Then we have

‖CTG,ℓ(ν)‖A ≤ maxy∈[0,1]

y(1 − C−1A y)ν

=

(1 − C−1A )ν if ν ≤ CA − 1

CA

ν+1

(

νν+1

)νif ν ≥ CA − 1

(9.55)

Proof. Define Xℓ := M−1ℓ Aℓ. This matrix is symmetric w.r.t. the energy scalar product and

from (9.50a) it follows that0 ≤A Xℓ ≤A I (9.56)

holds. From lemma 9.4.27 we obtain 0 ≤A Qℓ ≤A CAXℓ. Note that due to this, (9.56) and thefact that Qℓ is an A-orthogonal projection which is not identically zero we get

CA ≥ 1 (9.57)

Using (9.53) we get

0 ≤A Qℓ ≤A αCAXℓ + (1 − α)I for all α ∈ [0, 1] (9.58)

Hence, using Sℓ = I −Xℓ we have

0 ≤A CTG,ℓ(ν) ≤A (I − Xℓ)12ν(

αCAXℓ + (1 − α)I)

(I − Xℓ)12ν

for all α ∈ [0, 1] , and thus

‖CTG,ℓ(ν)‖A ≤ minα∈[0,1]

maxx∈[0,1]

(

αCAx+ (1 − α))

(1 − x)ν

A minimax result (cf., for example, [83]) shows that in the previous expression the min and maxoperations can be interchanged. A simple computation yields

maxx∈[0,1]

minα∈[0,1]

(

αCAx+ (1 − α))

(1 − x)ν

= max

maxx∈[0,C−1

A]CAx(1 − x)ν , max

x∈[C−1A,1]

(1 − x)ν

= maxx∈[0,C−1

A]CAx(1 − x)ν = max

y∈[0,1]y(1 − C−1

A y)ν

This proves the inequality in (9.55). An elementary computation shows that the equality in(9.55) holds.

We now show that the approach used in the convergence analysis of the two-grid method intheorem 9.4.28 can also be used for the multigrid method.We start with an elementary result concerning a fixed point iteration that will be used in theo-rem 9.4.30.

Lemma 9.4.29 For given constants c > 1, ν ≥ 1 define g : [0, 1) → R by

g(ξ) =

(1 − 1c )ν if 0 ≤ ξ < 1 − ν

c−1c

ν+1

(

νν+1

)ν(1 − ξ)

(

1 + 1c

ξ1−ξ)ν+1

if 1 − νc−1 ≤ ξ < 1

(9.59)

220

For τ ∈ N, τ ≥ 1, define the sequence ξτ,0 = 0, ξτ,i+1 = g(ξττ,i) for i ≥ 1. The followingholds:

∗ ξ → g(ξ) is continuous and increasing on [0, 1)

∗ For c = CA, g(0) coincides with the upper bound in (9.55)

∗ g(ξ) = ξ iff ξ =c

c+ ν

∗ The sequence (ξτ,i)i≥0 is monotonically increasing, and ξ∗τ := limi→∞

ξτ,i < 1

∗(

(ξ∗τ )τ , ξ∗τ

)

is the first intersection point of the graphs of g(ξ) and ξ1τ

∗ c

c+ ν= ξ∗1 ≥ ξ∗2 ≥ . . . ≥ ξ∗∞ = g(0)

Proof. Elementary calculus.

As an illustration for two pairs (c, ν) we show the graph of the function g in figure 9.6.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 9.6: Function g(ξ) for ν = 2, c = 4 (left) and ν = 4, c = 4 (right).

Theorem 9.4.30 We take ν1 = ν2 = ν and consider the multigrid algorithm with iterationmatrix CMG,ℓ = CMG,ℓ(ν, τ) as in (9.30). Assume that (9.50a) and (9.51) hold. For c = CA,ν ≥ 2 and τ as in (9.30) let ξ∗τ ≤ c

c+ν be the fixed point defined in lemma 9.4.29. Then

‖CMG,ℓ‖A ≤ ξ∗τ

holds.

Proof. From (9.30) we have

CMG,ℓ = S12ν

ℓ

(

I − pℓ(I − CτMG,ℓ−1)A

−1ℓ−1rℓAℓ

)

S12ν

ℓ

= S12ν

ℓ (Qℓ + Rℓ)S12ν

ℓ , Rℓ := pℓCτMG,ℓ−1A

−1ℓ−1rℓAℓ

The matrices Sℓ and Qℓ are symmetric w.r.t. 〈·, ·〉A. If CMG,ℓ−1 is symmetric w.r.t. 〈·, ·〉Aℓ−1

then from(AℓRℓ)

T =[

(AℓpℓA−1ℓ−1)(Aℓ−1C

τMG,ℓ−1)(A

−1ℓ−1rℓAℓ)

]T= AℓRℓ

221

it follows that Rℓ is symmetric w.r.t. 〈·, ·〉A, too. By induction we conclude that for all ℓ thematrices Rℓ and CMG,ℓ are symmetric w.r.t. 〈·, ·〉A. Note that

0 ≤A CτMG,ℓ−1 ⇔ 0 ≤ Cτ

MG,ℓ−1A−1ℓ−1 ⇔ 0 ≤ pℓC

τMG,ℓ−1A

−1ℓ−1rℓ ⇔ 0 ≤A Rℓ

holds. Thus, by induction and using 0 ≤A Qℓ we get

0 ≤A Qℓ + Rℓ , 0 ≤A CMG,ℓ for all ℓ (9.60)

For ℓ ≥ 0 define ξℓ := ‖CMG,ℓ‖A. Hence, 0 ≤A CMG,ℓ ≤A ξℓI holds. For arbitrary x ∈ Rnℓ wehave

〈Rℓx,x〉A = 〈CτMG,ℓ−1A

−1ℓ−1rℓAℓx,A

−1ℓ−1rℓAℓx〉Aℓ−1

≤ ξτℓ−1〈A−1ℓ−1rℓAℓx,A

−1ℓ−1rℓAℓx〉Aℓ−1

= ξτℓ−1〈x, (I − Qℓ)x〉A

and thus

Rℓ ≤A ξτℓ−1(I − Qℓ) (9.61)

holds. Define Xℓ := M−1ℓ Aℓ. Using (9.58), (9.60) and (9.61) we get

0 ≤A Qℓ + Rℓ ≤A (1 − ξτℓ−1)Qℓ + ξτℓ−1I

≤A (1 − ξτℓ−1)(


+ ξτℓ−1I for all α ∈ [0, 1]

Hence, for all α ∈ [0, 1] we have

0 ≤A CMG,ℓ ≤A (I − Xℓ)12ν[

(1 − ξτℓ−1)(


+ ξτℓ−1I]

(I − Xℓ)12ν

This yieldsξℓ ≤ min

α∈[0,1]maxx∈[0,1]

[

(1 − ξτℓ−1)(

αCAx+ 1 − α)

+ ξτℓ−1

]

(1 − x)ν

As in the proof of theorem 9.4.28 we can interchange the min and max operations in the previousexpression. A simple computation shows that for ξ ∈ [0, 1] we have

maxx∈[0,1]

minα∈[0,1]

[

(1 − ξ)(

αCAx+ 1 − α)

+ ξ]

(1 − x)ν

= max

maxx∈[0,C−1

A]

(

(1 − ξ)CAx+ ξ)

(1 − x)ν , maxx∈[C−1

A,1]

(1 − x)ν

= g(ξ)

where g(ξ) is the function defined in lemma 9.4.29 with c = CA. Thus ξℓ satisfies ξ0 = 0 andξℓ ≤ g(ξτℓ−1) for ℓ ≥ 1. Application of the results in lemma 9.4.29 completes the proof.

The bound ξ∗τ for the multigrid contraction number in theorem 9.4.30 decreases if τ increases.Moreover, for τ → ∞ the bound converges to the bound for the two-grid contraction number intheorem 9.4.28.

Corollary 9.4.31 Consider Aℓ, pℓ, rℓ as defined in (9.22), (9.23),(9.24). Assume that thevariational problem (9.20) is such that b = 0 and that the usual conditions (3.42) are satisfied.Moreover, the problem is assumed to be H2-regular. In the multigrid method we use one of thesmoothers (9.49). Then the assumptions (9.50a) and (9.51) are satisfied and thus for ν1 = ν2 ≥ 1the multigrid V-cycle has a contraction number (w.r.t. ‖ · ‖A) smaller than one independent ofℓ.

222

9.5 Multigrid for convection-dominated problems

9.6 Nested Iteration

We consider a sequence of discretizations of a given boundary value problem, as for example in(9.22):

Aℓxℓ = bℓ, ℓ = 0, 1, 2, . . . .

We assume that for a certain ℓ = ℓ we want to compute the solution x∗ℓ

of the problem Aℓxℓ = bℓusing an iterative method (not necessarily a multigrid method). In the nested iteration methodwe use the systems on coarse grids to obtain a good starting vector x0

ℓfor this iterative method

with relatively low computational costs. The nested iteration method for the computation ofthis starting vector x0

ℓis as follows

compute the solution x∗0 of A0x0 = b0

x01 := p1x

∗0 (prolongation of x∗0)

xk1 := result of k iterations of an iterative methodapplied to A1x1 = b1 with starting vector x0

1

x02 := p2x

k1 ( prolongation of xk1)

xk2 := result of k iterations of an iterative methodapplied to A2x2 = b2 with starting vector x0

2...etc....x0ℓ

:= pℓxkℓ−1

.

(9.62)

In this nested iteration method we use a prolongation pℓ : Rnℓ−1 → Rnℓ . The nested iterationprinciple is based on the idea that pℓx

∗ℓ−1 should be a reasonable approximation of x∗ℓ , because

Aℓ−1x∗ℓ−1 = bℓ−1 and Aℓx

∗ℓ = bℓ are discretizations of the same continuous problem. With

respect to the computational costs of this approach we note the following (cf. Hackbusch [44],section 5.3). For the nested iteration to be a feasible approach, the number of iterations appliedon the coarse grids (i.e. k in (9.62)) should not be ”too large” and the number of grid pointsin the union of all coarse grids (i.e. level 0, 1, 2, ..., ℓ − 1) should be at most of the same orderof magnitude as the number of grid points in the level ℓ grid. Often, if one uses a multigridsolver these two conditions are satisfied. Usually in multigrid we use coarse grids such that thenumber of grid points decreases in a geometric fashion, and for k in (9.62) we can often takek = 1 or k = 2 due to the fact that on the coarse grids we use the multigrid method, which hasa high rate of convergence.

Note that if one uses the algorithm MGMℓ from (9.25) as the solver on level ℓ then theimplementation of the nested iteration method can be done with only little additional effortbecause the coarse grid data structure and coarse grid operators (e.g. Aℓ, ℓ < ℓ) needed in thenested iteration method are already available.

If in the nested iteration method we use a multigrid iterative solver on all levels we obtain

223

3

2

1

0

MGM1(x01,b1)

MGM2(x02,b2)

MGM3(x03,b3)

- -

- -

- -

x∗0

x01

p1x1

1

x02

p2x1

2

x03

p3

x13

Figure 9.7: Multigrid and nested iteration.

the following algorithmic structure:

x∗0 := A−10 b0; xk0 := x∗0

for ℓ = 1 to ℓ dobegin

x0ℓ := pℓx

kℓ−1

for i = 1 to k do xiℓ := MGMℓ(xi−1ℓ ,bℓ)

end;

(9.63)

For the case ℓ = 3 and k = 1 this method is illustrated in Figure 9.7.

Remark 9.6.1 The prolongation pℓ used in the nested iteration may be the same as the pro-longation pℓ used in the multigrid method. However, from the point of view of efficiency itis sometimes better to use in the nested iteration a prolongation pℓ that has higher order ofaccuracy than the prolongation used in the multigrid method.

9.7 Numerical experiments

We consider the Poisson model problem described in section 6.6 and apply a multigrid methodto this problem. In this section we present some results of numerical experiments and discussthe complexity of the multigrid method for this model problem.

Example 9.7.1 (Poisson model problem) We apply a multigrid algorithm as in (9.25) tothe discrete Poisson equation described in section 6.6. For the smoother we use a Gauss-Seidel method. The starting vector is x0 = 0. For the parameters in the algorithm we chooseν1 = 2, ν2 = 0, τ = 2. We solve the discrete problem on the triangulation Thℓ

with meshsize hℓ ≈ 2−ℓ−1. In table 9.1 we show the error reduction ‖xk+1

ℓ − x∗ℓ‖2/‖xkℓ − x∗ℓ‖2 for severalvalues of ℓ and of k. For a better comparison with the basic iterative methods and with theCG method, we also computed the number of iterations (#) needed to reduce the norm of thestarting error with a factor 103 . The results are shown in table 9.2. From these results it isclear that the contraction number is not close to one, even if the mesh size is small. In otherwords, the rate of convergence does not deteriorate if the mesh size hℓ becomes smaller. This isa crucial difference compared with basic iterative methods and CG.

224

ℓ hℓ 1 3 5 7 9

3 1/16 0.080 0.055 0.061 0.067 0.0704 1/32 0.056 0.053 0.058 0.062 0.0665 1/64 0.044 0.055 0.059 0.062 0.0656 1/128 0.043 0.054 0.058 0.061 0.063

Table 9.1: Multigrid error reduction, k = 1, 3, . . . , 9.

hℓ 1/16 1/32 1/64 1/128

# 3 3 3 3

Table 9.2: # iterations for multigrid method.

Complexity. Consider the situation described in example 9.7.1. Then the arithmetic costs permultigrid iteration are cnℓ flops and the error reduction per iteration is bounded by α < 1 withα independent of ℓ (as proved in section 9.4). To obtain a reduction of a starting error by a fixedfactor R we then need at most lnR/| lnα| iterations, i.e. the arithmetic costs are approximatelylnR/| lnα| cnℓ flops. We conclude that the multigrid method has complexity cnℓ . Note that thisis optimal in the sense that for one matrix-vector multiplication Aℓxℓ we already need O(nℓ)flops. A nice feature of multigrid methods is that such an “optimal complexity property” holdsfor a large class of interesting problems.

With respect to the efficiency of multigrid methods we note the following. The rate of con-vergence will increase if ν1 + ν2 or τ is increased. However, in that case also the arithmeticcosts per iteration will grow. Analysis of the dependence of the multigrid contraction numberon ν1, ν2, τ and numerical experiments have shown that for many problems we obtain an efficientmethod if we take ν1 + ν2 ∈ 1, 2, 3, 4 and τ ∈ 1, 2 . In other words, in general many (> 4)smoothing iterations or more than two recursive calls in (9.25) will make a multigrid methodless efficient.

Stopping criterion. In general, for the discrete solution x∗ℓ (with corresponding finite elementfunction uℓ = Pℓx

∗ℓ ) we have a discretization error , so it does not make sense to solve the

discrete problem to machine accuracy. For a large class of elliptic boundary value the fol-lowing estimate for the discretization error holds: ‖u − uℓ‖ ≤ ch2

ℓ . If in the multigrid itera-tion one has an arbitrary starting vector (e.g., 0) then the error reduction factor R should betaken proportional to h−2

ℓ . Using the multigrid iteration MGMℓ one then needs approximatelylnR/| lnα| ≈ ln ch−2

ℓ /| lnα| ≈ c lnnℓ/| lnα| iterations to obtain an approximation with the de-sired accuracy. Per iteration we need O(nℓ) flops. Hence we conclude: When we use a multigridmethod for computing an approximation uℓ of u with accuracy comparable to the discretizationerror in uℓ, the arithmetic costs are of the order

c nℓ lnnℓ flops . (9.64)

Multigrid and nested iteration. For an analysis of the multigrid method used in a nestediteration we refer to Hackbusch [44]. From this analysis it follows that a small fixed number ofMGMℓ iterations (i.e. k in (9.62)) on each level ℓ ≤ ℓ in the nested iteration method is sufficientto obtain an approximation xℓ of x∗

ℓwith accuracy comparable to the discretization error in x∗

ℓ.

225

The arithmetic costs of this combined multigrid and nested iteration method are of the order

c

| lnα| nℓ flops. (9.65)

When we compare the costs in (9.64) with the costs in (9.65) we see that using the nestediteration approach results in a more efficient algorithm. From the work estimate in (9.65) weconclude: Using multigrid in combination with nested iteration we can compute an approximationxℓ of x∗

ℓwith accuracy comparable to the discretization error in x∗

ℓand with arithmetic costs

≤ Cnℓ flops (C independent of ℓ).

Example 9.7.2 To illustrate the behaviour of multigrid in combination with nested iterationwe show numerical results for an example from Hackbusch [44]. In the Poisson problem as inexample 9.7.1 we take boundary conditions and a righthand side such that the solution is givenby u(x, y) = 1

2y3/(x+ 1), so we consider:

−∆u = −(3y/(x+ 1) + y3/(x+ 1)3) in Ω = [0, 1]2 ,

u(x, y) = 12y

3/(x+ 1) on ∂Ω .

For the discretization we apply linear finite elements on a family of nested uniform triangulationswith mesh size hℓ = 2−ℓ−1. The discrete solution on level ℓ is denoted by uℓ. The discretizationerror, measured in a weighted Euclidean norm is given in table 9.3. From these results one can

hℓ 1/8 1/16 1/32 1/64

‖uℓ − u‖2 2.64 10−5 6.89 10−6 1.74 10−6 4.36 10−7

Table 9.3: Discretization errors.

observe a ch2ℓ behaviour of the discretization error. We apply the nested iteration approach of

section 9.6 in combination with the multigrid method. We start with a coarsest triangulationTh0

with mesh size h0 = 12 (this contains only one interior grid point). For the prolongation

pℓ used in the nested iteration we take the prolongation pℓ as in the multigrid method (linearinterpolation). When we apply only one multigrid iteration on each level ℓ ≤ ℓ (i.e k = 1 in(9.63)) we obtain approximations x0

ℓ (= pℓx1ℓ−1) and x1

ℓ of x∗ℓ (= P−1ℓ uℓ) (cf. figure 9.7). The

errors in these approximations are given in table 9.4. In that table we also give the errors for thecase with two multigrid iterations on each level (i.e., k = 2 in (9.63)). Comparing the results intable 9.4 with the discretization errors given in table 9.3 we see that we only need two multigriditerations on each grid to compute an approximation of x∗ℓ (0 ≤ ℓ ≤ ℓ) with accuracy comparableto the discretization error in x∗ℓ .

9.8 Algebraic multigrid methods

9.9 Nonlinear multigrid

226

hℓ ‖xiℓ − x∗ℓ‖2, k = 1 ‖xiℓ − x∗ℓ‖2, k = 2

1/8 x02 7.24 10−3 x0

2 6.47 10−3

x12 5.98 10−4 x1

2 4.92 10−4

x22 2.86 10−5

1/16 x03 2.09 10−3 x0

3 1.73; 10−3

x13 1.30 10−4 x1

3 9.91 10−5

x23 4.91 10−6

1/32 x04 5.17 10−4 x0

4 4.43 10−4

x14 2.54 10−5 x1

4 1.82 10−5

x24 8.52 10−7

1/64 x05 1.23 10−4 x0

5 1.12 10−4

x15 4.76 10−6 x1

5 3.25 10−6

x25 1.47 10−7

Table 9.4: Errors in nested iteration.

227

Chapter 10

Iterative methods for saddle-point

problems

In this chapter we discuss a class of iterative methods for solving a linear system with a matrixof the form

K =

(

A BT

B 0

)

A ∈ Rm×m symmetric positive definite

B ∈ Rn×m rank(B) = n < m

(10.1)

The so-called Schur complement matrix is given by S := BA−1BT . Note that S is symmetricpositive definite. The symmetic matrix K is (strongly) indefinite:

Lemma 10.0.1 The matrix K has m strictly positive and n strictly negative eigenvalues.

Proof. From the factorization

K =

(

A 0B I

)(

A−1 00 −S

)(

A BT

0 I

)

it follows that K is congruent to the matrix blockdiag(A−1,−S) which has m strictly positiveand n strictly negative eigenvalues. Now apply Sylvester’s inertia theorem.

Remark 10.0.2 Consider a linear system of the form

K

(

vw

)

=

(

f1f2

)

(10.2)

with K as in (10.1). Define the functional L : Rm×Rn → R by L(v,w) = 12〈Av,v〉+ 〈Bv,w〉−

〈f1,v〉 − 〈f2,w〉. Using the same arguments as in the proof of theorem 2.4.2 one can easily showthat (v∗,w∗) is a solution of the problem (10.2) iff

L(v∗,w) ≤ L(v∗,w∗) ≤ L(v,w∗) for all v ∈ Rm, w ∈ Rn

Due to this property the linear system (10.2) is called a saddle-point problem.

In section 8.3 we discussed the preconditioned MINRES method for solving a linear system witha symmetric indefinite matrix. This method can be applied to the system in (10.2). Recall thatthe preconditioner must be symmetric positive definite. In section 10.1 we analyze a particularpreconditioning technique for the matrix K. In section 10.2 we apply these methods to thediscrete Stokes problem.

229

10.1 Block diagonal preconditioning

In this section we analyze the effect of symmetric preconditioning of the matrix K in (10.1) witha block diagonal matrix

M :=

(

MA 00 MS

)

MA ∈ Rm×m, MA = MTA > 0, MS ∈ Rn×n, MS = MT

S > 0

The preconditioned matrix is given by

K = M− 12 KM− 1

2 =

(

A BT

B 0

)

A := M− 1

2

A AM− 1

2

A , B := M− 1

2

S BM− 1

2

A

We first consider a very special preconditoner, which in a certain sense is optimal:

Lemma 10.1.1 For MA = A and MS = S we have

σ(K) = 1

2(1 −

√5) , 1 ,

1

2(1 +

√5)

Proof. Note that

K =

(

I BT

B 0

)

, B = S−12BA−

12

The matrix B has a nontrivial kernel. For v ∈ ker(B), v 6= 0, we have K

(

v0

)

=

(

v0

)

and thus

1 ∈ σ(K). For µ ∈ σ(K), µ 6= 1, we get(

I BT

B 0

)(

vw

)

= µ

(

vw

)

, w 6= 0

This holds iff µ(µ− 1) ∈ σ(BBT ) = σ(I) = 1 and thus µ = 12(1 ±

√5).

Note that from the result in (8.41) it follows that the preconditioned MINRES method withthe preconditioner as in lemma 10.1.1 yields (in exact arithmetic) the exact solution in at mostthree iterations. In most applications (e.g., the Stokes problem) it is very costly to solve linearsystems with the matrices A and S. Hence this preconditioner is not feasible. Instead we willuse approximations MA of A and MS of S. The quality of these approximations is measuredby the following spectral inequalities, with γA, γS > 0:

γAMA ≤ A ≤ ΓAMA

γSMS ≤ S ≤ ΓSMS(10.3)

Using an analysis as in [77, 82] we obtain a result for the eigenvalues of the preconditionedmatrix:

Theorem 10.1.2 For the matrix K with preconditioners that satisfy (10.3) we have:

σ(K) ⊂[ 1

2(γA −

√

γ2A + 4ΓSΓA ,

1

2(γA −

√

γ2A + 4γSγA

]

∪[

γA ,1

2(ΓA +

√

Γ2A + 4ΓSΓA

]

230

Proof. We use the following inequalities

γAI ≤ A ≤ ΓAI (10.4a)

γAA−1 ≤ M−1A ≤ ΓAA−1 (10.4b)

γSI ≤ M− 1

2

S SM− 1

2

S ≤ ΓSI (10.4c)

Note that BBT = M− 1

2

S BM−1A BTM

− 12

S . Using (10.4b) and (10.4c) we get

γAγSI ≤ BBT ≤ ΓAΓSI (10.5)

Take µ ∈ σ(K). Then µ 6= 0 and there exists (v,w) 6= (0, 0) such that

Av + BTw = µv

Bv = µw(10.6)

From v = 0 it follows that w = 0, hence, v 6= 0 must hold. From (10.6) we obtain (A +1µB

T B)v = µv and thus µ ∈ σ(A + 1µB

T B). Note that σ(BT B) = σ(BBT ) ∪ 0. We firstconsider the case µ > 0. Using (10.5) and (10.4a) we get

γAI ≤ A +1

µBT B ≤ (ΓA +

1

µΓSΓA)I

and thus γA ≤ µ ≤ ΓA + 1µΓSΓA holds. This yields

µ ∈[

γA ,1

2(ΓA +

√

Γ2A + 4ΓSΓA)

]

We now consider the case µ < 0. From (10.5) and (10.4a) it follows that

A +1

µBT B ≥ (γA +

1

µΓSΓA)I

and thus µ ≥ γA + 1µΓSΓA. This yields µ ≥ 1

2(γA−√

γ2A + 4ΓSΓA). Finally we derive an upper

bound for µ < 0. We introduce ν := −µ > 0. From (10.6) it follows that for µ < 0, w 6= 0 musthold. Furthermore, we have

B(A + νI)−1BTw = νw

and thus ν ∈ σ(B(A + νI)−1BT ). From I + νA−1 ≤ (1 + νγA

)I and (10.4c) we obtain

B(A + νI)−1BT = BA−12 (I + νA−1)−1A−

12 BT ≥ (1 +

ν

γA)−1BA−1BT

= (1 +ν

γA)−1M

− 12

S SM− 1

2

S ≥ (1 +ν

γA)−1γSI

We conclude that ν ≥ (1+ νγA

)−1γS holds. Hence, for µ = −ν we get µ ≤ 12 (γA−

√

γ2A + 4γSγA).

Remark 10.1.3 Note that if γA = ΓA = γS = ΓS = 1, i.e., MA = A and MS = S, we obtainσ(K) = 1

2 (1 −√

5) ∪ [ 1 , 12 (1 +

√5) ], which is sharp (cf. lemma 10.1.1).

231

10.2 Application to the Stokes problem

In this section the results of the previous sections are applied to the discretized Stokes problemthat is treated in section 5.2. We consider the Galerkin discretization of the Stokes problemwith Hood-Taylor finite element spaces

(Vh,Mh) =(

(Xkh,0)

d , Xk−1h ∩ L2

0(Ω))

, k ≥ 2

Here we use the notation d for the dimension of the velocity vector (Ω ⊂ Rd). For the basesin these spaces we use standard nodal basis functions. In the velocity space Vh = (Xk

h,0)d the

set of basis functions is denoted by (ψi)1≤i≤m. Each ψi is a vector function in Rd with d − 1components identically zero. The basis in the pressure space Mh = Xk−1

h ∩L20(Ω) is denoted by

(φi)1≤i≤n. The corresponding isomorphisms are given by

Ph,1 : Rm → Vh, Ph,1v =m∑

i=1

viψi

Ph,2 : Rn →Mh, Ph,2w =

n∑

i=1

wiφi

The stiffness matrix for the Stokes problem is given by

K =

(

A BT

B 0

)

∈ R(m+n)×(m+n) , with

〈Av, v〉 = a(Ph,1v, Ph,1v) =

∫

Ω(∇Ph,1v) · (∇Ph,1v) dx ∀ v, v ∈ Rm

〈Bv,w〉 = b(Ph,1v, Ph,2w) = −∫

ΩPh,2w divPh,1v dx ∀ v ∈ Rm, w ∈ Rn

The matrix A = blockdiag(A1, . . . ,Ad) is symmetric positive definite and A1 = . . . = Ad =: Ais the stiffness matrix corresponding to the Galerkin discretization of the Poisson equation inthe space Xk

h,0 of simplicial finite elements.

We now discuss preconditioners for the matrix A and the Schur complement S = BA−1BT .The preconditioner for the matrix A is based on a symmetric multigrid method applied to thediagonal block A. Let CMG be the iteration matrix of a symmetric multigrid method applied tothe matrix A, as defined in section 9.4.5. The matrix MMG is defined by CMG =: I− M−1

MGA.

This matrix, although not explicitly available, can be used as a preconditioner for A. For giveny the vector M−1

MGy is the result of one multigrid iteration with starting vector equal to zero

applied to the system Av = y.From the analysis in section 9.4.5 it follows that MMG is symmetric and under certain reason-able assumptions we have σ(I − M−1

MGA) ⊂ [0, ρMG] with the contraction number ρMG < 1independent of the mesh size parameter h. For the preconditioner MA of A we take

MA := blockdiag(MMG) (d blocks)

For this preconditioner we then have the following spectral inequalities

(1 − ρMG)MA ≤ A ≤ MA , with ρMG < 1 independent of h (10.7)

232

For the preconditioner MS of the Schur complement S we use the mass matrix in the pressurespace, which is defined by

〈MSw, z〉 = 〈Ph,2w, Ph,2z〉L2 for all w, z ∈ Rn (10.8)

This mass matrix is symmetric positive definite and (after diagonal scaling, cf. section 3.5.1)in general well-conditioned. In practice the linear systems of the form MSw = q are solvedapproximately by applying a few iterations of an iterative solver (for example, CG). We recallthe stability property of the Hood-Taylor finite element pair (Vh,Mh) (cf. section 5.2.1):

∃ β > 0 : supuh∈Vh

b(uh, qh)

‖uh‖1≥ β‖qh‖L2 for all qh ∈Mh (10.9)

with β independent of h. Using this stability property we get the following spectral inequalitiesfor the preconditioner Ms:

Theorem 10.2.1 Let MS be the pressure mass matrix defined in (10.8). Assume that thestability property (10.9) holds. Then

β2 MS ≤ S ≤ dMS (10.10)

holds.

Proof. For w ∈ Rn we have:

maxv∈Rm

〈Bv,w〉Av,v〉 1

2

= maxv∈Rm

〈BA−12 v,w〉

‖v‖

= maxv∈Rm

〈v,A− 12 BTw〉

‖v‖= ‖A− 1

2 BTw‖ = 〈Sw,w〉 12

Hence, for arbritrary w ∈ Rn:

〈Sw,w〉 12 = max

uh∈Vh

b(uh, Ph,2w)

|uh|1(10.11)

Using this and the stability bound (10.9) we get

〈Sw,w〉 12 ≥ β ‖Ph,2w‖L2 = β 〈MSw,w〉 1

2

and thus the first inequality in (10.10) holds. Note that

|b(uh, Ph,2w)| ≤ ‖div uh‖L2‖Ph,2w‖L2

≤√d |uh|1‖Ph,2w‖L2 =

√d |uh|1〈MSw,w〉 1

2

holds. Combining this with (10.11) proves the second inequality in (10.10).

Corollary 10.2.2 Suppose that for solving a discrete Stokes problem with stiffness matrixK we use a preconditioned MINRES method with preconditioners MA (for A) and MS (forS) as defined above. Then the inequalities (10.3) hold with constants γA,ΓA, γS ,ΓS that areindependent of h. From theorem 10.1.2 it follows that the spectrum of the preconditioned matrixK is contained in a set [a, b] ∪ [c, d] with a < b < 0 < c < d, all independent of h, and withb − a = d − c. From theorem 8.42 we then conclude that the residual reduction factor can bebounded by a constant smaller than one independent of h.

233

Appendix A

Functional Analysis

A.1 Different types of spaces

Below we give some definitions of elementary notions from functional analysis (cf. for exampleKreyszig [55]). We restrict ourselves to real spaces, i.e. for the scalar field we take R.

Real vector space. A real vector space is a set X of elements, called vectors, together withthe algebraic operations vector addition and multiplication of vectors by real scalars. Vectoraddition should be commutative and associative. Multiplication by scalars should be associativeand distributive.

Example A.1.1 Examples of real vector spaces are Rn and C([a, b])

Normed space. A normed space is a vector space X with a norm defined on it. Here a normon a vector space X is a real-valued function on X whose value at x ∈ X is denoted by ‖x‖ andwhich has the properties

‖x‖ ≥ 0

‖x‖ = 0 ⇔ x = 0

‖αx‖ = |α| ‖x‖‖x + y‖ ≤ ‖x‖ + ‖y‖

(A.1)

for arbitrary x,y ∈ X,α ∈ R.

Example A.1.2 . Examples of normed spaces are

(Rn, ‖ · ‖∞) with ‖x‖∞ = max1≤i≤n

|xi| ,

(Rn, ‖ · ‖2) with ‖x‖22 =

n∑

i=1

x2i ,

(C([a, b]), ‖ · ‖∞) with ‖f‖∞ = maxt∈[a,b]

|f(t)| ,

(C([a, b]), ‖ · ‖L2) with ‖f‖L2 = (

∫ b

af(t)2 dt)

12 .

Banach space. A Banach space is a complete normed space. This means that in X everyCauchy sequence, in the metric defined by the norm, has a limit which is an element of X.

235

Example A.1.3 Examples of Banach spaces are:

(Rn, ‖ · ‖2) ,

(Rn, ‖ · ‖∗) with any norm ‖ · ‖∗ on Rn ,

(C([a, b]), ‖ · ‖∞).

The completeness of the space in the second example follows from the fact that on a finitedimensional space all norms are equivalent. The completeness of the space in the third exampleis a consequence of the following theorem: The limit of a uniformly convergent sequence ofcontinouos functions is a continuous function.

Remark A.1.4 The space (C([a, b]), ‖·‖L2 ) is not complete. Consider for example the sequencefn ∈ C([0, 1]), n ≥ 1, defined by

fn(t) =

0 if t ≤ 12

n(t− 12) if 1

2 ≤ t ≤ 12 + 1

n1 if 1

2 + 1n ≤ t ≤ 1 .

Then for m,n ≥ N we have

‖fn − fm‖2L2 =

∫ 1

0|fn(t) − fm(t)|2 dt ≤

∫ 12+ 1

N

12

1 dt =1

N.

So (fn)n≥1 is a Cauchy sequence. For the limit function f we would have

f(t) =

0 if 0 ≤ t ≤ 12

1 if 12 + ε ≤ t ≤ 1,

for arbitrary ε > 0. So f cannot be continuous.

Inner product space. An inner product space is a (real) vector space X with an inner productdefined on X. For such an inner product we need a mapping of X ×X into R, i.e. with everypair of vectors x and y from X there is associated a scalar denoted by 〈x,y〉. This mapping iscalled an inner product on X if for arbitrary x,y, z ∈ X and α ∈ R the following holds:

〈x,x〉 ≥ 0 (A.2)

〈x,x〉 = 0 ⇔ x = 0 (A.3)

〈x,y〉 = 〈y,x〉 (A.4)

〈αx,y〉 = α〈x,y〉 (A.5)

〈x + y, z〉 = 〈x, z〉 + 〈y, z〉. (A.6)

An inner product defines a norm on X :

‖x‖ =√

〈x,x〉.

An inner product and the corresponding norm satisfy the Cauchy-Schwarz inequality:

|〈x,y〉| ≤ ‖x‖ ‖y‖ for all x,y ∈ X. (A.7)

236

Example A.1.5 Examples of inner product spaces are:

Rn with 〈x,y〉 =n∑

i=1

xiyi ,

C([a, b]) with 〈f, g〉 =

∫ b

af(t)g(t) dt.

Hilbert space. An inner product space which is complete is called a Hilbert space.

Example A.1.6 Examples of Hilbert spaces are:

Rn with 〈x,y〉 =

n∑

i=1

xiyi ,

L2([a, b]) with 〈f, g〉 =

∫ b

af(t)g(t) dt.

We note that the space C([a, b]) with the inner product (and corresponding norm) as in Ex-ample A.1.5 results in the normed space (C([a, b]), ‖ · ‖L2). In Remark A.1.4 it is shown thatthis space is not complete. Thus the inner product space C([a, b]) as in Example A.1.5 is not aHilbert space.

Completion. Let X be a Banach space and Y a subspace of X. The closure Y of Y inX is defined as the set of accumulation points of Y in X, i.e. x ∈ Y if and only if there is asequence (xn)n≥1 in Y such that xn → x. If Y = Y holds, then Y is called closed (in X). Thesubspace Y is called dense in X if Y = X.Let (Z, ‖ · ‖) be a given normed space. Then there exists a Banach space (X, ‖ · ‖∗) (which isunique, except for isometric isomorphisms) such that Z ⊂ X, ‖x‖ = ‖x‖∗ for all x ∈ Z, and Zis dense in X. The space X is called the completion of Z.

The space L2(Ω). Let Ω be a domain in Rn. We denote by L2(Ω) the space of all Lebesguemeasurable functions f : Ω → R for which

‖f‖0 := ‖f‖L2 :=

√

∫

Ω|f(x)|2 dx <∞

In this space functions are identified that are equal almost evereywhere (a.e.) on Ω. The elementsof L2(Ω) are thus actually equivalence classes of functions. One writes f = 0 [f = g] if f(x) = 0[f(x) = g(x)] a.e. in Ω. The space L2(Ω) with

〈f, g〉 =

∫

Ωf(x)g(x) dx

is a Hilbert space. The space C∞0 (Ω) (all functies in C∞(Ω) which have a compact support inΩ) is dense in L2(Ω):

C∞0 (Ω)‖·‖0

= L2(Ω)

In other words, the completion of the normed space (C∞0 (Ω), ‖ · ‖0) results in the space L2(Ω).

237

Dual space. Let (X, ‖ · ‖) be a normed space. The set of all bounded linear functionalsf : X → R forms a real vector space. On this space we can define the norm:

‖f‖ := sup|f(x)|‖x‖ |x ∈ X, x 6= 0 .

This results in a normed space called the dual space of X and denoted by X ′.

Bounded linear operators. Let (X, ‖ · ‖X) and (Y, ‖ · ‖Y ) be normed spaces and T : X → Ybe a linear operator. The (operator) norm of T is defined by

‖T‖Y←X := sup ‖Tx‖Y

‖x‖X| x ∈ X, x 6= 0

.

The operator T is bounded iff ‖T‖Y←X < ∞. The operator T is called an isomorphism if T isbijective (i.e., injective and surjective) and both T and T−1 are bounded.

Compact linear operators. Let X and Y be Banach spaces and T : X → Y a linearbounded operator. The operator T is compact if for every bounded set A in X the image set,i.e., B := T (A), is precompact in Y (this means that every sequence in B must contain a con-vergent subsequence).

Continuous embedding; compact embedding. Let X and Y be normed spaces.and Alinear operator I : X → Y is called a continuous embedding if I is bounded (or, equivalently,continuous, see theorem A.2.2) and injective. The embedding is called compact if I is continuousand compact. An equivalent characterization of compact embedding is that for every boundedsequence (xk)k≥1 in X the image sequence (Ixk)k≥1 has a subsequence that is a Cauchy sequencein Y .

A.2 Theorems from functional analysis

Below we give a few classical results from functional analysis (cf. for example Kreyszig [55]).

Theorem A.2.1 (Arzela-Ascoli.) A subset K of C(Ω) is precompact (i.e., every sequence hasa convergent subsequence) if and only if the following two conditions holds:

(i) ∃M : ‖f‖∞ < M for all f ∈ K (K is bounded)

(ii) ∀ε > 0 ∃δ > 0 : ∀f ∈ K |f(x) − f(y)| < ε ∀ x, y ∈ Ω with ‖x− y‖ < δ

(K is uniformly equicontinuous)

Theorem A.2.2 (Boundedness of linear operators.) Let X and Y be normed spaces anT : X → Y a linear mapping. Then T is bounded if and only if T is continuous.

238

Theorem A.2.3 (Extension of operators.) Let X be a normed space and Y a Banach space.Suppose X0 is a dense subspace of X and T : X0 → Y a bounded linear operator. Then thereexists a unique extension Te : X → Y with the properties

(i) Tx = Tex for all x ∈ X0

(ii) If (xk)k≥1 ⊂ X0, x ∈ X and limk→∞

xk = x then Tex = limk→∞

Txk

(iii) ‖Te‖Y←X = ‖T‖Y←X0

Theorem A.2.4 (Banach fixed point theorem.) Let (X, ‖·‖) be a Banach space. Let F : X →X be a (possibly nonlinear) contraction, i.e. there is a constant γ < 1 such that for all x,y ∈ X:

‖F (x) − F (y)‖ ≤ γ‖x − y‖.Then there exists a unique x ∈ X (called a fixed point) such that

F (x) = x

holds.

Theorem A.2.5 (Corollary of open mapping theorem.) Let X and Y be Banach spacesand T : X → Y a linear bounded operator which is bijective. Then T−1 is bounded, i.e.,T : X → Y is an isomorphism.

Corollary A.2.6 Let X and Y be Banach spaces and T : X → Y a linear bounded operatorwhich is injective. Let R(T ) = Tx | x ∈ X be the range of T . Then T−1 : R(T ) → X isbounded if and only if R(T ) is closed (in Y ).

Theorem A.2.7 (Orthogonal decomposition.) Let U ⊂ H be a closed subspace of a Hilbertspace H. Let U⊥ = x ∈ H | (x,y) = 0 for all y ∈ U be the orthogonal complement ofU . Then H can be decomposed as H = U ⊕ U⊥, i.e., every x ∈ H has a unique representationx = u + v, u ∈ U , v ∈ U⊥. Moreover, the identity ‖x‖2 = ‖u‖2 + ‖v‖2 holds.

Theorem A.2.8 (Riesz representation theorem.) Let H be a Hilbert space with inner prod-uct denoted by 〈·, ·〉 and corresponding norm ‖ · ‖H . Let f be an element of the dual space H ′,with norm ‖f‖H′ . Then there exists a unique w ∈ H such that

f(x) = 〈w,x〉 for all x ∈ H.

Furthermore, ‖w‖H = ‖f‖H′ holds. The linear operator JH : f → w is called the Rieszisomorphism.

Corollary A.2.9 Let H be a Hilbert space with inner product denoted by 〈·, ·〉. The bilinearform

〈f, g〉H′ := 〈JHf, JHg〉 , f, g ∈ H ′

defines a scalar product on H ′. The space H ′ with this scalar product is a Hilbert space.

239

Appendix B

Linear Algebra

B.1 Notions from linear algebra

Below we give some definitions and elementary notions from linear algebra. The collection ofall real n× n matrices is denoted by Rn×n.

For A = (aij)1≤i,j≤n ∈ Rn×n the transpose AT is defined by AT = (aji)1≤i,j≤n.

Spectrum, spectral radius. By σ(A) we denote the spectrum of A, i.e. the collection ofall eigenvalues of the matrix A. Note that in general σ(A) contains complex numbers (even forA real). We use the notation ρ(A) for the spectral radius of A ∈ Rn×n:

ρ(A) := max |λ| | λ ∈ σ(A) .

Vector norms. On Rn we can define a norm ‖ · ‖, i.e. a real-valued function on Rn withproperties as in (A.1). Important examples of such norms are:

‖x‖1 :=n∑

i=1

|xi| (1-norm), (B.1)

‖x‖2 := (

n∑

i=1

x2i )

12 (2-norm or Euclidean norm), (B.2)

‖x‖∞ := max1≤i≤n

|xi| (maximum norm). (B.3)

Cauchy-Schwarz inequality. On Rn we can define an inner product by 〈x,y〉 := xTy. Thenorm corresponding to this inner product is the Euclidean norm (B.2). The Cauchy-Schwarzinequality (A.7) takes the form:

|xTy| ≤ ‖x‖2‖y‖2 for all x,y ∈ Rn. (B.4)

Matrix norms. A matrix norm on Rn×n is a real valued function whose value at A ∈ Rn×n isdenoted by ‖A‖ and which has the properties

‖A‖ ≥ 0 and ‖A‖ = 0 iff A = 0

‖αA‖ = |α|‖A‖‖A + B‖ ≤ ‖A‖ + ‖B‖

‖AB‖ ≤ ‖A‖‖B‖

(B.5)

241

for all A,B ∈ Rn×n and all α ∈ R. A special class of matrix norms are those induced by avector norm. For a given vector norm ‖ · ‖ on Rn we define an induced matrix norm by:

‖A‖ := sup ‖Ax‖‖x‖ | x ∈ Rn, x 6= 0 for A ∈ Rn×n. (B.6)

Induced by the vector norms in (B.1), (B.2), (B.3) we obtain the matrix norms ‖A‖1, ‖A‖2,‖A‖∞. From the definition of the induced matrix norm it follows that

‖Ax‖ ≤ ‖A‖‖x‖ for all A ∈ Rn×n, x ∈ Rn (B.7)

and that the properties (B.5) hold.In the same way one can define a matrix norm on Cn×n. In this book we will always use realinduced matrix norms as defined in (B.6).

Condition number. For a nonsingular matrix A the spectral condition number is definedby

κ(A) := ‖A‖2‖A−1‖2.

We note that condition numbers can be defined with respect to other matrix norms, too.

Below we introduce some notions related to special properties which matrices A ∈ Rn×n mayhave.The matrix A is symmetric if A = AT holds. The matrix A is normal if the equalityATA = AAT holds. Note that every symmetric matrix is normal.

A symmetric matrix A is positive definite if xTAx > 0 holds for all x 6= 0. In that case,A is said to be symmetric positive definite.

A matrix A is weakly diagonally dominant if the following condition is fulfilled:

∑

j 6=i|aij | ≤ |aii| for all i, with strict inequality for at least one i.

The matrix A is called irreducible if there does not exist a permutation matrix Π such thatΠTAΠ is a two by two block matrix in which the (2, 1) block is a zero block. The matrix A iscalled irreducibly diagonally dominant if A is irreducible and weakly diagonally dominant.

A matrix A is an M-matrix if it has the following properties:

aij ≤ 0 for all i 6= j ,

A is nonsingular and all the entries in A−1 are ≥ 0.

B.2 Theorems from linear algebra

In this section we give some basic results from linear algebra. For the proofs we refer to anintroductory linear algebra text, e.g. Strang [88] or Lancaster and Tismenetsky [58].

Theorem B.2.1 (Results on eigenvalues and eigenvectors) . For A,B ∈ Rn×n the fol-lowing results hold:

242

E1. σ(A) = σ(AT ).

E2. σ(AB) = σ(BA) , ρ(AB) = ρ(BA).

E3. A ∈ Rn×n is normal if and only if A has an orthogonal basis of eigenvectors. In generalthese eigenvectors and corresponding eigenvalues are complex, and orthogonality is meantwith respect to the complex Euclidean inner product.

E4. If A is symmetric then A has an orthogonal basis of real eigenvectors. Furthermore, alleigenvalues of A are real.

E5. A is symmetric positive definite if and only if A is symmetric and σ(A) ⊂ (0,∞).

Theorem B.2.2 (Results on matrix norms) . For A ∈ Rn×n the following results hold:

N1. ‖A‖1 = max1≤j≤n∑n

i=1 |aij |.

N2. ‖A‖∞ = max1≤i≤n∑n

j=1 |aij |.

N3. ‖A‖2 =√

ρ(ATA).

N4. If A is normal then ‖A‖2 = ρ(A).

N5. Let ‖ · ‖ be an induced matrix norm. The following inequality holds:

ρ(A) ≤ ‖A‖.

Using (N4) and (E5) we obtain the following results for the spectral condition number:

κ(A) = ρ(A)ρ(A−1) , if A is normal , (B.8)

κ(A) = λmax/λmin , if A is symmetric positive definite, (B.9)

with λmax and λmin the largest and smallest eigenvalue of A, respectively.

Theorem B.2.3 (Jordan normal form) . For every A ∈ Rn×n exists a nonsingular matrixT such that A = TΛT−1 with Λ a matrix of the form Λ = blockdiag(Λi)1≤i≤s,

Λi :=

λi 1 ∅. . .

. . .

. . . 1∅ λi

∈ Rki×ki , 1 ≤ i ≤ s,

and λ1, . . . , λs = σ(A).

243

Bibliography

[1] R. A. Adams. Sobolev Spaces. Academic Press, 1975.

[2] S. Agmon, A. Douglis, and L. Nirenberg. Estimates near the boundary of solutions ofelliptic partial differential equations satisfying general boundary conditions ii. Comm. onPure and Appl. Math., 17:35–92, 1964.

[3] H. W. Alt. Linear Funktionalanalysis, 2.ed. Springer, Heidelberg, 1992.

[4] D. N. Arnold, F. Brezzi, and M. Fortin. A stable finite element for the stokes equation.Calcalo, 21:337–344, 1984.

[5] W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigen-value problem. Quart. Appl. Math., 9:17–29, 1951.

[6] O. Axelsson. Iterative Solution Methods. Cambridge University Press, NY, 1994.

[7] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Value Problems.Theory and Computation. Academic Press, Orlando, 1984.

[8] R. E. Bank and T. F. Chan. A composite step biconjugate gradient method. Numer.Math., 66:295–319, 1994.

[9] R. E. Bank and L. R. Scott. On the conditioning of finite element equations with highlyrefined meshes. SIAM J. Numer. Anal., 26:1383–1394, 1989.

[10] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,R. Pozo, C. Romine, and H. van der Vorst. Templates for the Solution of Linear Systems:Building Blocks for Iterative Methods. SIAM, Philadelphia, 1994.

[11] M. Bercovier and O. Pironneau. Error estimates for finite element solution of the stokesproblem in primitive variables. Numer. Math., 33:211–224, 1979.

[12] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences.Academic Press, NY, 1979.

[13] C. Bernardi. Optimal finite element interpolation on curved domains. SIAM J. Numer.Anal., 26:1212–1240, 1989.

[14] C. Bernardi and V. Girault. A local regularization operator for triangular and quadrilateralfinite elements. SIAM J. Numer. Anal., 35:1893–1915, 1998.

[15] D. Boffi. Stability of higher-order triangular hood-taylor methods for the stationary stokesequations. Math. Models Methods Appl. Sci., 4:223–235, 1994.

245

[16] D. Boffi. Three-dimensional finite element methods for the stokes problem. SIAM J.Numer. Anal., 34:664–670, 1997.

[17] D. Braess, M. Dryja, and W. Hackbusch. A multigrid method for nonconforming fe-discretisations with application to non-matching grids. Computing, 63:1–25, 1999.

[18] D. Braess and W. Hackbusch. A new convergence proof for the multigrid method includingthe V-cycle. SIAM J. Numer. Anal., 20:967–975, 1983.

[19] J. H. Bramble. Multigrid Methods. Longman, Harlow, 1993.

[20] J. H. Bramble and S. R. Hilbert. Estimation of linear functionals on sobolev spaces withapplications to fourier transforms and spline interpolation. SIAM J. Numer. Anal., 7:113–124, 1970.

[21] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods.Springer, New York, 1994.

[22] F. Brezzi and R. S. Falk. Stability of higher-order hood-taylor methods. SIAM J. Numer.Anal., 28:581–590, 1991.

[23] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial (2nd ed.). SIAM,Philadelphia, 2000.

[24] O. Broker, M. Grote, C. Mayer, and A. Reusken. Robust parallel smoothing for multigridvia sparse approximate inverses. SIAM J. Sci. Comput., 32:1395–1416, 2001.

[25] A. M. Bruaset. A Survey of Preconditioned Iterative Methods. Longman, Harlow, 1995.

[26] L. Cattabriga. Su un problema al contorno relativo al sistema di equazioni di stokes. Rend.Sem. Mat. Univ. Padov., 31:308–340, 1961.

[27] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. North Holland, 1978.

[28] P. G. Ciarlet. Basic error estimates for elliptic problems. In P. G. Ciarlet and J. L. Lions,editors, Handbook of Numerical Analysis, Volume II: Finite Element Methods (Part 1).North Holland, Amsterdam, 1991.

[29] P. Clement. Approximation by finite element functions using local regularization. RAIROAnal. Numer. (M2AN), 9(R-2):77–84, 1975.

[30] M. Dauge. Stationary stokes and navier-stokes systems on two- or three-dimensionaldomains with corners. part i: linearized equations. SIAM J. Math. Anal., 20:74–97, 1989.

[31] G. Duvaut and J. L. Lions. Les Inequations en Mecanique et en Physique. Dunod, Paris,1972.

[32] E. Ecker and W. Zulehner. On the smoothing property of multi-grid methods in thenon-symmetric case. Numerical Linear Algebra with Applications, 3:161–172, 1996.

[33] V. Faber and T. Manteuffel. Orthogonal error methods. SIAM J. Numer. Anal., 20:352–262, 1984.

[34] M. Fiedler. Special Matrices and their Applications in Numerical Mathematics. Nijhoff,Dordrecht, 1986.

246

[35] R. Fletcher. Conjugate gradient methods for indefinite systems. In G. A. Watson, editor,Numerical Analysis Dundee 1975, Lecture Notes in Mathemaics, Vol. 506, pages 73–89,Berlin, 1976. Springer.

[36] R. W. Freund, G. H. Golub, and N. M. Nachtigal. Iterative solution of linear systems.Acta Numerica, pages 57–100, 1992.

[37] G. Frobenius. Uber matrizen aus nicht negativen elementen. Preuss. Akad. Wiss., pages456–477, 1912.

[38] E. Gartland. Strong uniform stability and exact discretizations of a model singular pertur-bation problem and its finite difference approximations. Appl. Math. Comput., 31:473–485,1989.

[39] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order.Springer, Berlin, Heidelberg, 1977.

[40] V. Girault and P.-A. Raviart. Finite Element Methods for Navier-Stokes Equations, vol-ume 5 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,1986.

[41] G. H. Golub and C. F. V. Loan. Matrix Computations. John Hopkins University Press,Baltimore, 2. edition, 1989.

[42] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997.

[43] M. E. Gurtin. An Introduction to Continuum Mechanics, volume 158 of Mathematics inScience and Engineering. Academic Press, 1981.

[44] W. Hackbusch. Multigrid Methods and Applications, volume 4 of Springer Series in Com-putational Mathematics. Springer, Berlin, Heidelberg, 1985.

[45] W. Hackbusch. Theorie und Numerik elliptischer Differentialgleichungen. Teubner,Stuttgart, 1986.

[46] W. Hackbusch. Iterative Losung Großer Schwachbesetzter Gleichungssysteme. Teubner,1991.

[47] W. Hackbusch. Elliptic Differential Equations: Theory and Numerical Treatment, vol-ume 18 of Springer Series in Computational Mathematics. Springer, Berlin, 1992.

[48] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 ofApplied Mathematical Sciences. Springer, New York, 1994.

[49] W. Hackbusch. A note on reusken’s lemma. Computing, 55:181–189, 1995.

[50] L. A. Hageman and D. M. Young. Applied Iterative Methods. Academic Press, New York,1981.

[51] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.J. Res. Nat. Bur. Stand., 49:409–436, 1952.

[52] P. Hood and C. Taylor. A numerical solution of the navier-stokes equations using the finiteelement technique. Comp. and Fluids, 1:73–100, 1973.

247

[53] J. Kadlec. On the regularity of the solution of the Poisson problem on a domain withboundary locally similar to the boundary of a convex open set. Czechoslovak Math. J.,14(89):386–393, 1964. (russ.).

[54] R. B. Kellog and J. E. Osborn. A regularity result for the stokes problem in a convexpolygon. J. Funct. Anal., 21:397–431, 1976.

[55] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, New York, 1978.

[56] O. A. Ladyzhenskaya. Funktionalanalytische Untersuchungen der Navier-Stokesschen Gle-ichungen. Akademie-Verlag, Berlin, 1965.

[57] O. A. Ladyzhenskaya and N. A. Ural’tseva. Linear and Quasilinear Elliptic Equations, vol-ume 46 of Mathematics in Science and Engineering. Academic Press, New York, London,1968.

[58] P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, Orlando, 2.edition, 1985.

[59] C. Lanczos. Solution of systems of linear equations by minimized iterations. J. Res. Natl.Bur. Stand. 49, pages 33–53, 1952.

[60] J. L. Lions and E. Magenes. Non-homogeneous Boundary Value Problems and Applications,Vol. I. Springer, Berlin, 1972.

[61] J. T. Marti. Introduction to Sobolev Spaces and Finite Element Solution of Elliptic Bound-ary Value Problems. Academic Press, London, 1986.

[62] J. A. Meijerink and H. A. van der Vorst. An iterative solution method for linear systemsof which the coefficient matrix is a symmetric m-matrix. Math. Comp., 31:148–162, 1977.

[63] N. Meyers and J. Serrin. H=W. Proc. Nat. Acad. Sci. USA, 51:1055–1056, 1964.

[64] C. Miranda. Partial Differential Equations of Elliptic Type. Springer, Berlin, 1970.

[65] J. Necas. Les Methodes Directes en Theorie des Equations Elliptiques. Masson, Paris,1967.

[66] O. Nevanlinna. Convergence of Iterations for Linear Equations. Birkhauser, Basel, 1993.

[67] R. A. Nicolaides. On a class of finite elements generated by lagrange interpolation. SIAMJ. Numer. Anal., 9:435–445, 1972.

[68] W. Niethammer. The sor method on parallel computers. Numer. Math., 56:247–254, 1989.

[69] U. T. C. W. Oosterlee and A. Schuler, editors. Multigrid. Academic Press, London, 2001.

[70] C. C. Paige and M. Saunders. Solution of sparse indefinite systems of linear equations.SIAM J. Numer. Anal., 12:617–629, 1975.

[71] O. Perron. Zur theorie der matrizen. Math. Ann., 64:248–263, 1907.

[72] S. D. Poisson. Remarques sur une equation qui se presente dans la theorie des attractionsdes spheroides. Nouveau Bull. Soc. Philomathique de Paris, 3:388–392, 1813.

248

[73] A. Quarteroni and A. Valli. Numerical Approximation of Partial Differential Equations,volume 23 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,1994.

[74] A. Reusken. On maximum norm convergence of multigrid methods for two-point boundaryvalue problems. SIAM J. Numer. Anal., 29:1569–1578, 1992.

[75] A. Reusken. The smoothing property for regular splittings. In W. Hackbusch and G. Wit-tum, editors, Incomplete Decompositions : (ILU)-Algorithms, Theory and Applications,volume 41 of Notes on Numerical Fluid Mechanics, pages 130–138, Braunschweig, 1993.Vieweg.

[76] H.-G. Roos, M. Stynes, and L. Tobiska. Numerical Methods for Singularly Perturbed Dif-ferential Equations, volume 24 of Springer Series in Computational Mathematics. Springer,Berlin, Heidelberg, 1996.

[77] T. Rusten and R. Winther. A preconditioned iterative method for saddlepoint problems.SIAM J. Matrix Anal. Appl., 13:887–904, 1992.

[78] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, London,1996.

[79] Y. Saad and M. H. Schultz. Conjugate gradient–like algorithms for solving nonsymmetriclinear systems. Math. Comp., 44:417–424, 1985.

[80] Y. Saad and M. H. Schultz. Gmres: a generalized minimal residual algorithm for solvingnonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856–869, 1986.

[81] L. R. Scott and S. Zhang. Finite element interpolation of nonsmooth functions satisfyingboundary conditions. Math. Comp., 54:483–493, 1990.

[82] D. Silvester and A. Wathen. Fast iterative solution of stabilised stokes systems. part ii:using general block preconditioners. SIAM J. Numer. Anal., 31:1352–1367, 1994.

[83] M. Sion. On general minimax theorems. Pacific J. of Math., 8:171–176, 1958.

[84] G. L. G. Sleijpen and D. R. Fokkema. Bicgstab(l) for linear equations involving matriceswith complex spectrum. ETNA, 1:11–32, 1993.

[85] G. L. G. Sleijpen and H. van der Vorst. Optimal iteration methods for large linear sys-tems of equations. In C. B. Vreugdenhil and B. Koren, editors, Numerical Methods forAdvection-Diffusion Problems, volume 45 of Notes on Numerical Fluid Mechanics, pages291–320. Vieweg, Braunschweig, 1993.

[86] P. Sonneveld. Cgs: a fast lanczos–type solver for nonsymmetric linear systems. SIAM J.Sci. Statist. Comput., 10:36–52, 1989.

[87] R. Sternberg. Error analysis of some finite element methods for the stokes problem. Math.Comp., 54:495–508, 1990.

[88] G. Strang. Linear Algebra and its Applications. Harcourt Brace Jovanovich, San Diego,3. edition, 1988.

249

[89] A. van der Sluis. Condition numbers and equilibration of matrices. Numer. Math., 14:14–23, 1969.

[90] A. van der Sluis and H. van der Vorst. The rate of convergence of conjugate gradients.Numer. Math., 48:543–560, 1986.

[91] H. A. van der Vorst. Bi-cgstab: A fast and smoothly converging variant of bi-cg for thesolution of non–symmetric linear systems. SIAM J. Sci. Statist. Comput., 13:631–644,1992.

[92] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, New Jersey, 1962.

[93] R. Verfurth. Error estimates for a mixed finite element approximation of stokes problem.RAIRO Anal. Numer., 18:175–182, 1984.

[94] R. Verfurth. Robust a posteriori error estimates for stationary convection-diffusion equa-tions. SIAM J. Numer. Anal., 43:1766–1782, 2005.

[95] W. Walter. Gewohnliche Differentialgleichungen. Heidelberger Taschenbucher. Springer,Berlin, 1972.

[96] P. Wesseling. An introduction to Multigrid Methods. Wiley, Chichester, 1992.

[97] G. Wittum. Linear iterations as smoothers in multigrid methods : Theory with applica-tions to incomplete decompositions. Impact Comput. Sci. Eng., 1:180–215, 1989.

[98] G. Wittum. On the robustness of ILU-smoothing. SIAM J. Sci. Stat. Comp., 10:699–717,1989.

[99] J. Wloka. Partial Differential Equations. Cambridge University Press, Cambridge, 1987.

[100] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, NY, 1971.

250

numerical methods for elliptic partial diﬀerential ... · introduction to elliptic boundary value...

Documents