numerical analysis parabolic differential equations ole osterby (2005)

7/28/2019 Numerical Analysis Parabolic Differential Equations Ole Osterby (2005)

1/128

Department of Computer Science June 10, 2005Aarhus University

Numerical Analysis and

Parabolic Equations

A Second Course

Ole sterby

Spring 2005


2/128

2


3/128

Contents

1 Error analysis 7

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Number representation and computational errors . . . . . . . . . 7

1.3 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Bisection in practice . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 An approximation to e . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Floating-point numbers . . . . . . . . . . . . . . . . . . . . . . . . 12

1.7 A model for machine numbers . . . . . . . . . . . . . . . . . . . . 13

1.8 IEEE Standard for Binary Floating-Point Arithmetic . . . . . . . 15

1.9 Computational errors . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.10 More computational errors . . . . . . . . . . . . . . . . . . . . . . 19

1.11 Condition and stability . . . . . . . . . . . . . . . . . . . . . . . . 21

1.12 On adding many numbers . . . . . . . . . . . . . . . . . . . . . . 24

1.13 Some good advice . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.14 Truncation errors vs. computational errors . . . . . . . . . . . . . 29

2 The global error theoretical aspects 31

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 The explicit method . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 The initial condition . . . . . . . . . . . . . . . . . . . . . . . . . 33

3


4/128

2.4 Dirichlet boundary conditions . . . . . . . . . . . . . . . . . . . . 33

2.5 The error for the explicit method . . . . . . . . . . . . . . . . . . 33

2.6 The implicit method . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 Crank-Nicolson . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.9 Example continued . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.10 Upwind schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.11 Boundary conditions with a derivative . . . . . . . . . . . . . . . 38

2.12 A first order boundary approximation . . . . . . . . . . . . . . . . 39

2.13 The symmetric second order approximation . . . . . . . . . . . . 40

2.14 An asymmetric second order approximation . . . . . . . . . . . . 40

3 Estimating the global error and order 43

3.1 The local error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 The global error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Limitations of the technique. . . . . . . . . . . . . . . . . . . . . . 51

3.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Two space dimensions 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 The explicit method . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Implicit methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 ADI methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 The Peaceman-Rachford Method . . . . . . . . . . . . . . . . . . 61

4.6 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . 62

4.7 Stability of Peaceman-Rachford . . . . . . . . . . . . . . . . . . . 63

4.8 DYakonov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4


5/128

4.9 Douglas-Rachford . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.10 Stability of Douglas-Rachford . . . . . . . . . . . . . . . . . . . . 66

4.11 The local error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.12 The global error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Equations with mixed derivative terms 73

5.1 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Stability with mixed derivative . . . . . . . . . . . . . . . . . . . 75

5.3 Stability of ADI-methods . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Two-factor models two examples 81

6.1 The Brennan-Schwartz model . . . . . . . . . . . . . . . . . . . . 81

6.2 Practicalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.3 A Traditional Douglas-Rachford step . . . . . . . . . . . . . . . . 87

6.4 The Peaceman-Rachford method . . . . . . . . . . . . . . . . . . 88

6.5 Fine points on efficiency . . . . . . . . . . . . . . . . . . . . . . . 89

6.6 Convertible bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

7 Ill-posed problems 93

7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.3 Variable coefficients An example . . . . . . . . . . . . . . . . . . 98

8 A free boundary problem 99

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.2 The mathematical model . . . . . . . . . . . . . . . . . . . . . . . 99

8.3 The boundary condition at infinity . . . . . . . . . . . . . . . . . 102

8.4 Finite Difference Schemes . . . . . . . . . . . . . . . . . . . . . . 106

5


6/128

8.5 Varying the Time Steps . . . . . . . . . . . . . . . . . . . . . . . 109

8.6 The Implicit Method . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.7 The Crank-Nicolson Method . . . . . . . . . . . . . . . . . . . . . 115

8.8 Determining the order . . . . . . . . . . . . . . . . . . . . . . . . 119

8.9 Efficiency of the methods . . . . . . . . . . . . . . . . . . . . . . . 123

6


7/128

Chapter 1

Error analysis

1.1 Introduction

As the main title suggests these notes cannot stand alone but are meant as asupplement to an ordinary text book in Numerical Analysis. This chapter treatsvarious aspects of error analysis, usually not found in such text books, but whichthe author has found useful in practical applications.

1.2 Number representation and computational

errors

An old rule within Numerical Analysis says: No result is better than the accom-panying error estimate. Therefore error analysis is an important subject and thefirst step is to identify the disease and localize the sources of contamination.

The error is usually defined as the true value minus the calculated value. If isan approximation to , then the error in is

error = The error is a signed number and it is sometimes as important to know the signof the error as its magnitude.

Example: More than 2000 years ago Archimedes used circumscribed polynomi-als to calculate an approximation to : 22

7. Archimedes did not have the tools

for error estimation but by arranging his calculations carefully he made sure thatthe error was negative: < 22

7. Using inscribed polynomials and equally care-

ful calculations he arrived at an approximation with positive error such that he

7


8/128

altogether could demonstrate:

310

71< < 3

1

7. 2

Example: exp(x) can be approximated by the first terms of the MacLaurinseries, e.g.

e = 1 + x +x2

2+

x3

6.

The error is the sum of the remainder series:

error =x4

24+

x5

120+

If

1 < x < 0, then this series is alternating with decreasing terms so the error

has the same sign as the first term in the remainder series and is smaller inabsolute value. So we have an error bound.If x > 0 then the error is positive but we have no immediate error bound.If 0 < x < 1 then the first term in the remainder series will provide a useful errorestimate, i.e. a number which has the right sign and the right order of magnitudewithout being a safe bound. 2

In many cases the important quantity is the relative error defined as

relative error =error

, ( = 0).

But where and how do these errors originate ?

1. One type of error is the so-called truncation error, which appears when aninfinite series defining the solution to a mathematical problem is truncated toa finite number of terms (cf. the example above). We also use the expressiondiscretization error which reminds us that a mathematical (continuous) problemmust be discretized (made finite) in order to be attacked numerically. In manycases these two considerations amount to the same thing, and the terms are usedinterchangeably.

Example: To solve the differential equation y

= f(x, y) one can use Eulersmethodyn+1 = yn + h f(xn, yn)

where xn = x0 + n h and yn is an approximation to the solution value at xn.The error is independent of whether we choose to consider Eulers formula as adiscretized version of the differential equation or we consider the right-hand-sideas a truncated Taylor series for y(xn + h). 2

But it is important to distinguish between this local error which is committedin a single step and the global error which we observe in a given point x as the

8


9/128

difference between the true solution value and the calculated value, and whichis the accumulated effect of the errors in each individual step. It is clear thatwe can reduce the local error by reducing the step size h, but it is not quite asobvious what will happen with the global error because there will be more stepsand more contributions to the error; and these cannot be added directly becauseeach of them propagate independently through Eulers formula.

To estimate the magnitude of the truncation error or to find the step size or thenumber of terms to include in order that the truncation error becomes smallerthan a desired error tolerance is a mathematical problem. Our computerscannot always help us here.

2. Quite another type of error, which does not bother the pure mathematician,but which can have a considerable influence on our results which are often theresult of millions of unsupervised calculations, is the rounding error or computa-tional error which we shall take a closer look at in the next sections.

3. Finally we have the regular blunders, such as 2 + 2 = 5, or y = x/0. Theseerrors can not be analysed; they must be eradicated.

1.3 Bisection

From mathematics we know the theorem:If a function, f, is continuous in a closed interval [a, b], and f(a) and f(b) havedifferent signs then f has (at least) one zero in (a, b).We should like to find this zero, and this can be done by successively computingthe value off at the midpoint, m, and replacing [a, b] by [a, m] respectively [m, b]depending on whether f(m) has the same sign as f(b) or not. If this processis not stopped by f(m) = 0 at some stage then after k steps we shall have aninterval of width (b a)2k which contains the zero. We can therefore determinethe zero with arbitrary accuracy, and at any stage we have a safe error bound.

9


10/128

1.4 Bisection in practice

Example: We have performed bisection on

p5(x) = x5 5.5x4 + 12.1x3 13.31x2 + 7.3205x 1.61051

with a = 0 and b = 2. This polynomial has the root 1.1 in [a, b]. We got thefollowing results:

i m p5(m) error

1 1.00000000000000000 -0.00001000000000095 0.1000000

2 1.50000000000000000 0.01023999999999559 -0.4000000

3 1.25000000000000000 0.00007593749999857 -0.15000004 1.12500000000000000 0.00000000976562253 -0.0250000

5 1.06250000000000000 -0.00000007415771619 0.0375000

6 1.09375000000000000 -0.00000000000953770 0.0062500

7 1.10937500000000000 0.00000000007241940 -0.0093750

8 1.10156250000000000 0.00000000000000733 -0.0015625

9 1.09765625000000000 -0.00000000000007172 0.0023438

10 1.09960937500000000 -0.00000000000000089 0.0003906

11 1.10058593750000000 -0.00000000000000044 -0.0005859

12 1.10107421875000000 0.00000000000000067 -0.0010742

13 1.10083007812500000 -0.00000000000000133 -0.0008301.

.

52 1.10101168873318978 -0.00000000000000133 -0.0010117

53 1.10101168873319000 0.00000000000000000 -0.0010117

54 1.10101168873319022 0.00000000000000044 -0.0010117

55 1.10101168873319022 0.00000000000000044 -0.0010117

We can now make the following observations:After 54 iterations there is no change/improvement.

The zero is determined to about 1.1010.There is no significant improvement after iteration 12.We could have stopped the process at iteration 53, since p5(m) was calculated to0.An error of 0.001 is rather poor, when we compute with 16 decimals. 2

In practice there is a limit to how fine we can subdivide, and this is not evena measure of the error, when we as here make a wrong decision, since f(m) iscalculated with the wrong sign at iteration 11.

10


11/128

1.5 An approximation to e

From mathematics we know that limn(1 +1n)

n

= e 2.718281828459 . . .We compute the subseries corresponding to n = 2k:

k 1 + 2k (1 + 2k)2k

1 1.50000000000000000 2.25000000000000000

2 1.25000000000000000 2.44140625000000000

3 1.12500000000000000 2.56578451395034790

4 1.06250000000000000 2.63792849736659995

5 1.03125000000000000 2.67699012937818237

6 1.01562500000000000 2.69734495256509987

.

.

23 1.00000011920928955 2.71828166642075297

24 1.00000005960464477 2.71828174742680062

25 1.00000002980232238 2.71828178793237640

26 1.00000001490116119 2.71828180818247311

27 1.00000000745058059 2.71828180818247311

.

.

51 1.00000000000000044 2.71828180818247311

52 1.00000000000000022 2.7182818081824731153 1.00000000000000000 1.00000000000000000

54 1.00000000000000000 1.00000000000000000

We observe the following:After 26 iterations we have 8 correct digits (7 decimals).Then there is no change for a long time.At iteration 53 the product becomes equal to 1. 2

These two examples show that there is a difference between usual mathematical

calculations and whatever happens within a computer. In the next sections weshall take a closer look at what goes on when we use computers to performcalculations with real numbers.

11


12/128

1.6 Floating-point numbers

Most modern computers represent (a subset of) the real numbers using the so-called floating-point numbers.

A floating-point number system is characterized by

a base N\{1}a number of digits s Na smallest exponent m Za largest exponent M Z

A floating-point number y can be written

y = d1.d2 . . . ds e

where 1 d1, 0 dk 1, m e M.

The number d1.d2 . . . ds is interpreted as d1 + d2 1 + + ds s+1 and iscalled the mantissa or the number part, while e is called the exponent part.

Remark: Most floating-point number systems also contain the number0 = 0.00 . . . 0 e, where e can be, but does not have to be, 0. 2A floating-point number system with parameters , s, m og M is denoted

F(,s,m,M).

On an electronic computer a practical choice is = 2, and then the s digits canbe stored in s bits. The sign takes up another bit. The exponent e can be storedas an integer, e.g. in the interval [m, M] = [2t, +2t 1] which takes t + 1 bits,such that the number altogether takes up s + t + 2 bits.

Remark: The condition d1 1 ensures that we work with normalised floating-point numbers, and our estimations of rounding errors in the next section are onlyvalid for these. If = 2 it follows that d1 = 1, and this redundant informationcan be left out such that we actually have one more bit available. 2

Example: Fig. 1.1 shows the numbers in F(2, 3, 2, 1), corresponding to whatcan be represented in 6 bits. Note how the distance between adjacent floating-point numbers change throughout the interval. Note also the large (relativelyspeaking) interval from 0 to the smallest positive floating-point number, m. 2

The actual physical representation of digits in bits and bytes can vary quite abit(!), but the above notation captures the details essential for our considerations.

If the result of a computation becomes M+1 or bigger then we have floating-pointoverflow and the computations will either be stopped, or a flag will be set with

12


13/128

-4 -3 -2 -1 0 1 2 3 4

Figure 1.1: F(2, 3, 2, 1)

information on what happened. If a number becomes smaller in absolute valuethan m, then we have underflow. The result will most often be set equal to 0,and the computations can continue, possibly with an indication that an underflowhas occurred.

Example:

The pocket calculator HP11C uses F(10, 10, 99, 99)while SUN Pascal is based on F(2, 53, 1022, 1023) 2

It happens very seldom that meaningful calculations take us outside the limits ofthe exponent (overflow or underflow), so we shall disregard these in what followsand instead consider the system F(, s).

1.7 A model for machine numbers

Floating-point numbers is by far the most common, but not the only way torepresent something that looks like real numbers on a computer. To view thingsa little more generally we introduce the following model.

Let M R be a discrete set of numbers which contains 0,and let f l : R M be a mapping with the following properties:

x M f l(x) = x, > 0, x R | x f l(x) | |x|.

The above is often denoted the machine accuracy or machine-.

The error, x f l(x), is called the representation error of x, and from the aboveproperty follows that the relative representation error is bounded by . We fur-thermore have

f l(x) x = 1 x, |1| or

f l(x) = x(1 + 1), |1| a fundamental relation, which is rather important for what follows.

13


14/128

Remark: This model opens wide possibilities for the choice of f l. It is quiteuseful, however, to require a monotonicity-condition:

a, b R, a < b f l(a) f l(b).

From this very natural requirement follows implicitly, that a real number is al-ways represented by one of the two closest machine numbers and therefore, thatthe representation error is bounded by the distance between adjacent machinenumbers. 2

Example: Let M = F(, s). Ifx R, then x can be written asx = d1.d2 . . . dsds+1 . . .

e.

Let f l be the mapping which cuts off the digits ds+1, . . ., such that

f l(x) = d1.d2 . . . ds e.We then have

0 error es+1and

0 relative error es+1

x

es+1

e= 1s.

Instead of this truncationof the infinite decimal fraction one might choose correctrounding to the nearest (s 1)decimal fraction. In this way the error boundbecomes half as big; but what is far more important: The error now has meanvalue 0 which has a favourable effect on the accumulated error after many com-putations. 2

Remark: Since the mantissa can vary between 1 and (1 s) the relativeaccuracy can vary with a factor . We wish for a uniform accuracy and it istherefore an advantage to choose small. = 2 is optimal in this respect. Thisobservation shall not detain us from using = 10 for daily use; but = 16,which has been used on some IBM computers, is not recommended. 2

Remark: On the very first computers it was common to represent real numbersin the interval (1, +1) as binary fractions y = 0.d1d2 . . . dn1. With n bitsyou could represent numbers with an absolute accuracy of 2n+1, but in a veryconfined interval. With floating-point numbers you give up the high accuracy ina small interval for a slightly reduced accuracy in a much larger interval, sincesome of the n bits are now used to represent the exponent part. 2

Example: Fig. 1.2 shows the numbers in F(2, 5, 1, 1), which corresponds towhat can be represented with 6 bits in fixed-point notation. 2

14


15/128

-1 0 1

Figure 1.2: F(2, 5, 1, 1)

Example: With 32 bits and fixed point representation we have an absolute errorof at most 231 4.7 1010, but only for x (1, 1). If we use 8 bits for theexponent, then the relative representation error is bounded by 223 1.2 107;but we can choose m = 128 and M = 127 and therefore represent numbers inthe interval from 2128 2.9 1039 to 2128 3.4 1038. 2

1.8 IEEE Standard for Binary Floating-Point

Arithmetic

Throughout the years there has been a fair amount of confusion about the imple-mentation of floating-point numbers, and many decisions seem to have been leftto individual engineers and designers of arithmetic units; and numerical prop-erties have therefore played second fiddle to considerations regarding speed and

design.

In 1985 the IEEE (Institute of Electrical and Electronics Engineers, USA) issueda standard for binary floating-point numbers [6], and this standard has more orless been adopted in most modern processors. The IEEE standard includes twoformats which in our notation (roughly) corresponds to

Single precision F(2, 24, 126, 127)Double precision F(2, 53, 1022, 1023)

The IEEE standard furthermore prescribes correct rounding with the extra fi-nesse, that if x lies exactly between two machine numbers then x is rounded tothat machine number whose least significant bit is 0. The standard also pre-scribes three possibilities for directional rounding: towards 0, towards + andtowards . These come in handy when implementing interval arithmetic.A count of the number of bits yields 33 resp. 65 bits for the two representationsand we can conclude that d1 is not supposed to be explicitly stored.

We also note that two possible values for the exponent have been taken out forspecial purposes. The high value has been reserved for representations of overflow: Inf (= Infinity) and NaN (= Not a Number, e.g. 0/0 or Inf Inf). There

15


16/128

are also rules for arithmetic with these generalized numbers such that a programdoes not need to stop because of a division by 0 or any other form of floatingoverflow.

The large gap between 0 and m1 is also mended by allowing the mantissaat this particular exponent to be non-normalised. One drawback with this trickis that our relative error estimates of the previous section do not hold in thisinterval but then again, they didnt hold before, either.

Example: Fig. 1.3 shows the numbers in an IEEElike number systemF(2, 4, 1, 1), where the leading bit in a normalised number is not stored andwhere the smallest exponent is reserved non-normalised numbers in the neigh-bourhood of 0. We have, however, not reserved the largest exponent for specialpurposes. As in the previous figures the number system corresponds to what canbe represented using 6 bits. 2

-4 -3 -2 -1 0 1 2 3 4

Figure 1.3: F(2, 4, 1, 1)

The various characteristics of the floating-point number system, such as wordlength, relative representation error, etc., used in ones computer is often docu-mented in the accompanying manual; but a considerable amount of informationcan be gained from program fragments such as the following.

Example: On a Sun Sparc ELC the following Pascal program is run:

eps := 1; n := 0; sum := 2;

while sum > 1 do

begin eps := eps/2; n := n + 1; sum := 1 + eps end;writeln(n:6, eps:22:18, + 1 = 1);

twoeps := eps*2; n := n - 1; sum := 1 + twoeps;

writeln(n:6, twoeps:22:18, + 1 =, sum:22:18, = x);

writeln( :6, eps:22:18, + x =, sum+eps:22:18);

The results are:

53 0.00000000000000011 + 1 = 1

52 0.00000000000000022 + 1 = 1.00000000000000022 = x

0.00000000000000011 + x = 1.00000000000000044

16


17/128

These results are in agreement with the IEEE standard for double precision:

1

253 = 1

1 252 = 1 + 252(1 252) 253 = 1 + 251

The second relation shows that we actually have 53 bits at our disposal in themanttissa. The first and third relation shows that rounding at 0.5 can go up ordown depending on the value of the least significant bit.

Running

eps := 1; n := 0;

while eps > 0 do

begin eps := eps/2; n := n + 1 end;

gives IEEE Warning:Underflow, and a printout of n shows that 21074 is thesmallest positive machine number, and that f l(21075) = 0.

Further investigations show that

(1 + 252) 21022 = (1 + 252)21022(1 + 252) 21023 = 21023

The last calculation which is accompanied by IEEE Warning:Inexact shows

how the relative representation error becomes larger when the exponent is below1022. In similar ways one can show that1.25 21072 = 21072 + 210741.25 21073 = 21073

and that a computation of 21024 gives Infinity with IEEE Warning: Overflow,and that the largest representable number is

(1 + 21 + + 252)21023. 2

1.9 Computational errors

Based on the estimates of the relative representation error we shall now esti-mate the errors associated with addition, subtraction, multiplication and divisionwithin a machine number system.

Example: In this and later examples we shall demonstrate properties of floating-point number systems on the system F(10, 4), i.e. a common base 10 systemwhere numbers are written with 4 significant digits. Instead of the correct way

17


18/128

of writing, like 2.891 101 or 6.146 104, we shall allow ourselves to use the morereader-friendly way of 28.91 and 0.0006146. Note that these two numbers arewritten with 2 resp. 7 decimals; but that they both have 4 significant digits.

The numbers 1.573 and 0.1824 are valid machine numbers, but

1.573 + 0.1824 = 1.75541.573 0.1824 = 1.39061.573 0.1824 = 0.28691521.573 / 0.1824 = 8.6239035 . . .

can not be represented in F(10, 4). This shortcoming is not due to floating-pointnumbers being a particularly clumsy way of representation but follows directlyfrom the fact that the set of machine numbers is finite. Our first conclusion,based on these examples, is that the set of machine numbers is not closed w.r.t.plus, minus, multiply, and divide. 2

Of course we should like to perform calculations anyway within our number sys-tem, so instead we modify the arithmetic.

We shall make the followingAssumption: The result of adding (subtracting, multiplying, dividing) two ma-chine numbers on a computer is the same as the representation of the exact sum(difference, multiplication, division), i.e.

a, b M a b = f l(a + b) = (a + b)(1 + 1) |1| 2

Remark: This assumption is quite realistic. Most computers have internal reg-isters of length 2s. They can therefore internally store a product of two sdigitnumbers or a sum of two numbers whose exponents differ by at most s. But alsoin those cases typically division where an exact result cannot be represented,do we have sufficient information to find f l(a/b) correctly.

A closer analysis will show that just one extra digit (in addition to an over-flow digit) is sufficient for the assumption to hold, if the arithmetic unit in thecomputer is designed carefully. 2

It follows from the above that we can represent the result of a simple arithmetic

operation involving two machine numbers with a relative error of the same orderof magnitude as the representation error of the number system. But how aboutthree numbers ?

Example:

(1.418 2937) 2936 = 2938 2936 = 2.0001.418 (2937 2936) = 1.418 1.000 = 2.4181.418 (2001 2000) = 1.418 1.000 = 1.418

(1.418 2001) (1.418 2000) = 2837 2836 = 1.000

18


19/128

2

We observe two of the most essential consequences of the fact that the machine

numbers are not closed w.r.t. the simple arithmetic operations: the associativeand the distributive laws do not hold for machine numbers. Moreover we notethat we cannot guarantee a small relative error when more than two numbers areinvolved in the computations.

A closer analysis shows

a,b,c M (a b) c = f l(f l(a + b) + c)= ((a + b)(1 + 1) + c)(1 + 1)= (a + b)(1 + 1)(1 + 1) + c(1 + 1)

= a(1 + 2) + b(1 + 2) + c(1 + 1)= a + b + c + a2 + b2 + c1

where |1|, |1| and 2 2 + 2.The last expression is an ordinary (forward) error analysis where the computedresult is compared to the true value. If a, b and c have the same sign then thisanalysis indicates that the relative error is small, but if we have different signsthen the relative error can be arbitrarily large.

The expression right above is of a type which has proved much more useful whenanalysing computational errors. It expresses the fact that the computed sum ofa, b, and c is equal to the true sum of three numbers, a, b and c, that are prettyclose to a, b, and c, respectively, in the sense that they deviate by no more thanthe representation errors. Such a backward error analysis shows (in this case),that this computation could not really be performed much better.

1.10 More computational errors

When analysing computational errors in arithmetic expressions involving severaloperations we shall encounter error factors of the form

(1 + 1)(1 + 2)

1 + 3, |i|

We shall now show that the essential part in such expressions is the number ofs, and that the above expression can be written as 1 + 3, where |3| 3 almost.

19


20/128

Theorem Let |i| for i = 1, 2, . . . n, and 0 k n.If n = b < 1, then there exists a , || <

1b, such that

u =(1 + 1)(1 + 2) (1 + k)

(1 + k+1) (1 + n) = 1 + n

Remark: When usually is very small (224, 109, or the like), b will almostalways be small such that 1 b 1, and || . 2The proof of the theorem is split into a series of lemmas.

Lemma 1 (1 )k ki=1(1 + i) (1 + )k 2Lemma 2 0 < a < 1

1 + a < 1

1a

Proof: (1 + a)(1 a) = 1 a2 < 1 2Corollary 3 0 < a < 1 1 a < 1

1+a

Lemma 4 0 < k < 1 1 k (1 )k < (1 + )k

The last inequality follows from Corollary 3, and the first inequality is trivial fork = 1. For k 2 it follows from

(1 )k = 1 k + 12

k(k 1)2(1 )k2 1 k (0 < < ) 2Lemma 5 0 < k < 1

(1 + )k < (1

)k

1 + k

1k

The first inequality follows from Lemma 2, and the second inequality follows fromLemma 4:

(1 )k 11k

= 1 + k1k

2

The proof of the theorem now follows by combining these inequalities:

u (1 + )k

(1 )nk < (1 )n 1 + n

1 b

u

(1 )k

(1 + )nk > (1

)n

1

n 2

Inspired by the theorem we now introduce the followingNotation: By n we denote a real number which satisfies |n| n1n n.Remark: This notation says nothing about the sign, and two occurrences of thesame index do not necessarily correspond to the same number. We therefore havesome unusual arithmetic rules for n such as

1 3 = 1 + 3 and 1 + 21 + 2 = 1 + 4 2Example: a1 a2 an = a1a2 an(1 + n1) 2

20


21/128

Example:

an

an1

a1 = an(1 + n1) + an1(1 + n1) + an2(1 + n2) +

+a2(1 + 2) + a1(1 + 1)

Note that it is not possible to give an estimate for the relative error for a sumwhere the terms may have different signs. The backward error analysis suggeststhat the first terms undergo the most perturbations. It is therefore a good rule-of-thumb to start with the smallest terms. It must be mentioned here that thisarrangement gives the smallest error estimate, but not necessarily the smallesterror. 2

Example:

a2n a2n1 a21 = a2n(1 + n) + a2n1(1 + n) + a2n2(1 + n1) + +a22(1 + 3) + a

21(1 + 2)

= (a2n + a2n1 + + a21)(1 + n)

Since all terms are positive it is possible here to give an estimate of the relativeerror. The error estimate (and the error) can be diminished further by doing abinary-tree addition, i.e. adding the terms two and two, adding the partial sumstwo and two, etc. 2

1.11 Condition and stability

The condition of a problem and the stability of an algorithm are two conceptswhich are important for the understanding of what factors matter for the accuracyof our computations.

Loosely formulated, a mapping f is said to be well conditioned(in a region ), ifx close to x implies f(x) close to f(x).

This definition looks a bit like the concept of continuity:

> 0 > 0 : x x < f(x) f(x) < .

A well conditioned problem is continuous, but for a continuous problem to bewell conditioned, must not be too small relative to .

We can define a condition number for the mapping f in a region as

(f) = supx,x

f(x) f(x)x x .

21


22/128

The condition number is closely related to the concepts of Lipschitz-constant andmodulus of continuity for real functions.

A small condition number ( 10) means that the problem is well conditionedwhile a large condition number ( > 1000000) means that the problem is illconditioned, and there is a smooth transition from the very well conditionedproblems to the very poorly conditioned ones.

An algorithm or computational formula f is a practical realisation of the map-ping f. An algorithm is stable in a region , if

x x close to x such that f(x) is close to f(x).We have again stated the definition very loosely. This reflects the smooth trans-ition from (very) stable to (clearly) unstable algorithms.

We shall not require that f(x) = f(x), since we have no guarantee that f(x)can be represented exactly in the number system at hand.

We note that if f is stable and f is well conditioned then f(x) will be close tof(x), and that is really what we want.

There is no tradition for characterizing a good or bad stability by a stabilitynumber, but we sometimes come close when we perform a backward error anal-ysis of an algorithm.

If a problem is ill conditioned then it is a good idea to try to reformulate, andif this seems impossible then the least we can do is to focus attention on thegreat sensitivity of the problem. The trouble with ill conditioned problems is amathematical, not a computational one, and there is very little we can do aboutit.

If an algorithm is unstable then we shouldnt use it, but replace it with a stableone.

Example: Finding the roots of a quadratic equation

x2 2bx + c = 0

with coefficients b and c is an ill conditioned problem if b2

c, i.e. when there isa double root or two close roots. If for instance b = 1 and c = 1, then both rootsare equal to 1, but b = 1 and c = 0.9999 9999 gives the roots 0.9999 and 1.0001,so the shift of the roots is 10000 times larger than the change in c. Actually thisproblem has = for (b, c) close to (1, 1).But it should be stressed that when the roots in the quadratic equation are notclose then the root-finding problem is a well conditioned one.This example illustrates that when there is a mathematical exception (determi-nant = 0, double root, . . . ) then there is a neighbourhood around it where thecondition is poor.

22


23/128

This example also illustrates that a problem can be ill conditioned for some val-ues of the data and well conditioned in other places. Care should be taken to useit only in the other places. 2

Example: The usual formula for the roots

x1,2 = b

b2 c

is an unstable algorithm for the smaller root if this is much smaller (in absolutevalue) than the larger, i.e. when |c| b2. If for instance b = 1 and c = 0.000001,and we use the number system F(10, 4), then f l(x1) = 0 and f l(x2) = 2. Whilethe larger root is represented as well as possible the relative error in the smallerroot is 100 %. The computational formula for the larger root is stable; but forthe smaller root we should rather use the fact that the product of the roots isequal to c, leading to the formula

f l(x1) = f l(c

x2) = 0.00000 05000,

a result with a relative error less than machine. We note in passing that thisproblem is well conditioned since 1/2. 2Example: Compute the average of 6.231 and 6.233 in F(10, 4).From the formula m = (a + b)/2 we get

f l(f l(6.231 + 6.233)/2) = f l(12.46/2) = 6.230.

The error is reasonably small, but we note that the computed result does not liein the interval [a, b].If we instead use m = a + (b a)/2 we find

f l(6.231 + f l(f l(6.233 6.231)/2)) = 6.231 + 0.0022

= 6.232

We shall return to this way of rearranging the computations to achieve betterstability properties in Section 1.13 on page 27. 2

The bible for rounding errors where much of the previous notation has beenintroduced and many results given is [14].

23


24/128

1.12 On adding many numbers

One often encounters the problem of summing an infinite series. The finite ma-chine accuracy can here be an advantage because it implicitly sets a limit to howmany terms in the series are needed. Even if you dont know the machine accu-racy or if you write a program to be run on different platforms you can makethe program adapt to different surroundings. One simple technique is to keepadding terms until the sum does not change any more. This technique works finewith swiftly converging series, but there are pitfalls.

We can for instance sum the harmonic series

n=11n

in single precision to 15.403682 71 using 221 = 2097152 terms. We note that 1

221+1is so small that adding

it to 15.403 682 71 gives no change. But the harmonic series is divergent and

therefore has no finite sum. More generally, the fact that one term is too smalldoes not imply that the effect of several terms cannot be felt.

We now redefine the problem to the one of summing the first 2097152 terms ofthe harmonic series. Can we trust the calculated sum ?

k 2k Sk Dk

1 2 1.500 000 00 1.500 000 00

2 4 2.083 333 49 0.583 333 49

3 8 2.717 857 36 0.634 523 87

4 16 3.380 728 96 0.662 871 605 32 4.058 495 52 0.677 766 56

6 64 4.743 891 72 0.685 396 19

7 128 5.433 147 43 0.689 255 71

8 256 6.124 345 78 0.691 198 35

9 512 6.816 517 35 0.692 171 57

10 1024 7.509 182 93 0.692 665 58

11 2048 8.202 081 68 0.692 898 75

12 4096 8.895 108 22 0.693 026 54

13 8192 9.588 195 80 0.693 087 58

14 16384 10.281 306 27 0.693 110 4715 32768 10.974 409 10 0.693 102 84

16 65536 11.667 428 02 0.693 018 91

17 131072 12.360 085 49 0.692 657 47

18 262144 13.051 303 86 0.691 218 38

19 524288 13.737 017 63 0.685 713 77

20 1048576 14.403 683 66 0.666 666 03

21 2097152 15.403 682 71 0.999 999 05

In the table above we have listed the partial sums Sk =2k

n=11n

and the differences

24


25/128

Dk = Sk Sk1.We can show that

ln 2 2k < Dk < ln 2 0.693 147 18,and these inequalities are fulfilled up to k = 14, but clearly not for k > 16.Especially the last difference is completely wrong with a value of almost 1.0. Inother words: there must be an error in the first decimal (= third digit) in S21.Rather poor considering that we have 7 digits machine accuracy.

But we have also broken the good rule of summing the small terms first. In thetable below we have performed the sum backwards and again listed both thepartial sums Sk =

221n=2k1

1n

and the differences Dk = Sk Sk+1. The latter can

be compared with the similar ones in the previous table.

k 2k Sk Dk

21 2097152 0.693 266 15 0.693 266 15

20 1048576 1.386 155 01 0.692 888 86

19 524288 2.079 162 84 0.693 007 83

18 262144 2.772 186 52 0.693 023 68

17 131072 3.465 300 08 0.693 113 57

16 65536 4.158 451 08 0.693 151 00

15 32768 4.851 575 37 0.693 124 29

14 16384 5.544 691 56 0.693 116 19

13 8192 6.237 781 05 0.693 089 49

12 4096 6.930 810 45 0.693 029 40

11 2048 7.623 712 06 0.692 901 61

10 1024 8.316 372 87 0.692 660 81

9 512 9.008 551 60 0.692 178 73

8 256 9.699 749 95 0.691 198 35

7 128 10.389 008 52 0.689 258 58

6 64 11.074 402 81 0.685 394 29

5 32 11.752 169 61 0.677 766 80

4 16 12.415 040 97 0.662 871 36

3 8 13.049 565 32 0.634 524 352 4 13.632 898 33 0.583 333 02

1 2 15.132 898 33 1.500 000 00

The results look better but not overwhelmingly so. Apparently we must besatisfied with 3 correct decimals in the first differences where more than 100000terms participate.

In this problem where we have many terms of the same order of magnitude, athird adding strategy can be advantageous. Arrange the terms as the leaves of a

25


26/128

balanced binary tree (this works very well when n is a power of 2), add the terms2 and 2, add the partial sums 2 and 2, etc. In a summation ofn = 2k terms eachterm participates in only k additions and this leads to a smaller error estimate.The table below shows that it also leads to smaller errors.

k 2k Sk Dk

1 2 1.500 000 00 1.500 000 00

2 4 2.083 333 49 0.583 333 49

3 8 2.717 857 36 0.634 523 87

4 16 3.380 729 20 0.662 871 84

5 32 4.058 495 52 0.677 766 32

6 64 4.743 891 24 0.685 395 72

7 128 5.433 147 43 0.689 256 198 256 6.124 345 30 0.691 197 87

9 512 6.816 516 88 0.692 171 57

10 1024 7.509 176 25 0.692 659 38

11 2048 8.202 079 77 0.692 903 52

12 4096 8.895 105 36 0.693 025 59

13 8192 9.588 191 99 0.693 086 62

14 16384 10.281 309 13 0.693 117 14

15 32768 10.974 441 53 0.693 132 40

16 65536 11.667 581 56 0.693 140 03

17 131072 12.360 725 40 0.693 143 84

18 262144 13.053 871 15 0.693 145 75

19 524288 13.747 016 91 0.693 145 75

20 1048576 14.440 163 61 0.693 146 71

21 2097152 15.133 310 32 0.693 146 71

The differences suggest that the error is of the order of 106 220, which isclose to machine- relative to a sum of about 15.

To check the results we have performed yet another calculation in double precisionand using binary addition. The results are given below:

summation precision computed sum errorbinary double 15.133 306 70binary single 15.133 310 32 0.000 003 62

backwards single 15.132 898 33 +0.000 408 37forwards single 15.403 682 71 0.270 376 01

The results show clearly that it is far better to add the small terms first thanlast. Even better is a binary addition, however. The relative error here is about4, which is rather good considering that we have added two million terms.

26


27/128

1.13 Some good advice

It is easy to give examples that computer calculations with real numbers can gowrong. It is more difficult (and dangerous) to give general advice on how to avoidthe pitfalls. But it is rather important to do so, so we try anyway.

1. A (good) approximation plus a (small) correction term.

In many calculations, typically iterations, we shall compute better and (hopefully)better approximations to the solution. If the calculations can be arranged asmentioned above, this will always be advantageous. For instance

a + b a2

is better than a + b2

,

x x2 a2x

is better than1

2(x +

a

x),

x2 y2x2 x1y2 y1 is better than

y2x1 x2y1y2 y1 .

2. Add the small terms first.

When a series with decreasing terms is to be summed, it is a good idea to start

with the small terms.If you must add many numbers then you should consider binary addition.

3. Be careful when subtracting almost equal numbers.

I am not saying that such subtractions should be avoided. We have actually usedthem to advantage in #1. They are perfectly OK in correction terms, but wemust make sure that we do not lose essential information.

4. Avoid large partial results on the road to a small final answer.

The MacLaurin series for exp(x) is not to be recommended when x = 10, whereit contains terms of the order 2700, while the reult is 0.000 045. A solution hereis (cf. #5) to compute exp(+10) where the series can be used, and take thereciprocal.

5. Use mathematical reformulations to avoid 3. and 4.

For small values of v we have that

2sin2v

2is better than 1 cos v.

27


28/128

For values of v in the neighbourhood of /2 we have that

2sin2(

2 v) is better than 1 sin v

and especially if it is possible to find an explicit expression for 2

v.6. Series expansions can supplement 5.

For small x we have that

x + x

2

2 + x

3

6 + is better than ex 1,x3

6 x

5

120+ is better than x sin x,

x

2 x

2

8+ is better than 1 1 x.

7. Use integer calculations when possible.

Even when h is defined by a/n we shall often move one step too little or too manyif we try to march the distance a in steps of h. It is better to keep the integer

n as the primary parameter and go n steps. If we afterwards wish to halve thestep size then we can just double n and redefine the step size from this.

8. Look at the numbers once in a while.

It is a good idea to (write and) check intermediate results, e.g. while testing theprogram, and judge whether they make sense. Sound judgment and commonsense are invaluable helpers.

28


29/128

1.14 Truncation errors vs. computational errors

From Taylors formula

f(x + h) = f(x) + hf(x) +1

2h2f(x) +

1

6h3f(x) +

we have that

f(x) =f(x + h) f(x)

h 1

2hf(x) 1

6h2f(x) +

When we approximate f(x) by the divided difference then this value has a trun-cation error of the form c1h + c2h

2 +

We can make the truncation error small by using a small value ofh. But when h issmall then f(x+h) is close to f(x), and we can expect a considerable cancellationin the difference:

f l(f(x + h) f(x)) = (f l(f(x + h) f l(f(x)))(1 + 1)= (f(x + h)(1 + 1) f(x)(1 + 2))(1 + 1)= f(x + h) f(x) + f(x + h)1 f(x)2 + ( )1

We know that a difference between two machine numbers can be computed withsmall relative error, but if the two terms are the results of computations with

accompanying errors then the cancellation can become catastrophic. If we assumethat f(x) can be computed with a relative error of = p, where p is a smallinteger then the error can be stimated by

| error | 2|f(x)|p + |f(x)|1 2p+1,if we assume that f(x) and f(x) have order of magnitude 1. This error is nowmagnified in the division by h.

While the truncation error is a nice and differentiable function, the contributionfrom the computational error is more random, but with a standard deviationinversely proportional to h. The total error can therefore be expressed as

total error c1h + d11

h ,where c1 is of order of magnitude 1 and d1 of order of magnitude machine. The

expression on the right-hand-side has its minimum for h =

d1/c1 with aminimum value of 2

c1d1

From these considerations we deduce that there is a lower bound for how smallvalues of h it is reasonable to use, and that there is a lower limit for the errorwe can achieve with a given formula and a given machine accuracy, and that thiserror limit is considerably larger than the machine accuracy.

29


30/128

A formula where the truncation error has the form c1hp + is said to be of order

p, and for such a formula we havetotal error

c1h

p + d11

h.

Now the minimum occurs for h = (d1/pc1)1/(p+1) 1/(p+1) with a minimum value

of order of magnitude p/(p+1).

If for instance p = 2 and = 1015, then the optimal h will be in the neighbour-hood of 105, and the optimal error about 1010. We therefore prefer formulaeof high order as long as they dont involve too much extra computation or haveother infavourable properties compared to the low order formula.

30


31/128

Chapter 2

The global error theoretical

aspects

2.1 Introduction

We study the linear, parabolic equation

ut = buxx aux + u + (2.1)or as we prefer to write it here

P u = ut buxx + aux u = (2.2)introducing the partial differential operator P. The coefficients b, a, , and may depend on t and x. We produce a numerical solution v(t, x) and our basicassumption is that the global error can be expressed in terms of a series expansionin the step sizes k and h

v(t, x) = u(t, x) hc kd hke h2f k2g (2.3)

The auxiliary functions c, d, e, f, and g need not all be present in any particularsituation. Often we shall observe that c or e or d are identically zero such thatthe numerical solution is second order accurate in one or both of the step sizes.

Strictly speaking v(t, x) is only defined on a discrete set of grid points but weshall imagine that it is possible to extend it in a differentiable manner to thewhole region. Actually this can be done in many ways. The same considerationsapply to the auxiliary functions and we shall in the following see a concrete wayof extending these.

We can get information on the auxiliary functions by studying the differenceequations and by using Taylor expansions. We first look at the explicit scheme.

31


32/128


33/128

The auxiliary functions are actually only defined on the grid points but inspiredby (2.12) (2.16) it seems natural to extend them between the gridpoints suchthat these differential equations are satisfied at all points in the region. We notethat each of the auxiliary functions should satisfy a differential equation verysimilar to the original one, the only difference lying in the inhomogeneous terms.

2.3 The initial condition

In order to secure a unique solution to (2.1) we must impose some side conditions.One of these is an initial condition, typically of the form

u(0, x) = u0(x), A

x

B, (2.17)

where u0(x) is a given function of x. It is natural to expect that we start ournumerical solution as accurately as possible, i.e. we set v0m = v(0, mh) = u0(mh)for all grid points between A and B. But we would like to extend v between gridpoints as well, and the natural thing would be to set v(0, x) = u0(x), A x B.With this assumption we see from (2.3) that

c(0, x) = d(0, x) = e(0, x) = f(0, x) = g(0, x) = = 0, A x B. (2.18)

2.4 Dirichlet boundary conditions

In order to secure uniqueness we must in addition to the initial condition imposetwo boundary conditions which could look like

u(t, A) = uA(t), u(t, B) = uB(t), t > 0, (2.19)

where uA(t) and uB(t) are two given functions of t. Just like for the initialcondition it is natural to require v(t, x) to satisfy these conditions not only at thegrid points on the boundary but on the whole boundary and as a consequencethe auxiliary functions will all assume the value 0 on the boundary:

c(t, A) = d(t, A) = e(t, A) = f(t, A) = g(t, A) = = 0, t > 0, (2.20)c(t, B) = d(t, B) = e(t, B) = f(t, B) = g(t, B) = = 0, t > 0. (2.21)

2.5 The error for the explicit method

If we have an initial-boundary value problem for (2.1) with Dirichlet boundaryconditions, and if we use the explicit method for the numerical solution then wehave the following results for the auxiliary functions:

33


34/128

The differential equation (2.12) for c(t, x) is homogeneous and so are the sideconditions according to (2.18), (2.20), and (2.21). c(t, x) 0 is a solution, andby uniqueness the only one. It follows that c(t, x)

0 and therefore that there

is no h-contribution to the global error in (2.3).

The differential equation (2.14) for e(t, x) is apparently inhomogeneous, but sincec(t, x) 0 so is ctt and the equation is homogeneous after all. So are the sideconditions and we can conclude that e(t, x) 0.The global error expression (2.3) for the explicit method therefore takes the form

v(t, x) = u(t, x) kd h2f k2g (2.22)and we deduce that the explicit method is indeed first order in time and second

order in space.

For d we have from (2.13) that P d = 12

utt so we must require the problem tobe such that u is twice differentiable w.r.t. t. This is usually no problem exceptpossibly in small neighbourhoods around isolated points on the boundary.

2.6 The implicit method

For the implicit method

vn+1m vnmk

bn+1m 2vn+1m + an+1m vn+1m n+1m vn+1m = n+1m . (2.23)

it is natural to choose ((n + 1)k,mh) as the expansion point. Equations (2.8) (2.10) stay the same with the exception that all functions should be evaluated at((n + 1)k,mh). In equation (2.7) three terms on the r.h.s. change sign:

vn+1m vnmk

= ut 12

kutt +1

6k2uttt hct + 1

2hkctt kdt + 1

2k2dtt

hket h2ft k2gt + O(k3 + k2h + kh2 + h3) (2.24)Equating terms as before we get a set of equations rather similar to (2.11) (2.16). (2.11) and (2.12) are unchanged, there is a single sign change in (2.14)and we can still conclude that c(t, x) e(t, x) 0. The remaining equations are

k : P d = 12

utt (2.25)

h2 : P f = 112

bu4x +1

6auxxx (2.26)

k2 : P g =1

6uttt +

1

2dtt (2.27)

34


35/128

and the error expansion for the implicit method has the same form as (2.22).Since there is a sign change in (2.25) as compared to (2.13) we can concludethat dIm(t, x) =

dEx(t, x). The r.h.s. of (2.26) is the same as in (2.15) and the

sign change in the r.h.s. of (2.27) is compensated by d being of opposite sign.We therefore have that f(t, x) and g(t, x) are the same for the explicit and theimplicit method.

2.7 An example

Consider

ut = uxx,

1

x

1, t > 0,

u(0, x) = u0(x) = cos x, 1 x 1,u(t, 1) = u(t, 1) = et cos1, t > 0.

The true solution is u(t, x) = et cos x.

For the explicit method we have

P d = dt dxx = 12

utt =1

2u.

For f we have similarly

P f = ft fxx = 112

u4x = 112

u.

We conclude that for this problem f(t, x) = 16

d(t, x) or d(t, x) = 6f(t, x).For the explicit method we must have k = h2 with 1

2. Keeping fixed the

leading terms in the error expansion are

kd + h2f = 6h2f + h2f = (6 1)h2f.If we choose = 1

2as is common we get the leading term of the error to be

2h2f(t, x). There is an obvious advantage in choosing = 1

6

in which case weobtain fourth order (in h) accuracy.

If we use the implicit method f stays the same and d changes sign and the leadingterms of the error expansion become

6h2f + h2f = (6 + 1)h2f.

With = 12

the error becomes 4h2f(t, x) i.e. twice as big (and of opposite sign)as for the explicit method. There is no value for that will secure a higher orderaccuracy.

35


36/128

2.8 Crank-Nicolson

The Crank-Nicolson method can be written asvn+1m vnm

k 1

2bn+1m

2vn+1m 1

2bnm

2vnm (2.28)

+1

2an+1m v

n+1m +

1

2anmv

nm

1

2n+1m v

n+1m

1

2nmv

nm =

1

2(n+1m +

nm).

The natural expansion point is now ((n + 12

)k,mh). For the approximations tothe x-derivatives it is worthwhile to Taylor-expand first in the x-direction to getformulas like (2.8) and (2.9) referring to (nk,mh) and ((n + 1)k,mh) and thenuse the formula

1

2un+1 +

1

2un = un+

1

2 +1

8k2utt + O(k

4) (2.29)

on all the individual terms. The resulting equations are

vn+1m vnmk

= ut +1

24k2uttt hct kdt hket (2.30)

h2ft k2gt + O(k3 + k2h + kh2 + h3)1

2(bn+1m

2vn+1m + bnm

2vnm) = bn+ 1

2m {uxx + 1

12h2u4x hcxx kdxx (2.31)

hkexx h2fxx k2gxx} + 18 k2(buxx)tt + O( )1

2(an+1m v

n+1m + a

nmv

nm) = a

n+ 12

m {ux + 16

h2uxxx hcx kdx hkex (2.32)

h2fx k2gx} + 18

k2(aux)tt + O( )1

2(n+1m v

n+1m +

nmv

nm) =

n+ 12

m {u hc kd hke h2f k2g} (2.33)

+1

8k2(u)tt + O( )

1

2(

n+1

m +

n

m) = +

1

8 k

2

tt + O( ) (2.34)

We insert (2.30) (2.34) in (2.28) and equate terms with the same powers of hand k:

1 : P u = (2.35)

h : P c = 0 (2.36)

k : P d = 0 (2.37)

hk : P e = 0 (2.38)

36


37/128

h2 : P f = 112

bu4x +1

6auxxx (2.39)

k2

: P g =

1

24 uttt 1

8(buxx)tt +

1

8 (aux)tt 1

8(u)tt 1

8 tt (2.40)

The r.h.s. in (2.40) looks rather complicated but if the solution to (2.1) is smoothenough such that we can differentiate (2.1) twice w.r.t. t then we can combinethe last four terms in (2.40) to 1

8uttt and the equation becomes

k2 : P g = 112

uttt (2.41)

If the inhomogeneous term (t, x) in the equation (2.1) can be evaluated at the

mid-points ((n +1

2)k,mh) then it is tempting to use

n+ 12

m instead of1

2(n+1

m + n

m)in (2.28). We shall then miss the term with 18

tt in (2.40) and therefore not havecomplete advantage of the reduction leading to (2.41). Instead we shall have

k2 : P g = 112

uttt +1

8tt.

It is impossible to say in general which is better, but certainly (2.41) is simpler.

Looking at equations (2.35) (2.41) we again recognize the original equation foru in (2.35), and from (2.36) (2.38) we may conclude that c(t, x) d(t, x) e(t, x)

0 showing that Crank-Nicolson is indeed second order in both k and h.

We also note from (2.39) that f(t, x) for Crank-Nicolson is the same function asfor the explicit and the implicit method.

2.9 Example continued

For our example we have

P g = gt gxx = 112

uttt =1

12u.

We conclude that f(t, x) = g(t, x) and that the leading terms of the error are

h2f + k2g = (h2 k2)f.

There is a distinct advantage in choosing k = h in which case the second orderterms in the error expansion will cancel.

37


38/128


39/128

just consider a derivative condition on one boundary. We shall in turn studythree different discretizations of the derivative in (2.48):

vn1 vn0h

(first order) (2.49)

vn1 vn12h

(second order, symmetric) (2.50)

vn2 + 4vn1 3vn02h

(second order, asymmetric) (2.51)

2.12 A first order boundary approximation

We first use the approximation (2.49) in (2.48). If the coefficients , and depend on t they should be evaluated at t = nk:

vn0 vn1 vn0

h= , t > 0. (2.52)

We now use the assumption (2.3) and Taylor-expand vn1 around (nk, 0):

{u hc kd hke h2f k2g} {ux + 12

huxx +1

6h2uxxx hcx

12

h2cxx kdx 12

hkdxx hkex h2fx k2gx} = O( ) (2.53)

Collecting terms with 1, h, k, hk, h2, and k2 as before we get

1 : u ux = (2.54)h : c cx = 1

2uxx (2.55)

k : d dx = 0 (2.56)hk : e

ex =

1

2

dxx (2.57)

h2 : f fx = 16

uxxx +1

2cxx (2.58)

k2 : g gx = 0 (2.59)

We recognize the condition (2.48) for u in (2.54). As for c the boundary condition(2.55) is no longer homogeneous and we shall expect c to be nonzero. This holdsindependently of which method is used for the discretization of the equation (2.1).So if we use a first order boundary approximation we get a global error which isfirst order in h.

39


40/128

2.13 The symmetric second order approxima-

tion

We now apply the symmetric approximation (2.50) in (2.48):

vn0 vn1 vn1

2h= , t > 0. (2.60)

We again use the assumption (2.3) and Taylor-expand vn1 and vn1 around (nk, 0):

{u hc kd hke h2f k2g} {ux + 16

h2uxxx hcx kdx hkex h2fx k2gx} = O(h3 + h2k + hk2 + k3) (2.61)


1 : u ux = (2.62)h : c cx = 0 (2.63)k : d dx = 0 (2.64)

hk : e ex = 0 (2.65)h2 : f fx = 1

6uxxx (2.66)

k2 : g gx = 0 (2.67)

We recognize the condition (2.48) for u in (2.62). We now have a homogeneouscondition (2.63) for c and this will assure that c(t, x) 0 when we combine (2.60)with the explicit, the implicit or the Crank-Nicolson method. We shall also havee(t, x) 0, but in order to have d(t, x) 0 we must use the Crank-Nicolsonmethod.

2.14 An asymmetric second order approxima-

tion

We finally apply the approximation (2.51) in (2.48):

vn0 vn2 + 4vn1 3vn0

2h= , t > 0. (2.68)

We again use the assumption (2.3) and Taylor-expand vn1 and vn2 around (nk, 0):

{u hc kd hke h2f k2g} {ux 13

h2uxxx hcx kdx hkex h2fx k2gx} = O(h3 + h2k + hk2 + k3) (2.69)

40


41/128


1 : u

ux = (2.70)

h : c cx = 0 (2.71)k : d dx = 0 (2.72)

hk : e ex = 0 (2.73)h2 : f fx = 1

3uxxx (2.74)

k2 : g gx = 0 (2.75)

All the same conclusions as for the symmetric case will hold also in this asymmet-ric case. One disadvantage which does not show in equations (2.70) (2.75) is

that the next h-term is now third order and therefore can be expected to interferemore than the fourth order term which is present in the symmetric case.

41


42/128

42


43/128

Chapter 3

Estimating the global error and

order

3.1 The local error

The error of a finite difference scheme for solving a partial differential equationis often given by means of the local error which is the error committed in onestep given correct starting values, or more frequently as the local truncation errorgiven in terms of a Taylor expansion again for a single step and with presumedcorrect starting values. Rather than giving numerical values one often resorts togiving the order of the scheme in terms of the step size such as O(h) or O(h2).The interesting issues, however, are the magnitude of the global error i.e. thedifference between the true solution and the computed value at a specified point,in a sense the cumulated value of all the local errors up to this point, and theorder of this error in terms of the step size used.

3.2 The global error

We shall first define what we mean by the global error being of order say O(h).Let u(x) be the true solution, and let v(x) be the computed solution. v(x) isactually only defined on a discrete set of grid points but if need be it can beextended in a differentiable fashion. We shall assume that the computed solutioncan be written as

v(x) = u(x) hc(x) h2d(x) h3f(x) , (3.1)where c(x), d(x) and f(x) are differentiable functions. These auxiliary functionsare strictly speaking also defined only on the discrete set of grid points but can

43


44/128

likewise be extended in a differentiable fashion to the whole interval in questionif we wish.

If the function c(x) happens to be identically 0 then the method is (at least) ofsecond order, otherwise it is of first order. Even if c(x) is not identically 0 thenit might very well have isolated zeroes. At such places our analysis might giveresults which are difficult to interpret correctly. Therefore the analysis shouldalways be performed for a substantial set of points in order to give trustworthyresults.

In the following we shall show how we by performing calculations with variousvalues of the step size, h, can extract information not only about the true solutionbut also about the order and magnitude of the error.

We shall begin our analysis in one dimension and later extend it to functions oftwo or more variables. A calculation with step size h will thus result in

v1 = u hc h2d h3f

A second calculation with twice as large a step size gives

v2 = u 2hc 4h2d 8h3f

We can now eliminate u by subtraction:

v1 v2 = hc + 3h2

d + 7h3

f + A third calculation with 4h is necessary to retrieve information about the order

v3 = u 4hc 16h2d 64h3f ,

whencev2 v3 = 2hc + 12h2d + 56h3f +

and a division gives

v2 v3v1 v2 = 2

c + 6hd + 28h2f + c + 3hd + 7h2f + . (3.2)

This quotient can be computed in all those points where we have information fromall three calculations, i.e. all grid points corresponding to the last calculation

Ifc = 0 and h is suitably small we shall observe numbers in the neighbourhood of2 in all points, and this would indicate that the method is of first order. If c = 0and d = 0, then the quotient will assume values close to 4 and if this happens for

44


45/128

many points and not just at isolated spots then we can deduce that the methodis of second order. The smaller h the smaller influence for the next terms in thenumerator and the denominator and the picture should become clearer.

The error in the first calculation, v1, is given by

e1 = u v1 = hc + h2d + h3f + .If we observe many numbers of the ratio (3.2) in the neighbourhood of 2 indicatingthat |c| is substantially larger than h|d|, and that the method therefore is of firstorder, then e1 is represented reasonably well by v1 v2:

e

1

= v1

v2

2h2d

6h3f

, (3.3)

and v1 v2 can be used as an estimate of the error in v1.One could choose to add this result to v1 and thereby get more accurate results.This process is called Richardson extrapolationand can be done for all grid pointsinvolved in the calculation of v2. If the error (estimate) behaves nicely we mighteven consider interpolating to the intermediate points and thus get extrapolatedvalues with spacing h. Interpolation or not, we cannot at the same time, i.e.without doing some extra work, get a realistic estimate of the error in this im-proved value. The old estimate can of course still be used but it is expected to

be rather pessimistic.

If in contrast we observe many numbers in the neighbourhood of 4 then |c| issubstantially less than h|d| and is probably 0. At the same time |d| will be largerthan h|f|, and the method would be of second order with an error

e1 = u v1 = h2d + h3f + .This error will be estimated nicely by (v1 v2)/3:

e1 = 13(v1 v2) 43 h3f . (3.4)

It is thus important to check the order before calculating an estimate of the errorand certainly before making any corrections using this estimate. If in doubt itis usually safer to estimate the order on the low side. If the order is 2 and thecorrect error estimate therefore (v1 v2)/3, then misjudging the order to be 1and using v1 v2 for the error estimate would not be terribly bad, and actuallyon the safe side. But if one wants to attempt Richardson extrapolation it is veryimportant to have the right order.

45


46/128

-1 1

2

4

Figure 3.1: The function w(x) = 21+2x1+x

.

If our task is to compute function values with a prescribed error tolerance then theerror estimates can also be used to predict a suitable step size which would satisfythis requirement and in the second round to check that the ensuing calculationsare satisfactory.

Can we trust these observations? Yes, if we really observe values of (3.2) betweensay 1.8 and 2.2 for all relevant grid points then the method is of first order andthe first term in the remainder series dominates the rest. Discrepancies from thispattern in small areas are also allowed. They may be due to the fact that c(x)has an isolated zero. This can be checked by observing the values of v1 v2in a neighbourhood. These numbers which are dominated by the term hc willthen display a change of sign. In such a region the error estimate v1 v2 mightbe pessimistic but since the absolute value at the same time is rather small thisshould not give reason to worry.

If a method is of first order and we choose to exploit the error estimate to adjustthe calculated value (i.e. to perform Richardson extrapolation) then it might bereasonable to assume that the resulting method is of second order. This of coursecan be tested by repeating the above process.

Once more it should be stressed that the order must be checked before carryingon. Therefore we need an extra calculation, v4, such that we can compute threeextrapolated values on the basis of which we can get information about the order.We of course expect the order to be at least 2. If the results do not conform thenit might be an idea to review the calculations.

46


47/128

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 t

0.0

0.1

0.2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

x

Figure 3.2: c(t, x).

What will actually happen if one performs a Richardson extrapolation based ona wrong assumption about the order? Usually not too much. If one attempts toeliminate a second order term in a first order calculation then the result will stillbe of first order; and if one attempts to eliminate a first order term which is notthere then the absolute value of the error might double but the result will retain

its high order.If one wants to understand in detail what might happen to the ratio (3.2) in thestrange areas, i.e. how the ratio might vary when h|d| is not small compared to|c|, then one can consider the behaviour of the function

w(x) = 21 + 2x

1 + x, (3.5)

where x = 3hdc

If x is positive, then 2 < w(x) < 4, and w(x) 4, when x .This corresponds to c and d having the same sign.If x is small then w(x) 2.If x is large and negative then w(x) > 4, and w(x) 4 when x .The situation x corresponds to c = 0, i.e. that the method is of secondorder.The picture becomes rather blurred when x is close to 1, i.e. when c and d haveopposite sign and c 3hd:

x 1 w +x 1 w

47


48/128

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 t

0.00

0.01

0.02

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

x

Figure 3.3: g(t, x).

1 < x < 12

w < 0

But in these cases we are far away from |c| h|d|.Reducing the step size by one half corresponds to reducing x by one half.

If 0 < w(x) < 4 then w(x2

) will be closer to 2.If w(x) < 0 then 0 < w(x

2) < 2.

If 6 < w(x) then w(x2

) < 0.If 4 < w(x) < 6 then w(x

2) > w(x).

If c and d have opposite sign and c is not dominant the picture will be ratherchaotic, but a suitable reduction ofh will result in a clearer picture if the funda-mental assumptions are valid.

We have been rather detailed in our analysis of first order methods with a non-

vanishing second order term. Quite similar analyses can be made for second andthird order, or for second and fourth order or for higher orders. If the ratio (3.2)is close to 2p then our method is of order p.

If u is a function of two or more variables then we can perform similar analysestaking one variable at a time. If say u(t, x) is a function of two variables, t andx, and v is a numerical approximation based on step sizes k and h then ourassumption would be

v1 = u hc kd h2f k2g

48


49/128

t \ x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1

0.2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.4 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.5 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.6 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.7 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.8 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

0.9 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

Figure 3.4: h-ratio for first order boundary condition.

A calculation with 2h and k gives

v2 = u 2hc kd 4h2f k2g ,

such that

v1 v2 = hc + 3h2

f + ,and we are back in the former case. We compute

v3 = u 4hc kd 16h2f k2g ,

and can check the order of approximation in h using the ratio (3.2).For the k-dependence we compute

v4 = u hc 2kd h2f 4k2g

and

v5 = u hc 4kd h2f 16k2g and using v1, v4 and v5 we can determine the order of approximation in k andthe corresponding term in the error of v1.

These error terms can then be used to decide how the step sizes might be reducedin order to achieve a given error tolerance. Richardson extrapolation is also apossibility here to increase the order and accuracy. In particular it would betempting to improve an O(k + h2)-method to O(k2 + h2) by extrapolation inthe k-direction. In order to check that such extrapolation(s) give the expected

49


50/128

t \ x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 1.9 14.7 3.9 1.2 4.3 5.2 4.5 3.4 2.8 2.8

0.2 1.8 17.1 2.7 4.1 4.2 4.0 4.0 4.0 4.0 4.0

0.3 1.7 6.1 3.6 4.1 4.0 4.0 4.0 4.0 4.0 4.0

0.4 1.5 4.2 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.5 1.3 3.7 4.1 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.6 1.2 3.5 4.1 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.7 1.0 3.5 4.1 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.8 0.9 3.5 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.9 0.8 3.6 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

1.0 0.8 3.7 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

Figure 3.5: k-ratio for second order boundary condition.

results, it is again necessary to supplement with further calculations (and with amore advanced numbering system for these vs).

So what is the cost in terms of work or computer time to get this extra informa-tion? We shall compare with the computational work for v1 under the assumptionthat the work is proportional to the number of grid points. Therefore v2 costshalf as much as v1, and v3 costs one fourth. The work involved in calculatingv1v2, v2v3 and their quotient which is done for 1/4 of the grid points will notbe considered since it is assumed to be considerably less than the fundamentaldifference calculations.

The work involved in finding v1, v2 and v3 is therefore 1.75, i.e. an extra cost of75%, and that is actually very inexpensive for an error estimate. If the numbersallow an extrapolation then the result of this is expected to be better than a

calculation with half the step size and we are certainly better off. If the compu-tational work increases faster than the number of grid points then the result iseven more in favour of the present method.

Ifu is a function of two variables with two independent step sizes then the cost ofthe five necessary calculations is 2.5 times the cost ofv1. This is still a reasonableprice to pay. Knowing the magnitude of the error and its dependence on the stepsizes enables us to choose near-optimal combinations of these and thus avoidredundant calculations, and a possible extrapolation might improve the resultsconsiderably more than halving the step sizes and quadrupling the work.

50


51/128

t \ x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 3.2 3.4 3.6 3.7 3.8 3.9 3.9 3.9 3.9 4.0

0.2 3.4 3.5 3.6 3.7 3.7 3.8 3.8 3.8 3.9 3.9

0.3 3.5 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.8 3.8

0.4 3.5 3.6 3.6 3.7 3.7 3.7 3.8 3.8 3.8 3.8

0.5 3.6 3.6 3.6 3.7 3.7 3.7 3.8 3.8 3.8 3.8

0.6 3.6 3.6 3.7 3.7 3.7 3.7 3.7 3.8 3.8 3.8

0.7 3.6 3.6 3.7 3.7 3.7 3.7 3.7 3.7 3.8 3.8

0.8 3.6 3.6 3.7 3.7 3.7 3.7 3.7 3.7 3.8 3.8

0.9 3.6 3.6 3.7 3.7 3.7 3.7 3.7 3.7 3.8 3.8

1.0 3.6 3.6 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.8

Figure 3.6: h-ratio for asymmetric second order.

3.3 Limitations of the technique.

It is essential for the technique to give satisfactory results that the leading termin the error expansion is the dominant one. This will always be the case when

the step size is small, but how can we know that the step size is small enough?

If we compute with step size h and if the result is first order as indicated informula (3.1) then we assume that the error term h c is small. It is thereforereasonable to assume that the second term, h2d is very small because it containsthe factor h2 and therefore that the ratio (3.2) will be close to 2. In this casethe estimate of the error will also be reliable. If the method is second order theleading term is h2 d. It is often the case with symmetric formulae that the nextterm is fourth order, and if h2 d is small then we may assume that h4 g is verysmall. Thus the ratio (3.2) will give numbers close to 4 and we should be able totrust the error estimate.

If, however, a method is second order and the next term in the error expansionis third order the situation is quite different. Now we can expect the third or-der term to interfere significantly making the order determination difficult andextrapolation a dubious affair.

Even if we can safely determine the order we should not expect too much of theextrapolation. Going from first to second or from second to fourth order we shallusually double the number of correct decimals, but going from second to third orfrom fourth to sixth order this number will only increase by 50 %.

51


52/128

t \ x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 4.0 4.0 4.0 4.0 4.0 4.0 4.1 4.2 4.4 4.9

0.2 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 3.9

0.3 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.2

0.4 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9

0.5 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.1

0.6 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.7 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.8 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

0.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

1.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

Figure 3.7: k-ratio for asymmetric second order.

So the main area of application is to first (and second) order methods, but ofcourse here the need is also the greatest.

3.4 An example

We shall illustrate our techniques on a simple example involving the heat equa-tion:

ut = uxx 0 x 1 t 0,

with initial condition

u(0, x) = cos x 0 x 1

and boundary conditionsux(t, 0) = 0 t 0,

u(t, 1) = cos 1 t 0.

The solution is u(t, x) = et cos x.

We have solved numerically using Crank-Nicolson and wish to study the be-haviour of the global error using various discretizations of the derivative boundarycondition. First of all we have solved the initial-boundary value problems (usingh = k = 0.025) for the functions c(t, x) and g(t, x) corresponding to the first or-

52


53/128

t \ x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 15.9 15.9 15.9 16.0 16.0 16.1 16.2 16.5 17.2 18.6

0.2 16.0 16.0 16.0 16.1 16.1 16.1 16.1 16.0 15.8 15.5

0.3 16.1 16.1 16.1 16.1 16.0 16.0 16.0 16.0 16.1 16.5

0.4 16.0 16.0 16.0 16.0 16.1 16.1 16.1 16.1 16.0 15.8

0.5 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.1 16.2

0.6 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.1 16.0 15.9

0.7 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.1

0.8 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 15.9

0.9 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.1

1.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0

Figure 3.8: h-ratio for symmetric second order with h = k.

der boundary approximation (cf. Section 2.12) and show the results graphicallyin Fig. 3.2 and Fig. 3.3.

The values of c(t, x) lie between 0 and 0.28 and those of g(t, x) between 0 and0.022 and from this we could estimate the truncation error for given values of the

step sizes h and k. Or we could suggest step sizes in order to make the truncationerror smaller than a given tolerance.

More often we have little knowledge of the auxiliary functions beforehand andwe shall instead extract empirical knowledge from our calculations.

Using formula (3.2) and the similar one for k we check the order of the methodcalculating the ratios on a 10 10 grid using step sizes that are 16 times smaller.The results are shown in Fig. 3.4 and Fig. 3.5 for h and k respectively.

The method is clearly first order in h with only few values deviating appreciablyfrom 2.0. The picture is more confusing for k where the second order is only

convincing for larger values of t or x. For small values of x or t k2g(t, x) is muchsmaller than hc(t, x) and a greater sensitivity is to be expected here.

The values for c(t, x) as determined by v1v2 (see 3.3) with h = k = 0.00625 agreewithin 7 % with those obtained from solving the differential equation for c anda better agreement can be obtained using smaller step sizes. The correspondingdetermination of g(t, x) is reasonably good when t and x are not too close to 0.In the regions where we have difficulty determining the order (cf. Fig. 3.5) we canof course have little trust in an application of

numerical analysis parabolic differential equations ole osterby (2005)

Documents